Processing math: 0%
Deep Compressed Sensing for Terahertz Ultra-Massive MIMO Channel Estimation | IEEE Journals & Magazine | IEEE Xplore

Deep Compressed Sensing for Terahertz Ultra-Massive MIMO Channel Estimation


Abstract:

Envisioned as a pivotal technology for sixth-generation (6G) and beyond, Terahertz (THz) band communications can potentially satisfy the increasing demand for ultra-high-...Show More

Abstract:

Envisioned as a pivotal technology for sixth-generation (6G) and beyond, Terahertz (THz) band communications can potentially satisfy the increasing demand for ultra-high-speed wireless links. While ultra-massive multiple-input multiple-output (UM-MIMO) is promising in counteracting the exceptionally high path loss at THz frequency, the channel estimation (CE) of this extensive antenna system introduces significant challenges. In this paper, we propose a deep compressed sensing (DCS) framework based on generative neural networks for THz CE. The proposed estimator generates realistic THz channel samples to avoid complex channel modeling for THz UM-MIMO systems, especially in the near field. More importantly, the estimator is optimized for fast channel inference. Our results show significant superiority over the baseline generative adversarial network (GAN) estimator and traditional estimators. Compared to conventional estimators, our model achieves at least 8 dB lower normalized mean squared error (NMSE). Against GAN estimator, our model achieves around 3 dB lower NMSE at 0 dB SNR with one order of magnitude lower computation complexity. Moreover, our model achieves lower training overhead compared to GAN with empirically 4 times faster training convergence.
Page(s): 1747 - 1762
Date of Publication: 21 February 2025
Electronic ISSN: 2644-125X

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Following the commercial launch of fifth-generation (5G) networks in 2020, the vision and plan of sixth-generation (6G) that aims to provide communication services for the 2030s has started [1]. 6G wireless systems are anticipated to attain a peak data rate of 1 Tbps, a tenfold increase from that of 5G, with an expectation of spectral efficiency doubling to 60 bps/Hz. There is also an expectation of significant enhancement in end-to-end performance, with a packet error of 10^{-9} and a latency of 10^{-4} seconds. Moreover, the number of connected devices will be 100 times greater than in 5G due to the extensively improved energy efficiency, which provides the potential to achieve a sensing resolution at the millimeter level for the 6G Internet of Things (IoT) devices [2].

Millimeter-wave (mm-Wave) (30-300 GHz) communications below 100 GHz have been utilized in 5G systems to support higher data rates. However, it is challenging for mm-Wave systems to achieve a Tbps-level data rate due to the limited bandwidth (up to 20 GHz) and spectral efficiency (below 30 bps/Hz) [2]. Hence, the Terahertz (THz) band (0.1 - 10 THz) is nominated among all available frequency bands due to its ultra-wide bandwidth of up to hundreds of GHz, which is mostly unlicensed [2]. However, THz band is exposed to strong free space path loss due to high operating frequency and severe absorption loss because of gaseous concentration [3]. On the other hand, the ultra-high operating frequency allows the integration of thousands of tightly packed antennas within an area of 1 mm2, making it possible for the ultra-massive multiple-input multiple-output (UM-MIMO) systems to achieve high beamforming gain to compensate for the high path loss to enable Tbps wireless communications [4]. Moreover, extremely large-scale antenna arrays can be used for spatial multiplexing to increase spectral efficiency [5].

Beamforming and spatial multiplexing rely heavily on accurate channel state information (CSI) obtained through channel estimation (CE). However, CE in the THz band has significant challenges and the conventional CE techniques are not able to provide satisfactory performance due to the following reasons. First, channel modeling in the THz band is more complex due to near-field effect [6], [7], [8]. Second, the computational complexity is extremely high, considering the huge channel matrix and the dictionary matrix while using conventional compressed sensing (CS)-based methods. Moreover, conventional schemes suffer from high communication overhead since prior knowledge of channel statistics is often needed while it usually remains unavailable [9].

A. Related Works

In the following, we present state-of-the-art CE schemes considering both traditional methods and emerging deep learning (DL)-based techniques.

1) Conventional Methods

Beyond the conventional least squares (LS) and linear minimum mean square error (LMMSE) solutions, traditional CE methods involve dictionary-based CS. More precisely, those methods assume that angle of arrival (AoA) and angle of departure (AoD) are taken from a fixed grid and exploit the channel sparsity in angular domains, where CS algorithms such as orthogonal matching pursuit (OMP) [10] and approximate message passing (AMP) [11] can be utilized. Moreover, the authors in [12] propose a generalized simultaneous OMP (GSOMP) algorithm, a variant of the traditional OMP algorithm, by exploiting the common support property and considering the CE problem as a generalized multiple measurement vector (GMMV) problem, where multiple sensing matrices are employed. To overcome the off-grid problem where the actual AoA and AoD deviate from the grid, in [13], the estimates of the angles obtained from GSOMP is refined using electromagnetic (EM)-based methods. Due to the reduced communication distance, several works are tailored for cross-field CE problems. In [14], a simultaneous OMP (SOMP)-based algorithm with dictionary reduction is proposed to solve the cross-field CE problem with lower complexity. In [15], the authors propose an on-grid polar-domain SOMP (P-SOMP) algorithm to estimate channels in the near field effectively. To accommodate the model mismatch for the line-of-sight (LoS) path in the near field, in [16], a LoS-NLoS-separated CE algorithm is proposed where the former is based on parameter estimation and the latter on P-SOMP. LoS sensing-based CE in unmanned aerial vehicle (UAV)-assisted system is also investigated in [17].

Conventional dictionary-based methods rely heavily on accurate channel modeling and dictionary building. Methods are being developed by either using more accurate channel models or refining the dictionary to mitigate the model mismatch problems. These methods may not perform as well as in the lower frequency band, since the antenna dimension goes higher in the THz band, where the dictionary becomes larger and the computation load is much heavier.

2) Deep-Learning-Based Methods

The success of DL in various fields makes it a promising candidate for MIMO CE. DL-based CE algorithms can be categorized into two classes: data-driven methods and model-driven methods. Data-driven approaches, also known as black-box methods, aim to provide an end-to-end mapping from the received signal to the full channel matrix or its corresponding parameters. The authors in [18] utilize the deep kernel learning (DKL) algorithm by using Gaussian process regression, where second-order statistics are learned using a multilayer perceptron (MLP) neural network. Moreover, deep convolutional neural network (DCNN) is extensively exploited in the literature [19], [20], [21], [22] since channel matrices are analogous to pixel-based images, whose intrinsic properties can be captured by convolutional layers. In [19], the channel is initially coarsely estimated using AMP and is then fed to a DCNN as refinement. In addition, insignificant neurons are pruned to reduce inference complexity. In [20], the authors propose a federated learning (FL)-based framework for CE to reduce communication overhead. The works mentioned above leverage the far-field dictionary, which results in high estimation errors in the near-field. To tackle the near-field CE problem, the channel matrix is built first upon the planar wave assumptions in [21]. Then it is multiplied by a correction matrix representing the phase difference between planar and spherical waves. In [22], the authors create a near-field dictionary with an additional dimension for distance, creating a denser grid with higher complexity. Apart from the DCNN structures, generative adversarial network (GAN)-based CE schemes have emerged in recent years [23], [24], [25], [26]. The authors in [23] propose a GAN-based CE framework for wideband channels. GAN is trained to learn the channel distribution from data and generate channel samples from the learned distribution. Then, gradient descent and its variants are used to find the best-generated channel based on the measurements. The authors in [24] and [25] employ the Wasserstein GAN with gradient penalty (WGAN-GP) algorithm to stabilize training and proposing an FL-based algorithm that distributes the training to multiple users, respectively. In [26], the authors propose a score-based generative model to avoid adversarial training, relying on less stringent assumptions regarding the low-dimensional characteristics of wireless channels. DL-based CE in orthogonal frequency-division multiplexing (OFDM) and orthogonal time frequency space (OTFS) are studied in [27] and [28], respectively. In contrast to the data-driven approaches mentioned above, model-driven methods intertwine DL with domain knowledge by expanding traditional algorithms and substituting non-linear estimators (NLEs) with a neural network. The initial work of deep unfolding methods truncates a set of algorithms to a fixed number of layers, imposing instability and unaffordable training costs [29], [30]. In [9], a fixed point network (FPN) featuring a cyclical topology that offers an adjustable trade-off between accuracy and computational complexity is proposed based on the orthogonal approximate message passing (OAMP) algorithm. In addition to minimizing the mean-square error (MSE) loss in training, a constraint on input and output is imposed to enhance the stability of the algorithm. In [31], a joint data-driven and model-driven CE approach is investigated under imperfect hardware consideration. Despite all the merits of DL-based methods, current DL-based estimators often prioritize estimation accuracy while neglecting computational complexity, which is critical in practical implementation. In this paper, we aim to maintain high estimation accuracy while minimizing the inference complexity by using a similar principle of meta-learning but with only one task dataset [32].

B. Contribution

The GAN-based estimator [23], [24], [25], [33] has demonstrated superior performance to traditional methods such as OMP [34], least absolute shrinkage and selection operator (LASSO) [35], and Expectation-maximization Gaussian-mixture Approximate Message Passing (EM-GM-AMP) [36], as well as DL-based methods such as Resnet [37]. Specifically, [23] shows that GAN estimator achieves more than 5 dB lower normalized mean squared error (NMSE) than ResNet while using only 6% of model parameters. In low compression-ratio scenarios, [24] demonstrates that GAN estimator outperforms LASSO and EM-GM-AMP by more than 5 dB margin. The superiority of GAN estimator is further validated in [25] across all considered 3GPP delay profiles (CLD-A through E), where it consistently achieves higher accuracy than EM-GM-AMP. The GAN model can create a highly compressed latent representation for high-dimensional channels. It also offers lower communication overhead and high estimation accuracy, especially at low signal-to-noise ratio (SNR). However, GAN-based estimator is not optimized for CS tasks. First, the reconstruction of GAN-based estimator is slow, involving hundreds to thousands of gradient descent steps with several random restarts [38]. Second, GAN is difficult to train due to the nature of the problem, i.e., finding a Nash equilibrium between two neural networks [39]. The adversarial loss designed also provides little insight into the model performance in CS, making it difficult to evaluate the model during training. Third, training convergence is slow, since most computation resources are spent on the discriminator, while the performance of the CS model only relies on the generator.

To address the aforementioned problems with GAN estimators, we propose deep compressed sensing (DCS), a CE framework for frequency-selective THz UM-MIMO CE tasks. The main contribution can be summarized as follows.

  • We propose DCS, a CE framework based on generative neural networks for THz UM-MIMO systems. We train a neural network that can produce realistic THz channel samples, eliminating complex channel modeling, especially in the near field.

  • We demonstrate how to integrate the CS framework into training a generative neural network. Our model is trained to adapt to the CE task with at least 8 dB lower NMSE compared to conventional estimators, namely LS, LMMSE, and OMP. Furthermore, our model achieves around 3 dB lower NMSE in comparison to the GAN estimator while reducing the number of online inference steps by one order of magnitude, which solves the most detrimental problem of the GAN estimator.

  • We design a loss function that accounts for both the reconstruction error and enforces the restricted isometry property (RIP) to ensure successful channel reconstruction with high probability. Moreover, the designed loss function provides a good measure of model performance compared to adversarial loss in GAN training, which is less related to the performance in CE tasks.

  • Our model waives the training of a discriminator that consumes the majority of the training computation. The task-aware nature of our model results in 4 times faster training convergence compared to GAN.

C. Notations and Structure of the Paper

Throughout this paper, we adhere to the following notation: A represents a matrix, a is a scalar, and a is a vector. The second norm of a is denoted as \|\mathbf {a}\|_{2} . The element in i-th row and j-th column of A is denoted as \mathbf {A}\left [{{i,j}}\right] . Additionally, various transformations of A are represented as follows: \rm {vec}\left \{{{\mathbf {A}}}\right \} for vectorization, \mathbf {A}^{\top } for transpose, \mathbf {A}^{\mathrm {H}} for the Hermitian (conjugate transpose), \mathbf {A}^{*} for the conjugate, \mathbf {A}^{-1} for the inverse, and \mathbf {A}^{\dagger } for the pseudo-inverse. \mathbf {I}_{N} is used to denote the N \times N identity matrix, and \mathbf {0}_{N} signifies an N-dimensional vector filled with zeros. The Kronecker product is represented by \mathbf {A} \otimes \mathbf {B} . The notation \mathcal {CN}(\mathbf {m}, \mathbf {R}) describes a circularly-symmetric complex Gaussian vector with mean m and covariance matrix R. The gradient of function f with respect to a is represented by \nabla _{\mathbf {a}} (f) . The complex set is denoted as \mathbb {C} , and the Dirac delta function is represented as \delta \left ({{\cdot }}\right) . Lastly, \mathbb {E}\{\cdot \} denotes the expectation operator, and \mathbb {P}\{\cdot \} denotes the probability distribution. The main symbols used in the paper are summarized in Table 1.

TABLE 1 List of Symbols
Table 1- List of Symbols

The remainder of the paper is organized as follows. In Section II, the system model is described. In Section III, the proposed DCS-based channel estimator is explained. In Section IV, the numerical results are presented. Finally, in Section V, conclusions are provided.

SECTION II.

System Model

In this section, we consider a multi-carrier THz UM-MIMO communication system with hybrid beamforming and combining techniques [40], as shown in Figure 1.

FIGURE 1. - Illustration of a THz UM-MIMO system with $N_{\mathrm {t}}$
 antennas at Tx and $N_{\mathrm {r}}$
 antennas at Rx.
FIGURE 1.

Illustration of a THz UM-MIMO system with N_{\mathrm {t}} antennas at Tx and N_{\mathrm {r}} antennas at Rx.

A. Array-of-Subarrays (AoSA)

We assume that transmitter (Tx) and receiver (Rx) both deploy an array-of-subarrays (AoSA) structure with planar antenna arrays distributed on the Y-Z plane of their local Cartesian coordinates, where the origins lie in the centers of Tx and Rx. The total number of sub-arrays (SAs) is N_{\mathrm {SA}}= N_{z}\times N_{y} , where N_{z} and N_{y} denote the number of SAs along the local Z-axis and Y-axis, respectively. In turn, each SA consists of N_{\mathrm {AE}}=\bar {N}_{z}\times \bar {N}_{y} densely arranged antenna elements (AEs) controlled by a unique baseband-to-radio frequency (RF) chain. To represent the distinct structures between the Tx and Rx, we apply the subscript t and r to denote the associated parameters at the Tx and Rx, respectively. Thus, the AEs at Tx form N_{\mathrm {RF},\mathrm {t}}=N_{\mathrm {SA},\mathrm {t}}=N_{z,\mathrm {t}} N_{y,\mathrm {t}} RF chains, and the total number of Tx antennas is N_{\mathrm {t}}=N_{z,\mathrm {t}}N_{y,\mathrm {t}}\bar {N}_{z,\mathrm {t}} \bar {N}_{y,\mathrm {t}} . The corresponding parameters at the Rx side, the number of received RF chains N_{\mathrm {RF},\mathrm {r}} and received AEs N_{\mathrm {r}} , can be obtained similarly.

B. Signal Model

To estimate wideband THz channels, we adopt OFDM and allocate pilot signals with N_{s}~(N_{s}\leq \min \left \{{{N_{\mathrm {RF},\mathrm {t}}, N_{\mathrm {RF},\mathrm {r}}}}\right \}) data streams to K subcarriers at frequency f_{k}=f_{c}+{B}(k-\left ({{K-1}}\right)/{2})/{K} , \forall k=0, 1, \ldots, K-1 , where f_{c} is the center frequency, and B is the total bandwidth. We assume that the coherent bandwidth of the THz channel is larger than the subcarrier bandwidth, so the wideband CE problem becomes K narrow band CE problems.1 We denote the vector of the pilot symbol on the k-th subcarrier as \mathbf {s}[k] \in \mathbb {C}^{N_{s} \times 1} that satisfies the power constraint \mathbb {E}\{\mathbf {s}[k] \mathbf {s}^{\mathrm {H}}[k]\}=({P}/(K N_{s})) \mathbf {I}_{N_{s}} with total transmit power P. Moreover, the baseband symbol vector is then precoded by a digital and an analog precoder to down-scale the hardware complexity. Thus, the transmitted signal at k-th subcarrier is expressed as\begin{align*} \mathbf {x}[k]=& \mathbf {F}_{\mathrm {RF}} \mathbf {F}_{\mathrm {BB}}[k] \mathbf {s}[k], \\=& \mathbf {F}[k]\mathbf {s}[k], \tag {1}\end{align*} View SourceRight-click on figure for MathML and additional features.where \mathbf {F}_{\mathrm {BB}}[k] \in \mathbb {C}^{N_{\mathrm {RF},\mathrm {t}} \times N_{s}} denote the digital precoding matrix, \mathbf {F}_{\mathrm {RF}} \in \mathbb {C}^{N_{\mathrm {t}}\times N_{\mathrm {RF},\mathrm {t}}} represents the frequency-independent analog precoding matrix, and \mathbf {F}[k] = \mathbf {F}_{\mathrm {RF}} \mathbf {F}_{\mathrm {BB}}[k] is the overall precoding matrix.

Due to the partially-connected structure in which the AEs in a SA shares a unique RF chain, the analog beamforming matrix \mathbf {F}_{\mathrm {RF}} has a block diagonal structure [40], i.e.,\begin{align*} \mathbf {F}_{\mathrm {RF}}= \left [{{\begin{array}{cccc} \mathbf {f}_{\mathrm {RF}}^{1} & \quad \mathbf {0} & \quad \cdots & \quad \mathbf {0} \\ \mathbf {0} & \quad \mathbf {f}_{\mathrm {RF}}^{2} & \quad \cdots & \quad \mathbf {0} \\ \vdots & \quad \vdots & \quad \ddots & \quad \vdots \\ \mathbf {0} & \quad \mathbf {0} & \quad \cdots & \quad \mathbf {f}_{\mathrm {RF}}^{N_{\mathrm {RF},\mathrm {t}}} \end{array}}}\right ] \text {,}~ \tag {2}\end{align*} View SourceRight-click on figure for MathML and additional features.where \mathbf {f}_{\mathrm {RF}}^{j} \in \mathbb {C}^{N_{\mathrm {AE},\mathrm {t}} \times 1}, \forall j \in \{1, 2,\ldots, N_{\mathrm {RF},\mathrm {t}}\} . Note that each AE is connected to a finite-bit phase shifter with equal power within an RF chain, the non-zero elements in \mathbf {F}_{\mathrm {RF}} have the same magnitude and varying phases. More precisely, the i-th element in \mathbf {f}_{\mathrm {RF}}^{j} can be expressed as\begin{equation*} \mathbf {f}_{\mathrm {RF}}^{j}\left [{{i}}\right ] = \frac {1}{\sqrt {N_{\mathrm {AE},\mathrm {t}}}} e^{j \psi _{i, j}}, \forall i \in \left \{{{1, 2,\ldots, N_{\mathrm {AE},\mathrm {t}}}}\right \}, \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \psi _{i, j} \in [0,2 \pi] represents the phase shift applied to the i-th AE of the j-th SA.

The received signal at the Rx is processed by the analog combiner \mathbf {C}_{\mathrm {RF}} \in ~\mathbb {C}^{N_{\mathrm {r}} \times N_{\mathrm {RF},\mathrm {r}}} and the digital combiner \mathbf {C}_{\mathrm {BB}}[k] \in \mathbb {C}^{N_{\mathrm {RF},\mathrm {r}} \times N_{s}} . The combining matrices have the same constraints as the precoding matrices. The combining process results in the baseband received signal \mathbf {y}[k] \in \mathbb {C}^{N_{s} \times 1} as\begin{equation*} \mathbf {y}[k]=\mathbf {C}^{\mathrm {H}}[k]\mathbf {H}[k] \mathbf {x}[k]+\mathbf {C}^{\mathrm {H}}[k]\mathbf {n}[k], \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathbf {H}[k] \in \mathbb {C}^{N_{\mathrm {r}} \times N_{\mathrm {t}}} denotes the frequency-domain channel matrix for the k-th subcarrier, \mathbf {n}[k] \sim \mathcal {C N}(\mathbf {0}_{N_{\mathrm {r}}}, \sigma ^{2} \mathbf {I}_{N_{\mathrm {r}}}) is the additive noise drawn from a circularly-symmetric complex Gaussian distribution, \mathbf {C}^{\mathrm {H}}[k]=\mathbf {C}_{\mathrm {BB}}^{\mathrm {H}}[k] \mathbf {C}_{\mathrm {RF}}^{\mathrm {H}} is the overall combining matrix.

C. Channel Model

The frequency-selective THz channel model is built upon the division of multiple subcarriers, each consisting of multiple sub-bands. Within this framework, we consider the presence of both LoS and non-line-of-sight (NLoS) components for each subcarrier. Considering a delay tap with length N_{u} in the time domain, the delay-u channel matrix at subcarrier frequency f_{k} can be expressed as the summation of channel of both LoS and NLoS paths as [8], [16]\begin{equation*} \mathbf {H}^{u}=\mathbf {H}_{\mathrm {LoS}}^{u} + \mathbf {H}_{\mathrm {NLoS}}^{u}, \tag {5}\end{equation*} View SourceRight-click on figure for MathML and additional features.where u \in \left [{{0,N_{u}-1}}\right] represents the delay tap index in the time domain.

In near-field UM-MIMO systems, each Tx-Rx AE pair experiences unique propagation paths for LoS components due to spherical wave propagation [16]. Thus, we model LoS channel under the geometric free space propagation assumption for each pair of Tx and Rx AEs instead of using array response vectors as commonly used in Saleh-Valenzuela (SV) model. Specifically, letting \mathbf {H}_{\mathrm {LoS}}^{u}\left [{{n_{\mathrm {r}}, n_{\mathrm {t}}}}\right] denote the channel between n_{\mathrm {r}} -th Rx AE and n_{\mathrm {t}} -th Tx AE, it can be written as\begin{equation*} \mathbf {H}_{\mathrm {LoS}}^{u}\left [{{n_{\mathrm {r}}, n_{\mathrm {t}}}}\right ] = g_{0}\left ({{f_{k}, d_{n_{\mathrm {r}}, n_{\mathrm {t}}}}}\right) \delta \left ({{u T_{s}-\tau _{n_{\mathrm {r}}, n_{\mathrm {t}}}}}\right), \tag {6}\end{equation*} View SourceRight-click on figure for MathML and additional features.where T_{s} is the sampling period, d_{n_{\mathrm {r}}, n_{\mathrm {t}}} depicts the distance between the AE pair, g_{0}\left ({{f_{k}, d_{n_{\mathrm {r}}, n_{\mathrm {t}}}}}\right) and \tau _{n_{\mathrm {r}}, n_{\mathrm {t}}} denote the path gain and propagation delay between n_{\mathrm {r}} -th Rx AE and n_{\mathrm {t}} -th Tx AE, respectively. We adopt the SV model for the NLoS channel [43], which can be expressed as\begin{equation*} \mathbf {H}_{\mathrm {NLoS}}^{u} = \sum _{\ell =1}^{L} g_{\ell } \mathbf {a}_{\mathrm {r}} \mathbf {a}_{\mathrm {t}}^{\mathrm {H}} \delta \left ({{u T_{s}-\tau _{\ell }}}\right), \tag {7}\end{equation*} View SourceRight-click on figure for MathML and additional features.where L is the count of NLoS propagation paths, g_{\ell } is the complex gain for the \ell -th path, \tau _{\ell } is the delay for the \ell -th path, and \mathbf {a}_{\mathrm {r}} and \mathbf {a}_{\mathrm {t}} are the model-dependent array response vectors. For large Tx-Rx separations, a planar wave model is sufficient, where these vectors depend solely on AoA and AoD. However, for short Tx-Rx distances, a spherical wave model becomes necessary, incorporating an additional dimension of the scatterer distance along with AoA and AoD as [44]\begin{align*} \mathbf {a}_{\mathrm {t}}\left ({{\theta _{\mathrm {t}}^{\ell }, d_{\mathrm {t}}^{\ell }}}\right)=& \frac {1}{\sqrt {N_{\mathrm {t}}}}\left [{{e^{-\mathrm {j} \frac {2 \pi }{\lambda }\left ({{d_{\mathrm {t}}^{\ell }(1)-d_{\mathrm {t}}^{\ell }}}\right)}, \ldots, e^{-\mathrm {j} \frac {2 \pi }{\lambda }\left ({{d_{\mathrm {t}}^{\ell }\left ({{N_{\mathrm {t}}}}\right)-d_{\mathrm {t}}^{\ell }}}\right)}}}\right ]^{\mathrm {H}}, \tag {8}\\ \mathbf {a}_{\mathrm {r}}\left ({{\theta _{\mathrm {r}}^{\ell }, d_{\mathrm {r}}^{\ell }}}\right)=& \frac {1}{\sqrt {N_{\mathrm {r}}}}\left [{{e^{-\mathrm {j} \frac {2 \pi }{\lambda }\left ({{d_{\mathrm {r}}^{\ell }(1)-d_{\mathrm {r}}^{\ell }}}\right)}, \ldots, e^{-\mathrm {j} \frac {2 \pi }{\lambda }\left ({{d_{\mathrm {r}}^{\ell }\left ({{N_{\mathrm {r}}}}\right)-d_{\mathrm {r}}^{\ell }}}\right)}}}\right ]^{\mathrm {H}}, \tag {9}\end{align*} View SourceRight-click on figure for MathML and additional features.where \theta _{\mathrm {t}}^{\ell } and \theta _{\mathrm {r}}^{\ell } denote the AoD and AoA of the \ell -th path, respectively; d_{\mathrm {t}}^{\ell } and d_{\mathrm {r}}^{\ell } represent the distances between the \ell -th scatter and the center of the Tx and Rx antenna array, respectively; d_{\mathrm {t}}^{\ell }\left ({{n_{\mathrm {t}}}}\right) and d_{\mathrm {r}}^{\ell }\left ({{n_{\mathrm {r}}}}\right) are the distances between the \ell -th scatter and the n_{\mathrm {t}} -th element of the Tx array and n_{\mathrm {r}} -th element of the Rx array, respectively.

The frequency-domain channel response is related to the time-domain response via Fourier transform (FT) as\begin{align*} \mathbf {H}[k]=& \sum _{u=0}^{N_{u}-1} \left ({{\mathbf {H}_{\mathrm {LoS}}^{u} + \mathbf {H}_{\mathrm {NLoS}}^{u}}}\right) e^{-j \frac {2 \pi k}{K} u}, \\=& \mathbf {H}_{\mathrm {LoS}}[k] + \mathbf {H}_{\mathrm {NLoS}}[k], \\=& \mathbf {H}_{\mathrm {LoS}}[k] + \sum _{\ell =1}^{L} g_{\ell } \mathbf {a}_{\mathrm {r}} \mathbf {a}_{\mathrm {t}}^{\mathrm {H}} e^{-j 2 \pi \frac {k B}{K} \tau _{\ell }}, \tag {10}\end{align*} View SourceRight-click on figure for MathML and additional features.where\begin{equation*} \mathbf {H}_{\mathrm {LoS}}[k]\left [{{n_{\mathrm {r}}, n_{\mathrm {t}}}}\right ] = g_{0}\left ({{f_{k}, d_{n_{\mathrm {r}}, n_{\mathrm {t}}}}}\right) e^{-j 2 \pi \frac {k B}{K} \tau _{n_{\mathrm {r}}, n_{\mathrm {t}}}}. \tag {11}\end{equation*} View SourceRight-click on figure for MathML and additional features.

D. Problem Formulation

Here, channel coherence time is assumed much longer than the symbol duration and is divided into two stages: training and data transmission. The assumption is valid since the symbol duration is on the order of picoseconds [2], while the channel coherence time spans milliseconds [45]. For the sake of simplicity, from here on the subcarrier index is dropped.

During the training phase, N_{p} sequential pilot signals are sent from the Tx, which are then used for CE. Similar to (4), the received signal for the m-th pilot symbol can be written as\begin{equation*} \mathbf {y}_{m}=\mathbf {C}_{m}^{\mathrm {H}}\mathbf {H}\mathbf {x}_{m}+\mathbf {C}_{m}^{\mathrm {H}}\mathbf {n}_{m}, \tag {12}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathbf {C}_{m} is the combining matrix for the m-th pilot symbol, \mathbf {x}_{m} is m-th pilot signal, and \mathbf {n}_{m} is the additive white Gaussian noise.

Vectorizing the channel matrix, we obtain the linear system as\begin{align*} \mathbf {y}_{m}=& \underbrace {\left ({{\mathbf {x}_{m}^{\top }\otimes \mathbf {C}_{m}^{\mathrm {H}}}}\right)}_{\boldsymbol {\Phi }_{m}} \mathbf {h}+ \mathbf {C}_{m}^{\mathrm {H}}\mathbf {n}_{m} \\=& \boldsymbol {\Phi }_{m} \mathbf {h}+\mathbf {C}_{m}^{\mathrm {H}}\mathbf {n}_{m}, \tag {13}\end{align*} View SourceRight-click on figure for MathML and additional features.where {\mathbf {h}} = \rm {vec}\left \{{{\mathbf {H}}}\right \} is the vectorized channel. Then, we concatenate all the received signals and obtain\begin{align*} \underbrace {\left [{{\begin{array}{c} \mathbf {y}_{1} \\ \mathbf {y}_{2} \\ \vdots \\ \mathbf {y}_{N_{p}} \end{array}}}\right ]}_{\tilde {\mathbf {y}}}= \underbrace {\left [{{\begin{array}{c} \boldsymbol {\Phi }_{1} \\ \boldsymbol {\Phi }_{2} \\ \vdots \\ \boldsymbol {\Phi }_{N_{p}} \end{array}}}\right ]}_{\boldsymbol {\Phi }} \mathbf {h} + \underbrace {\mathrm {diag}\left ({{\mathbf {C}_{1}^{\mathrm {H}},\ldots,\mathbf {C}_{N_{p}}^{\mathrm {H}}}}\right)}_{\boldsymbol {\Psi }} \underbrace {\left [{{\begin{array}{c} \mathbf {n}_{1} \\ \mathbf {n}_{2} \\ \vdots \\ \mathbf {n}_{N_{p}} \end{array}}}\right ]}_{\tilde {\mathbf {n}}}. \tag {14}\end{align*} View SourceRight-click on figure for MathML and additional features.

Simplifying the expression, we obtain the linear system used for CE as\begin{equation*} \tilde {\mathbf {y}}=\boldsymbol {\Phi } \mathbf {h}+\boldsymbol {\Psi }\tilde {\mathbf {n}}, \tag {15}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \tilde {\mathbf {y}} \in \mathbb {C}^{N_{s}N_{p} \times 1} is the overall received signal, {\boldsymbol {\Phi }} \in \mathbb {C}^{N_{s}N_{p} \times N_{\mathrm {t}}N_{\mathrm {r}}} is the measurement matrix, {\boldsymbol {\Psi }} \in \mathbb {C}^{N_{s}N_{p} \times N_{p} N_{\mathrm {r}}} is the noise projection matrix, and \tilde {\mathbf {n}} \in \mathbb {C}^{N_{p} N_{\mathrm {r}} \times 1} is the concatenated noise vector.

Conventional methods can be used to solve the linear system. The LS solution can be given as [46]\begin{equation*} \hat {\mathbf {h}}_{\mathrm {LS}}=\boldsymbol {\Phi }^{\dagger }\tilde {\mathbf {y}}, \tag {16}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \dagger depicts the pseudo-inverse operation. The LMMSE estimator assumes that the first- and second-order statistics, i.e., the mean and the covariance matrix, are available. The LMMSE estimator is given by [46] as\begin{equation*} \hat {\mathbf {h}}_{\mathrm {LMMSE}} = \mathbf {R}_{h h} \boldsymbol {\Phi }^{\mathrm {H}} \left ({{\boldsymbol {\Phi } \mathbf {R}_{h h} \boldsymbol {\Phi }^{\mathrm {H}} +\boldsymbol {\Psi } \mathbf {R}_{nn} \boldsymbol {\Psi }^{\mathrm {H}} }}\right)^{-1} \tilde {\mathbf {y}}, \tag {17}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathbf {R}_{h h}=\mathbb {E}\{\mathbf {h} \mathbf {h}^{\mathrm {H}}\} and \mathbf {R}_{n n}=\mathbb {E}\{\tilde {\mathbf {n}} \tilde {\mathbf {n}}^{\mathrm {H}}\} are the autocorrelation matrices of the channel and the noise, respectively.

SECTION III.

Deep Compressed Sensing for Thz Channel Estimation

In this section, we propose a novel CE framework, DCS, based on a task-aware generative neural network. First, we provide the core idea and inference process of DCS. Then, we propose a novel training algorithm Targeting the drawbacks of GAN estimators.

A. Structural Constraint via Neural Networks

Conventional CS schemes utilize the channel sparsity in the angular domain and impose a structural constraint to the channel matrix to successfully solve the underdetermined system. However, their performance is limited, since the reconstruction is highly dependent on how well the actual AoAs, AoDs, and distances match the constructed grid. Second, THz system may operate in the near-field region due to the joint effect of wavelength and array aperture [47], resulting in an additional dimension in channel modeling, which expands the search space of the dictionary and increases the algorithm execution time. Third, the SV model may not accurately capture LoS characteristics, causing a higher estimation error [16]. Although in [16], the authors propose a LoS-NLoS-separated estimation algorithm, the additional computation for estimating the LoS channel makes the inference time even more critical. Finally, some unknown properties of THz channels are still not yet captured by the mathematical model; thus, the performance in practice is even worse. However, these properties could be learned by DL models from the data.

The idea of DCS is to replace the sparsity constraint with the structural constraint imposed by a generative neural network, e.g., variational autoencoder (VAE) and GAN, which provides a mapping from a latent representation to the signal space. Thus, instead of requiring sparse signals, the neural network implicitly constrains its output in a low-dimensional manifold via its weights and biases learned from the data [48]. In other words, the channel estimate belongs to a space defined by a neural network G_{\vartheta } parameterized by \vartheta as\begin{equation*} \hat {\mathbf {h}} = G_{\vartheta }\left ({{\mathbf {z}}}\right), \tag {18}\end{equation*} View SourceRight-click on figure for MathML and additional features.where z is a low-dimensional input vector.

B. Inference

In the context of CE, the neural network G_{\vartheta } is trained offline and deployed for time-sensitive online inference without parameter updates. The inference task is equivalent to finding the optimal input z that minimizes the reconstruction error as\begin{align*} \hat {\mathbf {z}}=& \underset {\mathbf {z}}{\arg \min }\left \|{{\tilde {\mathbf {y}}-\boldsymbol {\Phi }G_{\vartheta }\left ({{\mathbf {z}}}\right)}}\right \|_{2}^{2} \\=& \underset {\mathbf {z}}{\arg \min }\,{\mathcal {L}}_{\vartheta }\left ({{\tilde {\mathbf {y}}, \boldsymbol {\Phi },{\mathbf {z}}}}\right), \tag {19}\end{align*} View SourceRight-click on figure for MathML and additional features.where {\mathcal {L}}_{\vartheta }(\tilde {\mathbf {y}}, \boldsymbol {\Phi },{\mathbf {z}}) = \|\tilde {\mathbf {y}}-\boldsymbol {\Phi } G_{\vartheta }(\mathbf {z})\|_{2}^{2} is the reconstruction loss. A regularization term could be applied to the loss to explore more in the preferred region of the generator [38]. Due to the non-linearity introduced by the neural network, the reconstruction loss is highly non-convex. However, gradient descent and its variants, e.g., Adam [49] and RMSProp [50], can be used to solve (19). Finally, the channel estimate is given as\begin{equation*} \hat {\mathbf {h}} = G_{\vartheta }\left ({{\hat {\mathbf {z}}}}\right). \tag {20}\end{equation*} View SourceRight-click on figure for MathML and additional features.Since finding the optimal \hat {\mathbf {z}} is an iterative process, the network parameters must be optimized during training to ensure accurate channel reconstruction within minimal inference steps. Poor network initialization would lead to more iterations, which would exceed the short channel coherence time in THz systems.

C. Training

The main goal of training is to learn the channel distribution from a dataset of real THz channels. We first introduce GAN training preliminary before we present our training method.

1) GAN Preliminaries

A GAN can be trained to learn the channel distribution from a channel dataset. Training of GAN involves a competition between two networks, i.e., a generator G that converts a noise source into a fake sample and a discriminator/critic D to differentiate genuine and generated samples. The discriminator D is trained using both true and fake channel samples to encourage D to discriminate against them. The generator G is trained to produce higher quality samples to fool the discriminator into classifying the fake samples as valid. Through this alternating training process, both networks progressively improve, leading G to generate increasingly realistic channel samples. Formally, the training involves an adversarial game of the following min-max problem as\begin{equation*} \min _{G} \max _{D} \underset {\mathbf {h} \sim \mathbb {P}_{r}}{\mathbb {E}}\left [{{\log \left ({{D(\mathbf {h})}}\right)}}\right ]+\underset {\tilde {\mathbf {h}} \sim \mathbb {P}_{g}}{\mathbb {E}}\left [{{\log \left ({{1-D\left ({{\tilde {\mathbf {h}}}}\right)}}\right)}}\right ], \tag {21}\end{equation*} View SourceRight-click on figure for MathML and additional features.to minimize the Jensen-Shannon (JS) divergence between the true data distribution \mathbb {P}_{r} the fake distribution \mathbb {P}_{g} defined by \tilde {\mathbf {h}}=G(\mathbf {z}) , where \mathbf {z} \sim \mathbb {P}(\mathbf {z}) is sampled from a known distribution, e.g., a standard normal distribution, with its size much smaller than h.

While GAN-based CE has demonstrated superior performance over traditional techniques [23], [24], [25], it faces several critical limitations. The primary drawback is its poor run-time efficiency, as GAN training is not optimized for CS tasks. Reconstruction with a GAN estimator is slow and sensitive to the initial latent vector that is sampled from a known distribution. Reconstruction of a single sample typically requires hundreds to thousands of gradient descent steps with multiple random restarts [38], making it impractical for THz CE where channel coherence time is limited to milliseconds. Furthermore, GAN training suffers from computational inefficiency, with substantial resources devoted to training a discriminator that is discarded after training. The model’s ability for CE is also difficult to evaluate during training, as the adversarial loss only indicates the discriminator’s perceived similarity between learned and real distributions.

To address limitations while maintaining the advantages of structural constraints by generative neural networks, we propose to integrate the CS framework into the training process. This approach enables the neural network to not only learn the THz channel distribution but also to facilitate rapid inference by jointly optimizing the latent vector and training the generator.

2) Deep Compressed Sensing

We propose that training the latent optimization process in (19) could enhance the run-time efficiency while maintaining an equivalent level of estimation accuracy. This approach involves back-propagating through gradient descent to update the model parameters \vartheta towards the direction that minimizes the estimation error within a small number of steps, as suggested by [32]. As mentioned earlier, the latent optimization of the GAN model requires hundreds or thousands of iterations. However, by refining this optimization process, we aim to obtain comparable outcomes with significantly fewer iterations. We show the diagram of DCS training process in Figure 2 and the training algorithm in Algorithm 1.

FIGURE 2. - DCS diagram: (a) model parameter update, (b) latent optimization.
FIGURE 2.

DCS diagram: (a) model parameter update, (b) latent optimization.

Algorithm 1 DCS Training Algorithm

Require: Initial generator parameters \vartheta _{0} , minibatches of data \left \{{{\mathbf {h}_{i}}}\right \}_{i=1}^{m} , learning rate for inner and outer optimization \alpha _{1} and \alpha _{2} , fixed optimization step T.

while\vartheta \text { has not converged}~ do

fori=1, \ldots, m do

Make noiseless measurements of the channel \tilde {\mathbf {y}}_{i}\gets \boldsymbol {\Phi }_{i} \mathbf {h}_{i}

Sample \mathbf {z}_{i} \sim \mathbb {P}(\mathbf {z})

fort=1, \ldots, T do

{\mathbf {z}_{i}} \gets {\mathbf {z}_{i}}-\alpha _{1} \nabla _{\mathbf {z}_{i}} {\mathcal {L}}_{\vartheta } \left ({{\tilde {\mathbf {y}}_{i}, \boldsymbol {\Phi }_{i},{\mathbf {z}}_{i}}}\right)

{\mathbf {z}_{i}} \gets {\mathbf {z}_{i}}/\|{\mathbf {z}_{i}}\|_{2}

end for

end for

{\mathcal {L}}_{G} = {}\frac {1}{m} \sum _{i=1}^{m}\rm {NMSE}\left [{{\mathbf {h}_{i},G_{\vartheta }(\mathbf {z}_{i})}}\right]

Compute {\mathcal {L}}_{F} using (26)

\vartheta \gets \vartheta - \alpha _{2} \nabla _{\vartheta }\left ({{{\mathcal {L}}_{G} + {\mathcal {L}}_{F}}}\right)

end while

This approach incorporates a dual-loop framework designed to optimize both the latent vector z and the model parameters \vartheta for effective CE tasks. The inner loop is responsible for refining the latent vector z through gradient descent, aiming to minimize the reconstruction error over a specified number of T steps as\begin{equation*} {\mathbf {z}} \leftarrow {\mathbf {z}}-\alpha _{1} \nabla _{\mathbf {z}} {\mathcal {L}}_{\vartheta }\left ({{\tilde {\mathbf {y}}, \boldsymbol {\Phi },{\mathbf {z}}}}\right), \tag {22}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \alpha _{1} is the learning rate for inner optimization, and the same learning rate is used in the testing stage. To enforce fast inference, the number of steps T is set to small and the optimization is performed in a single pass without random restarts. After each step, we normalize the optimization variable to project it to a unit sphere that creates a better latent distribution as [51]\begin{equation*} \mathbf {z} \leftarrow \mathbf {z}/\|\mathbf {z}\|_{2}. \tag {23}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The outer loop updates the model parameter \vartheta by back-propagating a loss function characterized by the task performance score achieved by T iterations. The loss function of the outer loop consists of two parts. First, the model is trained to minimize the expected NMSE, formulated as\begin{equation*} {\mathcal {L}}_{G}=\mathbb {E}_{\mathbf {h} \sim \mathbb {P}_{\mathrm {r}}}\left \{{{\rm {NMSE}\left [{{\mathbf {h},G_{\vartheta }\left ({{\hat {\mathbf {z}}}}\right)}}\right ]}}\right \}, \tag {24}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \hat {\mathbf {z}} is obtained by T inner optimization steps. The outer loop uses the expected NMSE rather than reconstruction loss as its loss criterion, leveraging the availability of ground truth channel information during training to guide the updates of the model parameters \vartheta effectively.

However, merely minimizing (24) would fail since the generator would exploit the measurement matrix \boldsymbol {\Phi } and map its output to the null space of \boldsymbol {\Phi } , leading to a small reconstruction error while containing useless information [48]. Thus, it is important to introduce RIP [52] described as\begin{equation*} \left ({{1-\delta }}\right)\|\mathbf {h}\|_{2}^{2} \leq \|\boldsymbol {\Phi } \mathbf {h}\|_{2}^{2} \leq \left ({{1+\delta }}\right)\|\mathbf {h}\|_{2}^{2}, \tag {25}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \delta \in (0,1) is a small constant. The RIP ensures that a projection matrix \boldsymbol {\Phi } approximately preserves the distance information of sparse signals by factors between 1-\delta and 1+\delta . This property is typically valid with a high probability for various types of random matrices \boldsymbol {\Phi } and for sparse signals h. Thus, we enforce RIP via training, by adding the measurement loss {\mathcal {L}}_{F} to the objective given by\begin{align*} {\mathcal {L}}_{F}=\mathbb {E}_{\mathbf {h}_{1}, \mathbf {h}_{2}\sim \left \{{{\mathbb {P}_{\mathrm {r}}, \mathbb {P}_{\mathrm {g}}}}\right \}}\left [{{\left ({{\left \|{{\boldsymbol {\Phi }\left ({{\mathbf {h}_{1}-\mathbf {h}_{2}}}\right)}}\right \|_{2}-\left \|{{\mathbf {h}_{1}-\mathbf {h}_{2}}}\right \|_{2}}}\right)^{2}}}\right ], \tag {26}\end{align*} View SourceRight-click on figure for MathML and additional features.where \mathbf {h}_{1} and \mathbf {h}_{2} are both channel samples either sampled from the real channel dataset \mathbb {P}_{\mathrm {r}} or fake channel distribution \mathbb {P}_{\mathrm {g}} so that RIP is valid for both real and generated channel distributions. A practical way is to randomly sample one channel realization from the dataset and two fake channel realizations at the beginning and end of latent optimization. Thus, the overall loss function for the outer loop is {\mathcal {L}}_{F}+{\mathcal {L}}_{G} , which is then used in the backpropagation process to update the model parameters to both enhance estimation accuracy within only T steps and enforce RIP as\begin{equation*} \vartheta \gets \vartheta - \alpha _{2} \nabla _{\vartheta }\left ({{{\mathcal {L}}_{G} + {\mathcal {L}}_{F}}}\right), \tag {27}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \alpha _{2} is the learning rate for the outer optimization.

The overall DCS algorithm for THz CE is given in Algorithm 2. During the offline phase, we train the DCS model using a THz channel dataset. Then, the model is uploaded to the Rx for online channel inference without local training.

Algorithm 2 Channel Estimation via DCS

Require: Channel dataset.

Train DCS model using Algorithm 1

Extract the model G_{\vartheta }

for Each channel coherent time do

Sample \mathbf {z}\sim \mathbb {P}\left ({{\mathbf {z}}}\right)

fort = 1,\ldots,T do

{\mathbf {z}} \gets {\mathbf {z}}-\alpha _{1} \nabla _{\mathbf {z}} {\mathcal {L}}_{\vartheta } \left ({{\tilde {\mathbf {y}}, \boldsymbol {\Phi },{\mathbf {z}}}}\right)

{\mathbf {z}}\gets {\mathbf {z}}/\|{\mathbf {z}}\|_{2}

end for

\hat {\mathbf {h}} = G_{\vartheta }\left ({{\mathbf {z}}}\right)

end for

3) Remarks

The proposed CE framework learns the channel distribution via a generative model that captures the underlying statistical properties of THz channels. More importantly, by constraining the number of optimization steps T to be small, we enforce an implicit optimization of the generator’s latent space. In contrast, the latent space of a GAN is not optimized specifically for CE tasks, resulting in unpredictable convergence behavior with varying numbers of optimization steps and high sensitivity to initial points. Moreover, our model waives the training of a discriminator, which typically consumes the majority of computing resources in GAN training while being discarded during the inference phase. Lastly, our training loss function provides insights into how the model performs in CE tasks since the NMSE is directly incorporated into the optimization objective, indicating the estimation accuracy. It is hard to predict the performance of a GAN during the training phase simply because the adversarial loss only indicates the discriminator’s perceived similarity between real and learned channel distribution, without directly measuring the estimation accuracy.

SECTION IV.

Performance Evaluation

In this section, we evaluate the proposed DCS framework in terms of convergence, estimation error, and run-time efficiency. The estimation error is evaluated using NMSE given as\begin{equation*} \mathrm {NMSE}=\frac {\sum _{k=0}^{K-1}\|\mathbf {h}[k]-\hat {\mathbf {h}}[k]\|_{2}^{2}}{\sum _{k=0}^{K-1}\|\mathbf {h}[k]\|_{2}^{2}}. \tag {28}\end{equation*} View SourceRight-click on figure for MathML and additional features.

A. Dataset Preparation

Since there are no datasets in standards yet, several THz channel simulators or community dataset can be adopted to obtain training datasets, e.g., TeraMIMO [8], Remcom Wireless Insite [53], and DeepMIMO [54].

In this paper, we adopt TeraMIMO, a statistical THz channel simulator built on the measurement data in [55], [56] for low THz band. The main simulation parameters are summarized in Table 2. In our setup, the Tx and Rx antenna aperture D_{\mathrm {t}} and D_{\mathrm {r}} are 14\lambda and 12\lambda , respectively. Thus, the near field threshold can be given by [7]\begin{equation*} \frac {2\left ({{D_{\mathrm {t}}+D_{\mathrm {r}}}}\right)^{2}}{\lambda } = 1.35\,\mathrm {m}. \tag {29}\end{equation*} View SourceRight-click on figure for MathML and additional features.We choose the distance between Tx and Rx to be 1\,\mathrm {m} to introduce the near-field effect. For near-field consideration, the original TeraMIMO employs a spherical wave model at the SA level, while adopting a planar wave model at the AE level due to its compact footprint, to reduce complexity. We modified it to account for AE-level spherical wave model for higher modeling accuracy. In the simulator’s default setting, Tx and Rx antenna arrays are parallel. To simulate more complex wireless environments, we allow the Rx array to rotate along all three axes with a maximum permitted angle of \dot {\phi }=\pi /6 in each dimension. The molecular absorption is also considered in our dataset based on the high-resolution transmission molecular absorption (HITRAN) database [57]. In contrast to previous studies that assume a fixed number of NLoS paths, our channel dataset employs independent Poisson processes to model arrival rate of clusters and rays [8], [43]. Specifically, we considered a cluster arrival rate \Lambda = 0.13~\mathrm {nsec}^{-1} and a ray arrival rate \dot {\Lambda }=0.37~\mathrm {nsec}^{-1} within a time margin of T_{\mathrm {m}} = 50~\mathrm {nsec} . Thus, the number of NLoS paths L during the time margin follows a Poisson distribution L \sim \mathrm {Pois}\left ({{\dot {\Lambda }T_{\mathrm {m}}}}\right) . Hence, the expectation and variance of the number of NLoS paths L is \dot {\Lambda }T_{\mathrm {m}}= 18.5 , which is less than the number of antennas \min \left \{{{N_{t}, N_{r}}}\right \}=64 and makes the considered THz channels statistically sparse in angular domain.

TABLE 2 Main System Parameters
Table 2- Main System Parameters

We use TeraMIMO to generate the channel dataset in the frequency domain, where the channel realization is of size \left ({{N, K, N_{\mathrm {r}}, N_{\mathrm {t}}}}\right) . Since DL algorithms work only on real samples, we separate the real and imaginary parts and stack them over the second dimension, i.e., the size of the real-valued channel is \left ({{N, 2K, N_{\mathrm {r}}, N_{\mathrm {t}}}}\right) , which is then used in PyTorch [58]. The channel dataset is similar to a typical image dataset, with the second dimension being RGB color channels and the last two dimensions being spatial. Before feeding the data into the training process, we perform channel-wise2 normalization, as normally used in image processing. More precisely, for a given dataset H of size N, the c-th channel of the dataset will be normalized by using\begin{align*} \mathbf {H}\left [{{:,c,:,:}}\right ] \gets \frac {\mathbf {H}\left [{{:,c,:,:}}\right ]-\mathrm {mean}~\left [{{c}}\right ]}{\mathrm {std}~\left [{{c}}\right ]}, \forall c = 1, 2,\ldots, 2K, \tag {30}\end{align*} View SourceRight-click on figure for MathML and additional features.where \mathrm {mean}~\left [{{c}}\right] and \mathrm {std}~\left [{{c}}\right] can be computed as\begin{align*} \mathrm {mean}\left [{{c}}\right ]=& \frac {\sum _{n_{\mathrm {t}}}^{N_{\mathrm {t}}}\sum _{n_{\mathrm {r}}}^{N_{\mathrm {r}}}\sum _{n}^{N}\mathbf {H}~\left [{{n,c,n_{\mathrm {r}},n_{\mathrm {t}}}}\right ]}{N_{\mathrm {t}}N_{\mathrm {r}}N}, \tag {31}\\ \mathrm {std}\left [{{c}}\right ]=& \sqrt {\frac {\sum _{n_{\mathrm {t}}}^{N_{\mathrm {t}}}\sum _{n_{\mathrm {r}}}^{N_{\mathrm {r}}}\sum _{n}^{N}\left |{{\mathbf {H}\left [{{n,c,n_{\mathrm {r}},n_{\mathrm {t}}}}\right ]-\mathrm {mean}\left [{{c}}\right ]}}\right |^{2}}{N_{\mathrm {t}}N_{\mathrm {r}}N-1}}. \tag {32}\end{align*} View SourceRight-click on figure for MathML and additional features.

To generate the measurement matrix \boldsymbol {\Phi } , we consider zero-mean circularly-symmetric Gaussian distribution for the pilot symbols \mathbf {s}_{m}\left [{{k}}\right] [19]. Since the linear combination of Gaussian random variables remains Gaussian, we set digital precoder \mathbf {F}_{\mathrm {BB}} and combiner \mathbf {C}_{\mathrm {BB}} as identity matrices without loss of generality. The phase shifts of the analog precoder and combiner are random with a one-bit resolution [23].

B. Neural Network Architecture & Training

We use deep convolutional GAN (DCGAN) [59] for the generator and discriminator and provide a general neural network structure that can be used for different input sizes. The neural network architecture for the generator G and discriminator D is summarized in Table 3, following standard notation in PyTorch. In image processing tasks, a \tanh activation function is typically at the last layer of the generator to scale the output to \left ({{-1,1}}\right) , which is then converted to \left [{{0,255}}\right] to display RGB colors [59]. However, there is no such constraint on the elements of the channel matrix; hence, we remove the activation layer. The WGAN-GP algorithm is used to train the GAN, so we leverage instance normalization to replace the commonly used batch normalization [60]. To fairly compare the performance of the proposed estimator, we use the same generator model for both GAN and DCS. In contrast, the discriminator is only used in the GAN model since a discriminator is not required for the other.

TABLE 3 Neural Network Architecture
Table 3- Neural Network Architecture

The optimized hyperparameters in our experiment are summarized in Table 4. We use Adam optimizer [49] for all optimization processes. Both models use a learning rate of 0.1 for the latent optimization and 0.0002 for the model update. Although we use the same batch size for both models, the GAN model is trained for more epochs because it takes longer to converge according to empirical results. For GAN training, we update the discriminator (critic) 5 times for every single generator update (n_{\mathrm {critic}} = 5) .

TABLE 4 Hyperparameters
Table 4- Hyperparameters

All experiments are performed in Ibex computing clusters at King Abdullah University of Science and Technology (KAUST) with an AMD EPYC 7713P 64-core CPU and an NVIDIA A100 GPU. The dataset is generated in a MATLAB (R2022a) environment, while the training and testing are implemented in a PyTorch environment.

C. Numerical Results

1) Latent Dimension

To evaluate the pure compression capability of our generative model, we first perform experiments at high SNR (40 dB), where measurement noise has a negligible impact. This quasi-noiseless setting allows us to isolate and analyze the fundamental trade-off between latent dimension and reconstruction accuracy. Figure 3 shows the relationship between NMSE and compression ratio \rho = N_{p}/N_{\mathrm {t}} for different latent dimensions d_{l} . As shown in the figure, while larger latent dimensions generally yield better performance, the improvement diminishes beyond d_{l} = 100 . Specifically, d_{l} = 120 only marginally outperforms d_{l} = 100 by approximately 0.2 dB when \rho \geq 0.11 , while both outperform smaller dimensions of d_{l} = 50 and d_{l} = 80 . Moreover, the NMSE curves flatten after \rho =0.15 for all configurations, suggesting that additional pilots offer diminishing returns. This indicates that d_{l} = 100 provides a good trade-off between compression efficiency and reconstruction accuracy, effectively representing a 32\times 256\times 8 complex channel (262144 real values) using just 100 parameters.

FIGURE 3. - NMSE of DCS as a function of pilot length for various latent dimensions: $d_{l}\in \left \{{{50,80,100,120}}\right \}$
.
FIGURE 3.

NMSE of DCS as a function of pilot length for various latent dimensions: d_{l}\in \left \{{{50,80,100,120}}\right \} .

2) Optimization Steps

We investigate the impact of optimization steps T on the gradient descent process during inference. As shown in Figure 4, increasing the number of steps from 5 to 20 improves NMSE across all SNR regimes, indicating better convergence to the optimal latent vector. However, beyond 20 steps, the improvement diminishes or even degrades at low SNR. To better understand this performance degradation beyond 20 steps at low SNR, we examine the convergence trajectories for a testing sample using 40 steps at SNR = 0 , −10, and −20 dB, as shown in Figure 5. At SNR = 0 dB, the NMSE steadily improves until around step 20 and then gradually converges to −16 dB. When SNR decreases to −10 dB, the trajectory is similar to SNR = 0 dB before 20 steps, though it converges to a slightly higher NMSE of around −14 dB and exhibits minor fluctuations after convergence. However, at SNR =−20 dB, while the optimization initially improves until step 17, the performance degrades afterwards, with the NMSE fluctuating between −8 and −6 dB. This degradation occurs because measurement noise corrupts the gradient information, creating multiple local minima in the optimization landscape. Extended optimization under noisy conditions can cause the solution to drift away from good local minima or oscillate between multiple suboptimal points. These results indicate that 20 steps achieve an optimal trade-off between convergence and stability, delivering −14 dB NMSE at high SNR while maintaining robustness across different noise levels.

FIGURE 4. - NMSE of DCS for different inner optimization steps.
FIGURE 4.

NMSE of DCS for different inner optimization steps.

FIGURE 5. - Optimization trajectory with 40 steps at SNR $= 0$
, −10, and −20 dB.
FIGURE 5.

Optimization trajectory with 40 steps at SNR = 0 , −10, and −20 dB.

3) Pilot Length

To evaluate pilot efficiency, we examine the performance of our model with varying pilot lengths during inference while maintaining a fixed training pilot length of N_{p}=100 . This training strategy offers several practical advantages. First, it enables deployment flexibility, i.e., network operators can dynamically adjust pilot length based on SNR conditions or overhead constraints without model retraining. Second, maintaining a single model reduces storage requirements and simplifies deployment compared to training separate models for each pilot length.

Figure 6 demonstrates both the strong generalization capacity and pilot efficiency of our approach across different pilot lengths. Even at low SNR of −5 dB, the model exhibits good performance with N_{p}\in \{50,100,200\} , maintaining a maximum NMSE gap of only 1 dB between different configurations. The proposed approach enables accurate THz UM-MIMO channel estimation at a compression ratio of \rho =N_{p}/N_{t}=50/256\approx 0.2 , requiring only N_{p}=50 pilots. In particular, our method achieves good performance with significantly fewer pilots compared to recent works such as [19] which requires N_{p}=1000 pilots for N_{\mathrm {t}}=512 Tx antennas. While N_{p}=50 demonstrates the model’s capability for extreme compression, using N_{p}=100 pilots provides an optimal trade-off between estimation accuracy and pilot overhead for practical deployments.

FIGURE 6. - NMSE performance for various pilot length.
FIGURE 6.

NMSE performance for various pilot length.

In the following experiments, we use the latent dimension of 100, the number of optimization steps of 20, and the pilot length of 100 as default unless otherwise stated.

4) Random Restarts

We show an example of channel inference using DCS and GAN in Figure 7. We test both models with five random restarts, while each restart in GAN and DCS takes 100 and 20 steps, respectively. It is clear from Figure 7 that GAN estimator is highly dependent on the initial point. While the second, third, and fifth starting points can result in a NMSE of around −13 dB, the inference from the first and fourth initial points exhibits an extremely high estimation error of around 0 dB. This demonstrates why random restarts are necessary for GAN estimator, i.e., the impact of bad initial points is detrimental. On the other hand, our model barely correlates with the initial points, with the NMSE for all restarts showing no measurable difference. Thus, we claim that our model does not need random restart, which is a key to reducing the total number of steps.

FIGURE 7. - An example NMSE trajectory with 5 random restarts for GAN and DCS, showing GAN estimator is highly dependent on initial points.
FIGURE 7.

An example NMSE trajectory with 5 random restarts for GAN and DCS, showing GAN estimator is highly dependent on initial points.

For the sake of fairness, we compare our DCS estimator to GAN with different configurations, as shown in Table 5. The average NMSE is tested under SNR = 0 dB. With identical configuration (20 steps and 1 trial), GAN achieves −1.61 dB NMSE, which is practically unusable compared to −14.33 dB for DCS. Although increasing both steps per restart and the number of restarts can enhance GAN’s performance, we observe diminishing returns beyond 50 steps with 10 restarts (500 total steps). The improvement from 500 to 1000 total steps is merely 0.11 dB, which does not justify doubling the computational cost. Therefore, we adopt the configuration of 50 steps and 10 restarts for GAN in subsequent experiments. Even with this optimized setting, GAN‘s performance (−11.75 dB) still falls short of DCS by approximately 2.6 dB, despite using 25 times more steps. Even with just 10 steps, DCS achieves −12.82 dB, surpassing GAN by 1 dB while using only 2% of the steps. The results clearly demonstrate that DCS not only provides better estimation accuracy but also does so with remarkably fewer steps compared to GAN-based estimator.

TABLE 5 Average NMSE Evaluated at SNR = 0 dB for DCS and GAN With Different Configurations
Table 5- Average NMSE Evaluated at SNR = 0 dB for DCS and GAN With Different Configurations

5) Estimation Accuracy

We compare our proposed DCS estimator with several benchmarks: standard LS, LMMSE with estimated second-order statistics, OMP with on-grid angular dictionary, and pre-trained GAN estimator with 10 random restarts and 50 steps per restart. Our model is evaluated with 20 steps without random restart. As shown in Figure 8, our method demonstrates superior performance across all SNR ranges from −20 dB to 15 dB. Consistent with [19], OMP with on-grid angular dictionary performs poorly for THz channels, especially in the near field, showing worse performance than LMMSE. Both LS and OMP lag significantly behind, with nearly 20 dB higher NMSE at low SNR and over 12 dB gap at high SNR. Our method outperforms LMMSE by approximately 8 dB in NMSE, while avoiding the high computational complexity of matrix inversion in LMMSE. While the GAN achieves similar NMSE at −20 dB SNR, our method shows increasing advantages as SNR improves, maintaining at least a 2.5 dB lead starting from −5 dB SNR. Notably, since both GAN and DCS use identical generator neural network architectures, the computation per step is equivalent, and this superior performance is achieved with only 20 optimization steps compared to GAN’s 500 steps, demonstrating both better accuracy and higher efficiency.

FIGURE 8. - NMSE performance for different estimators.
FIGURE 8.

NMSE performance for different estimators.

6) Computation Complexity

We summarize the computation complexity of the aforementioned CE algorithms in Table 6. The computation complexity of DCS model in the online inference stage mainly comes from computing the gradient in (22). The gradient can be written as\begin{equation*} \nabla _{\mathbf {z}}\left \|{{\tilde {\mathbf {y}}-\boldsymbol {\Phi }G_{\vartheta }\left ({{\mathbf {z}}}\right)}}\right \|_{2}^{2} = -2\left ({{\tilde {\mathbf {y}}-\boldsymbol {\Phi }G_{\vartheta }(\mathbf {z})}}\right)^{\top }\boldsymbol {\Phi }\nabla _{\mathbf {z}}G_{\vartheta }\left ({{\mathbf {z}}}\right). \tag {33}\end{equation*} View SourceRight-click on figure for MathML and additional features.

TABLE 6 Complexity of Different Methods of CE
Table 6- Complexity of Different Methods of CE

To compute the forward pass, i.e., compute G_{\vartheta }(\mathbf {z}) , the input goes through a linear layer of complexity \mathcal {O}(d_{l}N_{\mathrm {t}}N_{\mathrm {r}}) followed by four transposed convolutional layers, each with complexity \mathcal {O}(F_{\kappa }^{2}N_{\kappa -1}N_{\kappa }P_{\kappa }) , where F_{\kappa } is the filter size, N_{\kappa } is the number of filters, and P_{\kappa } is the number of pixels, respectively, for the \kappa -th transposed convolutional layer. Thus, the total computation complexity of the forward pass is \mathcal {O}\left ({{d_{l}N_{\mathrm {t}}N_{\mathrm {r}} + \sum _{\kappa =1}^{4}\mathcal {O}(F_{\kappa }^{2}N_{\kappa -1}N_{\kappa }P_{\kappa })}}\right) . Since d_{l}\lt \lt N_{\mathrm {t}}N_{\mathrm {r}} and the intermediate layers are smaller than the final output of G_{\vartheta } , i.e., P_{\kappa }\leq N_{\mathrm {t}}N_{\mathrm {r}} , the computation complexity of the forward pass can be simply given by \mathcal {O}(N_{\mathrm {t}}N_{\mathrm {r}}) . The complexity for backward pass is typically twice as the forward pass [61]; hence, the total complexity to compute G_{\vartheta }(\mathbf {z}) and \nabla _{\mathbf {z}}G_{\vartheta }(\mathbf {z}) in (33) is \mathcal {O}(N_{\mathrm {t}}N_{\mathrm {r}}) . G_{\vartheta }(\mathbf {z}) and \nabla _{\mathbf {z}}G_{\vartheta }(\mathbf {z}) are then multiplied by a measurement matrix \boldsymbol {\Phi }\in \mathbb {C}^{N_{s}N_{p} \times N_{\mathrm {t}}N_{\mathrm {r}}} , which yields to \mathcal {O}(N_{s}N_{p}N_{\mathrm {t}}N_{\mathrm {r}}) complexity. The complexity of the remaining vector subtraction and inner product operation is negligible. Thus, the computation complexity of a single iteration of DCS is \mathcal {O}(N_{s}N_{p}N_{\mathrm {t}}N_{\mathrm {r}}) . Since each iteration is dependent on the previous one and the complexity for each iteration is the same, the total complexity becomes T times the complexity for each iteration, i.e., \mathcal {O}(TN_{s}N_{p}N_{\mathrm {t}}N_{\mathrm {r}}) . For GAN-based estimator with the same generator architecture as DCS, the complexity of each iteration is the same as DCS. Considering T^{\prime } steps and R random restarts, the total computation complexity of GAN-based estimator is \mathcal {O}(RT^{\prime }N_{s}N_{p}N_{\mathrm {t}}N_{\mathrm {r}}) . Meanwhile, for LS and LMMSE algorithms in which matrix inversion is required, their computation complexity is \mathcal {O}(N_{\mathrm {t}}^{3}N_{\mathrm {r}}^{3}) . The complexity of OMP is given by \mathcal {O}(SN_{\mathrm {t}}N_{\mathrm {r}}) where S reflects channel sparsity.

From this complexity analysis, traditional LS and LMMSE methods exhibit cubic complexity \mathcal {O}(N_{\mathrm {t}}^{3}N_{\mathrm {r}}^{3}) , making them computationally prohibitive for large-scale MIMO systems. While OMP shows lower complexity \mathcal {O}(SN_{\mathrm {t}}N_{\mathrm {r}}) , its performance is heavily dependent on channel sparsity S. Both GAN-based and DCS approaches have complexity \mathcal {O}(N_{\mathrm {t}}N_{\mathrm {r}}) per optimization step, but DCS demonstrates better runtime efficiency by eliminating the need for random restarts (R) and requiring fewer optimization steps (T \leq T^{\prime }) to converge to a local minimum compared to GAN‘s multiple restarts and iterations. As shown in Table 5, with R=10 and T^{\prime }=50 for GAN, DCS achieves 2.6 dB better NMSE at 0 dB SNR while reducing the number of steps by 96%. Even compared to a lighter GAN configuration (R=10 , T^{\prime }=20 ), DCS still provides 3.3 dB better NMSE at 0 dB SNR with 90% fewer steps. By eliminating random restarts, our model achieves one order of magnitude lower computational complexity while providing around 3 dB better estimation accuracy.

7) Training Convergence

The convergence performance of the proposed DCS estimator is shown in Figure 9 and Figure 10 by plotting the training and testing loss as a function of the iteration count and the NMSE performance of the model for various training epochs. One epoch refers to a complete training cycle using the entire training dataset. Considering 5000 training data points and a batch size of 100, one epoch in Figure 10 consists of 50 iterations in Figure 9. As shown in Figure 9, the training and loss align well, indicating that there is no overfitting problem for our model. Furthermore, our proposed model presents accelerated training convergence compared to GAN estimator. With only 15 training epochs, DCS already achieves equivalent NMSE to the fully converged GAN estimator. DCS reaches optimal performance around epoch 150 and maintains stable performance thereafter, while GAN requires 600 epochs and exhibits larger ongoing fluctuations. This significantly reduces training overhead compared to a GAN estimator, with 4 times faster convergence. The fast convergence is due to the task-aware nature of our proposed model, i.e., our model learns fast inference within a limited number of steps during the training process. In contrast, the GAN training only focuses on learning the distribution of the channel dataset, setting the inference task alone, and making its computation expensive. The slow convergence of GAN is also due to the massive computation resources spent training a discriminator, which is unfortunately useless after training.

FIGURE 9. - Training and testing loss of DCS model.
FIGURE 9.

Training and testing loss of DCS model.

FIGURE 10. - NMSE of GAN and DCS at 0 dB SNR for varying training epochs.
FIGURE 10.

NMSE of GAN and DCS at 0 dB SNR for varying training epochs.

Another remark of our proposed model is the introduction of a more meaningful loss function related to the inference task, where a lower loss indicates better model performance. However, as shown in Figure 11, the baseline GAN model does not have a task-specific loss due to the adversarial game between the generator and the discriminator. The adversarial losses merely indicate the dynamics between the generator and discriminator during training, but do not directly reflect the channel estimation accuracy. This is evident when comparing Figure 11 with Figure 10, where despite the fluctuating adversarial losses, the NMSE performance steadily improves and converges to −11.56 dB after 600 epochs. This disconnect between the adversarial training objectives and the actual channel estimation performance highlights a limitation of the GAN approach, as the model optimization is not directly guided by the channel estimation accuracy metrics.

FIGURE 11. - Training loss of GAN model.
FIGURE 11.

Training loss of GAN model.

8) Generalization Capability

In [23], the generalization capability of GAN estimator was evaluated through testing with varying numbers of clusters and rays. While our dataset inherently incorporates random numbers of clusters and rays, we extend the generalization analysis to two additional out-of-distribution (OOD) scenarios: measurement distribution shift and channel distribution shift.

For measurement distribution shift, as shown in Figure 6, we train our model with N_{p}=100 pilots but test it across N_{p}\in \left \{{{10,30,50,100,200}}\right \} . The results demonstrate robust performance, with a maximum NMSE gap of only 1 dB between different configurations at SNR above −5 dB. For channel distribution shift, we evaluate our model under various scenarios as shown in Table 7, with 10 online optimization steps. When tested on LoS-only scenarios despite training on both LoS and NLoS conditions, the model achieves better performance with −16.5512 dB NMSE compared to −12.8193 dB, attributed to the simpler channel structure. The model exhibits stable performance under antenna spacing (d_{\mathrm {AE}}) variations from \lambda _{c}/2 to \lambda _{c}/5 or \lambda _{c}/10 , showing minimal NMSE degradation below 0.6 dB. For maximum receiver rotation angle shifts from \dot {\phi }=\pi /6 to \dot {\phi }=2\pi /9 , the performance decrease of 1.6 dB remains practically acceptable.

TABLE 7 Generalization Performance Under Different Scenarios
Table 7- Generalization Performance Under Different Scenarios

It is worth noting that OOD generalization is not a critical concern for our approach. As a model-free estimation scheme, the generative model learns directly from data without requiring explicit channel models. This means the model can be efficiently retrained when deployment scenarios differ significantly from the training distribution, providing a flexible solution for practical applications.

SECTION V.

Conclusion

In this paper, we propose DCS model for THz UM-MIMO CE. Unlike GAN-based estimator, our model is tailored for CE tasks and inherently learns fast inference without random restart by jointly optimizing latent dimension and network parameters. The proposed DCS model has superior performance compared to conventional techniques, with at least 8 dB lower NMSE. In addition, our model promisingly overcomes the detrimental problems of the GAN-based channel estimator, while showing even better performance. First and foremost, DCS provides around 3 dB lower estimation error compared to GAN-based estimator, with one order of magnitude lower computation complexity. Second, we design an informative loss function that allows easier model evaluation during training, addressing a common challenge in GAN. Finally, our model empirically achieves 4 times faster training convergence compared to GAN. In future work, we plan to extend our framework to include end-to-end system performance metrics such as bit error rate (BER) through hardware-aware co-design, considering practical aspects of beamforming and decoding schemes in resource-constrained systems.

References

References is not available for this document.