Deep-Learning-Based Robust Channel Estimation for MIMO IoT Systems

When the second-order statistics of channel and noise, such as their covariance matrices, are not exactly known, the acquisition of accurate channel state information (CSI) for a wireless propagation environment becomes quite challenging. In this article, we tackle the problem of robust channel estimation for multiple-input–multiple-output (MIMO)-aided Internet of Things (IoT) systems in the presence of uncertainties in the channel and noise covariance matrices. Our goal is to minimize the mean square error (MSE) of the channel estimation under the channel and noise covariance uncertainties by jointly optimizing the channel estimator and pilot signal, which is however highly nonconvex and mathematically intractable. To effectively and intelligently cope with this issue, we exploit a deep learning (DL) technique and propose a novel network architecture with two modules, namely, the pilot optimizer and channel predictor, both of which are designed by neural networks with their own local connections and weight sharings. Moreover, a novel and effective training strategy for the proposed DL model is devised in a self-supervised manner, in which samples obtained by properly compensated channel and noise covariance matrices are utilized to overcome any adverse impacts of the underlying uncertainties on the channel estimation. Through extensive numerical results simulated in realistic propagation environments, we substantiate the superior performance and effectiveness of the proposed scheme.


Deep-Learning-Based Robust Channel
Estimation for MIMO IoT Systems Jae-Mo Kang , Member, IEEE Abstract-When the second-order statistics of channel and noise, such as their covariance matrices, are not exactly known, the acquisition of accurate channel state information (CSI) for a wireless propagation environment becomes quite challenging.In this article, we tackle the problem of robust channel estimation for multiple-input-multiple-output (MIMO)-aided Internet of Things (IoT) systems in the presence of uncertainties in the channel and noise covariance matrices.Our goal is to minimize the mean square error (MSE) of the channel estimation under the channel and noise covariance uncertainties by jointly optimizing the channel estimator and pilot signal, which is however highly nonconvex and mathematically intractable.To effectively and intelligently cope with this issue, we exploit a deep learning (DL) technique and propose a novel network architecture with two modules, namely, the pilot optimizer and channel predictor, both of which are designed by neural networks with their own local connections and weight sharings.Moreover, a novel and effective training strategy for the proposed DL model is devised in a self-supervised manner, in which samples obtained by properly compensated channel and noise covariance matrices are utilized to overcome any adverse impacts of the underlying uncertainties on the channel estimation.Through extensive numerical results simulated in realistic propagation environments, we substantiate the superior performance and effectiveness of the proposed scheme.

I. INTRODUCTION
I NTERNET of Things (IoT) is now becoming an essen- tial part of ubiquitous connections between various devices and services, supporting seamless interactions in our daily lives [1], [2], [3].Particularly, the extensive connectivity of IoT together with the substantial data collected by diverse devices will be practically very useful in many deployment scenarios, such as smart cities, smart homes, smart factories, and smart transportation [2], [3].However, the rapid proliferation of IoT and its applications in various fields have led to ever-increasing demands for reliable wireless communications.Multiple-input-multiple-output (MIMO) exploiting the spatial The author is with the Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, South Korea (e-mail: jmkang@knu.ac.kr).
Digital Object Identifier 10.1109/JIOT.2023.3324667diversity of a wireless channel through the use of multiple antennas at both the transmitter and receiver [4], [5], has recently emerged as a key enabler technology to address such challenges [1], [2].Accordingly, MIMO-aided IoT systems have the great potential and synergies to substantially enhance the reliability, coverage, and energy-efficiency of communication links.However, the merits the MIMO IoT systems can offer are fully realizable only when accurate channel state information (CSI) is available.In practice, the CSI needs to be acquired by sending a priori known pilot (or training) signal, for which the accuracy of the resulting CSI estimate critically relies on how well the channel training process is designed according to the statistical knowledge of a given wireless propagation environment, such as second-order statistics or covariance matrices of channel and noise [6], [7], [8], [9], [10], [11], [12], [13].
In practice, the channel and noise covariance matrices need to be estimated as well based on real samples for channel estimate and noise measurement, respectively.As a result, the estimates of these covariance matrices are inevitably erroneous (especially, in some extreme propagation environments, such as in unmanned aerial vehicle (UAV) or satellite communication scenarios with high mobility) due to imperfections in the channel estimation and noise measurement processes [14], [15], [16], [17], [18], [19], [20], [21], [22].Moreover, the estimated covariance matrices may even be further distorted for the certain purposes, such as quatization, compression, data embedding, feedback transmission, etc., inducing additional discrepancies with the actual values [17], [18], [19], [20], [21], [22].It is obvious that only with the knowledge of the estimated channel and noise covariance matrices, the traditional signal processing approach, such as the linear minimum mean square error (LMMSE) channel estimation method, may fail to accurately estimate the CSI due to the mismatches between the actual and estimated covariance matrices [19].Even though the CSI can be estimated without any knowledge of the channel and noise statistics, such as via the least squares (LS) channel estimation method, the resulting performance of such an approach might not be satisfactory, especially at low-signalto-noise ratio (SNR), due to the noise amplification [8].To properly cope with these issues while guaranteeing the robustness, therefore, the uncertainties involved in the channel and noise covariance matrices have to be taken into account during the channel estimation.
In the literature, in the presence of the covariance uncertainties, the robust pilot signal design techniques were investigated for multiple-input-single-output (MISO) systems [16], MIMO systems [17], [18], [19], [20], UAV-assisted communication systems [21], and MIMO relaying systems [22].These techniques, however, have several major limitations.First, in [16], [17], [18], [19], [20], [21], and [22], an idealistic assumption was made that the LMMSE channel estimation could be performed with the exact knowledge of the channel and noise covariance matrices.Thus, the channel estimation methodology considered therein is not actually robust.Furthermore, in the majority of the works [16], [17], [18], [19], [20], [21], only the impact of the channel covariance uncertainty was considered in the pilot design, while that of the noise covariance uncertainty was neglected.In addition, the design approaches of [17], [18], [19], [20], [21], and [22] all presumed a specific channel model (namely, the Kronecker channel model), and thus, there is lack of generality in their applicability, especially for the scenarios where the presumed channel model is violated.Even if the presumed channel model is validated, the pilot signal designs need to be carried out through complicated iterative algorithms with high-computational complexities (except for some special cases investigated in [16] and [20]), which hinders their applicability to practical IoT systems with stringent real-time operations.
Recently, deep learning (DL) techniques have emerged as promising solutions to improve the performance of wireless communication systems by effectively and intelligently overcoming various technical challenges [23], [24], [25], [26], [27], [28], [29], [30], [31].Particularly, in [32], [33], [34], [35], [36], [37], [38], and [39], DL-based channel estimation techniques have been developed for various MIMO systems by considering different design approaches.Specifically, in [32], a beamspace channel estimation technique has been devised for millimeter-wave massive MIMO systems based on a learned denoising-based approximate message passing (LDAMP) neural network, which incorporated a denoising convolutional neural network (CNN) into an iterative sparse signal recovery algorithm.In [33], two algorithms for direction-of-arrival (DOA) estimation and channel estimation have been developed for massive MIMO systems based on (deep) feedforward neural networks (FNNs) by leveraging the spatial structure.In [34], a DL compressed sensing (DLCS) channel estimation scheme has been proposed for multiuser millimeter-wave massive MIMO systems and a DL quantized phase (DLQP) hybrid precoder design method has been developed subsequent to the channel estimation.Yang et al. [35] proposed a sparse complex-valued neural network (SCNet) for the downlink CSI prediction in frequency division duplex (FDD) massive MIMO systems via the uplink-to-downlink mapping function.Moreover, joint channel estimation and pilot signal design schemes with different DL architectures have been suggested for massive MIMO systems via data-aided iterative channel estimation [36], multiuser MIMO systems via successive interference cancelation [37], MIMO systems via received SNR feedback [38], and MIMO-orthogonal frequency division multiplexing (OFDM) systems via neural network pruning [39].Unfortunately, however, the aforementioned works [32], [33], [34], [35], [36], [37], [38], [39] did not consider the impacts of the channel and noise covariance uncertainties in the channel estimation (as well as pilot design) process, and thus, their performance will be deteriorated seriously in the real-world scenarios with imperfect covariance information (i.e., only with the knowledge of inexact covariance matrices).Accordingly, it is highly necessary to develop an innovative channel estimation scheme with high accuracy of the channel estimate even under the channel and noise covariance uncertainties.
To the best of our knowledge, all the aforementioned critical issues have not been addressed yet in the literature, which motivated our work.In this article, we study the problem of robust channel estimation for MIMO-aided IoT systems in the presence of uncertainties in both the channel and noise covariance matrices, based on DL. 1 Note that our work is the first to present a DL framework for the robust channel estimation and pilot signal design, to the best of our knowledge, which is not resorting to the assumptions invoked in [16], [17], [18], [19], [20], [21], and [22], and hence, has a wider applicability.To accomplish our design goal, we aim to minimize mean square error (MSE) of the channel estimation under the channel and noise covariance uncertainties by jointly optimizing the channel estimator and pilot signal, which is however generally hard to tackle due to nonconvexity and intractability.To break through this technical challenge, we develop an effective and intelligent DL technique.The main contributions of this article are as follows.
1) We propose a novel and effective DL model for the robust MIMO channel estimation with two modules, namely, the pilot optimizer and channel predictor, which has never been reported in the literature to the best of our knowledge. 2The pilot optimizer is constructed by a locally connected and weight-shared FNN with a specifically designed layer, called the pilot layer, such that the shared weights between the locally connected nodes correspond to the pilot signal, enabling the optimization of the pilot signal through training.Moreover, we construct the channel predictor by adopting a CNN structure such that useful features for the robust MIMO channel estimation are efficiently learnable.2) In addition, we devise a novel and effective training strategy for the proposed DL model in a self-supervised manner, in which the channel and noise covariance matrices are appropriately compensated to overcome any 1 The robust channel estimation in this article means the robust estimation of CSI of the MIMO system with imperfect knowledge of the channel and noise covariance matrices under uncertainties in the channel and noise covariance matrices, the goal of which is to acquire the CSI estimate of the MIMO system that is robust to the channel and noise covariance uncertainties.The CSI estimate acquired by the robust channel estimation can be used for the subsequent tasks, such as the robust beamforming design.Our proposed scheme can also be used for this purpose. 2The key novelty of our work lies in constructing the network structure of the pilot optimizer based on our own innovative construction inspired by the MIMO system model for transmission and reception of the pilot signal.In turn, the whole network structure of the proposed DL model combining the pilot optimizer and the channel predictor is entirely new and specialized in dealing with the robust channel estimation.In addition, a new finding from our work is that the constructed pilot optimizer (i.e., a very special neural network) still works well with the channel predictor, further supporting and demonstrating the universal superiority and effectiveness of constructing the prediction part with the CNN structure.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.adverse impacts of the underlying uncertainties on the channel estimation, and then, the samples drawn from the compensated covariance matrices are used to jointly train the two modules of the proposed DL model based on two different gradient descent approaches such that the MSE loss function of the prediction is minimized.
3) We present extensive simulation results, through which the superior performance and better effectiveness of the proposed DL model is demonstrated compared to baseline schemes and some useful engineering insights into the practical design are drawn.We also analyze the computational complexities of the proposed and baseline schemes.This article is organized as follows.In Section II, the system model is described and the robust channel estimation problem under consideration is formulated.In Section III, the proposed DL architecture and training strategy are elaborated.Section IV presents the simulation results along with thorough discussions and complexity analysis.Section V concludes this article.
Notations: R a×b and C a×b stand for the sets of a × b real-and complex-valued matrices, respectively.Also, A T , A H , A (1/2) , A −1 , and A † denote the transpose, conjugate (or Hermitian) transpose, (Hermitian) square root, inverse, and pseudo-inverse of a matrix A, respectively.a = vec(A) is the vectorization of a matrix A, which stacks all column vectors of A into a long column vector a, and its inverse operator is denoted by A = vec −1 (a).Also, A ⊗ B is the Kronecker product between matrices A and B. The maximum eigenvalue of a matrix A is denoted by λ max (A), and the (a, b)th entry of A by [A] a,b .The expectation of a random variable is denoted by E[ • ].The real and imaginary parts of a complex-valued argument a is denoted by Re{a} and Im{a}, respectively.The cardinality of a set A is denoted by |A|.An a × a identity matrix is denoted by I a , and a zero matrix with an appropriate size by 0. In addition, A 0 means that a Hermitian matrix A is positive semi-definite.The probability distribution of a circularly symmetric complex Gaussian (CSCG) random vector with mean a and covariance matrix B is denoted by CN (a, B).Also, W(a, B) denotes a Wishart distribution with a degrees of freedom and a scale matrix B.

A. System Model
As shown in Fig. 1, we consider an MIMO IoT system composed of a transmitter (e.g., a mobile or IoT device) and a receiver (e.g., a base station or gateway), 3 which are equipped with M and N antennas, respectively.For the purpose of CSI acquisition, the transmitter sends a priori known pilot signal of length L, denoted by S ∈ C L×M , to the receiver, of which transmission power is constrained such that Tr(S H S) ≤ P, where P denotes the maximum power budget.The received pilot signal at the receiver, denoted by Y ∈ C N×L , is then given by where H ∈ C N×M and Z ∈ C N×L are the matrices of MIMO channel coefficients and received additive noises (possibly accounting for interference from other links), respectively.Using vec(AXB) = (B T ⊗ A)vec(X) [40], the received signal in ( 1) can be written in a vector form as where y = vec(Y), h = vec(H), and z = vec(Z).
0 denote the (actual) channel and noise covariance matrices, respectively.In practice, the values of C h and C z are not exactly known as they have to be estimated or acquired from the real (yet erroneous) samples for the channel estimate and noise measurement, respectively. 4On top of such an incomplete acquisition, further errors may also arise in the subsequent processes, such as quantization, compression, data embedding, feedback transmission, etc.Consequently, in practice, the estimated channel and noise covariance matrices are generally inaccurate and unavoidably subject to some errors.Considering such imperfection, in this article, the mismatches between the actual values of C h and C z , and their estimated values (denoted by Ĉh 0 and Ĉz 0, respectively), are modeled as follows [16], [17], [18], [19], [20], [21], [22]: where E h ∈ E h and E z ∈ E z denote the corresponding error matrices, both of which are (generally indefinite) Hermitian matrices (i.e., E h = E H h and E z = E H z ) such that Ĉh + E h 0 and Ĉz + E z 0, respectively.Furthermore, E h and E z denote unitarily invariant sets of the channel and noise covariance uncertainties, respectively, such that if E h ∈ E h 3 This single user scenario is very fundamental, and even useful and insightful (and thus, still meaningful) for in-depth inspection of the robust channel estimation with DL.Nevertheless, our work is not confined to the single user scenario, but can be readily extended to the multiuser or massive access scenario.Specifically, the proposed DL model developed for the single user scenario can be readily extended to the multiuser or massive access scenario in uplink only with a very minor modification on the parameter update for the pilot optimizer to deal with individual transmission power constraints on pilot signals of multiple transmitters.Also, the proposed scheme can be directly applied to the multiuser or massive access scenario in downlink by treating the whole of multiple receivers as a large-size single receiver.Due to the scalability issue, however, the maximum number of accessible users should be limited or judiciously determined in practice according to the model capacity of the constructed DL network as well as the system requirements on the computational cost/capability/burden and the inference latency/delay. 4 If the actual covariance matrices C h and C z were exactly known, the system CSI could be acquired via the linear channel estimation technique, such as the LMMSE channel estimation, which would be given by ( 27 (E H a E a ) 1/2 ) ≤ a } (i.e., nuclear normbounded set) for a ∈ {h, z}, where a ≥ 0 denotes an upper bound [20], [21].
1) In [16], [17], [18], [19], [20], [21], and [22], the actual values of the channel and noise covariance matrices were assumed to be exactly known at the receiver, which is idealistic and impractical.While, in this article, we make a more practical assumption that those actual values are not knowable even at the receiver side.2) Most of the previous works [16], [17], [18], [19], [20], [21] assumed that there existed the uncertainty only in the channel covariance matrix, whereas there was no uncertainty in the noise covariance matrix.More generally, in this article, we assume that there exist the uncertainties in both the channel and noise covariance matrices.3) In [17], [18], [19], [20], [21], and [22], the channel and noise covariance matrices were assumed to be Kronecker-separable; that is, each of them can be factorized into the Kronecker product of two smaller matrices such that C h = A h ⊗ B h and C z = A z ⊗ B z for some A h 0, B h 0, A z 0, and B z 0. However, this assumption is generally inaccurate in practice [14] and may even be violated in certain scenarios where a strong coupling between transmitter and receiver exists due to proximity and/or in certain types of propagation environments where the transmitter and receiver shares a part of scatterers [14].In this article, for generality and universality of the practical applicability, it is assumed that the channel and noise covariance matrices are Kroneckerinseparable, i.e., C h = A h ⊗ B h and C z = A z ⊗ B z , respectively.

B. Problem Formulation
The goal of the robust channel estimation in this article is to estimate the MIMO channel vector h as accurately as possible in the presence of the uncertainties in the channel and noise covariance matrices, which is a rather challenging task.Obviously, only with the knowledge of the estimated covariance matrices Ĉh and Ĉz , the traditional signal processing approach, such as the LMMSE channel estimation method, may fail to achieve this goal due to the mismatches between the actual and estimated covariance matrices.
To mathematically formalize an optimization problem for the robust channel estimation under consideration, let ĥ = f θ (y; S) denote an MIMO channel estimate, which is specified by a (possibly nonlinear) function of the received signal y for a given pilot signal S, parameterized by a set θ of parameters.In this article, we aim to find the channel estimator f θ as well as to design the pilot signal S with the transmission power constraint such that the MSE of the channel estimation is minimized under the channel and noise covariance uncertainties as follows: (P1): minimize subject to Tr S H S ≤ P.
Note that problem (P1) is nonlinear and nonconvex.In particular, it even involves functional optimization that is mathematically intractable.As a result, problem (P1) is generally NP-hard, and thus, it is infeasible to tackle problem (P1) directly. 5Even for a rather simpler case where the channel estimator is restricted to be linear as well as for a very simplistic case where the LMMSE channel estimator is adopted with an idealistic assumption that the exact knowledge of the channel and noise covariance matrices is available, problem (P1) still remains very difficult to solve even numerically due to the nonconvexity.Furthermore, as discussed in Remark 1, the robust channel estimation problem formulated in (P1) is much more challenging than those considered in the previous works [16], [17], [18], [19], [20], [21], [22].Consequently, the existing solution approaches in [16], [17], [18], [19], [20], [21], and [22] from the optimization perspective are neither effective nor applicable in solving (P1).Motivated by breaking through these technical challenges effectively and intelligently, in the following section, we derive a new and innovative solution to problem (P1) based on DL via the construction of a novel neural network.

III. DEEP LEARNING-BASED ROBUST MIMO CHANNEL ESTIMATION
In this section, we first elaborate the network structure of the proposed DL model developed for the robust MIMO channel estimation.Then the training methodology with a strategy of the channel and noise covariance compensation is presented.

A. Network Structure
The whole network structure of the proposed DL model is presented in Fig. 2. The covariance compensation is initially carried out to obtain training samples.Then two DL modules, the pilot optimizer and the channel predictor, that are connected in tandem are trained via two different gradient descent methods with the MSE loss function.In what follows, these three components are elaborated and specified.
1) Covariance Compensation: Note that since the actual covariance matrices are not known, these cannot be used for the training.On the other hand, although the estimated covariance matrices are known, only using these for the training results in severely limited performance due to the discrepancies with the actual covariance matrices (as will be demonstrated by the simulation results in Section IV).To properly deal with these issues and to guarantee the robustness against the covariance uncertainties by compromising between the actual and estimated covariance matrices, in the proposed DL model, we compensate the estimated channel and noise covariance matrices, Ĉh and Ĉz , by intentionally adding some distortions Ēh and Ēz , respectively, as follows: Here, Ēh and Ēz have the same statistics as those of E h and E z , respectively, i.e., Ēh ∈ E h and Ēz ∈ E z .Consequently, Ch and Cz have the same statistics as those of the actual covariance matrices C h and C z , respectively, (although the values of Ch and Cz are not exactly the same as those of C h and C z ).
In the sequel, we will refer to Ch and Cz as the compensated channel and noise covariance matrices, respectively, or simply, the compensated covariance matrices. 6et h and z denote the channel and noise vectors whose covariance matrices are equivalent to the compensated channel and noise covariance matrices Ch and Cz , respectively, (these will be simply referred to as the compensated channel and noise vectors in the sequel).Then the inputs of the proposed DL model are h and z, and the output is the prediction of h, denoted by ĥ.In what follows, we further elaborate and specify the two modules of the proposed DL model.

2) Pilot Optimizer (DL Module I):
The goal of employing the pilot optimizer in the proposed DL model is to enable the pilot signal S optimizable or designable as in problem (P1) by learning the system model of (2) for the transmission and reception of the pilot signal through the noisy MIMO channel. 7o achieve this design goal, we construct the pilot optimizer using a single layer FNN with a novel local connection and weight sharing between the input and output nodes such that the shared weights between the locally connected nodes correspond to the pilot signal S to be optimized.We refer to this specifically designed layer as the pilot layer.
The network structure of the pilot optimizer (i.e., DL module I) we construct is shown in Fig. 3, detailed explanations and specifications on which are given as follows.
1) Configuration: In the constructed pilot optimizer, the number of input nodes (i.e., input size) is 2N(M + L) and the number of output nodes (i.e., output size) is 2NL.For the sake of notational brevity, we denote the first NL output nodes by {a  3) Operation: At each output node, the weighted sum of the inputs is computed, followed by passing through an activation function φ(•).Thus, the operation of each output node is given by ( 8), shown at the bottom of the page, or equivalently where Also, a , W , x , and b are defined similarly as in ( 10)-( 13), respectively.4) Design Inspiration: The network structure of the pilot optimizer is inspired by the system model of ( 2).Specifically, we can decompose (2) into real and imaginary parts as Notably and rather intriguingly, we can observe that the mathematical expression of ( 9) is equivalent to that of (14) provided that φ(x) = x, a = Re{y}, a = Im{y}, W = Re{S}, W = Im{S} x = Re{h}, x = Im{h}, b = Re{z}, and b = Im{z}.This essentially means that the physical mechanism (as well as relevant important feature) for the transmission and reception of the pilot signal through the noisy MIMO channel can be learned by the constructed pilot optimizer.In particular, the weights of the constructed pilot optimizer correspond to the pilot signal (i.e., W = Re{S} and W = Im{S}), and thus, the optimization of the pilot signal can be readily carried out through the training procedure, which is clearly a significant design advantage.5) Parameter Setup: According to the aforementioned design inspiration, in the sequel, we treat the weights of the pilot optimizer as the pilot signal to be optimized, i.e., we set W = Re{S} and W = Im{S}.6) Activation Function: Also, the activation function at each output node is set to the linear function, i.e., φ(x) = x.Unfortunately, however, this is infeasible since the actual covariance matrices are not known.To address this issue, we instead propose to take the real and imaginary parts of the compensated channel and noise vectors (rather than the actual values that are unknown) as the inputs of the pilot optimizer as follows:

3) Channel Predictor (DL Module II):
In the proposed DL model, we also employ a (deep) neural network subsequent to the pilot optimizer, called the channel predictor, which serves as the channel estimator f θ in problem (P1), where θ denotes the set of all learnable parameters.This is motivated by the universal function approximation theorem: even a neural network with a single hidden layer has a capability to approximate any nonlinear function within an arbitrarily accuracy [41], [42], [43].Particularly, we construct the channel predictor by adopting a CNN structure that is characterized by local connection and weight sharing, in order to pursue a structural matching with the pilot optimizer as well as to efficiently learn and extract the effective/useful features for the robust MIMO channel estimation under the channel and noise covariance uncertainties.
The network structure of the constructed channel predictor (i.e., DL module II) is shown in Fig. 4, which consists of a sequential connection of several convolutional and pooling layers, followed by a fully connected (FC) layer.Details are given in the following.
1) Reshaping: The channel predictor in the proposed DL model first reshapes the output of the pilot optimizer for an appropriate processing by the CNN.Specifically, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the output [a T , a T ] T ∈ R 2NL×1 of the pilot optimizer is divided into 2N patches, each of length L, as follows: where A = vec −1 (a) and A = vec −1 (a ).2) 1-D Conv Layers: In the convolution stage, we employ three 1-D convolutional layers.a) Input: The first convolutional layer takes the reshaped output of the pilot optimizer in (17) as the input.The second and third convolutional layers take the outputs of the first and second max-pooling layers as the inputs, respectively.In each convolutional layer, the input is properly zero padded such that the output has the same size as the input.b) Operation: The kth convolutional layer performs 1-D convolution between the zero-padded input and c k different kernels, each of size (or length) k , with unit stride rate.Let {χ i,j } denote the input to the kth convolutional layer.Then the output of the kth convolutional layer of size r k is given as [42], [43] for i = 1, . . ., c k and j = 1, . . ., r k , where {ω i,p,q } is the set of weights of the ith kernel and {β i } denotes the set of bias terms.For k = 1, we have c 0 = 2N, 0 = L, and {χ i,j } is properly formed by entries of [A T , A T ] T covered by the c 1 kernels.Also, ϕ denotes an activation function.c) Number and Size of Kernels: The number of kernels in the kth convolutional layer is set to c k = 8kMN, k = 1, 2, 3. Also, the kernel size is set to k = 3 ∀k, which has been demonstrated to be an adequate size to extract sufficient spatial features of the input data [44].d) Activation Function: The activation function in each convolutional layer is chosen as the exponential linear unit (ELU), i.e., ϕ(x) = exp(x) − 1 for x ≤ 0 and ϕ(x) = x for x > 0 [45].3) Max-Pooling Layers: In the pooling stage, we employ three max-pooling layers, each of which is placed after each convolutional layer, such that the dimensionality of the features extracted by the prior convolutional layers is gradually reduced via 1-D down-sampling in order to prevent overfitting as well as make features more robust against noise, shift, distortion, etc. a) Input: The kth max-pooling layer takes the output of the kth convolutional layer as the input.b) Operation: Let k denote the kernel size as well as the stride rate in the kth max-pooling layer.Then the output of the kth max-pooling layer of size r k is given by [43] α i,j = max α i,q : q ∈ j (19) for i = 1, . . ., c k and j = 1, . . ., r k , where j = {q : q ∈ { k (j − 1) + 1, k (j − 1) + 2, . . ., k (j − 1) + k }}.c) Kernel Size: In each pooling layer of the constructed channel predictor, the kernel size is set to k = 2, k = 1, 2, 3, such that the size of the feature extracted by each convolutional layer is halved after passing through each pooling layer.4) Concatenation: The output of the last max-pooling layer is concatenated into a 1-D vector.5) FC Layer: The final stage of the channel predictor is the FC layer, which is constructed by a single layer FNN with full connection for the purpose of fine-tuning of the features obtained by the convolutional and pooling layers.a) Input: The FC layer takes the concatenated output of the last pooling layer as the input.b) Operation: Let x F denote the input of the FC layer, i.e., the concatenated output of the last maxpooling layer.Then the FC layer processes the input by first multiplying a weight matrix W F and then adding a bias vector b F , followed by passing through an activation function ϕ F (•). Thus, the output of the FC layer is given by a The activation function in the FC layer is set to the linear function, i.e., ϕ F (x) = x.6) Output: The channel predictor takes the output of the FC layer as its output such that the real and imaginary parts of the compensated channel vector h are predicted as ⎡ ⎣ Re ĥ Im ĥ

B. Proposed Training Procedure
Through the training procedure, we jointly train the two modules, the pilot optimizer and channel predictor, of the proposed DL model in the self-supervised manner such that the joint optimization in (P1) can be carried out.The detailed process is explained in the following.

1) Training Data Acquisition:
Once the channel and noise covariance matrices are compensated as in ( 6) and ( 7), respectively, the samples of the compensated channel and noise vectors, h and z, i.e., the inputs of the proposed DL model, are drawn from the compensated covariance matrices Ch and Cz , respectively.For example, for the case of Rayleigh fading with additive Gaussian noise (i.e., h ∼ CN (0, C h ) and z ∼ CN (0, C z )), the samples of h and z can be obtained such that h ∼ CN (0, Ch ) and z ∼ CN (0, Cz ), respectively.
2) Parameter Update: The parameters of the proposed DL model, i.e., the weights W = Re{S} and W = Im{S} in the pilot layer of the pilot optimizer and the set θ of the weights and biases in the convolutional and FC layers of the channel predictor, need to be jointly optimized.For this purpose, the loss function for the training is selected as the empirical MSE between the compensated channel vector h and the predicted Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
value ĥ (i.e., the output of the proposed DL model) as follows: where T denotes the set of training samples.
To minimize the loss function in (21), the parameters of the proposed DL model can be updated based on two different gradient descent methods through the backward computations.Specifically, the update of θ can be performed via the stochastic gradient descent (SGD) method as where γ > 0 is the step size or learning rate.On the other hand, the weight update for the pilot optimizer can be carried out based on the projected SGD (PSGD) method such that the power constraint of the pilot signal in problem (P1) is fulfilled as [38] Re{S} Im{S} ← S Re{S} Im{S} − γ ∂L(S,θ) ∂Re{S} ∂L(S,θ) ∂Im{S} (23) where S denotes the projection operator onto the feasible set S = {S ∈ C L×M : Tr(S H S) ≤ P}, which is given by (24) shown at the bottom of the page.That is, if the power constraint is violated after the SGD update, the value of the updated S is immediately normalized such that Tr(S H S) = P.
3) Deployment of Proposed DL Model After Training: Note that although the two modules of the proposed DL model are jointly trained in the (offline) training phase, these are used separately at the (online) deployment stage for different purposes, as depicted in Fig. 1.Specifically, the trained channel predictor is used to robustly estimate the MIMO channel h from the received pilot signal y via ĥ = f θ (y; S).On the other hand, the trained pilot optimizer is not used directly; instead, its learned weights are utilized as the optimized pilot signal.

IV. SIMULATION RESULTS
In this section, we present the simulation results to validate the performance and effectiveness of the proposed DL model.

A. Simulation Setups
In the simulations, the estimated channel and noise covariance matrices, Ĉh and Ĉz , are obtained by the well-known exponential model as follows [46]: where 0 ≤ ρ h ≤ 1 and 0 ≤ ρ z ≤ 1 denote the channel and noise correlation coefficients, respectively.Also, σ 2 denotes a parameter such that the system SNR is defined as  In the training phase, the proposed DL model is trained with 10 5 samples of the compensated channel and noise vectors, h and z, drawn from h ∼ CN (0, Ch ) and z ∼ CN (0, Cz ), respectively.During the training (respectively, after the training), the performance of the proposed DL model is validated through the validation step (respectively, evaluated through the test step) using 3 × 10 4 samples of the actual channel and noise vectors, h and z, drawn from h ∼ CN (0, C h ) and z ∼ CN (0, C z ), respectively.

B. Ablation Studies
We first conduct the ablation studies to examine the effects of parameters and configurations of the proposed DL model and to gain the relevant design insights.Fig. 5(a) and (b) show the training and validation performance of the proposed DL model, respectively, for various values of the step size γ when trained over 10 3 epochs with a mini-batch size of 10 3 (i.e., totally over 10 5 iterations).As can be seen from Fig. 5(a) and (b), larger values of γ , such as γ = 0.1 and γ = 0.01, result in unstable learning with poor generalization behaviors, whereas smaller values of γ such as γ = 0.001 and γ = 0.0001 yield stable learning with good generalization behaviors.Also, it can be observed that although the training with γ = 0.0001 exhibits a slower convergence behavior than that with γ = 0.001, the former has better generalization performance than the latter due to less overfitting.For this reason, in the subsequent simulations, we use γ = 0.0001 when training the proposed DL model.
In Fig. 6, we compare the performance of the proposed DL model to that of the following variants. 8It is practically valid and reasonable to use the Wishart distributions to model the distributions of the (sample) estimation error covariance matrices of the channel and noise when the corresponding estimation errors follow the Gaussian distributions.Meanwhile, the proposed DL model can still be used even when the estimation error covariance matrices follow any other distributions.  1) The proposed DL model without (w/o) the pilot optimizer, in which the channel predictor is solely employed by adopting the orthogonal pilot signal with the transmission power equal to P, i.e., S H S = (P/M)I M .

Re{A}
2) The proposed DL model trained without the covariance compensation, in which the proposed DL model is trained with the samples of the uncompensated channel and noise vectors drawn from CN (0, Ĉh ) and CN (0, Ĉz ), respectively.
3) The proposed DL model without the CNN in the channel predictor, in which K convolutional and pooling layers are replaced by K FC layers, where the number of learnable parameters in each FC layer is set to be the same as that in each convolutional layer.From Fig. 6, it can be observed that the proposed DL model provides the best generalization performance (although all of the proposed DL model and its variants are stably trainable), thereby demonstrating the effectiveness and completeness of the proposed network architecture and training strategy.Particularly, solely using the channel predictor without the pilot optimizer is never effective due to severe overfitting with very poor generalization performance.
In Fig. 7, the pilot signal learned by the proposed DL model is visualized.In Fig. 7

C. Performance Comparisons
Now, we compare the performance of the proposed DL model with that of the following baseline schemes.
1) Baseline Scheme I: This scheme corresponds to a nonrobust LMMSE channel estimation with the estimated channel and noise covariance matrices, in which the MIMO channel is estimated as Also, the pilot signal is set such that S H S = (P/M)I M .2) Baseline Scheme II: This scheme is an extension of the existing scheme in [20, Th. 2] to the case with both the channel and noise covariance uncertainties, in which the MIMO channel is estimated as in (27) with Ĉh and Ĉz replaced by Ĉh + h I MN and Ĉz + z I NL , respectively.
Also, the pilot signal is set such that S H S = 0 0 0 , where is a ν and is a ν × ν diagonal matrix containing ν largest eigenvalues of Ĉh in descending order on the diagonal.3) Baseline Scheme III: This scheme is the LS channel estimation with no knowledge of the channel and noise covariance matrices, in which the MIMO channel is estimated as ĥ = S † ⊗ I N y.
Also, the pilot signal is set such that S H S = (P/M)I M .In Fig. 8, the channel estimation MSEs of the proposed and baseline schemes are shown versus the SNR.From Fig. 8, it can be observed that the proposed scheme consistently outperforms the baseline schemes over the entire SNR range, indicating that the proposed DL model effectively copes with the channel and noise covariance uncertainties via the covariance compensation strategy during the training.The baseline scheme III performs worst mainly due to the noise amplification issue and lack of utilizing the channel and noise covariance matrices for the channel estimation.Even though the baseline schemes I and II result in better performance than the baseline scheme III, their performance is still marginal due to the mismatches between the actual and estimated covariance matrices.Overall, the results of Fig. 8 reveal that properly utilizing the channel and noise statistical information and overcoming the uncertainties in such information play crucial roles in improving the performance of the MIMO channel estimation in practice.
In order to investigate the impacts of the strengths of the channel and noise correlations on the channel estimation performance, in Fig. 9, the values of ρ h and ρ z are set to be the same as ρ (i.e., ρ = ρ h = ρ z ).Then we depict the channel estimation MSEs of the various schemes as functions of the (common) correlation coefficient ρ.From Fig. 9, we can observe that the performance of all the baseline schemes degrades as ρ increases and very large values of ρ eventually result in the baseline scheme I performing even worse than the baseline scheme III, meaning that the adverse impacts of the channel and noise covariance uncertainties on the channel estimation become (much) more severe in (extremely) strongly correlated environments.On the other hand, the performance of the proposed scheme initially improves until about ρ = 0.6 and then degrades, suggesting that the robust channel estimation with the proposed DL model will be most effective in moderately correlated environments.
We further investigate the impacts of the degrees of the channel and noise covariance uncertainties on the channel estimation performance by introducing and controlling a parameter β such that β = β h = β z .In Fig. 10, the channel estimation MSEs of the proposed and baseline schemes are shown for different values of β.It can be seen from Fig. 10 that as β increases, the performance of all the schemes degrades, as expected, because there are more uncertainties (or errors) in the channel and noise covariance matrices.Nevertheless, the proposed scheme still performs better than the other schemes and the performance gaps are more pronounced for medium values of β.Therefore, the proposed DL model will be indeed very useful in practical IoT applications where only a coarse (not fine-grained) estimation of the channel and noise covariance matrices is possible with insufficient or erroneous samples.
In Fig. 11, the channel estimation performance of the various schemes is shown versus L to examine the effects of the  pilot length.Also, in Fig. 12, setting μ = M = N, we plot the channel estimation performance of the various schemes for different values of μ to investigate the effects of the number of antennas.As can be seen from Fig. 11 (respectively, Fig. 12), the performance of all the schemes improves when L increases (respectively, M decreases) since the effect of noise is reduced given the same SNR in our simulation setting (respectively, the number of channel coefficients to be estimated increases given the same resources).Nonetheless, the proposed scheme is observed to still surpass the other schemes, where the performance gaps are more pronounced as L increases or M decreases.

D. Computational Complexity Analysis
Finally, we analyze and compare the computational complexities of the proposed and baseline schemes in terms of the training and inference complexities, which are summarized in Table I.  1) Inference Complexity: The inference complexities of the baseline schemes I and II are all dominated the computation of the LMMSE channel estimate in (27), which require O(LMN 2 + N 3 L 3 ) [47].Similarly, the inference complexity of the baseline scheme III is given by O(LMN 2 + L 3 ) [47].On the other hand, for the interference of the proposed scheme, the feedforward computations of the pilot layer, K convolutional/pooling layers, and the FC layer need to be sequentially performed, of which computational complexities are O(LMN 2 ), O(LN K k=1 k c k−1 c k ), and O(MNn F ), respectively [38], [43], where n F denotes the size of the input in the FC layer of the channel predictor.Thus, the total inference complexity of the proposed scheme is estimated as

2) Training Complexity:
The training of the proposed scheme can be done via the backpropagation algorithm, which requires to perform multiple iterations of forward and backward computations [43].As analyzed just before, one iteration of the forward computation of the proposed scheme requires the computational complexity of O(LMN 2 + LN K k=1 k c k−1 c k + MNn F ). Also, the backward computation of the proposed scheme in one iteration has the similar complexity to that of the forward computation [43], which is thus still given by O(LMN 2 +LN K k=1 k c k−1 c k +MNn F ). Overall, the training complexity of the proposed scheme is estimated as O(N iter (LMN 2 + LN K k=1 k c k−1 c k + MNn F )), where N iter denotes the number of iterations.
Importantly and intriguingly, the above complexity analysis implies that the inference complexity of the proposed DL model can be even lower than those of the baseline schemes if the values of the parameters are properly chosen.For example, when ( K k=1 k c k−1 c k )/(N 2 L 2 ) + (Mn F )/(N 2 L 3 ) ≤ 1, the proposed scheme has clearly a lower inference complexity than the baseline schemes I and II.

V. CONCLUSION
This article investigated the robust channel estimation problem for the MIMO-aided IoT system in the presence of the channel and noise covariance uncertainties, to solve which the novel DL model composed of the two modules, the pilot optimizer and channel predictor, was proposed.The effective training strategy for the proposed DL model was also devised by properly compensating the channel and noise covariance matrices such that the adverse impacts of the underlying uncertainties were overcome.The extensive simulation results confirmed that the proposed DL model performed better and is more effective than the baseline schemes, rendering it highly useful in practice.

Manuscript received 7
August 2023; accepted 9 October 2023.Date of publication 16 October 2023; date of current version 7 March 2024.This work was supported in part by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government (MSIT) under Grant 2022R1A4A1033830, and in part by the Korea Agency for Infrastructure Technology Advancement (KAIA) Grant funded by the Ministry of Land, Infrastructure and Transport under Grant RS-2023-00251002.

Fig. 1 .
Fig. 1.MIMO IoT system with the proposed DL model for the robust channel estimation.

Fig. 2 .
Fig. 2. Whole network structure of the proposed DL model.

Fig. 3 .
Fig. 3. Network structure of the pilot optimizer in the proposed DL model.(a) Weight connection between the input and output nodes for each n ∈ {1, . . ., N} and l ∈ {1, . . ., L}, where not all nodes are shown; instead, only the nodes with connections are shown for brevity.The connections with weights {w l,m : m = 1, . . ., M} and {w l,m : m = 1, . . ., M} are denoted by dashed and dashed-dotted lines, respectively.Also, the connections with weights fixed to 1 are denoted by solid lines.(b) Network structure of the pilot optimizer when M = L = 1 and N = 2.

Fig. 4 .
Fig. 4. Network structure of the channel predictor in the proposed DL model.

7 )
Input: To enable the optimization of the pilot signal through the training procedure, the pilot optimizer needs to take the actual values of h and z as inputs such that x = Re{h}, x = Im{h}, b = Re{z}, and b = Im{z}.
(a) and (b), the weights of the trained pilot optimizer and the Gram matrix S H S of the optimized pilot signal (i.e., weight matrix of the trained pilot optimizer in the proposed DL model) are shown for various SNR values, respectively.From Fig.7(a), it can be seen that the design or optimization pattern of the pilot signal varies depending on the SNR value.Also, from Fig.7(b), we can observe that as the SNR increases, the off-diagonal values of the Gram matrix and the variance of the diagonal values get smaller, meaning that the pilot signal learned by the proposed DL model becomes Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 7 .
Fig. 7. Visualization of the pilot signal learned by the proposed DL model.(a) Weights of the trained pilot optimizer.(b) Gram matrix of the optimized pilot signal.

Fig. 8 .
Fig. 8. Channel estimation MSEs of the proposed and baseline schemes over different SNR values.

Fig. 9 .
Fig. 9. Channel estimation MSEs of the proposed and baseline schemes over different values of ρ.

Fig. 10 .
Fig. 10.Channel estimation MSEs of the proposed and baseline schemes over different values of β.

Fig. 11 .
Fig. 11.Channel estimation MSEs of the proposed and baseline schemes over different values of L.

Fig. 12 .
Fig. 12. Channel estimation MSEs of the proposed and baseline schemes over different values of M.
E z ∈ E z , then UE h U H ∈ E h and VE z V H ∈ E z for arbitrary unitary matrices U and V. Examples of such unitarily invariant sets are norm-bounded sets, such as E a = {E a : Tr(E H a E a ) ≤ a } (i.e., Frobenius norm-bounded set), E a = {E a : λ max (E H a E a ) ≤ a } (i.e., spectral norm-bounded set), and E a = {E a : Tr( ) with Ĉh and Ĉz replaced by C h and C z , respectively.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and The spectral norm-bounded covariance uncertainty sets are considered here.Specifically, the values of E h and E z are randomly generated such that E h ∼ W(MN, I MN ) and E z ∼ W(NL, I NL ), respectively,8and then, they are normalized such that λ max (E H h E h ) = h and λ max (E H z E z ) = z , respectively, where the values of h and z are chosen such

TABLE I COMPUTATIONAL
COMPLEXITIES OF PROPOSED AND BASELINE SCHEMES