Framework on Deep Learning Based Joint Hybrid Processing for mmWave Massive MIMO Systems

For millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems, hybrid processing architecture is essential to significantly reduce the complexity and cost but is quite challenging to be jointly optimized over the transmitter and receiver. In this paper, deep learning (DL) is applied to design a novel joint hybrid processing framework (JHPF) that allows end-to-end optimization by using back propagation. The proposed framework includes three parts: hybrid processing designer, signal flow simulator, and signal demodulator, which outputs the hybrid processing matrices for the transceiver by using neural networks (NNs), simulates the signal transmission over the air, and maps the detected symbols to the original bits by using the NN, respectively. By minimizing the cross-entropy loss between the recovered and original bits, the proposed framework optimizes the analog and digital processing matrices at the transceiver jointly and implicitly instead of approximating pre-designed label matrices, and its trainability is proved theoretically. It can be also directly applied to orthogonal frequency division multiplexing systems by simply modifying the structure of the training data. Simulation results show the proposed DL-JHPF outperforms the existing hybrid processing schemes and is robust to the mismatched channel state information and channel scenarios with the significantly reduced runtime.


I. INTRODUCTION
Due to the huge bandwidth, millimeter wave (mmWave) communications have been recognized as one of the key technologies to meet the demand for unprecedentedly high data rate transmission in the future mobile networks [1]. By equipping large-scale antenna arrays, massive multiple-input multiple-output (MIMO) can provide sufficiently large array gains for spatial multiplexing and beamforming [2]. MmWave massive MIMO communications can obtain the merits of both of them and thus have attracted significant interest [3]. However, the expensive and power-hungry hardwares used in mmWave bands become the main obstacle to equipping a dedicated radio frequency (RF) chain for each antenna. The mainstream solution for this problem is to use the two-stage hybrid architecture, where a large number of antennas are connected to much fewer RF chains via phase shifters [4], [5]. P. Dong

A. Related Work
For mmWave massive MIMO systems with the hybrid architecture, both the analog and digital processing should be carefully designed to achieve the comparable performance to the fully-digital systems. In [4], a low-complexity hybrid precoding scheme at the base station (BS) has been proposed for the massive MIMO downlink with single-antenna users. The hybrid architecture has been further introduced to the user side in [6], where hybrid block diagonalization (HBD) has been used for the analog and digital processing design. By exploiting the sparsity of mmWave channels, the hybrid precoding and combining at both the transmitter and receiver have been optimized in [7]. The heuristic hybrid beamforming design in [8] can approach the performance of the fully-digital architecture. The alternating minimization algorithms for both fully-connected and sub-connected hybrid architectures in [9] are with low complexity and limited performance loss. In [10], the hybrid processing along with channel estimation has been designed and analyzed for both the sparse and non-sparse channels. The uniform channel decomposition and nonlinear digital processing have been introduced in [11] for hybrid beamforming design. In the existing works, the hybrid processing matrices at the transmitter and receiver are usually optimized separately due to the intractability of the joint optimization with non-convex constraints, which makes the further performance improvement possible with joint optimization.
Deep learning (DL) has achieved great success in various fields, including computer vision [12], speech signal processing [13], natural language processing [14], and so on, due to its unique ability in extracting and learning inherent features. It has been recently introduced to wireless communications and shown quite powerful in the optimization of communication systems [15]- [18] and resource allocation [19]- [23]. In [17], DL has been successfully applied in pilot-assisted signal detection for orthogonal frequency division multiplexing (OFDM) systems with non-ideal transceiver and channel conditions. For wideband mmWave massive MIMO systems in timevarying channels, channel correlation has been exploited by deep convolutional neural network (CNN) in [24] to improve the accuracy and accelerate the computation for the channel estimation. Deep neural network (DNN) has been utilized in [25] to model the mapping relationship among antennas for reliable channel estimation in massive MIMO systems with mixed-resolution ADCs. An autoencoder-like DNN has been developed in [26] to reduce the overhead for channel state information (CSI) feedback in the frequency duplex division massive MIMO system. In [27], CNN has been utilized in CSI compression and uncompression to significantly improve the recovery accuracy. By combining the residual network and CNN, an efficient channel quantization scheme has been proposed from the perspective of bit-level in [28]. The DL based end-to-end optimization has been developed in [29] and [30] by breaking the block structures at the transceiver. DL has been recently used to design the hybrid processing matrices for massive MIMO systems with various transceiver architectures [31]- [35]. In [31], the analog and digital precoder design has been modeled as the DNN mapping based on geometric mean decomposition. In [32], DNN has been applied to design the analog precoder for massive multiple-input single-output (MISO) systems. Deep CNN has been applied to learn the phases of the analog precoder and combiner for mmWave massive MIMO systems in [33]. For the same system, channel estimation and analog processing have been jointly optimized by DL with reduced pilot overhead in [34]. In [35], deep CNN along with an equivalent channel hybrid precoding algorithm have been proposed to design the hybrid processing matrices.

B. Motivation and Contribution
The research on the DL based hybrid processing for mmWave massive MIMO systems is still in the exploratory stage and has many open issues. The existing works have applied DL to design the analog precoder [32], the analog combiner [35], the analog precoder and combiner [33], [34], and the analog and digital precoders [31]. Currently, only partial hybrid processing is designed by DL for the mmWave transceiver. In addition, conventional hybrid processing schemes are usually used to generate label matrices for the DNN to approximate, which limits the performance of the DL based approaches. The problems in the existing works motivate us to propose a general DL based joint hybrid processing framework (DL-JHPF) with the following two unique features: 1) The framework jointly optimizes the analog and digital processing matrices at both the transmitter and receiver in an end-to-end manner without pre-designed label matrices. By doing this, it can be applied to various types of mmWave transceiver architectures and will have the potential to break through the performance of the existing schemes.
2) The framework enables end-to-end optimization but still preserves the block structures at the transceiver considering the hardware and power constraints in practical implementation for the hybrid architecture, which is quite different from the end-to-end optimization in [29] and [30]. The main contributions of this paper are summarized as follows. 1) We model the joint analog and digital processing design for the transceiver as a DL based framework, which consists of the NN based hybrid processing designer, signal flow simulator, and NN based signal demodulator. For the sake of practical implementation, it does not break the original block structures at the transceiver but still allows the back-propagation (BP) based end-toend optimization by minimizing the cross-entropy loss The rest of the paper is organized as follows. Section II describes the channel model and signal transmission process for the considered mmWave massive MIMO system. The proposed DL-JHPF is elaborated in Section III. Simulation results are provided in Section IV to verify the effectiveness of the proposed framework and finally Section V gives concluding remarks.
[X] i,j and [x] i denote the (i, j)th element of matrix X and the ith element of vector x, respectively. | · | denotes the amplitude of a complex number.

II. SYSTEM MODEL
As shown in Fig. 1, we consider a point-to-point massive MIMO systems working at mmWave bands, where the transmitter and the receiver are with N T and N R antennas, respectively. To reduce the hardware cost and power consumption, N RF T (< N T ) and N RF R (< N R ) RF chains are used at the transmitter and the receiver, respectively, and are connected to the large-scale antennas via phase shifters.

A. Channel Model
Due to the sparse scattering property, the Saleh-Valenzuela channel model has been used to well depict the mmWave propagation environment, where the scattering of multiple rays forms several clusters. According to [7], the N R × N T channel matrix between the receiver and the transmitter can be represented as where N cl and N ray denote the number of scattering clusters and the number of rays in each cluster, respectively, α n,m ∼ CN (0, σ 2 α ) is the propagation gain of the mth path in the nth cluster with σ 2 α being the average power gain, ϕ n,m and φ n,m ∈ [0, 2π] are the azimuth angles of arrival and departure (AoA/AoD) at the receiver and the transmitter, respectively, of the mth path in the nth cluster. 1 For a uniform linear array with N antenna elements and an azimuth angle of θ, the response vector can be expressed as where d and λ denote the distance between the adjacent antennas and carrier wavelength, respectively.
In the above channel model, we assume the transmitted signal is with narrowband and therefore, channel matrix is independent of frequency. For wideband transmission, OFDM is used to convert a frequency-selective channel into multiple flat fading channels and the corresponding channel matrices will be different at different subcarriers. Accordingly, the design of DL-JHPF in Section III will start at the narrowband systems and is then extended to the wideband OFDM systems.

B. Signal Transmission
The transmitter sends N s parallel data streams to the receiver through the wireless channel. The bits of each data stream are first mapped to the symbol by the M -ary modulation. The symbol vector intended for the receiver, x ∈ C Ns×1 with E xx H = 1 Ns I Ns , is successively processed by the digital precoder, F BB ∈ C N RF T ×Ns , at the baseband and the analog precoder, F RF ∈ C NT×N RF T , through the phase shifters, yielding the transmitted signal where P denotes the transmit power. F RF represents the phase-only modulation by the phase shifters and thus has the constraint of |[F RF ] i,j | = 1 √ NT , ∀ i, j. F BB is normalized as F RF F BB 2 F = N s to satisfy the total power constraint at the transmitter. Then the received signal at the receiver is given by where n ∈ C NR×1 is additive white Gaussian noise (AWGN) with CN (0, 1) elements. The received signal y is then processed by the hybrid architecture at the receiver as where W RF ∈ C NR×N RF R and W BB ∈ C N RF R ×Ns represent the analog combiner and digital combiner, respectively. A hardware constraint is imposed on W RF such that |[W RF ] i,j | = 1 √ NR , ∀ i, j similar to F RF . Then the detected signal vector, r, is demodulated to recover the original bits of N s data streams.
Since the performance of the digital communication system is ultimately determined by BER, we aim to jointly design F RF , F BB , W RF , and W BB to minimize the BER between the original and demodulated bits, that is The BER in (6) is a complicated nonlinear function of F RF , F BB , W RF , and W BB without closed-form expression and the constraints in (7) and (8) are non-convex, which make this optimization problem intractable to be solved by the traditional approaches. DL is a potential solution by using the BP algorithm and thus we develop DL-JHPF to address this problem.

III. PROPOSED DL-JHPF
In this section, we first briefly review the existing work on the DNN based end-to-end communications. Then we propose DL-JHPF, where the framework is first described, followed by the details of training, deployment, and testing along with the corresponding complexity analysis. Finally, we extend the framework to OFDM systems over wideband mmWave channels.

A. DNN based End-to-End Communications
Prior works have shown that DNN based end-to-end optimization is an efficient tool to minimize BER. The BP algorithm makes the DNN based end-to-end communications over the air possible so long as the optimized performance metric is differentiable [15], [29], [30]. For the DNN based end-to-end communication system, the modules at the transmitter and the receiver are replaced by two DNNs, respectively. Specifically, the DNN at the transmitter encodes the original symbols into the transmitted signal and the one at the receiver recovers the original symbols from the output of the wireless channel. In the training stage, the error between the original and recovered symbols is computed and the weights of the two DNNs are adjusted iteratively based on the error gradient propagated from the output layer of the DNN at the receiver to optimize the recovery accuracy.  In this paper, we focus on the DL based joint analog and digital processing design for the transceiver in mmWave massive MIMO systems. Then, the existing DNN based endto-end communication is not suitable for this task since it integrates the modules of the transceiver into two DNNs and thus cannot meet the hardware and power constraints in practical implementation. To address this challenge, we design DL-JHPF in the following.

B. Framework Description
As shown in Fig. 2, the proposed DL-JHPF consists of three parts: hybrid processing designer, signal flow simulator, and NN demodulator, which are elaborated as follows.
Hybrid processing designer: It plays the role of outputting the hybrid processing matrices for the transceiver by using NNs based on the channel matrix. It includes six fullyconnected NNs and is used to generate the analog and digital processing matrices for the transmitter and the receiver based on the channel matrix, H. Specifically, H ∈ C NR×NT is first converted to a 2N T N R × 1 real-valued vector. 2 Then it is input into two NNs, called precoder phase NN (PP-NN) and combiner phase NN (CP-NN), to generate the corresponding phases, φ P ∈ R NTN RF T ×1 and φ C ∈ R NRN RF R ×1 , respectively, for phase shifters. With φ P and φ C , two complex-valued vectors with constant amplitude elements are generated as based on which, F RF and W RF are given by where T v→m (·) denotes the operation reshaping a vector to a matrix. Then, F RF and W RF along with H are used to generate a low-dimensional equivalent channel, i.e., H eq ∈ C N RF R ×N RF T is converted to a 2N RF T N RF R × 1 real-valued vector before it is input into four parallel NNs. The first two NNs, corresponding to the real part digital combiner NN (ReDC-NN) and the imaginary part digital combiner NN (ImDC-NN), output two N s N RF R × 1 vectors,w BB,re ,w BB,im , respectively. Then W BB can be obtained as Another two NNs, corresponding to the real part digital precoder NN (ReDP-NN) and the imaginary part digital precoder NN (ImDP-NN), output two N s N RF T × 1 vectors,f BB,re ,f BB,im , respectively. Then the unnormalized digital precoderF BB is given byF The following normalization utilizesF BB and F RF in (12) to output the final digital precoder as Signal flow simulator: In the training stage, it simulates the process from the original bits, X b , to the detected signal, r, over the channel, H, with AWGN, n, where X b with the size of N s × log 2 M , H, and n are generated in the simulation environment. It bridges the back propagation of the error gradient from NN demodulator to hybrid processing designer as we will elaborate in Section III.C. In the deployment and testing stage, the signal flow simulator is replaced by the actual transceiver and the actual wireless fading channel. In these two stages, the analog and digital processing matrices at the transceiver are provided by the hybrid processing designer based on the simulated or actual H.
NN demodulator: It is a fully-connected NN, which receives the detected signal, r, from the signal flow simulator (in the training stage) or the actual receiver (in the testing stage) and outputs recovered bitsx b ∈ R Ns log 2 M ×1 with each element lies in the interval [0, 1].x b is then reshaped toX b with the same size as X b .
Remark 1. The learning of hybrid processing matrices, F RF , W RF , F BB , and W BB , in DL-JHPF is embedded into the signal transmission and demodulation process instead of approximating pre-designed label matrices. All NNs are optimized jointly sharing the mapping principle from X b at the transmitter to X b at the receiver that resembles an autoencoder. By minimizing the error between X b andX b , each NN in hybrid processing designer can learn to output the appropriate vectors with specific meaning implicitly, i.e., phases of phase shifters and real and imaginary parts of the digital precoder and combiner. By doing this, DL-JHPF will have the potential to break through the performance of the existing schemes.

C. Framework Training
The goal of offline training is to determine the weights of the NNs in hybrid processing designer and NN demodulator based on the training samples with the input tuple H, X b , n and the label X b , where H is generated by certain channel model and n is generated according to the CN (0, 1) distribution. By minimizing the end-to-end error between the original bits, X b , and the recovered bits,X b , the weights of each NN in DL-JHPF are adjusted iteratively and the training procedure is elaborated as follows.
The proposed DL-JHPF is actually an integrated DNN consisting of neuron layers and custom layers. The training model in Fig. 3 demonstrates the detailed training process of the framework. For each training sample, H is converted into a real-valued vector by matrix-to-vector reshaping and real and imaginary parts stacking, which is input into PP-NN and CP-NN consisting of dense and batch normalization (BN) layers to generate the corresponding phases, φ P and φ C , respectively. Then (10) and (11) are executed by the same custom layer. Afterwards, the output vectors are reshaped according to (12) and (13) to generate F RF and W RF , respectively. Next, (14) is executed by a custom layer to generate H eq , followed by matrix-to-vector reshaping and real and imaginary parts stacking. This vector is input into four NNs consisting of dense and BN layers, i.e., ReDC-NN, ImDC-NN, ReDP-NN, and  ImDP-NN, respectively. The output vectors of the former two NNs are used to generate W BB through real and imaginary parts combining and vector-to-matrix reshaping as (15). Using the same operation, the output vectors of the latter two NNs are used to generateF BB as (16). After obtainingF BB , a custom layer is added to perform the normalization in (17) to generate F BB . Then (5) is executed through a custom layer by using the input tuple H, X b , n and the generated F RF , F BB , W RF , and W BB to yield the detected signal, r. After real and imaginary stacking, r is converted to a real-valued vector and input into the NN demodulator consisting of dense and BN layers to output the recovered bits,x b , which is then reshaped toX b . The binary cross-entropy (BCE) loss between X b andX b is calculated as where N tr denotes the number of training samples, superscript n is added to indicate the index of the training sample, and X n b is expressed as the function of the parameter set of all NNs in DL-JHPF, i.e., Θ.
Recall the optimization problem in (6), the BER over the training set can be written as P e,tr (F RF , F BB , W RF , W BB ) = P e,tr (Θ) = Ntr n=1 Ns i=1 whereX n b,bin (Θ) is the binary demodulated bit matrix with , which also minimizes P e,tr (Θ) in (19). Therefore, DL-JHPF can directly minimize the BER over the training set by minimizing the BCE loss and the feasibility is guaranteed by the following theorem.
Theorem 1. The proposed DL-JHPF is trainable and can minimize the BCE loss through BP algorithm.
Proof: Considering the mini-batch training, the BCE loss over a batch is written as where N bat denotes the batch size. Then Θ will be updated Ntr Nbat times in each epoch.
To prove Theorem 1, we need to show that L bat is differentiable with respect to each parameter in Θ. According to [36], the outputs are differentiable with respect to the corresponding weights and inputs for each NN in DL-JHPF. Since DL-JHPF can be viewed as an integrated DNN consisting of neuron layers and custom layers, the proof can be further simplified to prove that L bat is differentiable with respect to the outputs of each NN due to chain rule. In the following, we prove the differentiability of L bat with respect to the outputs of each NN by incorporating the custom layers.
NN demodulator: From (20), L bat is differentiable with respect to [X n b (Θ)] i,j , ∀i, j. Re/ImDC-NN: As mentioned in Section III.B,w BB,re and w BB,im are the outputs of ReDC-NN and ImDC-NN, respectively. Without loss of generality, we will prove that L bat is differentiable with respect to [w BB,re ] 1 and [w BB,im ] 1 . According to (5) and (15) Re/ImDP-NN: Sincef BB,re andf BB,im are the outputs of ReDP-NN and ImDP-NN, respectively, we also aim to prove that L bat is differentiable with respect to [f BB,re ] 1 and [f BB,im ] 1 . Considering the normalization in (17) (22) and (23). According to (17), we have where PP-NN: We still aim to prove that L bat is differentiable with respect to [φ P ] 1 that is one of the output of PP-NN and generates [f RF ] 1,re and [f RF ] 1,im as From (5) and (14) By considering (28)−(30), we arrive at

CP-NN:
The proof is similar to that of PP-NN and thus is omitted for simplicity. Now we have shown L bat is differentiable with respect to each parameter in Θ, which completes the proof.
It can be seen that the proposed DL-JHPF is abstracted into an integrated DNN, where the hybrid processing matrices, F RF , F BB , W RF , and W BB , are essentially the trainable weights therein. From the proof of Theorem 1, each weight of this integrated DNN can be optimized iteratively through BP algorithm by minimizing the BCE loss. Therefore, the optimal precoding and combining matrices on training set are obtained.
For the NNs in Fig. 3, each dense layer is with rectified linear unit (ReLU) activation function and followed by a BN layer to avoid gradient diffusion and overfitting. The number of dense layers and the number of neurons in each dense layer need to be adjusted according to the input and output dimensions. Since the outputs of the NNs will be used for hybrid processing at the transmitter and the reciever, the activation functions of the output layers should be carefully designed and are elaborated as follows.
PP-NN and CP-NN: The two NNs generate the phases for F RF and W RF , respectively. Since (10) and (11) are periodic functions, ReLU activation function is used in the output layer to provide the unbiased output for all possible phases. We may also use Sigmoid or hyperbolic tangent as the activation function, after which the outputs are multiplied by 2π or π to obtain the final phases with the range of [0, 2π] or [−π, π]. According to the simulation trails, ReLU and hyperbolic tangent achieve almost the same performance while Sigmoid performs worse. Therefore, ReLU is preferable since it is simple and free of the operation of exponential functions.
Re/ImDP-NN and Re/ImDC-NN: The four NNs generate the real and imaginary parts for F BB and W BB , respectively. Since F BB can be normalized by (17) while W BB has no constraint, the output layers do not apply any activation function to impose constraints and directly output the values that are input into the neurons.
NN demodulator: This NN approximates the original bits, X b , based on r. The approximation for each element in X b is a binary classification and thus the Sigmoid activation function is used for the output layer of the NN demodulator.

D. Deployment and Testing
In this subsection, we elaborate the deployment and testing of the trained DL-JHPF for practical implementation, where H is assumed to be available at both the transmitter and the receiver. 3 The practical deployment of DL-JHPF includes the following three parts: Deployment of hybrid processing designer: PP-NN and CP-NN will be deployed together at both the transmitter and the receiver to output the analog processing matrices, F RF and W RF , based on which the equivalent channel, H eq , can be generated via (14). ReDP-NN and ImDP-NN are equipped at the transmitter to generate the digital precoder, F BB , while ReDC-NN and ImDC-NN are equipped at the receiver to generate the digital combiner, W BB , both based on H eq . Deployment of signal flow simulator: It is only used for the training stage and will be replaced by the actual transceiver and wireless fading channel in the deployment and testing stage.
Deployment of NN demodulator: It will be deployed at the receiver to output the recovered bits,X b , based on the detected signal, r, after compensating the impact of the fading channel.
When testing the trained DL-JHPF in real world, the channel may change rapidly due to the relative motion of the transceiver and scatterers, in which case DL-JHPF will be faced new propagation scenarios with different channel statistics from the training stage. This channel scenario discrepancy poses a high requirement on the robustness of DL-JHPF. Fortunately, the offline trained framework in Section III.C is quite robust to the new channel scenarios that are not observed before as shown from our simulation results (Figs. 7 and 9). The further online fine-tuning may only provide marginal performance improvement but requires a relatively large overhead and needs to be performed frequently in the rapidly changed channel scenario. In addition, only the NNs at the receiver can be fine-tuned and thus the performance after fine-tuning will still have an intrinsic loss compared to the end-to-end training in Section III.C. To sum up, the proposed framework can cope with the mismatch of the channel scenario without relying on the fine-tuning in most cases.

E. Complexity Analysis
In this subsection, we analyze the computational complexity of the proposed DL-JHPF in testing stage by using the metric of required number of floating point operations (FLOPs). According to Fig. 3, the total required FLOPs of all neural layers in DL-JHPF is given by where N denotes the set including all NNs in DL-JHPF, L ∆ and N ∆ i represent the number of neural layers and the number of neurons of the ith neural layer of the NN ∆.
In addition, the complexity of matrix multiplications in the framework is given by Then, the total complexity of the proposed DL-JHPF can be expressed as It is noted that the NNs can be run efficiently via parallel computing on the graphic processing unit (GPU) and the simple matrix multiplications only cause negligible computational load for the central processing unit (CPU) compared with the existing schemes. Therefore, the proposed DL-JHPF is with low complexity and consumes the very limited runtime.

F. Extension to OFDM Systems
In this subsection, we extend the proposed DL-JHPF to the wideband OFDM systems. Two key issues need to be considered for the extension: 1) In the OFDM systems, the digital precoder and combiner can be designed independently for different subcarriers while the analog precoder and combiner must be shared by all subcarriers. It is critical to design the unified analog precoder and combiner performing well for all subcarriers. 2) It is important to maintain the relatively small size, i.e., the number of hidden layers and the number of neurons in each layer in the NNs, and short training time for DL-JHPF when the number of subcarriers is large. In the following, we study how to address the two issues when extending DL-JHPF to the OFDM systems.
According to [24], the N R × N T channel matrix between the receiver and the transmitter of the kth subcarrier can be expressed as where β = NTNR NclNray , τ n , f s , and K denote the delay of the nth cluster, the sampling rate, and the number of OFDM subcarriers, respectively. The signal transmission model in (5) becomes subcarrier dependent and the detected signal of the kth subcarrier is given by 4 4 Although x and n are also different for different subcarriers, they are independent of the channel and thus the index k in them is omitted. In the following, we propose a simple method to design the structure of training data so that the DL-JHPF in Section III.C can be flexibly extended to OFDM systems without changing the framework architecture. That is, both the framework size and training time will not be increased. The process of training and testing is detailed as follows.
Training: Compared to the training sample with the input tuple H, X b , n in Section III.C, we modify the input tuple as H , H[i], X b , n , whereH is the channel matrix of a given subcarrier, e.g., the qth subcarrier, same for all training samples while H[i] is the channel matrix of an uncertain subcarrier with i randomly generated from the set {1, 2, . . . , K} for each training sample. As shown in Fig. 4 can be obtained through Re/ImDP-NN and Re/ImDC-NN. On the other hand, H[i] is also input into the signal flow simulator to act as the fading channel since this training sample is used to simulate the transmission of the ith subcarrier. Then the end-to-end training can be performed by minimizing the BCE loss between X b andX b . Through training, we can obtain the unified analog precoder and combiner that match the channel of each subcarrier well without complicating the architecture of DL-JHPF.
Testing: With H[k], k = 1, 2, . . . , K, available at the transceiver, choose the channel matrix of the qth subcarrier asH. InputH into PP-NN and CP-NN to generate the unified F RF and W RF for all subcarriers. The unified F RF and W RF along with the channel of each subcarrier, H[k], k = 1, 2, . . . , K, are used to generate the corresponding equivalent channel, which will be input into Re/ImDP-NN and Re/ImDC-NN to generate F BB [k] and W BB [k] for channel equalization in each subcarrier. The NN demodulator will be used to recover the original bits for each subcarrier based on the detected signal, r[k].

IV. SIMULATION RESULTS
In this section, the effectiveness of the proposed DL-JHPF is verified in several cases. Six hybrid processing schemes and the fully-digital transceiver architecture are used as the baseline schemes for comparison: 1) HBD scheme in [6]; 2) Beam sweeping (BeS) scheme in [7]; 3) Discrete Fourier  [33] while the digital precoder and combiner are jointly optimized according to [38]; 4) Joint digital beamforming with alternating minimization (JDB-AltMin), where the optimal precoding and combining matrices are first designed according to [38], based on which the hybrid precoding and combining matrices are constructed according to the PE-AltMin algorithm in [9]; 5) Hybrid beamforming via deep learning (HBDL) scheme in [33]; 6) Deep learning for direct hybrid precoding (DLDHP) scheme in [34]; 7) Fully-digital transceiver architecture.
A. Simulation Settings 1) System Settings: We set N T = 32 and N RF T = 3 for the transmitter and N R = 16 and N RF R = 3 for the receiver. The number of data streams is set as N s = 3. The channel data are generated according to the 3GPP TR 38.901 Release 15 channel model [37]. Specifically, we use the clustered delay line models with N cl = 3 clusters and N ray = 20 rays in each cluster. The carrier frequency is f c = 28 GHz. For OFDM systems, the sampling rate is f s = 100 MHz and the number of subcarriers is K = 64. Two channel scenarios, urban micro (UMi) street non-line of sight (NLOS) scenario and urban macro (UMa) NLOS scenario, are considered. 5 Quadrature phase shift keying (QPSK) is used as the modulation method.
2) Proposed DL-JHPF Settings: The training set, validation set, and testing set contain 261,000, 29,000, and 10,000 5 According to the parameters for UMi NLOS scenario and UMa NLOS scenario defined by [37], we use the system object, nr5gCDLChannel, embedded in 5G Library for LTE System Toolbox in MATLAB to generate the corresponding channel data.  Table I, where the BN layer is added after each dense layer and thus is not listed in the table for simplicity.

B. Performance Evaluation
In Figs. 5−7, the proposed DL-JHPF is first evaluated in narrowband systems while the performance in wideband OFDM systems is presented in Figs. 8 and 9. Fig. 5 shows the BER performance of HBD, BeS, DCJDB, JDB-AltMin, HBDL, DLDHP, the proposed DL-JHPF, and the fully-digital architecture versus signal-to-noise ratio (SNR) in UMi NLOS scenario with perfect CSI. From the figure, DL-JHPF has a larger slope for the BER curve and outperforms the other six hybrid processing schemes after SNR = 0 dB although it performs not very well in the low SNR regime. When BER = 10 −2 , the proposed DL-JHPF achieves about 0.2 dB, 1 dB, 1.2 dB, 2 dB, 6 dB, and 8 dB gains compared to JDB-AltMin, DLDHP, DCJDB, BeS, HBDL, and HBD, respectively. The advantage of DL-JHPF becomes more obvious as SNR increases and the BER is smaller than 10 −4 when SNR = 10 dB while the performance of other four schemes is larger than 10 −3 . With the significantly increased number of RF chains, the fully-digital beamforming obtains substantial diversity gains, which directly leads to the better BER performance than all the hybrid processing schemes. The performance gap between the proposed DL-JHPF and the fully-digital beamforming is about 4dB.
Perfect CSI is used in framework training while only estimated CSI is available in the practical transmitter and receiver, which leads to the CSI mismatch. In Fig. 6  robustness of the proposed DL-JHPF with mismatched CSI, where the BER curve tested with perfect CSI in Fig. 5 is also plotted as the lower bound. We use the approach in [24] to estimate channels at SNR = 10 dB and 20 dB, respectively, for hybrid processing design. From Fig. 6, when tested with the CSI estimated at 20 dB, DL-JHPF achieves almost the same BER performance as the perfect CSI case and outperforms the other six hybrid processing schemes after SNR = 0 dB, indicating that DL-JHPF is hardly impacted by the mismatched CSI estimated at 20 dB. When tested with the CSI estimated at 10 dB, performance loss occurs at an acceptable level for DL-JHPF. The loss is less than 1 dB when BER = 10 −2 and DL-JHPF still has the clear performance superiority after SNR = 2.5 dB even compared to other hybrid processing schemes with the CSI estimated at 20 dB. As mentioned in Section III.D, it is very likely to face with different channel scenarios in the practical testing for DL-JHPF. In Fig. 7, we further consider this channel scenario   Fig. 7, the channel scenario mismatch causes only less than 0.5 dB performance loss for DL-JHPF. The total loss caused by the aggregate impact of channel scenario and CSI mismatch is only less than 1 dB. The proposed DL-JHPF has learned the inherent structure of the mmWave channels and thus is able to maintain its advantage even with mismatched channel scenarios and CSI. Fig. 8 shows the BER performance of HBD, BeS, DCJDB, JDB-AltMin, HBDL, DLDHP, the proposed DL-JHPF, and the fully-digital architecture in OFDM systems with UMi NLOS scenario and perfect CSI, which is similar to that in Fig. 5. In addition, we plot the BER performance of an ideal case with matched analog processing (AP) for DL-JHPF, where different analog processing matrices are designed for different subcarriers to match the corresponding channels. This is impossible to be implemented in practical systems and we just use it to quantify the performance loss caused by using the unified analog processing matrices for all subcarriers. From Fig. 8, only about 1 dB loss is incurred, which proves the effectiveness of DL-JHPF in OFDM systems by simply modifying the structure of training data without changing the framework architecture and increasing the training time.
In Fig. 9, we further test the robustness of DL-JHPF in OFDM systems with the mismatched channel scenario and CSI. The aggregate impact of channel scenario and CSI mismatch is still limited, and DL-JHPF tested in UMa NLOS scenario with the CSI estimated at 20 dB (mismatched channel scenario and CSI) even outperforms that tested in UMi NLOS  scenario with perfect CSI after SNR = 8 dB, which verifies the effectiveness and robustness of the proposed DL-JHPF in OFDM systems. In addition, the performance gap between the proposed DL-JHPF and the fully-digital beamforming maintains at about 4dB.

C. Computational Complexity Comparison
For mmWave mobile communications, the length of coherence time becomes smaller compared to that in sub-6 GHz and thus the runtime of a hybrid processing scheme is vital. Based on the simulation settings mentioned above, we compare the runtime of the proposed DL-JHPF in the testing stage with the baseline schemes in Table II. The HBD, BeS, DCJDB, and JDB-AltMin schemes are run on the Intel(R) Core(TM) i7-3770 CPU while the proposed DL-JHPF are run on the NVIDIA GeForce GTX 2080 Ti GPU. For HBDL and DLDHP, the predictions of F RF and W RF are implemented via DNN on the GPU while the following design of F BB and W BB are executed on the CPU. By moving the time-consuming design of analog processing to the GPU that enables the efficient parallel computing, the DL based schemes reduce the runtime significantly compared to the conventional schemes. Through carefully design, the proposed DL-JHPF is fully GPU-driven when generating hybrid processing matrices and thus consumes the minimum time among the three DL based schemes. Therefore, the proposed DL-JHPF is more suitable for mmWave communications, especially for the high-mobility scenario.

V. CONCLUSION
In this paper, DL is applied for joint hybrid processing design at the transceiver in mmWave massive MIMO systems. A novel DL-JHPF is developed to learn the optimal analog and digital processing matrices by minimizing the end-to-end BCE loss between the original and recovered bits. The elaborate architecture of the proposed DL-JHPF guarantees the BPenabled training of each NN therein. By simply modifying the structure of training data, DL-JHPF can be flexibly extended to OFDM systems without changing the framework architecture and increasing the training time. Simulation results show the superiority and robustness of DL-JHPF in various non-ideal conditions with the significantly reduced runtime.