Deep Learning Assisted Detection for Index Modulation Aided mmWave Systems

In this paper, we propose deep learning assisted detection for index modulation millimeter wave (mmWave) systems, where we train a neural network (NN) to jointly detect the transmitted data and index information without relying on explicit channel state information (CSI). As a design example, we first employ multi-set space-time shift keying (MS-STSK) combined with beamforming for transmission over the mmWave channel, where the information is conveyed implicitly using the index of the antennas, the dispersion matrix and the M-ary constellation. Then, we analyze our design when MS-STSK transmission is considered in conjunction with beam index modulation (BIM), where the information is also conveyed by the beam index in addition to the MS-STSK information. In contrast to the MS-STSK’s conventional maximum likelihood (ML) detector, our learning-assisted detection dispenses with the channel estimation stage. We demonstrate by simulations that the learning assisted detection outperforms the ML-aided detection in the face of channel impairments with low complexity. Furthermore, we show by simulations that ML-aided detection produces an error floor, when the MS-STSK transmission is coupled with BIM, when realistic channel estimation errors are considered. Additionally, we present qualitative discussions on the receiver complexity in terms of its search space as well as the number of computations required.

The challenge of providing massive connectivity to mobile users motivated wireless communication researchers to migrate from sub-6 GHz to the frequencies spanning between 30 GHz and 300 GHz, which is referred to as the millimeter wave (mmWave) band [1], [2]. Owing to the large available bandwidth, mmWave frequencies have the potential of accommodating a large number of users while simultaneously providing high data rates for each. However, harnessing mmWave frequencies faces several technical challenges, since they suffer from high propagation losses compared to that of the sub-6 GHz spectrum. To mitigate the substantial losses due to atmospheric absorption, rain-induced fading and foliage, beamforming-aided directional transmission is envisaged [3], [4] by relying on large antenna arrays (AA) to derive beamforming gains that can compensate for the loss [1]. Furthermore, the employment of spatial-multiplexing based multiple-input multiple-output (MIMO) transmission has been proved beneficial for the enhancement of the data rates. In the literature, there is a vast body of MIMO-aided transmission schemes, but Bell-Labs' Layered Space-Time (BLAST) [5] stands as the seminal technique of achieving high multiplexing gains. By contrast, when aiming for diversity gains, the family of diversity-oriented schemes exemplified by space time block coding (STBC) [7] and orthogonal space-time block coding (OSTBC) [6] may be invoked. A further extension of space diversity design for multiple access is known as space time spreading [20], [21]. Later, a new MIMO scheme has been born by the amalgamation of diversity-, multiplexing-and beamforming-oriented arrangements, which is referred to as a multi-functional MIMO (MF MIMO) design [11]. In this design, both diversity and multiplexing as well as beamforming gains can be obtained for enhancing the capacity. Explicitly, the MF MIMO concept relies on the amalgamation of two or more MIMO schemes. For example, Satyanarayana et al. [4] designed a MF MIMO that plays a dual role by providing both diversity and beamforming (BF) gains. Another example is the layered steered space-time coding (LSSTC) of [10], where both diversity and multiplexing gains as well as the beamforming gains are achieved by the amalgamation of V-BLAST, STBC and BF. Amongst other MF MIMO techniques, space-time shift keying (STSK) [12], [15] is popular as a benefit of striking a design trade-off between the attainable multiplexing and diversity gains. The STSK design is conceived as an extension to the concept of spatial modulation (SM) [9], [22], where a single antenna is activated at any time. To elaborate a little further, in the STSK design, a single dispersion matrix (DM) [8] is activated from a set of DMs at any symbol instant. In other words, information is conveyed implicitly by the index of the DM in addition to the complex-valued signal drawn from an M-ary constellation. As an extension of the STSK concept, a multi-set (MS) STSK scheme was proposed in [16], which is formed by combining the concepts of the STSK and SM. This design is capable of increasing the data rate, since the information is carried by both the classic M-ary alphabet and by the DM index as well as by the antenna index combination. In mmWave communications, where the channel supports a few clusters of rays, the data rate of the MS-STSK design can be further enhanced by coupling it with the concept of beam index modulation (BIM) [17]. In the BIM aided transmission, information is implicitly conveyed by the index of the beams in addition to the classic M-ary constellation.
However, a growing concern in the index modulation transmission schemes, such as the MS-STSK is the search complexity imposed on the receiver [16]. Furthermore, like any communication system, MS-STSK also required accurate channel state information (CSI) at the receiver for achieving a low bit error ratio (BER) [23]. In frequency division duplex (FDD) systems CSI estimations relies on pilots, which reduces the effective data rate in addition to imposing extra complexity for channel estimation.
To circumvent this problem, a machine learning based detection approach may be employed, where symbol detection is carried out without explicit CSI knowledge. This philosophy makes the design more spectrally efficient.
There is a vast body of literature on utilizing machine learning aided physical-layer communications [24]- [26]. Samuel et al. [27] conceived a deep neural network (DNN) assisted architecture for data detection, while Aoudia and Hoydis [28] demonstrated the feasibility of deep learning for end-to-end communications. By contrast, He et al. [18] employed deep learning for MIMO detection in a modelbased scenario. Additionally, Xia et al. [19] designed a DNNassisted MIMO detector operating in the presence of correlated interference, while Xiang et al. [29] employs DNN for joint channel estimation and data detection of SM. In the context of channel coding, Huang et al. et al. [30] employed reinforcement learning for the construction of polar codes, while Kurka and Gündüz et al. [31] demonstrated the power of deep learning in improving the performance of joint sourcechannel coding. Furthermore, the so-called autoencoderaided model was proposed by Van Luong et al. [32] for non-coherent index modulation (IM) detection. Table 1

5)
We show by simulations that our proposed design outperforms ML-aided detection in the face of channel estimation errors whose variance is as low as 0.15. 6) We then extend our design to link-adaptation, where the transmitter can be adapted between the beamformingaided MS-STSK and MS-STSK intrinsically amalgamated with BIM. The rationale of this design is to increase the data rate at a given target BER. 7) A qualitative complexity discussion is presented both for learning-assisted detection and for ML-aided detection in terms of the search space volume as well as the number of computations. The rest of the paper is organized as follows. In Section II, we detail the system models of MS-STSK amalgamated with beamforming and MS-STSK with BIM in mmWave communication, while in Section III we present our proposed learning-assisted detector design. Section IV presents the complexity of the design in terms of the computations as well as the search space, while Section V and Section VI discuss our results and conclusions, respectively.
Notations: We use upper case boldface, A, for matrices and lower case boldface, a, for vectors. We use (.) T , (.) H , . F , Tr(.) E(.) for the transpose, Hermitian transpose, Frobenius norm, trace and expectation operator, respectively. We adopt A(m, n) to denote the m th row and n th column of the A, I N is the identity matrix of size N × N , and A 0 indicate that A is a positive definite matrix. Finally, we use CN , U, and i.i.d. to represent complex-valued normal distribution, uniform distribution, and independent and identical distribution, respectively.

II. SYSTEM MODEL
Let us consider the system model shown in Fig. 2, where the transmitter is equipped with N t antenna arrays (AA) 1 of K antenna elements (AE) each. In Fig. 2, the transmitter employs an MS-STSK scheme, where the information is conveyed by both the STSK symbols and antenna combination (AC) information. In our system model of Fig. 2, the MS-STSK scheme relies on using M AAs (RF chains), where the AC selection is performed by selecting M AA out of N t AA. More explicitly, the MS-STSK codeword is comprised of two parts, where the first part conveys log 2 (M c M Q ) bits, with M c being the constellation size and M Q is the number of dispersion matrices [8]. The remaining log 2 (N t /M ) bits are mapped to a specific AC in the antenna selection unit of Fig. 2. It is important to emphasize that during the MS-STSK transmission only M AA are activated at any symbol interval, while the other antennas remain silent 2 The output of a typical STSK encoder is given by where x l is the M-QAM/PSK symbol, and A q is the q th dispersion matrix of size M × T from the set A = The physical significance of the matrix A q is that it disperses the symbol x l over M AA during T time slots. For example, for a 4-bitsequence '0110', where the first two bits, '10', are mapped to one of the classic 4-QAM symbols, while the remaining two bits, '01', are mapped to one of the four dispersion matrices from the set A having the cardinality of 4, i.e. A = {A 1 , A 2 , A 3 , A 4 }. It is also possible that the first three bits are mapped to 8-QAM while the last bit is used for the selection of one of two dispersion matrices, depending on the specific design requirements.
Having expounded on the MS-STSK design, in the next subsections, we focus our attention on the system model of the MS-STSK design combined with beamforming followed by the description of amalgamating MS-STSK with the BIM concept.

A. MS-STSK COMBINED WITH BEAMFORMING
It is important to emphasize that (1) represents the STSK symbol associated with a specific combination of active AA. However, since the transmitter is equipped with N t AA, the total number of ACs (N c ) for the STSK symbol transmission is The N c value is rounded down using the floor function to allow only an integer number of bits. Note that the number of ACs (N c ) is calculated by combining the set of M consecutive antennas together. During transmission, an MS-STSK symbol is formed when an STSK symbol is fed to the ST mapper of Fig. 2, where a specific AC is selected depending upon the input bitsequence. In other words, a part of the input bit-sequence determines the specific AC to be selected for transmission. To expound a little further, let us again consider the bitsequence where two additional bits are appended to the left of the aforementioned bits, i.e., '110110'. In this scenario, the bits '11' convey the index of one of four ACs. 3 Thus, in this design we have a total of log 2 (N c M c M Q ) bits/channel use (bpcu), where the additional log 2 (N c ) bits pass the information of the AC. Therefore, an MS-STSK symbol formed at the output of the ST mapper of Fig. 3 from the c th AC can be expressed as [15] whereÃ q,c is the MS-STSK dispersion matrix whose entries are constituted by the selected AC dispersion matrix and it is given byÃ where q denotes the index of the dispersion matrix, while c denotes the antenna combination. Then the MS-STSK symbol is steered in the desired direction for transmission over the mmWave channel using the RF analog BF matrix F RF to the desired user, as shown in Fig. 3(a), where the block-based received signal Y of size N r × T after RF analog combining using W RF matrix is given by 4 where V is the Gaussian noise having the distribution of CN (0, σ 2 ), while F RF is expressed as where F i RF is the BF vector of the i th AA of size K × 1. Similarly, W RF is the analog RF combining matrix of size M × N r . Furthermore, H represents the statistical mmWave channel model expressed as where H i is a statistical channel matrix of size N r × K , which is expressed as The variables N c , N ray in (9) are the number of clusters and number of rays, respectively, while α obeys the distribution of N C(0, 1), and a i as well as a r represent the array response vectors at the i th AA of the transmitter and the array at the receiver, respectively, expressed as follows: Note that our system model of (5) corresponds to Fig. 3(a), where the MS-STSK symbol is transmitted over the channel matrix H to its intended receiver with the aid of beamforming, where all the beams (one or many) supported by the channel are utilized.
Additionally, by letting W RF HF RF = H eff and invoking the vectorial stacking operation, Eq. (5) becomes equivalent to an SM system, which is detailed in [15]. In other words, Eq. (5) can be re-written in the vector form of [15], [16]: where the vectorized constituent matrices are expressed as At the receiver, the vectorized received signal y is used during the detection process. Conventionally, the detection of the MS-STSK symbol, where the estimatesq,l,ĉ of (5) are obtained, is carried out by employing ML detection relying on the CSI estimated at the receiver and it is expressed as <q,l,ĉ >= arg min q,l,c Additionally, the DCMC capacity of the MS-STSK scheme is given by [33] where By taking into account the pilot overhead f p , which is the ratio of the number of pilots to the number of data symbols, the effective DCMC capacity becomes: 202742 VOLUME 8, 2020 We note that the index modulation, such as spatial modulation and MS-STSK, suffer from a potentially higher performance loss than the classic MIMO configurations, because, the CSI accuracy affects the antenna and beam indices of our system, both of which form an integral part of the total information symbol stream. Hence, index modulation systems are more susceptible to CSI impairments. In order to overcome this difficulty, typically a higher pilot overhead is imposed with the objective of obtaining a more accurate CSI. Furthermore, the channel estimation error variance using LMMSE, which is typically employed in practice, is given by [34] where ρ t is the pilot transmission power and τ p = τ −τ d is the pilot symbols' transmission time duration, while τ is the total transmission time and τ d is the data duration. Equation (23) can also be equivalently written in terms of pilot ratio f p 5 and total number of symbols η as  In this model of Fig. 3(b) information is also conveyed by the index of the beam in addition to the information conveyed by MS-STSK symbol. More explicitly, when the channel of (8) supports a plurality of beams, say N b beams, instead of transmitting in all beams at once, the transmitter selects a specific beam for transmission depending upon the input bit-sequence. Thus, this design is capable of achieving an additional bit rate of log 2 (N b ) bits per channel use than that of its counterpart dispensing with BIM.
To elaborate a little further, let us consider the 'toy' example shown in Fig. 4, where the channel seen from each AA supports a total of 4 beams for transmission. In order to increase the spectral efficiency, the BIM is invoked by relying on the index of the beam used for transmission. In other words, MS-STSK transmission can be carried out by one of the four beams from each AA by allowing additional bits to be conveyed by the index of the beam. Naturally, this philosophy only can be exploited when there are more than one beams. 6 If there is only a single beam, then MS-STSK will be combined with conventional beamforming. As an example, in Fig. 4, beam 2 is selected for MS-STSK transmission according to the bit sequence '01' representing the beam index, while beam 3 is selected from the other AA for the bits '11'. Thus, the total number of bits per channel use when BIM is coupled with the MS-STSK example discussed in Section II-A is 10, i.e., '1101 Now let us again consider Fig. 3(b), where the channel seen from each AA at the transmitter supports N b beams and only one of the N b beams is selected for transmission depending on 6 At this stage the question arises -how do we best configure our N t K -element antenna for a specific diversity-, spatial multiplexing-and beamforming order? Naturally, this depends on the particular application in mind, as well as on its specifications. In the example of Fig. 4, we opted for invoking BIM for implicitly conveying two extra bits instead of the classic space-division multiple access principle, because BF relies on λ/2spaced elements; but such a tight element-spacing would result in a modest STSK multiplexing gains, because the adjacent AEs receive correlated signals, which are hard to separate at the receiver. In a nutshell, the specific assignment of AEs to the baseband signal processing functions has to be carefully considered.  [15], transmission dispensing with the beam index mode. The blockbased received signal Y BI of this scenario can be expressed as where H n BI is the statistical channel model of (8) in the n th beam, while the sizes of the matrices H n BI , W n RF , F n RF andX are still the same, as described in Sec. II-A. We observe that the only difference between (5) and (24) lies in the manner of exploiting the beams.
Similar to (13), Eq. (24) can be vectorized for the n th beam. At the receiver, the vectorized received signal y BI is used during the detection process. In this setting, the detection of the estimatesq.l,ĉ,n of (5) is obtained by employing ML detection on the CSI of the n th beam at the receiver and it is expressed as <q,l,ĉ,n >= arg min q,l,c,n It is important to emphasize that both (18) and (25) are heavily reliant on the availability of accurate CSI for the successful detection of the symbols, thereby imposing both the usual pilot overhead required for channel estimation and an additional channel estimation complexity. Furthermore, we will show in the subsequent section that, both (18) and (25) produce an error floor when the CSI estimation error variance is set to 0.15.
Additionally, the DCMC capacity of the MS-STSK scheme combined with BIM is expressed as  where we have: By taking into account the pilot percentage f p the effective DCMC capacity becomes: In the next section, we propose a design, where we employ deep detection relying on a neural network while dispensing with the requirement of having CSI knowledge at the receiver. The advantage of our design is that it avoids the reliance on CSI and consequently circumventing the pilot overhead and complexity of channel estimation. This philosophy makes our design spectral-efficient.

III. PROPOSED LEARNING ASSISTED DETECTOR DESIGN
In this section we commence our discussion by presenting some preliminaries on deep learning, in order for the paper to be self-contained. Then we expound on the proposed design, where we invoke deep learning philosophy for the data detection.

A. DEEP LEARNING PRELIMINARIES
Artificial Neural Networks (ANN) rely on a computational model inspired by the structural and functional aspects of biological neural networks [37]. As a benefit of their ability to learn and generalize, they have become one of the most popular machine learning techniques. They mimic the functions of the human brain in terms of organizing neurons to evaluate certain operations. The model of a typical neural network is shown in Fig. 5. A neural network as that of Fig. 5 consists of multiple layers, where the first and the last layers are the input and the output layers, while the layers between them are referred to as hidden layers. Note that a neural network having more than a single hidden layer is referred to as a deep neural network [38], which belongs to the family of deep learning techniques. To elaborate a little further, the input vector x i is fed to the input layer, which is connected to the left-most hidden layer of Fig. 5. The output from each neuron of the hidden layer is governed by a so-called activation function f (.). The activating function represents the rate of change neuron's potential from its resting state before it is stimulated. Typically, the activating function is a function of a weight vector and a bias. In our example of Fig. 5, the first column of the weight matrix W 1 and the first element of the bias vector b 1 belong to the first neuron of the first hidden layer. Similarly, the outputs from the other neurons are calculated and fed to the next layer whose outputs are again governed by the activating function of that layer. Finally, the predicted response of the neural network is obtained from the output layer. It is instructive to note that the only mathematical condition on the choice of the activating function is differentiability [37], [38]. Some commonly used activating functions include a simple threshold function, ReLu function, piecewise-linear function, and sigmoid function.
Having discussed the structure of a typical neural network, we now focus our attention on the employment of an ANN. The employment of an ANN comprises two stages: the training phase and testing phase. In the training phase, the known input and output samples are used for computing the weight matrices and bias vectors. In other words, the weights and biases are specifically designed for minimizing the error between the known output and the predicted output. Then in the testing phase, the pre-designed weights and biases are applied to the new input data, outside the training set for predicting the output.
In the next subsection, we rely on this philosophy, where the detection is carried out relying on the weights trained during the training phase.

B. LEARNING ASSISTED DETECTION
Let us now focus our attention on the learning-assisted detection for our system model. As discussed in the previous section, we first aim for designing the training weights and biases for our neural network. In our design, the number of hidden layers is set to 2, while the number of neurons is adjusted in such a way that it faithfully reproduces the output during the training stage. In our MS-STSK symbol transmission of Eq. (5), the vectorized matrix y serves as the input to the neural network; while the detected dispersion matrix index, the antenna index and the complex-valued classic symbol drawn from the M-QAM constellation constitute the output vector, as shown in Fig. 6(a). Similarly, when the MS-STSK encoder is amalgamated with a beam-index, as in Eq. (24), the vectorized matrix y BI serves as the input of the ANN. In this scenario, we have an additional element, at the output of the ANN which is the beam index,as shown in Fig. 6(b).
Having defined the input and the output vectors of the neural network, training of the network is carried out using a set of known input and output samples. We note that before the training process, the weight matrices and the biases vectors VOLUME 8, 2020 of all layers are set to random values from the distribution N (0, 1) [39]. Furthermore, in our design, a member of the hyperbolic tangent function is used as the activating function as a benefit of its smoothness and asymptotic properties, which is given as [37] f where f (x, a, v) is mapping on x, a is the slope parameter, and v affects the function position. The activating function of (29) is applied at every neuron of the network whose output is fed as the input to the next layer of neurons. In other words, the activating function of (29) maps its input vector x i of the i th training sample using the weight matrix W and bias vector b of that layer. Then this mapping serves as the input vector to its succeeding layer and so forth. After the final mapping, which is at the output layer, the error between the known output and the predicted output is computed. The error is formulated for each layer as a loss function given by [40] where S is the cardinality of the training set,ŷ i and y i are the predicted and the known output vector of the i th training sample, respectively, while ρ 1 , ρ 2 , ρ 3 are the regularization factors used for avoiding over-fitting [40]. Then, the weight matrices and bias vectors are designed for minimizing the loss function of (30). Note that the solution obtained is not guaranteed to have global optimum solution. This is typically carried out using the technique of backpropagation. In back-propagation, the gradient of (30) is evaluated with respect to the weight matrices W 1 , W 2 , W 3 and bias vectors b 1 , b 2 , b 3 . A more detailed explanation of back-propagation is presented by Chauvin et al. [41].
Note that in our design we assumed that the channel evolves in time according to Jakes' model, where the channel's correlation coefficient in time is defined by the zero-order Bessel-function of the first kind as [42] where f d is the maximum Doppler frequency and τ is the sample time. Therefore, the number of neurons and the time required for designing the weights and biases during the training phase depend on the Doppler spread 7 f d τ . Furthermore, the Doppler spread also plays a key role in deciding how often the training of the weights and biases is required for estimating the indices with a high integrity. After designing the neural network parameters, the testing phase ensues, where the vector y(or y BI ) from the receive AA is fed to the input of the neural network. Here the weights and biases computed during the training phase are applied to 7 At high Doppler spread, more number of neurons may be required for training. the input vector for estimating the indices at the output of the network.
Remark 2: The input of the neural network takes only real values; therefore, we have split the received vector y into real part R(y) and imaginary part I(y) before feeding it to the neural network.

C. TRANSCEIVER ADAPTATION
Having discussed the learning assisted detection for the MS-STSK mmWave systems, we now extend our design to link-adaptation, where the transmitter adapts between beamforming-aided MS-STSK and MS-STSK intrinsically amalgamated with BIM. The rationale of this design is to increase the data rate at a given target BER. As a design example, we consider the target BER of 10 −3 . Then it can be seen in Fig. 8(a) that at the SNR of -10 dB, the beamforming aided MS-STSK design switches to MS-STSK combined with BIM for increasing the data rate while still maintaining the target BER. Fig. 8(b) shows the rate achieved by the adaptive design. More explicitly, it can be seen that the adaptive design supports higher rate transmission than that of the beamforming aided MS-STSK while also satisfying the target BER of 10 −3 . It is instructive to note that the MS-STSK with BIM provides higher rate than that of the adaptive design for SNRs < −7.5 dB, but it fails to satisfy our target BER criterion.

IV. RECEIVER COMPLEXITY
In this section, we focus our attention on the receiver design of the proposed model. Fig. 7 illustrates the block diagram of a typical receiver employing ML detection as well as the learning-assisted detection. To expound further, Fig. 7(a) shows the schematic of ML detection. In this design, the receiver first combines the signal in the RF stage and then performs down-conversion for further digital processing in the baseband. Given the necessity of having the CSI, channel estimation is carried out with the aid of pilots prior to the detection. The ML detection is invoked for estimating the MS-STSK symbol after the CSI estimate is obtained.
By contrast, Fig. 7(b) shows the receiver design relying on a neural network, where the training weights are designed offline. It can be seen in Fig. 7(b) that, in contrast to Fig. 7(a), this design dispenses with the channel estimation stage.
More explicitly, the signal received after down-conversion is fed to an ANN, where the NN parameters learned during the training phase are applied for estimating the MS-STSK indices.
Having briefly discussed the receiver structure of both designs, let us now focus our attention on the complexity quantified in terms of search space volume and the number of computations. Let us consider again the MS-STSK symbol of (5) as a 'toy' example, where there are M Q dispersion matrices, M c complex-valued symbols and N c ACs. In this scenario, the ML detection of (18) has to estimate both the index of the dispersion matrix and of the M-QAM symbol, as well as of the AC. Thus, the run-time complexity relying on ML detection would be on the order of O(M c M Q N c ). Furthermore, the ML detection requires CSI knowledge which relies on pilots and imposes additional complexity during the channel estimation stage of Fig. 7 (a), while also significantly reducing the data rate because of the pilot overhead. In contrast to the ML detection, our learning assisted detection improves the data rate by eliminating the pilot overhead. VOLUME 8, 2020 In other words, once trained the neural network dispenses with the CSI. This philosophy makes our design spectralefficient. Furthermore, the pre-determined parameters of the network learned during the training process allows us to estimate the indices of (5) with a high integrity as we will show later in Section V.
On the other hand, the complexity of the proposed design depends on the number of neurons in each hidden layer. More explicitly, the complexity of a typical NN is jointly determined by the forward propagation, and backward propagation. To elaborate a little further, let us assume that there are n neurons in a hidden layer. Let us also assume that the input and output vectors are of sizes n i and n o , respectively. We know that for each layer's activating function of (29) is computed using the network parameters of the respective layer. In other words, the pre-determined weight matrix and bias vector values are substituted in the activating function of (29) relying on the input vector n i for computing the intermediate output n o , which serves as the input of the next layer. In (29), x is the input, a is the weight and v is the bias. Contrasting it to the ML receiver's search complexity by considering each search operation as a node of Fig. 6, the complexity of the proposed design would be on the order of O(n i n h 1 n h 2 n o ). 8 It is important to emphasize that in contrast to the ML of Fig. 7(a), this design does not require additional computations for channel estimation and also avoids the pilot overhead.
Let us now delve into the complexity in terms of the number of complex multiplications for both designs. The total number of complex multiplications required by ML detection for the transmission parameters of (5) is O (N t N r N (24). By contrast, for the NN associated with the afore-    Table 2 illustrates the number of complex multiplications required for (18) with the simulation parameters listed in Table 3.

V. SIMULATIONS
In this section we present our simulations characterizing the performance of the proposed design and of the ML detection. More particularly, we performed Monte Carlo simulations for comparing the performance of the learning-assisted detection and of the ML-aided detection. Our simulation parameters are listed in Table 3. Furthermore, in our simulations, we have configured the NN to have 2 hidden layers with 30 neurons each, where we trained the NN using 1000 samples and the classic stochastic gradient descent technique having a step size of 0.005 for 92 epochs. Fig. 9 shows the probability density function (PDF) of the antenna index estimate generated from the output of the NN for the system model of Fig. 3(a). The plots shown in Fig. 9 are generated during the testing phase of the NN for SNRs of 0, 5 and 15 dB. To obtain this plot, we have set the number of hidden layers to 2, where the NN underwent training using 2000 samples. Additionally, we note that the input vector of the NN of Fig. 6 takes only real values as discussed in Section III, hence the received vector y is split into its real and imaginary parts. In this setting, we empirically observed that the NN makes an accurate inference between the output and the input vectors, when the number of neurons is set to 12 and 13 in hidden layers 1 and 2, respectively for both the real and imaginary parts of the network.
In this setting, Fig. 9(a) shows the PDF of the antenna index output from the NN for an SNR of 0 dB. Ideally, the estimate of the output representing the antenna index is expected to be either '0' or '1', but it can be seen from the figure that the output of the NN is not exactly binary but a set of continuous values spanning from -0.4 to 1.5. More explicitly, for the bit '0' the output ranges from -0.4-to-0.5; while for the bit '1', the output of the NN is a real value between 0.3-to-1.6. It is observed empirically that the output for the bit '0' follows a near-Gaussian distribution with a VOLUME 8, 2020 FIGURE 10. BER of the proposed design for different number of frames as channel evolves in time according to Jakes' correlation coefficient (31). The simulation parameters are listed in Table 3.

FIGURE 11.
Pilots transmission for the proposed learning-assisted design and conventional design. In ML-aided detection, pilots P are transmitted in every frame which significantly affects the spectral-efficiency, while training data T d is requested by the user only after N f frames which is contingent on the Doppler spread. Note that the channel is not time-invariant over N f frames. The channel is different for every frame according to the Doppler frequency. However, for NN assisted designs, the same training weights can be employed for the duration of N f frames for successful symbol detection -thanks to the DNN. mean of 0 and variance of 0.15. Similar trends are valid for the bit '1'. However, as the SNR increases from 0-to-5-to-15, as shown in Figures 9(b) and 9(c), it may be observed that the variances of the distributions reduce gradually. In other words, the range of values seen for an SNR of 0 dB becomes narrower as the SNR increases, which results in a confident estimate of the antenna index. To understand the behavior of the NN in terms of index estimation, we have analyzed the output of the antenna index, as an example. The PDFs for the other indices can also be readily obtained. Fig. 10 analyzes the BER of the proposed design for different number of frames as channel evolves in time according to Jakes' correlation coefficient of (31). It is evident from Fig. 10 that as the number of frames increases from 10 to 100, the BER of the proposed design degrades. This phenomenon is observed because the training weights designed during the first few frames become outdated after a certain number of frames; hence retraining the NN parameters becomes necessary. We note that as expected, the number of frames  Table 3.
transmitted before the NN weights become outdated directly depends on the Doppler spread. For example, in Fig. 10, the number of frames transmitted before the BER starts to degrade for the normalized Doppler spread of 0.0005 is higher than for 0.001. Therefore, in this scenario, the receiver requests the BS to transmit pilots to recalibrate its weights depending on the BER observed. Fig. 11 shows the schematic of the pilot transmission for both the learning-assisted design and ML detection. It is important to emphasize that the pilots are transmitted for every single frame in the case of ML detection. By contrast, our learning-assisted detection requires the training data for recalibrating the NN weights only after every N f frames, as shown in Fig. 11, while performing detection without explicit CSI in the rest of the frames. This can also be interpreted as online learning. More explicitly, using the loss function of (30), the weights may be computed. Note that at the commencement of the process, random weights are used. By contrast, during retraining, initial weights are set to the previously trained weights. In this way, the NN adjusts its weights. The length of the known sequence depends on the Doppler frequency -and it is determined empirically. It is of salient importance to note that the amount of training data required for recalibration decreases with the reduction of the Doppler spread. The overhead involved in retraining the NN network of our design is proportional to N p / N f (N d + N p ) , where N f is the number of frames, N d is the number of data streams, and N p is the number of pilots; while the pilot overhead involved in the channel estimation for ML detection is N p N f / N f (N d + N p ) . Fig. 12 shows the BER of both the learning-assisted detection and of the ML-aided detection with perfect CSI, as well as of the ML detection with imperfect CSI for the MS-STSK transmission dispensing with the BIM. Since no BF index is considered, it can be assumed that the channel supports only a single beam, or all potential beams are utilized for the transmission. It can be seen in Fig. 12 that for the aforementioned NN parameters, there is around 6 dB gap at the  Table 3.
BER of 10 −3 between the learning-assisted detection and the ML-aided detection relying on perfect CSI. Although the ML-aided detection relying on perfect CSI outperforms the learning-assisted detection by 6 dB SNR gain, this is achieved under the idealized simplifying assumption of having perfect CSI. On the other hand, for the CSI estimate having an error variance of 0.16, which is obtained by substituting f p = 0.05 in (23), the SNR gap reduces to around 3 dB. Furthermore, the ML-aided detection produces an error floor 9 for the CSI error variance of 0.25, when f p = 0.03. By contrast, the learningassisted detection remains capable of accurately estimating the indices of the MS-STSK transmission regardless of the nature of the CSI, while also circumventing pilot-assisted channel estimation all together. 10 While Fig. 12 shows an SNR gap between the proposed scheme and ML detection, it is also pertinent to study the effective throughput of both designs for the sake of fairness, since the SNR gain observed for ML detection in Fig. 12 is critically contingent on the CSI estimation accuracy, which increases proportionally to the pilot density. However, increasing the pilot density would reduce the effective throughput of the design. Therefore, Fig. 13 characterizes the Discrete-input Continuous-output Memoryless Channel (DCMC) capacity of the learning-assisted detection and of the ML detection for different pilot overheads. It can be seen in the figure that the capacity of the design is strictly governed by the pilot overhead. More explicitly, for the 9 Note that we have used LMMSE for channel estimation, since it is employed LTE. It is instructive to note that employing other sophisticated channel estimation techniques may also increase the complexity of the design, which defeats the purpose of our work. 10 We emphasize here that the performance of DNN method cannot be better than the conventional ML method. It is also instructive to note that the loss function computed for designing the DNN weights is itself deduced from the ML method. The purpose of using DNN method is to reduce the detection complexity. Note that the DNN method can beat only in the face of channel impairments.  Table 3. Table 3 11 Additionally, we note that there is a SNR gain of 0.5 dB and 1 dB for our learning-assisted detection over the ML-detection with 5% overhead when aiming for achieving a rate of 3 [bpcu] and 4 [bpcu], respectively. This gain is around 1 dB and 3 dB for the learning-assisted detection over the ML-detection with 10% overhead at the rate of 3 [bpcu] and 4 [bpcu], respectively. Fig. 14 shows the BER of both the learning-assisted detection and of the ML-aided detection with perfect CSI as well as of the ML-aided detection with imperfect CSI for MS-STSK transmission in conjunction with BIM. In this setting, we assumed that the channel supports two beams for each AA of Fig. 3(b). In other words, in Fig. 3(b) there is an extra index to be estimated, which is the beam index. In this scenario, it is empirically observed during the training phase that the number of neurons has to be set to 30 for both the real and imaginary constituents of the NN. It can be seen from Fig. 14 that adding an additional index for estimation increased the SNR gap between the learning-assisted detection and the ML detection using perfect CSI to 8 dB. Again, the superior performance of the ML detection is because of the unrealistic  Table 3.

simulation parameters summarized in
assumption of having perfect CSI. However, when a CSI estimate associated with the error variance of 0.15 is introduced, the ML-aided detection starts to produce an error floor from around −10 dB, while the BER remains flat for the CSI error variance of 0.25. On the other hand, despite the absence of CSI, the learning-assisted detection estimates both the MS-STSK indices and the beam index with high integrity. Fig. 15 characterizes the DCMC capacity of the proposed design and of the ML design when MS-STSK transmission is amalgamated with the BIM. In this simulation, it is assumed that the channel supports two beams and BIM is employed for exploiting these two beams, where only one beam is activated depending on the input bit-sequence. It is evident from the figure that the DCMC capacity of the ML detection is inferior to that of the learning-assisted detection. This is because of the overhead imposed by the pilots used for channel estimation to aid the ML detection process. This becomes especially pronounced, when the pilot overhead is set to 10% as seen in the figure, where the DCMC capacity is less than 5.5 [bpcu], while that of learning-aided detection is 6 [bpcu]. In other words, the necessity of having pilots for estimating the CSI partly consumes the physical resources, thereby reducing the effective capacity of the system. However, this behavior is avoided by the learning-assisted system, since it achieves accurate symbol detection at a retraining overhead as low as 0.002%. Furthermore, we observe an SNR gain of 3 dB at the rate of 5 [bpcu] for our learning-assisted detection over the ML detection at 10% overhead, while it is around 2 dB at the rate of 4 [bpcu].

VI. CONCLUSION
In this paper, we proposed deep learning assisted detection for index modulation aided mmWave systems, where we trained a NN for estimating the symbol indices without relying on explicit CSI knowledge. As a design example, we first employed MS-STSK and then we analyzed our design when MS-STSK transmission is considered in conjunction with BIM. In contrast to the MS-STSK's conventional ML-detector, our learning-assisted detection dispenses with the channel estimation stage, which makes it more spectral-efficient than ML detection. We demonstrated by simulations that the learning-assisted detection outperforms ML detection in the face of realistic channel impairments. Furthermore, we show by simulations that ML detection produces an error floor, when the MS-STSK transmission is coupled with BIM. Additionally, we presented qualitative discussion on the receiver complexity in terms of both its search space as well as the number of computations required. ALAIN A. M. MOURAD (Member, IEEE) received the Ph.D. degree in telecommunications from ENST Bretagne, France. He has over 15 years' experience in the wireless networks industry. He is currently leading the research and development of next-generation radio access networks at InterDigital International Labs (London, Berlin, Seoul). Prior to joining InterDigital, he was a Principal Engineer at Samsung Electronics R&D, U.K., and previously a Senior Engineer at Mitsubishi Electric R&D Centre Europe, France. Throughout his career, he has been active in the research and standardization of recent communication networks (5G/4G/3G) and broadcasting systems (ATSC 3.0/DVB-NGH/DVB-T2). He has held various leadership roles in the industry, invented over 35 granted patents and several other patent applications, and has authored over 50 peer-reviewed publications. He