Enumeration and Identification of Active Users for Grant-Free NOMA Using Deep Neural Networks

In next-generation mobile radio systems, multiple access schemes will support a massive number of uncoordinated devices exhibiting sporadic traffic, transmitting short packets to a base station. Grant-free non-orthogonal multiple access (NOMA) has been introduced to provide services to a large number of devices and to reduce the communication overhead in massive machine-type communication (mMTC) scenarios. In grant-free communication, there is no coordination between the device and base station (BS) before the data transmission; therefore, the challenging task of active users detection (AUD) must be conducted at the BS. For NOMA with sparse spreading, we propose a deep neural network (DNN)-based approach for AUD called active users enumeration and identification (AUEI). It consists of two phases: firstly, a DNN is used to estimate the number of active users; then in the second phase, another DNN identifies them. To speed up the training process of the DNNs, we propose a multi-stage transfer learning technique. Our numerical results show a remarkable performance improvement of AUEI in comparison to previously proposed approaches.


I. INTRODUCTION
In recent years, mMTC has gained a lot of attention due to applications such as smart grid and metering, smart factories, autonomous driving, and public health [1], [2]. In cellular scenarios, mMTC has to provide connectivity between BSs and a very large number of devices [3].
In a conventional multiple-access scenario consisting of a relatively small number of human-type users, the BS assigns radio resources in a coordinated fashion to each user. On the contrary, in mMTC scenario, the resource allocation approach will yield tremendous control signaling overhead which may be large in comparison to the size of the data, making the protocol highly inefficient.
To cope with these limitations, grant-free-based approaches have been proposed. In grant-free random access, signalling The associate editor coordinating the review of this manuscript and approving it for publication was Ding Xu . overhead and latency are reduced as the active devices transmit data without a grant procedure. In contrast to orthogonal multiple access, NOMA permits sharing of the same timefrequency resources, therefore, it can support a massive number of devices in a limited radio spectrum. In the code domain NOMA, each user is assigned a sparse spreading sequence, known to the BS. The length of the spreading sequences is kept low to efficiently utilize the radio spectrum. Due to a large number of users, the sequences are nonorthogonal. Despite this, decoding is possible in mMTC because the number of active devices at any given time is a small fraction of the total number of devices. Since there is no previous coordination or grant procedure, the BS must identify the active users to be able to decode them by their respective spreading sequences. Thus, the first crucial step is active user detection. Due to the sparseness of the users' activation pattern, compressive sensing (CS)based techniques have been proposed in NOMA to identify them [4], [5], [6]. In [7], the authors proposed a lowcomplexity algorithm for active users detection using pilot sequences with a massive number of antennas at the BS. A receiver which works independently of parameters such as signal-to-noise ratio (SNR) and user activity ratio in a NOMA setting is proposed in [8]. However, it has been shown that the performance of CS-based detection schemes degrade considerably as the sparsity level (number of active devices) increases [6]. Moreover, CS-based algorithms fail to consider time constraint [9]. For instance, the number of iterations of block iterative hard thresholding (BIHT) presented in [10] depends on the sparsity level, i.e., the algorithm will take more time to converge as the sparsity level increases.
To overcome some of these issues, deep learning (DL) methods could be used instead of CS. Indeed, it has been shown that a DNN can learn a large number of piecewise smooth functions [11], and since then DL methods have been successfully proposed in various fields, such as speech recognition [12], computer vision [13], and language translation [14]. DL techniques find several applications in the wireless communication domain as well [9], [15], [16], [17]. In contrast to CS solutions, DL requires a large amount of data for training, but once the algorithm is trained the complexity becomes low. Indeed, in the operational mode, DL involves multiply-accumulate and element-wise nonlinear evaluations, which are far less computationally expensive than the CS-based techniques [9], [18]. Thus, some studies have been carried out to identify active users in NOMA scenarios using DL algorithms [3], [18]. Specifically, a recurrent neural network (RNN) has been proposed for both AUD and channel estimation considering a NOMA scenario with sparse spreading sequences in [3]. Another approach that deals with AUD using a DNN architecture with residual connections has been proposed in [18]. The existing DNNbased algorithms for AUD can be divided into three categories: i) assuming the number of active users is perfectly known [19]; ii) without preliminary estimation of the number of active users [3], [20], [21]; iii) estimating this number through thresholding-based algorithms [18]. Assuming perfect knowledge of the number of active users is unrealistic. Also, sparsity estimation by thresholding-based algorithms is not an easy task, as the threshold level would depend on several system parameters in an unknown way, leading to poor results when compared with the other categories [3].
In this paper, we assume that at the beginning of the transmission the BS is unaware of the number of active users. The main contributions of this paper are summarized as follows • we propose a new solution to active users detection comprised of two novel DNN architectures, one for sparsity estimation called active users enumeration (AUE), and the other one for identifying the active users called active users identification (AUI); • we compare our solution with previous approaches to assess the performance improvement; • we also report the false alarm rate to completely characterize the performance of our model. The false alarm rate has never been analyzed in the literature on AUD to the best of our knowledge; • we investigate a multi-stage transfer learning approach to reduce the training time of the DNNs. The rest of the paper is organized as follows. We present the system model along with the concept of spreading sequences and multiple measurements in Section II. In Section III, we explain the DNN architecture for AUD and sparsity estimation. Section IV contains the simulation settings and results. Section V concludes the study.
We use boldface uppercase, boldface lowercase, and lowercase letters to denote matrices, vectors, and scalars respectively. Also, abs(v) and arg(v) denote the magnitude and argument of the complex number v, respectively. The operator diag(v) outputs a diagonal matrix with entries of the vector v along the diagonal, and . p represents the p-norm.

II. SYSTEM MODEL
We consider a synchronized uplink grant-free NOMA system scenario as in [18] and [3], in which N machine-type devices can transmit to the BS (see Fig. 1), both machine-type devices and BS are equipped with a single antenna, and each device is assigned a preconfigured sequence (or codeword), known by the BS. A small number of devices K are active at a given time, with 1 ≤ K ≤ K max and K max N , where K max is a system parameter representing the maximum number of active users under consideration. The symbols generated by each active device are spread with its device-specific non-orthogonal codeword. Then, the samples are transmitted through parallel frequency-flat channels, e.g., by Orthogonal Frequency-Division Multiplexing (OFDM).
For instance, if the i th device wants to send at time t a symbol s ∈ C S is the codeword of length S associated with the i th device. The S elements of q (t) i are sent over S parallel additive white Gaussian noise (AWGN) channels with gains h Overall, the received vector at the BS at time t can be written as VOLUME 10, 2022 where n denotes the complex Gaussian noise vector n ∼ CN (0, σ 2 n I). The device indicator δ i ∈ {0, 1} indicates the activity status of the i th device, with δ i = 0/1 for inactive/active devices, respectively.
To minimize the interuser interference, we employ lowdensity signature (LDS) codewords, i.e., each codeword has only a small number n S of non-zero values [22]. Similarly to [22], [18], and [3], to generate a codeword we first randomly pick n S positions, and then generate the non-zero entries as independent and identically distributed (i.i.d.) according to a complex Gaussian distribution CN (0, σ 2 w ). Assuming the devices transmit N d consecutive symbols, the received measurements can be arranged in a vector as followsỹ (1) . . .
Assuming a maximum number of active users K max , AUD can be formulated as the support identification problem where are the subsets of {1, 2, . . . , N }, and theˆ contains the indexes of the estimated active users. One possible approach to solve (4) consists of applying CS-based techniques, which however could be challenging for real-time applications [6], [10], [23], [24], [25]. On the contrary, once a DNN is trained, estimatingˆ will be less computationally expensive with respect to CS-based approaches. In the next section, we will discuss our approach which is based on DL.

III. DEEP LEARNING-BASED AUD
Different approaches based on DNNs have been proposed in the literature for AUD, all employing thresholding-based algorithms for determining the number of active users [18], [26]. Here we present a different solution composed of two separate DNN architectures, one for active users enumeration and the other for active users identification. To the best of our knowledge, this is the first work which utilizes a DNN-based architecture for enumerating the active users in a NOMA scenario. The task of the AUE network is to output the number of active users, while a set of AUI networks, each trained for a different sparsity level, identifies the active users. More precisely, the former learns the mapping between the received vectorỹ and the estimated number of active usersK , while the latter learns the mapping between the received vectorỹ andˆ for the cardinality |ˆ | =K . The networks provide the result as followsK where and k are the sets of weights and biases associated with the enumeration DNN and the identification DNN for sparsity k ∈ {1, 2, . . . , K max }, respectively.

A. DNNs ARCHITECTURE
The received vector obtained through (2) has complex elements. To work with common DNNs, which assume real numbers as input, we split the magnitude and phase parts. More precisely, for a received vectorỹ = [y 1 , . . . , y m ] T ∈ C m then the input to the DNNs would beŷ = [abs(y 1 ), arg(y 1 ), . . . , abs(y m ), arg(y m )] T . Fig. 2 shows the architecture for the AUE and AUI. Both DNNs consist of convolutional layers, fully-connected layers, batch normalization layers, dropout layer, and activation layers. The difference between the AUE and AUI is in the output layer, which is a softmax for the AUE, and a sigmoid layer for the AUI. These output layers are described precisely below. The input to the DNNsŷ is reshaped to a 2-D feature map (N d , 2S) using the reshape layer, where the first dimension corresponds to the channels analogous to the channels in a colour image. The 1-D convolution operation is performed using filters of size 2 and 4, with a stride equal to the filter size. We perform valid convolution, i.e., the output is only considered when the filter is fully contained in the feature map and the output feature map is reduced according to the input feature map, filter size and stride [27]. The output feature maps from the convolutional layers are passed through a Rectified Linear Unit (ReLU) activation function (described below). The output from the activation function is reduced to 1-D and then concatenated through the concatenation layer. The rationale behind using convolutional layers is to reduce the computational complexity and to extract the features shared among N d multiple measurements. The fully connected layer, with input a ∈ R in and output z ∈ R out , can be expressed as where W ∈ R out×in represents the weight matrix matching the dimension of output (out) and input (in) vector, and b ∈ R out describes the bias [27]. The fully-connected layers consist of α neurons, except for the last one. In fact, the last fully-connected layer dimension must agree with the output layer dimension, so it contains K max and N neurons for the enumeration and the identification DNNs, respectively.  mini-batch B to zero mean and unit variance, and then scales it using the trainable parameters γ and β where µ i and σ 2 i are estimates of the mean and variance of the i th element of the vector, respectively, obtained by moving average [28]. The activation layers are introduced so that the DNNs can learn non-linear functions. ReLU is a common choice as an activation function in the hidden layers for numerous DNN architectures [27], [29] and ReLU can be mathematically described as where the operation is to be considered element-wise. A DNN consists of many hidden layers, and it becomes challenging to train due to the vanishing/exploding gradient problem [30]. Therefore, we adopt the residual connections scheme proposed in [31]. Residual connections directly pass information from the previous layer to the next layer as depicted in Fig. 2.
The output layer of the AUE has dimension equal to the maximum sparsity level K max . The softmax layer takes as input a vector and normalizes it to a probability distribution where z i and z j are the i th and j th element of z, whilep i is the i th element ofp. The final estimate iŝ For active user identification we use K max neural networks g 1 (·), g 2 (·), . . . , g K max (·), as defined in (5). The architecture of all the K max AUI networks remains the same as shown in Fig. 2b, only the dataset used for training each AUI is different. For instance, for training g k (·), a dataset comprising of k active users is considered. Here, a dropout layer is Algorithm 1 Deep Learning-Based AUEI Input:ỹ, K max Output:ˆ 1: Passỹ through the enumeration DNN to obtainp 2:K ← arg max i∈{1,...,K max }p i 3:ˆ ← gK (ỹ, K ) Return:ˆ employed to avoid overfitting of the model on the training dataset. In this layer, during the training phase, a fraction of the input and output connections from the neurons are dropped [28]. To identify active users, we adopt as output a sigmoid layer with N outputs, one per user. Each output is calculated asq whereq i represents the likelihood of user i being active.
In previous approaches, a comparison with a threshold was proposed to decide which users were active, but these methods suffer from the difficulties in finding a suitable threshold value. In our approach, summarized in Algorithm 1, we rely onK from the AUE network and then consider the AUI network trained forK active users. With this network, usingỹ as input, theK users with the largest likelihoods are considered as activeˆ

B. DNNs TRAINING
Sparsity estimation can be seen as a multi-class classification, in which we are categorizing the inputỹ in one of the categories ranging from 1 to K max . To this aim, we employ a categorical cross-entropy loss. Let us indicate the true label vector as p = [p 1 , p 2 , . . . , p K max ]. If the number of active users is k, it will be p k = 1 and p j = 0 ∀j = k. For instance, if the number of active users is 2, then p = [0, 1, 0, . . . , 0]. The categorical cross-entropy J S (p,p) loss is defined as User activity identification can be seen as a multi-label classification problem, in which we are selectingK out of N users. To this aim, we employ binary cross-entropy loss. Let us indicate the true label vector as q = [q 1 , q 2 , . . . , q N ] where each element represents the user as active (q i = 1) or inactive (q i = 0). For instance, if = {2, 4} then q = [0, 1, 0, 1, . . . , 0]. The binary cross-entropy loss is defined as (14) In order to determine the parameters and k in (5), we need to minimize the loss functions J S (p,p) and J A (q,q) for enumeration and identification, respectively. For that purpose, we employ the well-known Adam optimizer [32].
With the proposed approach, we have to train K max AUI networks which is a time and computationally expensive task. To counter that, we propose a multi-stage transfer learning technique. To train the AUI network g k (·) in (5) for k ≥ 2 through this technique, we start from the weights of g k−1 (·). More precisely, the weights of g 1 (·) are initialized according to [33]. Then, g 1 (·) is trained until the network converges, i.e., there is no significant change in the network weights. Instead of initializing the weights of g 2 (·) randomly, they are initialized with the trained weights of g 1 (·); this way, g 2 (·) leverages the information learnt by g 1 (·) and converges faster than its randomly initialized counterpart. In general, the weights of g k (·) are hence initialized through the trained weights of g k−1 (·), for k = 2, . . . , K max .

C. COMPUTATIONAL COMPLEXITY
In this subsection, the computational complexity of the AUEI is presented in terms of floating point operations (FLOPs). We assume the addition, subtraction, multiplication, division and exponential computation as a single floating point operation, as in [18]. The FLOPs of the convolutional layers are given by where N conv * , F conv * and out conv * represent the number of convolution filters, size of the filter and output shape, respectively. The output of the convolutional layers is fed into a ReLU, having computational complexity The number of FLOPs in a fully-connected layer (6) is dictated by the input (in) and output (out) size The number of multiplication and addition operations in W a is given by the term (in · out) and (in · out − out), respectively. The last term (out) is the number of addition operations due to the bias b. The computational complexity of the fully-connected layer simplifies to Consequently, the FLOPs of the input fully-connected layer can be defined as The batch normalization (7) involves four operations, therefore, the complexity of the input batch normalization layer can be expressed as The hidden layer is composed of two fully-connected layers, two batch normalization layers, two activation functions, one dropout layer and one residual connection. The dropout layer and residual connection are elementwise multiplication and addition operations; therefore, each will contribute α complexity to the algorithm. The overall complexity of L hidden layers is given by The computational cost incurred at the output fully-connected layer of AUE and AUI is C AUE FC out = 2αK max and C AUI FC out = 2αN respectively. The softmax layer (9) in AUE invokes K max exponential, K max divisions and K max −1 additions operations Similarly, the number of floating point operations in a sigmoid layer (11) is: According to [34], finding the largest probabilities in (10) and (12) yields the following complexity C AUE max = K max − 1 and C AUI max = KN − K (K + 1) 2 respectively. The overall computational complexity of the AUE is described below (15) Likewise, the computational complexity of AUI is given as Finally, the complexity of the AUEI is In the next section, we compare this complexity with that of the algorithm presented in [18].

A. SIMULATION SETUP
We generate samples according to the system model described by (2) for training and testing our DNNs networks.
To compare our proposal with other algorithms from the literature, we choose the same simulation parameters as in [18], namely a total number of users N = 100, a maximum number of active users K max = 8, spreading codewords with sparsity n S = 2 and length S = 10, and N d = 7 successive measurements. The case of zero active users can be handled with less computationally expensive spectrum sensing techniques or machine learning algorithms, as described, e.g., in [35], [36], and [37]. The non-zero values of the LDS codewords are generated from the distribution CN (0, σ 2 w ) with σ 2 w = 1. We use a Rayleigh fading channel model with perfect power control, so that h i,j ∼ CN (0, 1) are i.i.d. complex Gaussian. Note that owing to perfect power control, the distance of the devices from the BS does not contribute towards the received vector. The data symbols s i are unit energy quadrature phase-shift keying (QPSK), so that the SNR is defined as SNR = 1/σ 2 n . For the AUE network dataset, the number of active users in each sample varies from 1 to K max . For the training, we generate 13.5 · 10 6 samples. The dataset generation for the k th AUI network g k (·) involves randomly activating k users from a total of N . We generate 9·10 6 training samples and 10 6 testing samples per AUI network.
The architecture of both the AUE and AUI DNNs consists of L = 2 hidden layers. The convolutional layers consist of 64 filters. Except the last fully-connected layer, each fully-connected layers consists of α = 1000 neurons. In case of AUE and AUI, the last fully connected layer contains K max = 8 and N = 100 neurons, respectively.
We train the sparsity estimation DNN for 10 epochs. Regarding the AUI networks, in order to minimize the training time, we adopt the multi-stage transfer learning approach. Hence, the first AUI network, g 1 (·), is trained for 10 epochs with He initialization [33], while for the g k (·) network the weights are initialized from the trained weights of g k−1 (·). We employ the Adam optimizer for learning the weights in both DNN networks. For the optimizer, we consider the following configuration: learning rate = 0.001, β 1 = 0.9, and β 2 = 0.999. In the training phase, we consider a mini-batch of size |B| = 1000. The drop out rate is set to 0.1.
For the implementation of the deep learning algorithms, we employ Keras deep learning framework with Tensorflow as backend [28], [38]. We trained the DNN algorithms on a GPU server consisting of two Nvidia Quadro RTX 5000 cards, two Intel Xeon Gold 5222 Processors and 128 GB RAM.

B. RESULTS
As for performance metrics, we use the recall defined as R = TP/(TP + FN) and the false alarm rate F = FP/(FP + TN), where TP, TN, FP, and FN stand for true positive, true VOLUME 10, 2022 negative, false positive, and false negative, respectively. True positives (TP) and true negatives (TN) indicate the number of occurrences when the active/inactive users are correctly identified as active/inactive, respectively. Similarly, false positives (FP) and false negatives (FN) represent the number of occurrences when the inactive/active users are misclassified as active/active, respectively. In the following one iteration means updating the weights over a mini-batch.
Let us first investigate the training phase, specifically the rate of convergence of the weights for the AUI networks. In this regard, in Fig. 3 we report the loss versus the number of iterations. Comparing the curves with and without transfer learning, where g k (·) for k = 2, 4 and 8 is trained for 3 epochs for the transfer learning approach, a considerable improvement in the speed of convergence of the training can be observed. The improvement is substantial for all sparsity levels K , and it is particularly important for the networks designed for large K (see, e.g., the case K = 8). In the case K = 2 the advantage due to transfer learning is less pronounced. We can appreciate the improvement also in terms of recall in Table 1, where we report the results with and without transfer learning for K = 8 and SNR = 10 dB. For obtaining the recall values through the multi-stage transfer learning, g 1 (·) is trained for 10 epochs while g k (·) for 2 ≤ k ≤ 8 are trained for epochs as in the first column of the Table 1. The networks which are trained without the transfer learning approach are initialized through [33].
We compare the recall for the proposed architecture with the points taken from the literature proposing other algorithms, under the same simulation parameters, namely the Deep AUD (D-AUD) [18], and the compressed-sensing Approximate Message Passing (AMP) [18]. The curves for the proposed AUEI are obtained through the multi-stage transfer learning approach. The g 1 (·) is trained for 10 epochs while g k (·) for 2 ≤ k ≤ 8 is trained for 3 epochs. The proposed approach shows improved recall values with respect to the other algorithms, as can be seen in Fig. 4 and Fig. 5 for SNR = 10 dB and SNR = 20 dB, respectively. In contrast to our approach, the other algorithms suffer from substantial performance degradation for high sparsity levels. We present  in Table 2 the false alarm rate for the proposed architecture with multi-stage transfer learning. It can be observed that our approach, besides the previously discussed high recall, yields a negligible false alarm rate.
In Fig. 6, we compare the performance of our algorithm with D-AUD and AMP in terms of recall for the SNR range 0 − 20 dB, N d = 7 and K = 4. It can be observed that our approach outperforms the other approaches, especially in the low SNR regime. To check the robustness of our algorithm, we illustrate the performance for overloading factors 125% and 250% in Fig. 7 and 8, respectively. The overloading factor is defined as N /(N d S). For different overloading factors, we assume a fixed length of the spreading sequence, S, and a number of users, N , while varying the number of measurements, N d . A significant performance improvement can be observed for N d = 8 in comparison to N d = 4 for all the algorithms. In other words, increasing the number of measurements N d or reducing the overloading factor yields better performance. We observe that the proposed algorithm outperforms the D-AUD and AMP in both scenarios, confirming the reliability of AUEI.
Finally, we present the numerical comparison of computational complexity between AUEI (see Section III-C) and D-AUD, whose complexity for a given sparsity K is stated in [18] as For calculating the overall D-AUD complexity, we also take into account the algorithm proposed in [18] for sparsity estimation. In this algorithm, the received vector is passed first through the D-AUD trained for sparsity level K = 1. If the output satisfies the threshold-based condition, this is considered as the sparsity level. Otherwise, the received vector is   passed through the D-AUD network trained for K = 2, and so on. The procedure is repeated until the threshold-based condition is met or the maximum sparsity level is reached. Thus, for a given sparsity K , the received vector is passed through K D-AUDs. For this reason, the complexity of the D-AUD algorithm grows linearly with the sparsity level. Considering that, the overall computational complexity   Table 3 shows the computational complexity of AUEI and D-AUD for N d = 7 and K = 1, 2, 4, and 8, calculated through (17) and (18). The number of hidden layers for AUEI and D-AUD is L = 2 and L = 6, respectively. As observed, the computational complexity of D-AUD increases VOLUME 10, 2022 linearly with the sparsity level, while the complexity of AUEI remains practically constant. This is due to the fact that the dependence on the sparsity level K in (16) has a negligible effect on the overall computational complexity. Specifically, for all cases with more than one active user, the AUEI shows a significant gain in terms of complexity. So, despite having two separate architectures instead of one as in D-AUD, our approach yields a lower complexity and better performance.

V. CONCLUSION
In this paper, we have proposed an active users detection method, realized by one DNN for active users enumeration and one for active users identification. We designed the deep neural network architectures to extract relevant features from the multiple measurements for enumeration and identification. Besides the fully-connected layers, both DNNs consist of convolutional layers to reduce the computational complexity. To minimize the training time for the active users identification networks, we adopted the multi-stage transfer learning technique. The numerical results demonstrate that our approach is more effective than previously known methods in identifying the active users, especially for high sparsity levels and low SNR. We also analyzed the false alarm rates, which are negligible for the scenarios of interest, and the computational complexity, which results lower than other approaches. Future work will include analysis of the scalability of the proposed algorithm for a different number of users and further reduction of the computational cost.