Joint User and Data Detection in Grant-Free NOMA With Attention-Based BiLSTM Network

We consider the multi-user detection (MUD) problem in uplink grant-free non-orthogonal multiple access (NOMA), where the access point has to identify the total number and correct identity of the active Internet of Things (IoT) devices and decode their transmitted data. We assume that IoT devices use complex spreading sequences and transmit information in a random-access manner following the burst-sparsity model, where some IoT devices transmit their data in multiple adjacent time slots with a high probability, while others transmit only once during a frame. Exploiting the temporal correlation, we propose an attention-based bidirectional long short-term memory (BiLSTM) network to solve the MUD problem. The BiLSTM network creates a pattern of the device activation history using forward and reverse pass LSTMs, whereas the attention mechanism provides essential context to the device activation points. By doing so, a hierarchical pathway is followed for detecting active devices in a grant-free scenario. Then, by utilising the complex spreading sequences, blind data detection for the estimated active devices is performed. The proposed framework does not require prior knowledge of device sparsity levels and channels for performing MUD. The results show that the proposed network achieves better performance compared to existing benchmark schemes.


I. INTRODUCTION
Grant-free non-orthogonal multiple access (NOMA) is a promising solution to support machine-type communications in the 6G Internet of Things (IoT) [1].In traditional grantbased orthogonal multiple access (OMA) schemes, the maximum number of devices being serviced is limited by the number of available orthogonal resources.Therefore, scheduling is required to allow the devices to share the orthogonal resources.In contrast, grant-free NOMA allows devices to transmit their data in an arrive-and-go manner by randomly choosing a resource block without going through the grantaccess process [2], [3].When multiple devices choose the same resource block, a collision occurs, which requires retransmission.These collisions are significantly reduced due to the different multiple access signatures in NOMA [4].Saud Khan and Salman Durrani are with the School of Engineering, The Australian National University, Canberra, 2601, Australia (Email: {saud.khan,salman.durrani}@anu.edu.au)Saud Khan, Chandra Thapa and Seyit Camtepe are with Data61, Commonwealth Scientific and Industrial Research Organization (CSIRO), Sydney, 2122, Australia (Email: {chandra.thapa,seyit.camtepe}@data61.csiro.au) This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.The work from Muhammad Basit Shahab and Sarah J. Johnson was supported by the Australian Government through the Australian Research Council Discovery Projects scheme (Projects DP180100606 and DP210102239).
Therefore, from a practical perspective, grant-free NOMA is considered an attractive solution for sporadic IoT traffic use cases.
The basic principle of grant-free NOMA is to allow the devices to randomly access the resource blocks through multiple access signatures, such as power levels, spreading sequences, scrambling, and interleaving [2, Table .III].Among these signatures, spreading sequences are considered superior because they can efficiently mitigate multi-user interference [5].The spreading sequences allow device-specific, low crosscorrelation codes to enable grant-free communication.However, in spreading-based signatures, longer-length sequences are needed as the number of devices increases.In this regard, complex spreading sequences, as proposed in multi-user shared access (MUSA) [6], enable support for a significantly larger number of devices than pseudo-random sequences, i.e., a higher overloading factor without increasing the sequence length.
In spreading-based grant-free NOMA, each active device randomly and independently selects a spreading sequence from a predefined set [7].Therefore, the key research challenge is to correctly detect the spreading sequences of the active devices, also known as a multi-user detection (MUD) problem [8].In this regard, identifying the total number of active devices, also known as the active user detection (AUD) sub-problem, and the accuracy of correctly identified active devices, which is the active user support set sub-problem, play a key role.This research challenge is addressed in this work.

A. Related Work
The quality of the active user support set devised from AUD directly impacts the performance of MUD.In many practical IoT use cases, while the total number of devices is large, only a small percentage of the total devices may be active in a given time frame [9]- [11].Using this inherent sparsity of IoT devices, the AUD problem can be readily formulated as a sparse recovery problem, which can be solved using compressed sensing (CS) [12], [13] or machine learning (ML) [2].Considering the inherent sparsity and the sporadic device activity, it is then crucial to correctly model the activity pattern of devices over a time frame.The activity pattern of devices over a given time frame, whether independent or temporally correlated, greatly impacts the performance of MUD.
In the literature, as summarised in Table I, the framewise sparsity and burst-sparsity are two prominent models for device activity patterns adopted for both CS-based and MLbased MUD.In the frame-wise sparsity model, the activity and inactivity of devices remain constant over an entire data frame, i.e., the temporal correlation of device activity patterns is fixed.In the burst-sparsity model, devices generally transmit their data in multiple adjacent time slots with a high probability.In many IoT applications where devices are deployed to detect an event of interest [14], and device data is sporadic in time [15], [16], e.g., smart meters and environmental monitoring, IoT devices transmit their data in the form of data bursts due to the size of their payload.However, some IoT devices can still complete their transmission in a single time slot.Thus, the burst sparsity model is regarded as a more practical model since it provides a balance between the frame-wise sparsity (a slightly impractical model since the temporal correlation of the device activity patterns is fixed) and the completely random transmission model (also slightly impractical as IoT traffic is not entirely random and consists of data bursts and traffic patterns).
CS-based solutions with frame-wise sparsity model: Many works have considered CS-based solutions for the MUD problem in spreading based grant-free NOMA with framewise sparsity [17]- [21].In [17], the frame-wise joint sparsity model is exploited to achieve better performance of device detection using an iterative order recursive least square (IORLS) algorithm based on the orthogonal matching pursuit (OMP) algorithm.However, the authors considered prior knowledge of device sparsity level at the AP, which is typically unknown in practical scenarios.In [18], the authors proposed the alternative-direction-method-of-multipliers-(ADMM)-based CS to show improvement in the device detection performance using a partial active device set as prior knowledge.However, obtaining the prior information on either the sparsity level, equivalent channel matrices, or both in practical systems is difficult.In [19], the device detection problem was modelled as a multiple measurement vector (MMV) problem, and a block sparsity adaptive subspace pursuit (BSASP) algorithm was used to solve it.However, pilot symbols are transmitted before every data packet, which leads to a significant system overhead.Similarly, the authors in [20], [21] developed greedy algorithms for joint device activity and data detection.However, these algorithms assume complete channel gain knowledge at the AP or pilot symbols for channel estimation.
CS-based solutions with burst sparsity model: Some recent works have considered CS-based solutions for the MUD problem in spreading-based grant-free NOMA with burst sparsity [22]- [24].In [22], a dynamic CS-based multi-device detection was proposed, which utilised the temporal correlation between device transmissions in the previous frame to achieve the performance gain.This algorithm was developed based on the assumption that the device sparsity level is known, which requires a training stage to learn such information accurately.Alternatively, the prior-information aided adaptive subspace pursuit (PIAASP) algorithm was proposed in [23], which utilised the prior support according to the additional quality information (the number of common support sets shared in time slots).However, the preceding work is heavily dependent on the inertia of device support; thus, it is unsuitable when the active device support varies rapidly in adjacent time slots, as is often the case in practice.Similarly, the authors in [24] proposed an algorithm to take advantage of the temporal correlation, where the frame is divided into subframes.Each subframe contains adjacent time slots and considers the active and inactive devices sharing common support in all the time slots.Also exploiting the temporal correlation, the authors in [25] used 2,1 minimisation to jointly detect the user activity and data.
ML-based solutions: Recent works have adopted ML and demonstrated higher detection accuracy than conventional iterative algorithms [27]- [31], [33].The authors in [27] and [28] considered pseudo-random noise-based and complex spreading sequences, respectively, and proposed deep neural networks (DNN) for active user detection (D-AUD) in a grant-free NOMA system by using the received signal as the input to the DNN.However, since the preceding works utilised a vanilla DNN for this purpose, the temporal activity of the devices cannot be taken advantage of, leaving room for improvement.To tackle this, the authors in [29] utilised a long short-term memory (LSTM) network to predict the activity of the devices based on their activation history.However, the dependence of LSTM on the previous activation history of devices makes the overall system prone to misclassification since the activation history is vaguely modelled.Adopting a different approach, the authors in [30] considered the use-case of generative networks to tackle the issue of detecting devices in different overloading factors with a single trained model.However, this work did not take the temporal correlation of device activity patterns into account.The authors in [31] provided a somewhat different approach by utilising power-domain NOMA instead of codedomain NOMA as the multiple access signature.However, the system faces extreme degradation due to this choice as the number of active devices in the cell increases.Furthermore, pilot symbols are included after every data symbol, drastically increasing the system's overhead.In a similar fashion, the authors in [33] utilised a bi-directional deep neural network for detection in a two-user power domain NOMA scenario.However, this differs from the grant-free NOMA scenario considered in this work since it does not use spreading-based signatures and assumes the connection of devices using prior access procedures.Similarly, the authors in [32] assigned nonorthogonal pilots to devices for transmission, leading to a larger system overhead as the number of devices increases.Thus, a more resilient approach is required in the context of deep learning, which can exploit the temporal correlation of active devices in the adjacent time slots whilst providing accurate detection of devices.

B. Contributions
As evident from Table 1, prior works on MUD in grantfree NOMA based on compressed sensing typically considered known user sparsity level or knowledge of the channel.Most works based on ML focused on the AUD problem with the frame-wise sparsity model, and the complete MUD problem was not considered.To the best of our knowledge, no prior work has fully addressed the MUD problem, along with AUD and active device support set identification subproblems, in In this paper, we consider a complex spreading sequencesbased grant-free NOMA scenario, where multiple devices communicate with the AP simultaneously in the uplink following a burst-sparsity model.To address the AUD problem, we design an attention-based bidirectional LSTM (BiLSTM) network, which aims to create a mapping function between the superimposed received signal at the AP and the indices of active devices in the transmit signal.The proposed framework does not require active user sparsity or channel state knowledge to carry out AUD.Using the estimated active user support set, we then design a MUD framework to find the user sparsity and carry out blind data detection at the AP.The main contributions of this work are as follows: • We design a BiLSTM network with an attention mechanism to carry out AUD.The BiLSTM network utilises two LSTM networks conjunctionally in opposite temporal directions.The attention mechanism exploits the temporal correlation in the active user set and facilitates the BiLSTM network by providing context to the important activation history of the active devices.By training the network in the offline stage, the proposed network maps the superimposed received signal and the active user support set, detecting a larger number of active devices with higher accuracy.• By detecting the active user support set using our proposed BiLSTM network, we then provide a framework to carry out blind data detection at the AP [34] without the need for explicit channel training.Using the estimated active user support set and complex spreading sequences, a blind minimum mean square error (MMSE) weight is obtained, from which the received signal is reconstructed without the explicit need for statistical channel information.
• Compared to the benchmark OMP scheme, our results show an improvement of around 30% when detecting the number of active devices and an improvement of around 29% when identifying the active device support set.Additionally, the proposed network achieves a gain of around 2.3 dB in bit-error-rate (BER) compared to the OMP scheme.• Compared to the ML-aided LSTM-based CS scheme, our results show an improvement of around 10% when detecting the number of active devices and an improvement of around 6% when identifying the active device support set.Additionally, the proposed network achieves a gain of around 0.9 dB in BER compared to the LSTM-based CS scheme.The computational complexity of the proposed network increases only marginally as the number of active devices increases.

C. Paper Organisation and Notations
The rest of this paper is organised as follows.In Section II, we present the system model and MUD problem.Section III describes the proposed attention-based BiLSTM scheme and describes the neural network's architecture.Section IV discusses the network's training details and complexity analysis.In Section V, we present the simulation results to verify the performance gain of the proposed technique.Finally, Section VI concludes the paper.
We use the following notations in this paper.Lower and upper case boldface letters are used for vectors and matrices, respectively.The transpose of a vector a is a T .The exponential function is calculated as e (•) , where e is the base of the natural logarithm.The norm is denoted by ||•||.R x×y and C x×y denotes the real and complex valued space of size x × y respectively.diag(•) denotes the diagonal operation, denotes Hadamard product, whereas ⊕ denotes the symmetric difference.(•) and (•) denote the real and imaginary parts of a complex number respectively.The gradient differential operator is denoted by ∇. z, z, ẑ, ž, and z represent the output states of the proposed network at their respective intermediate stages.Table II summarizes the important symbols used in this work, including the dimensions of vectors and matrices.

II. SYSTEM MODEL
We consider a spreading-based uplink grant-free NOMA system comprising of an AP and K IoT devices, as shown in Fig. 1.Without loss of generality, all devices and the AP are assumed to be equipped with a single antenna.We consider an overloaded system where the number of resource blocks N is less than the number of IoT devices, i.e., N < K.During transmission, a subset of the K devices sporadically and randomly become active when they have data to transmit.We adopt the burst-sparsity model in this work, i.e., some transmissions continue for several consecutive time slots while others last for one-time slot only [22]- [24].

A. Signal Model
Considering an arbitrary symbol interval, an active device transmits its complex modulated signal towards the AP, which are independent random variables drawn from standard symmetric discrete constellation set M .For inactive devices, their transmit symbol is equal to zero.In this work, we consider that the device symbols are spread with a family of short complex-valued spreading sequences with low crosscorrelation values [6].These short complex-valued spreading sequences can be generated naturally based on the binary sequence elements.For instance, for M = 3, each element of the complex spreading sequence is taken from the set [28].
After modulation, the symbol s k from the k th device is spread onto a spreading sequence which is randomly and independently selected from a pre-defined set.The received signal y at the AP is the superposition of all signals, given as where denotes the channel vector between the AP and the k th device over N sub-carriers, and w ∼ CN (0, σ 2 I) represents the complex Gaussian noise vector.Moreover, C = [diag(c 1 ), diag(c 2 ), . . ., diag(c K )] ∈ C N ×N K refers to the codebook matrix of all devices, and v = is the synthesis of the transmit symbols and channel vectors.

B. Consecutive-Time Slot Dynamic Model
Exploiting the sparsity in the data transmission (i.e., only a subset of devices wake up to transmit) and the temporal correlation of the device activity pattern (i.e., data transmission is bursty in general), we can formulate the vector v as a sparse vector and extend our system model in (1) to a continuoustime slot model.
The idea is to utilise the bursty nature of ṽ = v [1] , v [2] , . . ., v [J] T ∈ C N JK×1 where v [j] is the signal at the j-th time slot, to retrieve it from the received signals ỹ = y [1] , y [2] , . . ., y [J] T ∈ C N J×1 , in the J successive time slots.This formulation helps in capturing the temporal correlation of the active devices by detecting the transmit signals v in the continuous-time slots.The stacked received signal vector ỹ can be represented as v [2]  . . .
where C is the equivalent code-book matrix of all devices, which contains the complex spreading sequences of all K devices, v is the composite of the transmitted symbol and channel vector, and w is the Gaussian noise vector.
The AP receives a multi-device vector ỹ with no knowledge of the active transmitting devices or locations of the non-zero symbols.The active device support set Γ [j] varies over different time slots considering the device's random transmission in a grant-free fashion.With this in mind, let u [j] = [s 1 , s 2 , . . ., s K ] T correspond to the total devices in the j-th time slot.Then, the active device support set1 Γ [j] of the signal x [j] in the j-th time slot is defined as [23] From this, the number of transmitting active devices is defined through the cardinality of the active device support set Γ [j] , given as [23] Since IoT traffic is not entirely random and often consists of data bursts and traffic patterns, in this work, we consider the burst-sparsity model where only a subset of active devices in the previous time slot also transmit in the next time slot.That is, only a subset of indices in Γ [j −1] are present in Γ [j] .Therefore, to quantify the commonality of active devices transmitting in consecutive time slots, we define η as the level of temporal correlation between the previous time slot Γ [j−1]  and the current time slot Γ [j] .It is given as Note that in (5), η characterises the overlapping level of the active devices transmitting in consecutive time slots.For instance, when η = 0.5, half of the devices transmit in consecutive time slots ≥ 2, whereas the remaining transmit only once during the whole process.In Section V, we will show how the variation of temporal correlation η affects the overall system performance.

C. Multi-User Detection Problem
When multiple active devices communicate with the AP simultaneously in a grant-free manner, the first task for the AP is to detect the active devices that contributed to the received signal.Therefore, the identification of active devices leads to the problem of finding the support of the transmitted signal.
In this regard, the rows in (2) can be rearranged.We also introduce an active device criterion δ ∈ 0, 1, where δ = 1 and δ = 0 correspond to active and inactive devices, respectively [27].Using this, the stacked received signal vector ỹ can be written as w [1]  . . .
where ξ = [ξ 1 , ξ 2 , . . ., ξ K ] ∈ C N J×N JK , and , such that for any k th device, k ) T , . . ., (s k )], respectively.From (6), it is inferred that out of K, only a subset of devices, say S, are active.This means that the sparse vector x has S nonzero blocks corresponding to the S active devices.Therefore, ỹ in (6) can be represented as a linear combination of S submatrices of ξ 1 , . . ., ξ K perturbed by the noise [27].We assume the codebook entries of ξ are available at the AP [2].However, the AP does not know which spreading sequence is chosen by the different active devices.Thus, the AP needs to identify the sub-matrices ξ S , which are analogous to ξ, by processing ỹ.
From this, the MUD problem becomes a 2-dimensional CS problem, which is common in the CS paradigm [35].With this in mind, the following MUD problem is readily articulated as an active device support estimation problem, given as This detection problem in ( 7) can be solved using classical CS approaches.The approaches based on exhaustive searches, such as 1 -minimisation [36], provide theoretical performance gains but suffer from heavy computational complexity.The approaches based on greedy algorithms [37] have comparably lower complexity but result in a sub-optimal solution and require a larger number of measurements for signal recovery.The biggest drawback of conventional CS-based schemes is that they assume perfect knowledge of the channel and active device sparsity levels.Furthermore, the enormous computational complexity and the latency of iterative algorithms make them a practical solution only for a small number of active devices.When there is a larger number of active devices, the performance of conventional CS-based schemes degrades due to their sole dependence on the residual vector in each iteration 2 .Due to this, as the number of active devices increases, conventional CS-based schemes are not suitable solutions to facilitate grant-free communication.This motivates us to pursue a machine learning-aided solution presented in the next section.

III. DEEP LEARNING AIDED MUD
To tackle the MUD problem in Section II-C, we propose a solution using deep learning.In essence, we aim to delineate a nonlinear mapping using deep learning to create a pattern between the stacked received signal ỹ and the support of x and perform MUD at the AP.The resulting active device support estimation problem Υ is then defined as where Θ represents the weights and corresponding biases of the learning architecture.

A. Learning Architecture
In this work, we adopt an attention-based BiLSTM network to solve the MUD problem, as illustrated in Fig. 2. The 2 A nonzero submatrix of ξ with an index chosen at the i-th iteration is given as = arg max y is an approximate of the transmitted signal x in the (i − 1)-th iteration.It is of understanding that the performance of active user support identification is influenced primarily by ξ, which is generated through the codebook C, and residual vector r (•) .attention mechanism is discussed in Section III-B, while in this section, we discuss the BiLSTM network.The motivation for adopting the BiLSTM network is as follows.
Standard unidirectional LSTM networks undertake sequences in forwarding temporal order, ignoring future context.This is because unidirectional LSTM only preserves the information of the previous time steps since it has exclusively seen inputs from the past.On the other hand, BiLSTM networks take unidirectional LSTM networks one step further by setting up a second LSTM layer, where the gradients in the hidden connections flow in the opposite temporal direction.That is, BiLSTM runs the inputs in two ways, one from past to future (left to right, i.e., forward) and another one from future to past (right to left, i.e., backward).This gives BiLSTMs the ability to exploit more information, thereby simultaneously obtaining contextual features from forward and reverse temporal directions.In essence, more features from both directions are captured for mapping active devices transmitting in consecutive time slots.The LSTM in the reverse direction is calculated in the same fashion as the forward direction.Noticeably, since the direction is reversed, the time information is passed from future to past.
For input ỹt at the current time step t, the BiLSTM network calculation is given by where σ represents the activation function, t − 1 and t + 1 represent the forward and reverse direction time steps respectively, h t−1 and h t+1 represent the previous and next hidden states respectively, W f and W r represent the forward and reverse direction input weights respectively, and b f and b r represent the forward and reverse direction learnable bias parameter respectively.− → h f and ← − h r represent the forward and reverse direction LSTM network outputs respectively.Finally, the output of the BiLSTM z t is where W z represents the BiLSTM output weights, b z represents the BiLSTM output learnable bias parameter, and ht is the concatenated hidden state of the forward and reverse direction LSTMs.Fig. 2 shows the proposed attention-based BiLSTM network applied to our MUD problem.For each training iteration, we use U training data copies ỹ(1) , • • • , ỹ(U) .Next, since the stacked received signal ỹ(u) is a complex-valued modulated vector, we split the real and imaginary parts and use N )] as an input vector to the network.With this in mind, the unit output in (11) is substituted as where t ∈ R 2N J×1 is the input vector, and b in z ∈ R α×1 is the input learnable bias term.
In this work, we employ batch normalisation to help coordinate the update of multiple layers by standardising the inputs of each layer to have fixed means and variances.This is important because when active devices experience different wireless channels and transmit their data in a grant-free manner, the resulting stacked received signal ỹ has substantial variations.These significant variations make it difficult for the network to learn the device activation pattern.By standardising the inputs of each layer, batch normalisation reduces the variations and helps to overcome this difficulty.Thus, the output vectors U from (12) are put together in the mini-batch Once arranged in a mini-batch, these vectors are scaled and shifted using their respective hidden weights and batch normalised.The output for each element z in,(u) t,i of the batch normalisation (BatchNorm) is given as where −µ B,t,i ) 2 calculates the batchwise variance, β is used as a scaling parameter, γ is used as the shifting parameter, and α represents the width of the hidden layers.
The proposed scheme learns to create a mapping function between the stacked received signal ỹ and the current active device support set Γ.However, the estimate of the current active device support set Υ is vastly agitated by the activation patterns of the neurons, which in turn are dependent on perturbations and precision errors.This issue is further compounded as the spreading sequences in the sensing codebook matrix ξ are correlated.Accordingly, the estimate of the current active device support set Υ might not be accurate and will misclassify in the presence of random perturbations.In addition, when the device activity pattern is similar in consecutive time slots, the network is more prone to overfitting due to the unchanging device activation pattern.We use the ReLU activation function and dropout layer to address these issues.By using the ReLU activation function, the computed weights at every iteration are ranged, i.e., f (x) = max(x, 0), which is then used as In the dropout layer, the activated neurons in a hidden layer are randomly halted with a probability ρ drop , given as where d (u) i is the i-th element of the dropout vector d (u) , and is the Hadamard product.Bern(ρ drop ) is the Bernoulli random variable which takes the value 0 with the dropout probability ρ drop and 1 with the probability 1 − ρ drop .The dropout mechanism deliberately makes the training process noisy by deactivating neurons randomly, forcing the remaining neurons to take more responsibility in creating a different path for the gradient flow.This random dilution of neurons provides rigorous circumstances where network layers co-adapt to rectify mistakes from prior layers, which helps create a more generalised network capable of estimating the current active device support set with more accuracy.Therefore, removing incoming and outgoing connections of the dropped neurons with a random probability systematically resolves the active device activation patterns' similarity among correlated support sets.
After the dropout layer, the output vector ž makes its way through multiple hidden layers 3 .In subsequent, every hidden layer comprises the BiLSTM layer, a BatchNorm layer to reduce the variation of z[l] , a ReLU activation function applied to z(u) to determine whether the information (ẑ α ) generated by the hidden unit is activated or not, and finally a dropout layer to overcome overfitting of the network is applied (see Fig. 2).The output of the l th hidden layer's BiLSTM where W [l] ∈ R α×α and b [l] ∈ R α×1 are the weight and bias in the l th hidden layer, respectively.

B. Attention Mechanism
Fig. 3 shows the working of the attention mechanism and its integration with the BiLSTM network architecture.The motivation for adopting the attention mechanism is twofold: (i) a neural network that creates a mapping function for the active device detection problem in (7) by analyzing the whole input at every step ignores the temporal correlation of the device activity pattern and (ii) with the increasing number of active devices, it becomes difficult for a neural network to learn its activation pattern due to its inherent sequential path architecture, causing problems such as vanishing and exploding gradients [38].The attention mechanism allows the neural network to apply context to specific parts of the data at every time step.That is, instead of finding the active devices in all of the input vectors altogether, a neural network with an attention mechanism breaks down the data, applies a contextual vector to it, and then gives a score to the parts where active devices are present and transmitting consecutively.This mechanism brings additional temporal-based reasoning into the overall architecture for active device detection, helping the neural network load more active devices for detection.
The output of the BiLSTM network is computed as a weighted summation of the output of the BiLSTM network ž(L) where ζ k is the temporal attention value at time step t−(k−1), computed as where the scores s = [s 1 • • • s K ] T indicate the repeated activation pattern of active devices in the time slots, which is obtained as where ž[l] is the output of the previous hidden layers, and W a and Z a are the attention learnable parameters that learn to project each context element and hidden state into a latent space and W rel denotes the relevance parameter [39].Evidently from ( 18) and ( 19), at time step t, ζ depends on the input ž(l) t .Furthermore, ζ is also dependent on the hidden variables ž[L] t in the previous and current time step t.The attention value ζ can also be regarded as activating the active device detection gate.That is, the amount of information flow into the BiLSTM network is controlled by setting the gates.With this in mind, the final prediction result is influenced by a larger activation value, which results in a larger flow of information.It should also be noted that the standard LSTM network cannot detect many active devices concerning the previous activation pattern due to the large memory overhead occurring.The BiLSTM network with an attentionbased mechanism can capture device activation patterns in consecutive time slots with long-range dependencies.The information not required can be suppressed to improve the accuracy and efficiency of active device detection.
After passing through the L BiLSTM layers and the attention-based mechanism, the FC layer at the output produces K values corresponding to the total number of devices.Thereby, the output vector z out is produced as where W out is the corresponding weight and b out the bias, respectively.The softmax layer then maps K output values into K probabilities (p 1 , • • • , pK ) representing the likelihood of being the true support element in the estimated active device support set Υ.The k th probability pi calculated through softmax is given as Finally, an estimate of the active device support set Υ is obtained by picking from the K elements those having a probability greater than the threshold τ , given as Once Υ in ( 22) is obtained, the estimated support Ŝ is then extracted through the cardinality of the estimated active device support set, i.e., Ŝ = || Υ|| 0 for the j-th time slot.We later show how the estimated active device support set and estimated support are used to evaluate the MUD performance and device identification accuracy.

IV. MODEL TRAINING, USER DETECTION, AND COMPLEXITY ANALYSIS
In this section, we discuss the model training, which in turn is used for signal reconstruction, and find the computational complexity of the proposed attention-based BiLSTM network.

A. Model Training
During the offline training phase, the network's parameters set Θ * are computed by minimising the loss function J (Θ) (i.e., Θ * = arg min Θ J (Θ)).During every training iteration, the network parameters are updated using the gradient descent method when the loss function J (Θ) is differentiable.Specifically, using the Adam optimiser, the network parameters Θ i are updated in the direction of the steepest descent in the i-th training iteration, given as where ψ is the learning rate determining the step size, and is a smoothing term that prevents division by zero.Furthermore, m i and v i are estimates of the mean and uncentered variance of the gradients, respectively, defined as [40] where δ 1 and δ 2 are the decay rates of the moving average.The moving average parameters help in controlling the step size of the optimiser in order to identify the global optimum solution of the training set correctly and prevents the network from looping in a local solution when the training data is not sparse [40].
Recalling that the final output of the attention-based BiL-STM network is the K-dimensional vector p whose element represents the probability of being the estimated support element from the estimated active device support set Υ.In this regard, p = [p 1 • • • pK ] needs to be compared against the true probability p in the loss function calculation.We employ the cross entropy loss J (p, p, Θ) for network training, defined as [38] where p k is the ground truth (actual active device), pk is the estimate (estimated active device) of the attention-based BiLSTM network, and λ is the L 2 regularisation term which is used for weight decaying and in turn, improves the generalisation performance of the network.

B. Blind Data Detection of Active Devices
In grant-free transmission, the codebook or the spreading sequence is unknown before the signal is detected.This can significantly increase the detection complexity.However, finding the active devices from the received signal can also recover their adopted spreading sequences since a local copy of the spreading sequences is available at the AP.Thus the detection and decoding computational complexity can be reduced significantly while keeping the practical constraints of grant-free NOMA systems intact.
First, AUD is carried out as in (22) where the estimated active device support set Υ, and the estimated sparsity level Ŝ is obtained from the attention-based BiLSTM network using the stacked received signal ỹ at the AP.Next, using this estimated active device support set Υ, the stacked received signal ỹ is transformed into a sparse signal ý, which contains received data for the estimated active devices.Having knowledge of the estimated active devices and their received bits, the spreading sequences employed by these active devices are obtained by selecting the estimated Ŝ spreading sequences having the highest correlation probability with the spreading sequences at the AP4 [43].
Once the spreading sequences employed by the active devices are calculated, blind detection can be carried out.In blind detection, the active device channels are unknown, while the spreading sequences are known.Therefore, based on the sparse signal ý, which includes the statistical information of channels of all active devices, the blind MMSE weight w can be obtained without the knowledge of device channels.Thereby, the MMSE weight can be calculated as [34] Algorithm 1 The Proposed Attention-based BiLSTM Network.
Input Received signal ŷ Output Estimated active user support set Υ, estimated sparsity level Ŝ, bits of reconstructed sparse signal Active device support and sparsity estimation 1: for j = 1 to J do  end for Blind data detection of active devices 12: ξ mapped to the nearest symbol 14: end for Return Υ, Ŝ, χ where ĝ is the estimated channel between the AP and the devices.After rearranging (6), the transmitted bits of the reconstructed sparse signal χ for the active devices can then be estimated as By doing so, the active devices' bits are estimated, and the sparse signal ý is reconstructed without the need for explicit channel estimation.The entire process is summarised in Algorithm 1.

C. Computational Complexity
In this subsection, we evaluate the computational complexity of the proposed attention-based BiLSTM network.We evaluate the complexity using the floating-point operations per second (flops) [27], taking into account the complexity of the hidden and deep learning layers of the proposed BiLSTM network at the j-th time slot.
In the first layer of the attention-based BiLSTM network, the input vector has a dimension of ŷ ∈ R 2N ×1 , whereas the weight and bias have the dimensions W in ∈ R α×2N and b in ∈ R α×1 respectively.Furthermore, we know that BiLSTM has four gates, which do a forward pass and a backward pass, thereby bringing the generic flop computation per BiLSTM block to TABLE III: Computational complexity comparison for different sparsity levels (the total number of potential devices K = 200, the number of subcarriers N = 100, the number of hidden layers L = 3, width of hidden layer α = 1000).Next, in the BatchNorm, the element-wise scalar multiplication and addition are carried out twice for normalisation.Thereby, the complexity C bn of BatchNorm is given as Subsequently, in the proceeding hidden layers' BiLSTMs, the hidden weight W ∈ R α×α is multiplied with the input vector and then the bias term b ∈ R α×1 is added to it.Next, after passing through the subsequent BatchNorm for each element, the weights are passed through the ReLU activation function.Next, for generalisation, the dropout vector d is multiplied by the ReLU output ẑ.Consequently, the complexity of the L hidden layers C hide is given as (30) Following this, the Attention layer performs weighted matrix multiplications with the input and previously sampled data, adds a bias term to the latent data, and multiplies the learnable parameters matrix to compute the scores.Thus, the complexity C atten of the Attention layer is Next, the FC layer at its output has its weights W out ∈ R K×α and bias term b ∈ R K×1 multiplied with the weights from the L hidden layers and the Attention mechanism.Thereby, the FC layer at the output has a complexity C out given as The softmax layer computes the K probabilities of potential devices, as in (21).By doing so, the softmax complexity C sm is given as From ( 28) to (33), the final complexity of the proposed attention-based BiLSTM network is For an unbiased analysis, we compare the proposed attention-based BiLSTM network with two deep learningbased techniques, D-AUD [27] and LSTM-CS [44], and a conventional technique, least squares orthogonal matching pursuit (LS-OMP) [45] for complexity comparison in Table III.
In addition, for a fair comparison, the MMSE estimation term has been added to D-AUD and LSTM-CS techniques for signal detection purposes, such that C M M SE = 2N + S( 143 N 3 + N 2 − N ).We examine the computational complexity in flops for different sparsity levels.We observe that the complexity of the proposed attention-based BiLSTM network is slighter higher than D-AUD and LSTM-CS but much lower than that of conventional approaches.This is because the D-AUD technique utilises vanilla FC layers for its network, which do not exploit the temporal correlation of data.Due to this reason, the performance of such networks might degrade with a higher number of active devices.The LSTM-CS uses unidirectional LSTM and therefore has lower computational complexity than the proposed attention-based BiLSTM network.However, as shown in Section V, this results in performance degradation.It is important to note that the complexity of ML-based techniques depends heavily on the network parameters (L and α), but not the system parameters (the number of active devices S, and the total number of devices K).Thus, when S increases from 10 to 20, the computational complexity of ML networks increases marginally, but that of LS-OMP increases sharply.Therefore, in a practical grant-free NOMA setting with a higher number of active devices, the ML schemes are more competitive in computational complexity than conventional schemes.

D. Convergence
We examine the validation loss J v (Θ) for a different number of hidden layers L for the proposed network, as shown in Fig. 4. We can see that a lower L results in a network being unstable during training, whereas a higher L results in a more stable network but with a slower convergence rate.Thus, we adopt L = 3 for training dataset generation and also the simulations in this work.Note that the sudden increase in validation loss for the L = 1 curve is due to the model overfitting to the training data, causing it to fit noise and outliers and perform poorly on the validation set.

E. Training Dataset Generation
In order to determine the optimal network mapping function g * for the stacked received signal ỹ and support of the x, a comprehensive training dataset is required.A good option in this regard is acquiring a dataset produced using real received signals; however, there is no open-source dataset for the grantfree NOMA scenario at this stage.The temporal transmission nature of devices is captured based on (5), and η is set to 0.5.The length of the time frame is J = 7.The number of hidden BiLSTM layers is set as L = 3, each with a width of α = 1000, each followed by a ReLU activation function.The attention mechanism is placed before the final hidden layer.The output layer is preceded by an FC layer whose width corresponds to the number of classes.The dropout probability for the dropout layer is set to ρ drop = 0.3.The batch size is set as 20, while Adam is the optimiser.The value for the latent learning rate ψ is set to 0.001.

V. RESULTS AND DISCUSSION
In this section, we evaluate the performance of the proposed attention-based BiLSTM network in solving the MUD problem.We also plot the performance of four benchmark solutions: two traditional CS solutions, LS-OMP [45] and dynamic CS-based MUD method [22], one ML-based LSTM-CS MUD method [44] and the Oracle least squares (LS) algorithm.
The motivation for considering these four benchmarks is as follows.We consider the LS-OMP as it is the standard CS technique that is always considered as one of the benchmarks in this research field.The dynamic CS-based and ML-based LSTM-CS methods are considered because they take temporal correlation into account during MUD.Additionally, the MLbased LSTM-CS method demonstrates the advantage gained from considering the proposed BiLSTM over vanilla LSTM.The Oracle LS algorithm is considered as it provides the theoretical performance lower bound, although it is impractical in real-world situations where perfect knowledge is unavailable.For a fair comparison, we make the following assumptions in the implementations of the four benchmark schemes: • For the two traditional benchmark solutions, the sparsity level is assumed to be known at the AP due to the assumption of the channels being perfectly known; only the sparse support location is unknown at the AP.• For the ML-based LSTM-CS MUD method, the core working of the method is adopted from [44], but the LSTM layer is adapted to our architecture (as in Fig. ( ) for a fair comparison.• For the ML-based LSTM-CS MUD method, we assume that it does not need any channel state information, i.e., it is unaware of the sparsity level, sparse support location and the channels.• For the Oracle LS algorithm, we assume perfect knowledge of the channel state information, user sparsity level and sparse support location.
In the simulations, unless otherwise stated, K = 200 potential devices simultaneously share N = 100 orthogonal resource blocks.Thus, the overloading factor5 is 200%.The number of active devices is in the range S = 10 − 40.We employ M = 4-ary complex spreading sequences, where both the real and imaginary parts take values from the set {−2, −1, 0, 1}.For every time slot, there are S number of active devices, where the active device support set Γ [j] in each time slot has S/2 devices transmitting in the next time slot, i.e., η = 0.5, while the remaining are randomly selected from {1, 2, • • • , K}.The number of time slots is fixed at J = 7 to conform to the LTE-Advanced protocol [46].The signals being transmitted are modulated by Quadrature Phase Shift Keying (QPSK).Furthermore, all channels are assumed to follow an independent Rayleigh fading, and the channel fading coefficient is generated following g n,k ∼ CN (0, 1).
The path loss between the AP and the k-th device is modeled as 128.1 + 37.6 log 10 (d i ), where d i is the distance (in km) [46].The results are averaged over 1000 Monte Carlo trials.

A. Performance Metrics
In order to appropriately evaluate the performance, including the quality of support estimation, device identification, and multi-device data detection, we use the following metrics: the detection probability (ρ d ), the accuracy, and the average biterror rate (BER) as performance metrics.Given the bits of the reconstructed sparse signal χ[j] k for device k at the j-th time slot, the performance metrics are defined as follows.
• Detection probability: This metric evaluates the performance of support estimation.It is defined as the ratio of the number of detected active devices to the number of all active devices, given as • Accuracy: This metric evaluates the performance of the quality of support estimation for device identification.
It is defined as the ratio of the number of correctly identified active devices to the number of all active devices, expressed as a % and given as • Average BER: This metric evaluates the performance of multi-device data detection.It is defined as the ratio of incorrectly recovered bits transmitted by the active devices to all bits transmitted by the active devices.It should be noted that the average BER includes a penalty for decoding the wrongly detected active devices.The following trends can be observed from the figure.The Oracle LS gives the theoretical best performance (100% detection probability for the considered scenario), which is the same for all SNR values.As the SNR increases, the performance of all the schemes slowly approaches to that of the Oracle LS.The LS-OMP performs the worst since it ignores the temporal correlation in the device activation history.The dynamic CSbased MUD method performs better than LS-OMP since it considers the temporal correlation in the device activation history.The ML-based LSTM-CS method performs better than the two traditional algorithms but cannot perform similarly to the proposed BiLSTM network due to its unidirectional architecture.The proposed attention-based BiLSTM network outperforms all these benchmark algorithms, i.e., it exhibits a higher detection probability of successfully identifying the correct number of active devices against all other schemes.For instance, the proposed attention-based BiLSTM network achieves the Oracle LS detection performance at SNR = 8 dB and SNR = 12 dB, respectively, for S = 10 and S = 20 active devices.It should be noted that the proposed attention-based BiLSTM network is unaware of the device sparsity level and detects the active devices based on the received signal only, compared to other traditional algorithms, which are based on the assumption of the known channels and device sparsity level.As the number of active devices S increases from 10 to 20, the detection performance of the proposed attentionbased BiLSTM network decreases gradually.The decrease in performance is attributed to the introduction of additional interference, variability, and overlapping patterns.These complexities pose challenges for the model to effectively capture and learn the underlying patterns and relationships within the

C. Device Identification
Device identification can help the AP prioritise service provision considering the available resources and provide access to devices based on their priority.Table V shows the accuracy of correctly identified active devices at K = 200, N = 100, and SNR = 6 dB.It should be noted again that the traditional schemes in this regard assume complete knowledge of the device sparsity level and that their accuracy is based on identifying the actual active device support set only.On the contrary, the proposed attention-based BiLSTM network follows a practical approach where the active device sparsity level is first estimated.Then, the actual active device support set is identified based on the estimated sparsity level.
We can see from the figure that the trends between the various benchmark schemes are the same as in Fig. 5.The proposed attention-based BiLSTM network outperforms the benchmark schemes by correctly identifying the actual active device support set with higher accuracy.The ML-based LSTM-CS method cannot correctly identify all the active devices because it relies on forward direction architecture only.On the contrary, due to its forward and reverse direction architecture, the proposed BiLSTM network can identify more active devices correctly.It can be seen that with the increasing number of active devices, the accuracy of correctly identifying the actual active device support set decreases, which is to be expected when grant-free NOMA systems operate in overloaded conditions.sparsity level and active device support set.The inaccurate active device estimation causes the performance gap as a side effect of the grant-free NOMA system.Fig. 7 plots the average BER against the active device sparsity S, with K = 200, N = 100, and SNR = 6 dB.Unlike the computational complexity of the proposed network in Section IV-C, the BER performance is impacted by the number of active users.For all methods, the BER decays as the active devices increase.Even so, the proposed attention-based BiLSTM network exhibits consistently lower BER than the benchmark schemes throughout the whole considered range of SNR.The ML-based LSTM-CS method performs better than traditional methods initially but saturates with a high number of active devices since it cannot capture their temporal activation pattern due to its unidirectional architecture.The consistent performance gains of the proposed attentionbased BiLSTM network show that the network has precisely mapped the underlying relationship between device activity and received signals, given that the network is trained for S = 20 active devices.Fig. 8 plots the average BER against the SNR (dB) for varying overloading factors, with N = 100 and S = 20.It is evident that the average BER for all benchmark techniques increases with a higher overloading factor as the potential devices K are increased, making the system prone to correlation errors.Even so, the average BER of the proposed attention-based BiLSTM network compared to conventional techniques is lower, manifesting that the proposed attentionbased BiLSTM network can load more devices with the same training configuration.This is because the proposed attentionbased BiLSTM network has higher tolerance and robustness against increased overloading factors due to decoupled correlated activation patterns.η.Herein, the LS-OMP algorithm performs poorly because it does not utilise the extra information present in the previous time slots for temporal activity.Conversely, the BER of the dynamic CS-based method is also relatively higher due to its dependence on devices' activity in the (j − 1) time slot only.The ML-based LSTM-CS method performs better than the dynamic CS-based method because it takes the temporal activity of devices in all time slots.However, because the MLbased LSTM-CS utilises a forward direction LSTM only, it does not completely capture the activation pattern of active devices.On the contrary, it can be seen that the increasing temporal correlation parameter η enhances the BER performance of the proposed attention-based BiLSTM network.This is because the proposed attention-based BiLSTM network has bidirectional LSTM units, which successfully capture the underlying mapping of the stacked received signal ỹ with the temporal correlation of device activity between different timeslots using the estimated support of x.This further testifies to the generability of the proposed attention-based BiLSTM network in different transmission patterns.The Oracle LS algorithm outperforms the proposed algorithm and remains consistent since it assumes a complete active device support set.

E. Discussion on Robustness, Scalability and Generalisation
The results in Figs.5-9 show that the proposed attentionbased BiLSTM network, which is trained on S = 20, N = 100, K = 200 and η = 0.5, is robust to changes in the key system parameters.We can see that the trained BiLSTM network still performs well when there is a change in the number of active devices (Figs. 5 and 7), the number of potential devices or, equivalently, the overloading factor (Fig. 8) or temporal correlation model (Fig. 9), and does not need to be retrained for the considered practical range of considered values (10 ≤ S ≤ 40, 0.5 ≤ η ≤ 1 and 150 ≤ K ≤ 300).This is because training the network at η = 0.5 allows it to learn the important features of the device activation patterns, and it still performs well when the parameters change.This  shows that the proposed network is generalisable to different system parameters.
In addition, the proposed network is a good solution for grant-free NOMA systems to provide faster access for massive IoT devices.As demonstrated in Table III, the proposed network's computational complexity is comparable to the stateof-the-art ML-based solution and does not heavily depend on the system parameters.Thus, when the number of active devices increases, or the number of potential devices in the system becomes large, the computational complexity increases only marginally.Thus, the proposed scheme is scalable and is suitable for faster access in massive IoT device scenarios.

VI. CONCLUSIONS
In this paper, we proposed an attention-based BiLSTM network for AUD in an uplink grant-free NOMA system by exploiting the temporal correlation of active user support sets.First, a BiLSTM network is used to create a pattern of the device activation history in its hidden layers, whereas the attention mechanism provides essential context to the device activation history pattern.Then, the complex spreading sequences are utilised for blind data detection without explicit channel estimation from the estimated active user support set.Thus, the proposed mechanism is efficient and does not depend on impractical assumptions, such as prior knowledge of active user sparsity or channel conditions.Through simulations, we demonstrated that the proposed mechanism outperforms several existing benchmark MUD algorithms and maintains lower computational complexity.In this work, we have applied the proposed framework to spreading based grant-free NOMA scheme.Future work can investigate the generalisation of the proposed framework to other signature-based grant-free NOMA schemes.

Fig. 2 :
Fig. 2: Detailed architecture and working of the proposed attention-based BiLSTM network.

Fig. 4 :
Fig. 4: Validation loss J v (Θ) for different number of hidden layers L, with total number of devices K = 200, number of subcarriers N = 100, and number of active devices S = 20.

(a) S = 10 (b) S = 20 Fig. 5 :
Fig. 5: Probability of detection, ρ d , versus SNR (dB) for the number of active devices S, with the total number of potential devices K = 200, and the number of subcarriers N = 100.

Fig. 5
Fig. 5 plots the detection performance, ρ d , versus the SNR (dB) for S = 10 and S = 20, with K = 200, and N = 100.The following trends can be observed from the figure.The Oracle LS gives the theoretical best performance (100% detection probability for the considered scenario), which is the same for all SNR values.As the SNR increases, the performance of all the schemes slowly approaches to that of the Oracle LS.The LS-OMP performs the worst since it ignores the temporal correlation in the device activation history.The dynamic CSbased MUD method performs better than LS-OMP since it considers the temporal correlation in the device activation history.The ML-based LSTM-CS method performs better than the two traditional algorithms but cannot perform similarly to the proposed BiLSTM network due to its unidirectional architecture.The proposed attention-based BiLSTM network outperforms all these benchmark algorithms, i.e., it exhibits a higher detection probability of successfully identifying the correct number of active devices against all other schemes.For instance, the proposed attention-based BiLSTM network achieves the Oracle LS detection performance at SNR = 8 dB and SNR = 12 dB, respectively, for S = 10 and S = 20 active devices.It should be noted that the proposed attention-based BiLSTM network is unaware of the device sparsity level and detects the active devices based on the received signal only, compared to other traditional algorithms, which are based on the assumption of the known channels and device sparsity level.As the number of active devices S increases from 10 to 20, the detection performance of the proposed attentionbased BiLSTM network decreases gradually.The decrease in performance is attributed to the introduction of additional interference, variability, and overlapping patterns.These complexities pose challenges for the model to effectively capture and learn the underlying patterns and relationships within the

Fig. 6
Fig.6plots the average BER of the considered algorithms against the SNR (dB), with K = 200, N = 100, and S = 20.In all scenarios, our proposed attention-based BiLSTM network outperforms the benchmark schemes over the whole considered range of SNR, including the ML-based LTSM-CS method.For SNR > 4 dB, the gap between the proposed attention-based BiLSTM network and the Oracle LS algorithm is about 3 dB only.This performance gap with the Oracle LS algorithm is because it fully assumes the active device

Fig. 6 :
Fig. 6: Average BER versus the SNR (dB), with the total number of potential devices K = 200, the number of subcarriers N = 100, and the number of active devices S = 20.

Fig. 9
plots the average BER against the temporal correlation parameter, η, with K = 200, N = 100, S = 20 and SNR = 6 dB.Note that the result for η = 1 corresponds to the special case of frame-wise joint sparsity, i.e., devices' activity remains constant over an entire data frame.We can see that the proposed network performs well for all values of

Fig. 7 :
Fig. 7: Average BER versus the number of active devices S, with total number of potential devices K = 200, the number of subcarriers N = 100, and SNR = 6 dB.

Fig. 8 :
Fig. 8: Average BER versus SNR (dB) for varying overloading factor, with number of subcarriers N = 100, and number of active devices S = 20.

Fig. 9 :
Fig. 9: Average BER versus the temporal correlation parameter η, with total number of devices K = 200, number of subcarriers N = 100, number of active devices S = 20, and SNR = 6 dB.

TABLE I :
Comparison of this work with recent related works in grant-free NOMA.

TABLE II :
Important symbols used in this work.

TABLE IV :
Parameter values used in generating the training dataset.

TABLE V :
Device identification accuracy versus the number of active devices S, with the total number of potential devices K = 200, the number of subcarriers N = 100, and SNR = 6 dB.