Data-Driven Low-Complexity Detection in Grant-Free NOMA for IoT

This article proposes a low-complexity data-driven multiuser detector for grant-free nonorthogonal multiple access (GF-NOMA), which has gained significant interest in Internet of Things (IoT). IoT traffic is predominantly sporadic, where devices become active whenever they have data to transmit. The conventional grant-access procedure for requesting a transmission slot every time results in significant signaling overhead and latency. In power domain GF-NOMA, multiple devices can be preallocated the same channel resource, but different power levels. Whenever a device has data, it starts transmission directly using the allocated power level without any grant request. While this significantly reduces the signaling overhead, the access point has to perform the complex task of identifying the active devices and decoding their data. Conventional receivers for power domain NOMA fail in such GF scenarios and the typical solution is to limit transmissions to be packet-synchronized and add carefully chosen pilots in every packet to facilitate activity detection. However, in fairly static IoT networks with low-complexity devices and small packet sizes, this represents a significant overhead and reduces efficiency. In this work we solve the GF-NOMA detection problem without these constraints, by analyzing the boundaries of the received constellation points in power domain GF-NOMA for all activation combinations at once. A low-complexity decision tree-based receiver is proposed, which performs as well as the maximum likelihood-based benchmark receiver, and better than traditional data-driven detectors for GF-NOMA. Comprehensive simulation results demonstrate the performance of the proposed detector in terms of its detection efficiency and parameter learning with minimal training data.

type communications (MTCs), which provides the necessary framework for devices to connect with each other and the access point (AP) with little or no human intervention.Unlike the conventional human-type communications, MTC traffic is characterized by high-device density, majorly uplink communication, very small data size per device, and most importantly sporadic transmissions [3].Providing connectivity to these massive number of sporadically transmitting IoT devices poses many challenges, and requires new resource allocation and channel access mechanisms.
The massive number of IoT devices and the comparatively limited number of available channel resources in current wireless networks require efficient channel utilization, not achieved by existing orthogonal multiple access techniques which, due to their nonoverlapping (orthogonal) resource allocation to users, suffer from connectivity limitations.To this end, nonorthogonal multiple access (NOMA) has gained tremendous interest as a potential solution where a timefrequency resource can be simultaneously used by multiple users by employing user-specific multiple access signatures, which are exploited by the receiver to separate their signals [4], [5], [6], [7], [8], [9].Literature, both from academia and industry, shows that NOMA is a promising technology for achieving massive connectivity.This includes surveys [4], [5], books [6], [7], and technical reports from the third generation partnership program (3GPP) where comprehensive link and system level analysis of NOMA is provided for different 5G use cases [8], [9].
The connectivity potential and performance gains achieved by NOMA significantly depend on the type of signature used for multiple access [10].In this context, the 3GPP study on NOMA for 5G presented various possible operations for NOMA signature design [8], [9].Accordingly, different signatures, such as spreading sequences [11], [12], [13], [14], scrambling and interleaving patterns [15], [16], power levels [17], [18], [19], [20], [21], [22], [23], [24] etc., can be employed as multiple access signatures, resulting in different NOMA schemes, each having its own dynamics and signal structure.For instance, in uplink power domain NOMA (referred to simply as NOMA latter in this article), which is the focus of this work, multiple users can transmit their signals with different powers over the same RB, and the AP exploits this power difference in the received superimposed signal to separate their data for multiuser detection (MUD).It is known that, with efficient MUD at the AP, uplink NOMA can achieve high-device connectivity [24], [25], [26], [27].
c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
While uplink NOMA schemes can achieve high connectivity through efficient resource allocation, another major challenge to support the massive and sporadic IoT traffic is the way the devices access the channel resources.The channel access mechanism, i.e., how a user accesses a channel resource in existing wireless networks (e.g., LTE/LTE-A) is mainly grant-based.Any user that has data to transmit needs to request a data transmission channel/slot from the AP through a random access (RA) process.This is generally a fourstep contention-based handshake, where any candidate user randomly chooses a preamble from the available set and sends a transmission request [28].If the request is successful, the AP can allocate a channel resource to the successful user to initiate its communication.If the request is unsuccessful, e.g., when two or more users choose the same preamble resulting in a preamble collision, the handshake fails and the transmission cannot be initiated.
This four-step handshake in grant-based RA is identified as a source of excessive delay and signaling overhead [29], [30].The mechanism is suitable for a smaller number of users, but in IoT settings with massive number of sporadically activating devices, it can result in significant signaling overhead, network congestion, latency, and packet loss [31], [32].To this end, GF access has attracted significant research interest by academia and industry in recent years, where devices can transmit their data in an "arrive and go" manner without any grant requests, hence avoiding the signaling overhead and latency issues [33].While devices directly transmit their data without any grant request, efficient data transmission protocols and receiver design are the key so that the AP can identify and recover the data from active devices in case of successful transmissions or otherwise identify unsuccessful transmissions followed by notifying the devices to resend their data.
Overall, by considering the high-connectivity potential of different uplink NOMA schemes and the lower signaling overhead and latency benefits of GF access-based communications, grant-free NOMA (GF-NOMA) has been identified as a potential solution by academia [34], [35] and industry [8], [9], [36], [37] to tackle the massive and sporadic IoT traffic connectivity demands.Accordingly, a variety of GF-NOMA schemes have been presented by academia and industry by combining the various NOMA variants with GF access.In this context, Shahab et al. [34] recently provided a comprehensive survey of the existing works on GF-NOMA, their research/practical challenges, some possible solutions, and future directions.
GF access can either be contention-based or contentionfree.In contention-based access, an active device with data to transmit randomly chooses an RB and a NOMA signature from a resource pool and transmits its data.If two or more devices simultaneously choose the same RB and NOMA signature, a collision occurs, and the AP cannot recover their data, requiring the colliding devices to retransmit latter.The method is quite flexible and suitable when the number of devices is comparatively more than the NOMA resource pool [34].However, collisions do happen and the transmission status (collision or successful) detection at the AP along with active device identification and data recovery is quite complex.
Contrary to this, in contention-free GF-NOMA, the NOMA signatures are uniquely preallocated to different devices.Whenever a device has data to transmit, it becomes active and directly transmits its data using the preallocated signature.The AP then processes the received signals to identify the active devices and recover their data [34].Due to the unique signature allocation, collisions do not happen here, and the receiver is relatively less complex than the contention-based access.However, this model assumes that the overall resource pool (set of RBs and NOMA signatures) can accommodate the number of devices in the system.Here, power domain NOMA can be particularly effective as just adding one power level doubles the total size of the resource pool.
This work specifically focuses on contention-free GF-NOMA, where unique NOMA signatures are preallocated to devices.In contention-free power domain GF-NOMA, the devices over an RB can have preallocated unique power levels.A device with data becomes active and directly transmits its data to the AP using the allocated RB and power level.From the received superimposed signal, the AP needs to perform active device and data detection [38], [39].
From a receiver design perspective, in conventional uplink grant-based NOMA, where the AP exactly knows the number of users transmitting over an RB, the received signal at the AP is always a superposition of symbols from a known number of paired users, making MUD process at the AP straightforward using conventional receivers, such as successive interference cancelation (SIC) [17], [18], [19], [20], [21] or joint maximum likelihood (JML) [22], [23].However, in GF-NOMA, considering sporadic device activity, the number of active devices over an RB keeps changing, and the AP needs to efficiently estimate the changing number of devices transmitting over a particular RB, identify the active ones, and recover their signals.

A. Related Work
It is understandable that receiver design for GF access plays a vital role in benefiting from the true potential of these schemes.However, the existing literature on power domain GF-NOMA (simply GF-NOMA from here on) mainly focuses on transmission protocol or sum rate maximization, but does not provide detailed insights into the receiver design.For instance, in [40], [41], and [42], protocol designs for GF-NOMA are discussed, where multiple power levels are defined over each RB, and each active user transmitting over a particular RB adjusts its transmit power to reach one of the defined receive power levels.While the works provide a good starting point toward GF-NOMA, they do not provide any details about the actual receiver design, and therefore, no results on the detection performance; only an access throughput analysis is provided.Moreover, the models allow users to randomly choose one of the defined power levels, and therefore, are contention-based GF protocols, where collisions can happen if multiple users randomly choose the same RB and power level.In another work [43], a transmission power pool design to maximize the achievable data rate for GF-NOMA using deep reinforcement learning was proposed to tackle the absence of closed-loop power control in GF access.
Different from these, Emir et al. [39] proposed a deep learning-based detector for GF-NOMA under a tight preconfigured setup, where a number of devices are multiplexed over an RB with each device having a unique power level, i.e., contention-free access.To facilitate MUD at the AP, a transmission frame structure consisting of pilots and zero padding followed by data symbols is designed for all multiplexed devices such that their pilots do not overlap.However, such configuration incurs throughput loss due to the zero padding and pilot insertions in each frame or packet of the devices, which increases with an increase in the number of multiplexed devices.Second, the frame length needs to be redesigned for any change in the number of multiplexed users over an RB to avoid pilot collision.Moreover, the transmission from active devices needs to be strictly synchronized.
Different from this, Shahab et al. in their previous work [38] proposed multiple receivers for a two power levels contention-free GF-NOMA.Initially, an extension of the conventional maximum-likelihood-based JML receiver for NOMA, i.e., extended JML (EJML), is investigated to incorporate sporadic device activity for GF access.The receiver is able to accurately identify active devices and perform data recovery.However, it suffers from high-computational complexity, which increases exponentially with an increase in power levels.The work also proposes an activity indicator symbols (pilots)-based low-complexity detector, which performs efficiently with few pilot symbols.However, the receiver requires the devices' signals to be perfectly synchronized.Furthermore, the receiver assumes symmetric downlink/uplink channels, and perfect channel estimation and corresponding signal adjustments at device end, which is not feasible in practical scenarios.

B. Motivation and Contributions
While GF-NOMA avoids the signaling overhead and latency issues of traditional grant-based access, and reduces complexity at the transmitting device, the receiver design for GF-NOMA plays a vital role in realizing the true potential of these schemes.Existing receiver designs proposed in literature are either computationally complex or require a tight preconfigured frame structure and device synchronization.A key motivation here is to design a low-complexity detection scheme, which works for GF-NOMA where devices do not need to be frame synchronized.To this end, we aim to take benefit from the powerful tool of machine learning.
Recent research continues to confirm the incredible capabilities of machine learning technologies in enhancing the efficiency of transmitter/receivers in wireless communication [44], [45].Instead of relying on mathematical models and equations, machine learning algorithms search for patterns in the provided data to make the best possible, nearly optimal, decisions.The robustness of machine learning algorithms and models is especially desirable in wireless communication systems because of the dynamic nature of the networks, whether it is the fast changing channel states, the dynamic network traffic, or even the network topology and scheduling.
For grant-based NOMA systems, machine learning algorithms have been applied to several of its NP-hard problems, such as acquiring channel state information (CSI), resource allocation, power allocation, complex joint decoding, and the fundamental tradeoffs among them [46], [47], [48], [49].This is especially useful in massive IoT settings, as the complexity of these processes grows exponentially with the number of devices.Accordingly, for GF-NOMA scenarios, machine/deep learning methods have recently shown success at joint activity and data detection.However, most of the existing works focus on spreading-based NOMA schemes [50], [51], [52].Motivated by the strong performance of these data-driven receivers in spreading-based GF-NOMA scenarios, we aim to utilize such methods for the power domain GF-NOMA problem at hand.The principal contributions of the work are as follows.
1) This article proposes a low-complexity decision tree (DT)-based receiver design for active device and data detection in GF-NOMA.By exploiting the knowledge of the possible constellation points based on the maximum number of NOMA power levels, the modulation types/sizes of the multiplexed devices, and their sporadic activity, optimum boundaries between the constellation points are carefully analyzed, and are used to give a primary structure to the DT.Once the basic structure is defined, the decision-tree can be easily trained online or offline, and then used to efficiently perform active device and data detection in GF access.2) A training algorithm for the DT is proposed along with detailed explanation of the training process.While the initial tree structure is designed by considering no fading conditions, where the signals from the devices only contain Gaussian noise, this article also provides insight into the effects of practical transmission channels on the boundaries of the DT, where the signals from devices are faded and have phase rotations.To this end, the work proposes how the boundaries of the tree can be optimized to tackle the channel effects for efficient detection performance.3) A detailed analysis of the computational complexity of the proposed DT is provide for both the ideal and practical channel conditions as the tree structure is slightly different in both cases.It is shown that the proposed DT takes very few computations for its decision making to perform active device and data detection compared to the benchmarks.4) Moreover, a detailed performance analysis of the proposed detector in terms of activity and data detection error is provided.To this end, initially perfect channel estimation is assumed, where the proposed receiver performs exactly the same as benchmark maximum likelihood (ML)-based detector.Besides that, the detection error performance is comprehensively analyzed with practical channel estimation, where the proposed receiver performs extremely close to the benchmark receiver employing perfect channel estimation by Due to the predefined structure, the tree does not need training symbols (labels) for each constellation point unlike other classification models, and can construct the boundaries using whatever symbols it is provided with, demonstrating its robustness against the quantity of training data.6) Finally, while the proposed DT does not rely on synchronization between active devices, a modified version of the proposed receiver is also designed for the case of frame-synchronized transmissions.The receiver is compared with the pilots-based detector in [38] to demonstrate its efficient error rate performance in scenarios where data transmissions from the active devices are strictly synchronized.Remainder of this article is organized as follows.Section II provides details about the system model, transmission protocol, and formulating the detection problem.The existing NOMA receivers are discussed in Section III, followed by the proposed DTs design in Section IV.The training process for the trees is presented in Section V, whereas comprehensive performance evaluation of the receivers is provided in Section VI.Section VII discusses some practical challenges and future directions.Finally, Section VIII concludes this article.

A. System Model
Consider an uplink GF-NOMA system, where N devices (D 1 , D 2 , . . ., D N ) are multiplexed over an RB with predefined device-specific power levels P n (n ∈ {1, 2, . . ., N}), where , where λ n is the channel variance, d n is D n to AP distance, and v is the path loss exponent.Moreover, let x n (n ∈ {1, 2, . . ., N}) represent the symbol transmitted by an n th device at a specific time instance, where x n is taken from a complex-constellation set χ , e.g., Mquadrature amplitude modulation (QAM), whose cardinality is M.For inactive users, their transmission is equivalent to transmitting zero.Accordingly, the augmented complexconstellation set χ aug {χ ∪0} denotes the modulated symbol set of both active and inactive devices.Moreover, y represents the received signal at the AP.It is important to understand here that the GF-NOMA model considered here is similar to conventional grant-based uplink NOMA except for the fact that not all the devices multiplexed over an RB transmit at a particular time, thereby, causing a variable load on an RB over time.

B. Transmission Protocol
Considering the system model defined above, the received signal y at the AP at any time instance can be written as where η represents additive white Gaussian noise (AWGN).
Considering the received signal in (1), some important aspects of the considered GF-NOMA transmission protocol are explained as follows.Throughout the discussion, we keep N = 2 devices per RB, which allows a reasonable complexity but also doubles the number of IoT devices that can transmit over a network, i.e., a 200% overloading [34].
As IoT traffic is predominantly sporadic in nature, the devices here are considered to be active and transmitting sporadically.Hence, for two devices with each having a unique power level, at any particular time instant, either both, one or none of the devices can be active.Accordingly, the received signal at the AP is not always a superposition of M−ary signals from both devices unlike conventional grantbased NOMA.For this GF-NOMA scenario, we define a set E consisting of the possible events according to the users' activity status, given as Here, E 0 means no active device so that y contains only noise, i.e., y = η, E 1 means only D 1 is active, i.e., y = h 1 As the device activity changes, it is the task of the AP to identify the correct event at a specific time instance and recover devices' data accordingly.
We consider that each device uses quadrature phase shift keying (QPSK) as its data modulation scheme, i.e., its modulation set is χ = {s 1 , s 2 , s 3 , ]}.However, considering that the devices transmit sporadically and may frequently become active/inactive, then as defined in the system model, the augmented complex-constellation set of a device is given as χ aug {χ ∪ 0}, which becomes }, where x n = 0 means inactivity or no transmission from the device.Hence, for any n th device, its transmitted symbol x n is taken from the complex constellation set χ aug .Now for the two devices D 1 and D 2 , each allocated a different power level P 1 and P 2 , respectively, where P 1 > P 2 , and their augmented complex constellation set χ aug , the sample space consisting of total 25 received points y at the AP according to (1) by considering the possible four events in (2) is shown in Fig. 1; the events are shown using four different colors for ease of understanding.Now considering the set of possible events defined in (2), the points in Fig. 1 can be explained as follows.
For the event E 1,2 , i.e., both devices active, their superimposed received symbol at any time instance should correspond Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to one of the 16 purple constellation points, i.e., the symbol combinations {(s 1 , s 1 ), (s 1 , s 2 ), . . ., (s 4 , s 3 ), (s 4 , s 4 )} from the two devices with different powers, which is similar to what would be seen in conventional grant-based power domain NOMA systems.However, since this is GF-NOMA, we do have other possible events.For event E 1 , i.e., only D 1 active, we get the four green high-power QPSK symbols from D 1 .Similarly, for event E 2 , i.e., only D 2 active, the only possible received points are the four low-power gray colored QPSK symbols.Finally, for event E 0 , i.e., when no device active, the received signal only contains AWGN, i.e., the only one red point around the origin.

III. EXISTING RECEIVERS FOR GRANT-BASED AND GF-NOMA
As discussed previously, the device activity and data detection problem in GF-NOMA is different than the data-only detection problem in conventional grant-based NOMA as the later does not require device activity detection due to the prior grant-access process.To this end, some prominent grant-based and GF NOMA receivers are briefly discussed here.

A. Conventional Grant-Based NOMA Receivers
Due to the prior grant-access process, receivers in conventional grant-based NOMA always know the number of transmitting devices, and exploit the received superimposed signal accordingly to recover the data of each device.To this end, two prominent power domain NOMA receivers are successive interference cancelation (SIC) and JML [38].
1) SIC: SIC decodes the devices in decreasing order of their received power, where the signal of a higher power device is decoded first, and is subtracted from the received signal to decode the next device and so on.For the N = 2 case, assuming s data is recovered first from the received superimposed signal y, and is subtracted from y to recover D 2 .
Actually, SIC relies on prior knowledge of the number of transmitting devices.For N = 2 multiplexed devices, SIC receiver assumes both devices to be transmitting and considers any received symbol as a superposition of their transmitted symbols.In case one or both devices are inactive, the receiver will still perform the same steps, which will result in picking up noise as the signal of inactive device/devices.
2) JML: inspired by the conventional ML detection, makes a joint estimate of the transmitted symbols of paired devices [22], [23] x1 , x2 = arg min ( JML assumes the normal transmit symbols set for each device, i.e., χ = {s 1 , s 2 , s 3 , } considering QPSK, and not the augmented constellation set χ aug , i.e., it assumes both devices to be active for the N = 2 case.In the case of N = 2 and M = 4, this becomes a search space of the 16 (purple) constellation points shown in Fig. 1.Accordingly, if a device is inactive, the AP will still check the wrong set of constellation points, and will end up recovering wrong symbols for the devices.

B. Receivers for Power Domain GF-NOMA
As GF-NOMA with sporadic transmissions poses a different problem than conventional NOMA, some relevant detectors are discussed below.
1) Extended JML Receiver: Motivated by JML discussed above, and considering the sporadic device activity problem, Shahab et al. [38] in one of their previous works proposed an EJML receiver for power domain GF-NOMA.As the name indicates, it is an extension of conventional JML in a way that it considers the augmented constellation set χ aug in its detection process.The joint estimate of device activity and their transmitted data, similar to JML but using χ aug rather than χ , is given as By including χ aug in the modulation search space of each device, EJML is able to check all four events in (2), i.e., all 25 points in the constellation set of Fig. 1.
While EJML is shown to perform very well [38], it comes at the cost of a high-computational complexity.The number of points to be generated for Euclidean distance calculation in (4) followed by minimum distance calculation is the constellation size is 25, which increases to 125 for N = 3, thus exponentially increasing the computational complexity.
2) Pilot Signals-Based Detectors: Shahab et al. [38] presented a flag signals or pilots based low-complexity receiver named S-Hybrid, which relies on frame-synchronization and pilot symbols transmitted by each active device at the start of every frame to identify the right event, and accordingly switches between JML (for event E 1,2 ), ML (for E 1 or E 2 ), and no detection (for E 0 ).The detector is a good starting step in the design of low-complexity GF-NOMA receivers.However, first it assumes that rather than the AP, the devices estimate their downlink channels, and then adjust their transmit signals to tackle the channel effects; assumes perfect channel estimation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and power control, and that the downlink and uplink channels to be perfectly time reciprocal.It also requires the signals of both devices to be synchronized and pilots transmission at the start of every frame causing signaling overhead and throughput loss.
Similarly, Emir et al. [39] proposed a pilot-based deep learning detector for GF-NOMA under frame synchronization and a tight preconfigured setup.To facilitate detection, a frame structure consisting of pilots and zero padding followed by data symbols is designed for multiplexed devices such that their pilots do not overlap; at the pilot sequence location in one device's data, other multiplexed devices over that RB need to have zeros in their frame.However, such configuration causes signaling overhead and throughput loss due to the pilot sequence and zero padding in each frame of all devices, which increases with an increase in the number of multiplexed devices, further requiring redesign of frame length for any change in the number of multiplexed devices over an RB to avoid pilot collisions.Moreover, the transmission from active devices needs to be strictly synchronized.Finally, and importantly, the model uses an offline training for a large data set to work efficiently.
For asynchronous transmissions, the work in [53] proposes an efficient activity and data detection method for asynchronous scenarios by inserting pilot sequences and guard spaces in the data frames, and designing a generalized expectation consistent signal recovery-based algorithm.The work, however, is based on spreading sequences as signatures in the uplink, whereas power domain NOMA is used in the downlink transmissions from the AP.
Overall, the pilot-based detectors mostly require devices to transmit the pilot symbols at the start of every frame to facilitate activity detection, and therefore, require frame synchronization and result in throughput loss, signaling overhead, and higher energy consumption.Such models may still be useful in fast changing channel conditions, where the pilots in each frame can be used to estimate the channel.However, in many of the IoT settings, especially indoor scenarios, channel conditions are quite stable, and do not require channel estimation for every packet.In such settings, channel estimation can be performed occasionally at regular intervals or as required.For instance, in a smart home, channel estimation might be required just once a day, or only when some settings in the home change.In these scenarios, the AP can request the multiplexed devices at a specific time, e.g., once in a day to send some pilot symbols, and then the channel is estimated, which can be assumed to remain fairly constant until the next phase.Using this estimation, the devices then need to only transmit their actual data whenever they are active, and the AP is then required to perform activity and data detection through the data symbols only without any pilots.Such receiver design is the key objective of this work.

IV. RECEIVER DESIGN FOR GF-NOMA
As mentioned earlier, this article focuses on the design of low-complexity receivers for GF-NOMA.To this end, a decision-tree-based detector is proposed by exploiting the structure of the received constellation sample space to adjust decision boundaries that are used by the detector to jointly estimate the device activity and transmitted data.

A. Proposed Decision Tree-Based Detector
This section focuses on the design of data-driven lowcomplexity DT receiver for GF-NOMA.The goal is to achieve near-optimal error rate performance for the multiplexed devices for offline training with large data size or online training with minimal training data.The initial focus is on identifying the optimum decision boundaries between the constellation points, followed by designing training and testing mechanisms for the receiver.

1) Decision Tree Under Ideal Channel Conditions:
We start with a DT having boundary lines to separate events based on the constellation diagram in Fig. 1.The boundary lines are shown in Fig. 2 and summarized in Table I.Initially, we consider only AWGN channels, or alternatively assume the devices to perfectly estimate their downlink channel and adjust uplink transmissions to counter the channel effects.Considering N = 2, and P 1 and P 2 as D 1 and D 2 powers, respectively, such that P 1 > P 2 , boundary values in Fig. 2 can be calculated from (1).
The x-y axis boundary lines are represented by T 0 , whereas T 4 represents the shifted versions of the x-y lines centered at the D 1 QPSK symbols.Moreover, T 1 − T 3 correspond to squares that require four equations to be defined, one for each side of the square.Accordingly, we use the notation of the form T k i , where i = 1 : 3 represents three types of squares (small centered at origin, big centered at origin, and the small outer squares in each quadrant), and k = 1 : 4 represents the four lines of any square such that k = 1 → right top, k = 2 → right bottom, k = 3 → left top and k = 4 → left bottom lines of any square irrespective of its quadrant.The location of the boundary lines on the x-y axis depend on channel conditions and need to be learned from the training data.
It can be seen that lines of all squares have slopes of 1 and −1 here.This is because there is no channel imperfection or phase rotation considered in this figure, which means the squares are perfectly aligned along the axis.This is not the  case in practical scenarios as we discuss in next sections where practical channels and phase rotated constellations are also considered.The boundary lines are also summarized in Table I, where T 0 and T 4 are the horizontal-vertical lines and T 1 − T 3 the squares.Moreover, while the notation T 1 and T 2 refer to one square each, T 3 refers to a set of four squares, one in each quadrant as in Fig. 2. Accordingly, we further split the notation T 3 into T 3,1 , T 3,2 , T 3,3 , and T 3,4 to denote the squares in first, second, third, and fourth quadrants, respectively.Using these boundaries, the tree structure is shown in Fig. 3.
It should be noted that because equations are generally written as x-y variables, to avoid confusion with the received symbol y, we represent the received symbol by s in Fig. 3.The decision process starts by checking sign of received symbol s to find the correct quadrant Q, where Based on the correct Q, it checks whether the point is outside or inside the big square line 2 in that Q is checked for the right event and data.Otherwise, if s ∈ {E 1 , E 1,2 }, the outer x-y axis lines followed by relevant i th boundary of the smaller square in that Q, i.e., T k 3,Q is checked to find the right event, and recover the corresponding symbols.
This DT with square boundary lines is expected to perform exactly the same as EJML as also shown in the results in Section VI.Moreover, the computational complexity of this tree is expected to be quite low and is provided in detail in Section VI-A.For instance, to detect an E 0 related point, i.e., no activity, the tree goes through checks (1) s r > 0, s i > 0, T k 1 > 0, and T k 2 > 0, i.e., four checks requiring very few addition/multiplication/comparator operations, significantly lower than EJML as explained in detail in Section VI-A.
During all the discussion above, one critical point behind the simplicity and possible efficiency of the DT is the assumption of perfect channel estimation at the device end, or otherwise AWGN channel environment, which keeps the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.points in accordance with their original QPSK modulated symbol structure.The extended 25 points constellation looks pretty symmetric across the axis.However, as soon as this assumption is relaxed, the constellation may not remain in its current symmetric structure, which might cause some problems with how the decision boundaries are derived.We take an in-depth look into this next.
2) Optimized Tree Under Practical Channel Conditions: We relax the assumption of an AWGN channel here, and investigate the tree dynamics under practical channel conditions, particularly in conditions where constellation points from either one or both devices are phase rotated, affecting the overall extended 25 point constellation.For the ease of understanding, a scenario where D 2 's (the low-power device) constellation is phase rotated is shown in Fig. 4. Accordingly, it can be seen that the alignment of the constellation points with respect to each other and with respect to the x-y axis has changed, requiring an adjustment of the boundary definitions.Some key observations in this scenario are summarized here.
1) The first issue, shown as (a) in Fig. 4 here, is the x-y axis lines with respect to the devices' constellations points.
Previously, there was no rotation in the constellation points from any device.Therefore, the x-y axis lines (T 0 ) of these devices according to their constellation rotations were aligned with each other, and with the general x-y axis lines.However, as D 2 constellation is rotated here, according to its points, its corresponding x-y axis lines are also rotated and are now at some angle with respect to the general and D 1 axis lines.Similarly, D 2 's shifted axis lines (T 4 ) in the outer clusters previously were just the same for all outer clusters and were simply horizontal/vertical lines.However, now they are also rotated with respect to the general axis lines.These rotated lines across the origin, and in the outer clusters, need to be incorporated differently in the DT to make efficient decisions.2) Moreover, as shown as (b) in the first Q, the big square boundary line does not provide the best separation between D 2 's point and the nearest E 1,2 related point from the outer cluster anymore.Previously, with no rotation, both devices' constellations were aligned with each other, and therefore, we did not have this problem.However, because the big/small square lines were drawn only based on their respective devices' constellation, even when D 2 is now rotated, the big square does not take this into account, and is still the same as previous.Accordingly, it does not efficiently divide the two points in Q 1 , and also in other Qs, and touches some parts of both constellation points as shown by two circles.3) In this context, as shown in (c), one solution is to have a perfect separator between the two points by drawing a perpendicular to the line connecting these two points.However, as shown through a dashed circle in Q 1 , this line then causes decision problem with another constellation point from outer cluster, and is not optimal.Considering these points, and through comprehensive analysis, the optimum boundaries are shown in Fig. 5. Compared to the squares and horizontal-vertical lines in Fig. 4, the boundaries in Fig. 5 under phase rotations have some obvious differences as summarized below.
1) There are two sets of x-y axis lines now that depend entirely on the phase rotation of each device; (i) the x-y lines T j 0,1 (j ∈ x, y) across the origin with respect to D 1 , and (ii) the x-y axis lines T j 0,2 (j ∈ x, y) across the origin with respect to D 2 along with T jk 4 , j ∈ x, y, k = 1 : 4 in the outer clusters in four quadrants.
2) The big square lines, previously T k 1 , are replaced by two lines in each quadrant to correctly divide the decision region between the E 2 and the two nearest E 1,2 related points.These lines T k 1,1 and T k 1,2 thus represent the first and second lines in the k th quadrant.
3) The small square lines in set T 2 around origin and T 3 in outer clusters are similar (but rotated) as previously.Considering these new boundary lines, the final optimum DT that is capable of dealing with phase rotations is shown in Fig. 6.As expected, some steps in the DT are quite different Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to the previous tree.The working of this tree is discussed as follows.
1) As in previous tree, the first step is to identify the quadrant of the received symbol.Unlike previously, now we have two different sets of x-y lines across the origin due to the relative phase difference between the two devices.The tree starts by checking the received symbol with respect to the high-power device D 1 's x-y axis lines T x 0,1 and T y 0,1 to decide on the possible Q. 2) For any given Q, unlike previous tree where the point was checked against the boundary lines T k 1 of the big square, the tree now checks the received symbol against the two newly added lines in each quadrant.For instance, in Q 1 , the received point is checked jointly against T 1 1,1 and T 1 1,2 .3) If the point is found to be outside the two lines, i.e., corresponding to the outer cluster (E 1 or E 1,2 ), then the process for further detection is similar to the previous tree, i.e., checking T 4 x-y lines followed by checking the corresponding small square boundary T 3 .However, if the symbol is found to be inside the two lines, i.e., corresponding to E 0 or E 2 , the process is then different to the previous tree and is as follows.4) As the symbol is now either from E 0 or E 2 , therefore the x-y lines with respect to D 1 that were used at the start are not relevant.It is actually the x-y axis lines with respect to D 2 's points and rotation, i.e., T y 0,2 and T x 0,2 .Based on the rotation of D 2 's constellation points, these lines are responsible for dividing the region into four Qs.Accordingly, the received symbol is checked against these two lines to identify the true Q. 5) Once the true Q is identified, the final step is again similar to the previous trees, i.e., checking against the small square boundaries T 2 to identify the symbol as E 0 or E 2 related point.
All these steps are shown in Fig. 6.It can also be seen that, for better readability, the latter process of detecting E 0 or E 2 , which is same for all four quadrants, is labeled as A, and is only drawn once on the left.It is important to note that boundaries for both trees described above were drawn using known channel impacts on the signal, such as phase rotation or any power change.However, in practice, this needs to be either done using offline training, or on-the-fly using some online training, as discussed later in Section V.

B. Decision Tree in Scenarios With Higher Modulation Sizes and Device Overloading
The DT models in Sections IV-A1 and IV-A2 focus on the basic two-device GF-NOMA model under ideal and practical channel conditions, respectively, where both devices are considered to use the same low-size QPSK modulation for their data transmission.While the low-rate IoT devices mainly use lower order modulation schemes, the receiver presented here can be extended to higher modulation sizes.
In this section, we initially consider such a scenario, where D 1 uses 16 QAM and D 2 uses QPSK.We again have a 4 events set Here, E 0 has one constellation point at the origin representing channel noise only.E 2 , when only D 2 is active, results in four points with power P 2 since D 2 uses QPSK.Regarding E 1 , since D 1 uses 16QAM, this event will result in 16 constellation points with power P 1 unlike the earlier cases where E 1 also had 4 points due to QPSK use.Finally, for the event E 1,2 , the NOMA constellation results in 16 × 4 = 64 constellation points as each of the 16 higher power QAM points will be surrounded by 4 low-power shifted QPSK points.Hence, the constellation of possible events would contain 1 + 4 + 16 + 64 = 85 points in total.
All these constellation points are symmetric across the four quadrants, and therefore, a zoomed part, i.e., quadrant 1 of the constellation is shown in Fig. 7.Besides the E 0 point (red Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Regarding tree boundaries in this scenario, we have 5 sets of boundary lines here compared to 4 previously.The DT for this case starts with the main two T 0 lines to check the sign and quadrant Once the quadrant is identified (e.g., Q 1 here), the point is checked against the big central square line T k 1 in that quadrant (e.g., T 1 If s is found to be inside the big square, i.e., belonging to either E 0 or E 2 , then it is checked against the smaller center square line T k 2 in that quadrant (e.g., T 1 2 if Q1) to check the right event and the transmitted symbol.Up to this point, the tree working is same as the previous case of both devices using QPSK.However, if s was found to be outside the big square in the earlier step, i.e., s ∈ {E 1 , E 1,2 }, then there is one additional step here toward identifying the right event as we have four outer clusters in each quadrant compared to one cluster in the previous cases of both QPSK devices.Hence, s is first checked against two newly introduced T 3 lines to check the right outer cluster.For the estimated cluster, the last two steps include checking s against the two T 4 lines in that cluster to find the subquadrant, followed by checking the small square boundary line T 5 in that subquadrant to estimate the right event and symbols.
It can be noticed that, while the number of constellation points is around three times when we consider a 16QAM and a QPSK device compared to the simpler case of both QPSK devices, there is only one additional set of lines to check in order to estimate the right event or activity status and the transmitted symbols of the devices.This indicates the potential for scalability of the proposed DT to higher constellation sizes as the increase in constellation points only slightly increases the number of boundary checks causing only a slight increase in the computational complexity of the proposed DT as discussed in detail in Section VI-A.The detection method can also be extended to cases when higher number of devices are multiplexed over an RB, e.g., 3 devices per RB.The number of constellation points for the DT will also increase with a higher number of devices.For example, for N = 3 devices multiplexed over an RB using M = 4-ary QPSK modulation each, a conventional NOMA system will result in M N = 4 3 = 64 constellation points.Considering GF-NOMA for this N = 3 scenario, due to sporadic transmissions, we will have 8 possible events, i.e., E ∈ {E 0 , Again, these points are symmetric across the four quadrants, and Q 1 is shown here in Fig. 8. Regarding boundaries, the two T 0 lines identify the correct quadrant, e.g., Q 1 here.Then, T 1  1 in E 2,3 }, followed by checking either T 1  3 for {E 0 } and {E 3 } or two T 4 and a T 5 to identify {E 2 } and {E 2,3 }.Otherwise if the point belongs to the outer cluster (i.e., outside T 1 1 ), checking two T 6 , a T 7 and T 8 can identify E 1 and E 1,3 , whereas checking the two T 6 , T 7 , two T 9 and a T 10 can lead to E 1,2 and E 1,2,3 .It can be seen that the receiver still needs very few steps to estimate the right event and transmitted symbols despite the many points.

V. TRAINING THE DECISION TREE
Given the structure of the DT derived by our communications scenario, the main problem at hand is the training mechanism, so that the DT can efficiently construct these boundaries itself according to real channel environment using some training data.Training can either be done offline using a large data set or online using few training/pilot symbols.We consider the two device scenario with QPSK modulation here for discussion.--------------------- The training process proposed is carefully designed to exploit symmetry in the constellation structure, so that we can jointly analyze the training samples related to different labels (constellation points) in order to exploit their mutual relationships with fewer data points overall.For instance, for an event let say E 1 , while there are four constellation points relating to the four QPSK symbols that can be transmitted by D 1 , these points can be collectively used to estimate the channel attenuation and phase rotation for D 1 to efficiently draw the relevant decision boundaries for the tree, and achieve better performance with minimal training data.Accordingly, the training methodology is presented in Algorithm 1.
The input here is the received symbols from the devices.The training model does not require samples for all points in the 25 point constellation.It can work with any number of samples as long as the samples do contain transmissions from each of the two devices to estimate their channel states.This means that the training model can even calculate all the tree boundaries with a minimum of just two training symbols, one from each device; obviously the training accuracy may not be impressive with two sample points given the channel noise.As the 25 point constellation contains E 1 and E 2 symbols along with the E 1,2 points which are basically combinations of these E 1 and E 2 points, the training model only requires E 1 and E 2 related symbols (set S here) as they are sufficient to provide the training model the required information about the channel state of both devices.As a result of training, the outputs are the tree boundary lines equations.It is important to note that the training process needs to know the number of multiplexed devices over an RB and their used modulation type (QPSK here) to construct the tree boundaries.
The training process first divides the D 1 and D 2 related symbols in sets S 1 and S 2 , respectively.For each i th symbol in set S 1 , its amplitude |S 1 (i)| and phase θ(S 1 (i)) is calculated.Once this is done for all S 1 points, an average of the amplitudes and phases is calculated to get an averaged estimate of D 1 's amplitude (also contains the impact of D 1 's power level) and phase.As this is averaged, it is understandable that it does not even require samples for all four QPSK symbols from D 1 ; even one or two symbols can work as this is just for channel estimation purposes.Once this is done for D 1 , the same is repeated for D 2 using its samples in S 2 .Then, by using |S 1 |, θ(S 1 ), |S 2 |, and θ(S 2 ), boundary lines can be determined.To do this, slopes and y-intercepts are calculated using different combination of points as described in Algorithm 1.

A. Computational Complexity Analysis
The complexity of EJML and the proposed trees is analyzed here, and the results for N = 2 and M = 4-ary modulation are summarized in Table II.The complexity of conventional SIC is also discussed at the end for comparison purposes.
1) Complexity of EJML Receiver: For EJML, its detection process was shown in ( 4 This can also be written as arg min x 1 ,x 2 ∈χ aug ||y − ĝ1 x 1 − ĝ2 x 2 || 2 , where ĝ1 = √ P 1 ĥ1 , ĝ2 = √ P 2 ĥ2 , and χ aug = {0, s 1 , s 2 , s 3 , s 4 }.This means that the EJML calculates distance of the received symbol y with 25 possible points and then chooses the point with minimum distance from y. Considering that y is a complex number, as are x 1 , x 2 , ĝ1 , and ĝ2 , by writing them in terms of their real and imaginary parts, i.e., y = y r + jy i , x 1 = x 1r + jx 1i , x 2 = x 2r + jx 2i , ĝ1 = ĝ1r + jĝ 1i , and ĝ2 = ĝ2r + jĝ 2i , and using the complex number multiplication formula (a + bi) * (c + di) = (ac − bd) + j(ad + bc), the distance calculation in the EJML equation can be written as (5) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II COMPUTATIONAL COMPLEXITY ANALYSIS
Calculating (y r − f r ) 2 and (y i − f i ) 2 has four additions and five multiplications each.Thus, (y r − f r ) 2 + (y i − f i ) 2 has nine additions and ten multiplications.Finally, assuming √ a as a single multiplication operation, the total becomes nine additions and 11 multiplications.Now, the number of constellation points in EJML for two devices is M 2 + 2M + 1, which for M = 4 becomes 16 + 8 + 1 = 25 points.Hence, for each of these points, ( 5) is calculated, which results in 25 × 9 = 225 additions and 25 × 11 = 275 multiplications operations.Finally, the min function results in 24 comparator operations.These calculations are summarized in Table II.
It can be seen that EJML takes a large number of computations to perform the activity and data detection.Furthermore, these computations increase significantly with an increase in the number of devices or their modulation sizes.For instance, for the two device scenario, if the device with higher power uses 16 QAM modulation, and the lower power device uses QPSK, we end up with a constellation space of M 1 M 2 + M 1 + M 2 + 1 = 85 points.Considering Euclidean distance calculations earlier, this would require 85 × 9 = 765 additions and 85 × 11 = 935 multiplications followed by 84 comparator operations.Similarly, for the case of three devices multiplexed over an RB and transmitting sporadically using QPSK modulation, we have 125 constellation points resulting in 1125 additions, 1375 multiplications, and 124 comparator operations, which is significantly high.
2) Complexity of Decision Tree Receiver: Compared to EJML, we first take a look at the eventwise complexity of the DT shown in Fig. 2 under no phase rotations for the simple two-device QPSK scenario.As mentioned earlier, to avoid confusion with the x and y variables used in line equations, we use s to represent the received symbol for the DT.For detecting E 0 or E 2 , the steps are s r > 0, s i > 0, T 1 (s) > 0 and T 2 (s) > 0, resulting in four comparators, four additions, and two multiplications in total; T 1 (s) > 0 or T 2 (s) > 0 are for line equations of the form y−mx −c > 0, hence requiring two additions/subtractions, one multiplication, and one comparator.Similarly, for E 1 or E 1,2 , the decision steps are s r > 0, s i > 0, T 1 (s) > 0, s r > T 4 , s i > T 4 , and T 3 (s) > 0, resulting in six comparators, four additions, and two multiplications in total.It is noticeable that the complexity of the tree is variable based on the event to which a received symbol belongs.Here, E 0 and E 2 have slightly lower complexity than E 1 or E 1,2 .
To analyze how the complexity varies with higher constellation sizes, we again consider the case where D 1 uses 16QAM modulation and D 2 uses QPSK.For the DT without phase rotations, the maximum boundary checks are needed for points related to events E 1 and E 1,2 .Here, for a received symbols s, the boundary checks are s r > 0, s i > 0, T 1 (s) > 0, s r > T 3 , s i > T 3 , s r > T 4 , s i > T 4 , and T 5 (s) > 0. This results in four additions, two multiplications, and eight comparators, which is almost the same as that calculated above for the case where both devices use QPSK; the only different is an increase in the number of comparators here, which were 4 in that case.Similarly, for the case of three devices multiplexed over an RB and sporadically transmitting using QPSK, the maximum number of line checks is 9 for events E 1,2 and E 1,2,3 , that results in six additions, three multiplications, and nine comparators, which causes a very low-computational complexity.While the complexity will increase slightly in phase rotation scenarios, the overall computational complexity will still be significantly lower than that of EJML.
3) Complexity of Conventional SIC Receiver: While the conventional SIC is not suitable for GF scenarios in its actual form as the exact number of active devices are required to be known [38], we do analyze its computational complexity for comparison purposes.For the two device scenario, assuming both devices are active, the SIC receiver recovers the data of D 1 first, subtracts it from the superimposed signal, and finally recovers the signal of low-power device D 2 .Considering QPSK modulation, and ignoring the complexity of any power normalization, the QPSK demodulation of D 1 involves four Euclidean distance calculations followed by comparator.Since one Euclidean distance involves nine additions and 11 multiplications as shown for EJML earlier, D 1 QPSK decoding results in 4 * 9 = 36 additions and 4 * 11 = 44 multiplications followed by three comparators.The SIC stage for subtracting this recovered D 1 symbol from the received signal s, i.e., s − ĝ1 x1 , where x1 is the decoded signal of D 1 , involves 5 additions and 4 multiplications.Finally, the recovery of D 2 symbol results in 4 * 9 = 36 additions and 4 * 11 = 44 multiplications followed by three comparators as for D 1 .As a result, the overall decoding of SIC receiver results in 77 additions, 92 multiplications, and six comparators as summarized in Table II, which is significantly higher than the proposed DT.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III SIMULATION PARAMETERS
For the case where D 1 uses 16QAM and D 2 uses QPSK, the computations would become 185 additions, 224 multiplications, and 18 comparators.Similarly, for the case of three devices multiplexed over an RB, and transmitting sporadically using QPSK, the demodulation of QPSK signals for the three devices and the two SIC subtraction stages result in 118 additions and 140 multiplications and nine comparators, which are still quite high compared to the DT.

B. Error Rate Analysis
This section provides the detection error rate comparison of the proposed receivers with other benchmark detectors.Unless specified otherwise, the basic simulation parameters are as per Table III, and are explained as follows.
The considered user overloading is 200%, i.e., two devices per RB, which can be randomly located anywhere in an area with a normalized distance between 0.1 (close to the AP) to 1 (at the edge of the area).The modulation type for the data transmission is considered to be QPSK, and the power ratio (P 1 /P 2 ) between any two power levels is considered to be 3, i.e., power level P 1 is three times higher than P 2 . 1 The activation probability of the devices, i.e., the probability that a device will become active and transmit data, is set to be same. 2 Both online and offline training results are provided.The data size for the training is therefore variable, and is explicitly given for each result in this section.

1) Benchmark Receivers for Comparison:
We consider various receivers as benchmarks for performance comparison purposes.The first obvious one is EJML [38], which is a ML-based receiver that checks all the points before making a decision on the device activity and data detection and therefore provides a lower bound on the error rate.Moreover, since the proposed DT is a data-driven receiver, and considering that there does not exist a data-driven receiver for power domain GF-NOMA that performs detection directly from the data, we analyze some existing machine learning-based classification 1 The selection of power levels for the devices is based on one of the authors' previous works [54], which provides a comprehensive analysis of the impact of power levels on NOMA detection performance. 2The activation probability in practice depends on the type of the system, where different devices may have different activation probabilities, and some devices may have much lower values.However, for simulation purposes, we have kept the same value for all devices as per existing literature.Generally, lower activation probability will result in slightly lower average error rates.models on the considered problem, and choose the highperforming ones for comparison purposes with the proposed tree.Finally, we propose a modified version of the tree-based receiver suitable for frame-synchronized transmissions and compare with the pilots-based scheme in [38].
Regarding comparison with supervised learning-based classification models, supervised learning basically aims to learn the mapping function between some input data and its respective output (called its label) by minimizing the function approximation error.The 25 constellation points of Fig. 1 serve as the 25 labels or possible outputs, making it a multiclass classification problem.To choose the right benchmarks, a performance comparison of some commonly used classification models was first performed in MATLAB.This included classifiers, such as K-nearest neighbors (KNNs), DT, Naive Bayes (NBs), and support vector machines (SVMs).
The classifiers were trained using labeled data of 20 000 samples (800 for each of the 25 points) added with a mixture of AWGN noise values ([1/N 0 ] between 0 to 15 dB).K = 5-fold cross validation was used.The models were then tested on data across different noise levels.Using these results, some models were chosen for comparisons with the proposed decision-tree, which are KNN (neighbors: 10, distance metric: Euclidean, distance weight: equal), NB (distribution: Gaussian), SVM Kernel approximation (box constraint level: 1, iteration limit: 1000, multiclass method: one-vs-one, regularization strength (λ) : [1/2e 4 ]), and DT (maximum number of splits: 100, split criterion: Gini's diversity index).
Using the aforementioned benchmarks, we comprehensively investigate the performance of the proposed DT (referred to as prop.DTree) under a variety of scenarios that include perfect channel estimation case (no training), offline training (large training data size), and online training (with very small training data) under practical channel conditions with channel attenuation and phase rotations.
2) Performance Under Perfect Channel Estimation (No Training): First, we investigate the performance of the proposed DT versus EJML by assuming perfect channel estimation, i.e., no training required (or in other words a perfectly trained tree).The comparison is shown in Fig. 9, where the error rates for the two receivers are presented over a signal to noise (SNR) range from 0 to 25 dB.Specifically, Fig. 9(a) presents the average error rate of the two devices, whereas Fig. 9(b) provides their individual error rates.It is to be noted that the boundaries of the DT here are not learned through training, but by using the already given perfect received powers of the devices and their channel parameters.
It can be seen that the error rate of the DT is similar to that of EJML lower bound, 3 implying that, under perfect CSI, the DT boundaries perfectly divide the decision regions.In practical scenarios, the channel parameters need to be estimated or learned.Some works consider that the devices estimate the downlink channel through the signals broadcasted by the AP, and then adjust their uplink transmit signals accordingly to counter the channel effects and reach the desired receiver power levels at the AP.These works mostly assume the downlink and uplink channels to be similar.On the contrary, other traditional works consider channel estimation at the AP through any of the existing channel estimation methods in literature, which majorly rely on the use of some pilot signals/symbols.For our proposed DT-based receiver, we also assume the availability of pilot symbols for channel estimation and training the DT, and consider both online and offline training scenarios.For EJML, which serves as the benchmark, we mainly consider perfect CSI.
3) Performance Under Offline Training Scenarios-Practical Channel Estimation: The performance of the proposed tree is compared here with various classification models using offline training with a large data set.Offline training-based receivers may only work when the devices' channels are assumed to remain almost same once the training is completed.While this is not very practical, this gives us a good starting point to investigate various potential receivers.The selected classification models for evaluation are NB, SVM, KNN, and DT with the aforementioned hyperparameters.The results are shown in Figs. 10 and 11, where the offline training size is 10 000 symbols.In these figures, it is assumed that the transmitted symbols from the devices are attenuated but not phase shifted.Fig. 10 presents the average error rates of both devices for each receiver.The probability of occurrence of all four events is kept equal, i.e., E 0 = E 1 = E 2 = E 1,2 .To calculate the error rate for a specific SNR value, the models are trained with training data having the same SNR.That means, for calculating error at 10 dB, all models were trained with a training data with 10-dB SNR.Overall, it can be seen that some of the classification models and the proposed tree perform very well and quite similar to EJML (with perfect  channel knowledge).In particular, the proposed tree, NB, and SVM perform very close to EJML.KNN also performs close, whereas DT does not perform well.From the results, it can be seen that, for a large size training in offline case, the results of many classification models are very promising.
Similarly, Fig. 11 analyzes the individual detection error rates of D 1 and D 2 for the considered GF-NOMA model for the same simulation settings and models as for Fig. 10.It can be seen in Fig. 11 that overall the error rate of D 1 for all receivers is better than D 2 due to P 1 > P 2 .It can also be seen here that the average error rates shown earlier in Fig. 10 reflect the individual error rate trends for different receivers in Fig. 11.
Overall, it can be seen that for offline training scenario with large training data size, the performance of many classification models and the proposed DT are almost same as the benchmark EJML.However, the offline training may not be practical since the IoT settings, for example in a smart home or a smart factory etc., may still vary with time, and it would be inefficient to ask the devices to frequently transmit such large amount of training data to train the models, and the overall network.Hence, the critical point here is to investigate how efficiently these models perform when trained online with a very small training data size.
4) Performance Under Online Training Scenarios-Practical Channel Estimation: Considering that offline training may not be practical here, and realizing the importance of online training with small training size, this section comprehensively evaluates the performance of all these receivers over small training data sizes and practical channel where attenuation and phase rotations of the constellation points are both present.To this end, Fig. 12 presents the average error of DT, EJML (perfect CSI), and other classification models for a training size of 50 samples.While this is still too much for online training, it gives an insight about the high performance of proposed tree compared to other models and its closeness to EJML.For lower training sizes, the error for each noise value can sometime become too high due to accidental random selection of mostly bad quality training symbols, which significantly affects the classification models performance.But even with this training size, it can be seen that the proposed tree significantly outperforms classification models and performs close to EJML with perfect CSI.
While the proposed tree outperforms other classification models for a low-training size of 50 symbols, we further compare its performance with respect to the EJML benchmark (perfect CSI) for a lower training size of up to 8 symbols in Fig. 13; the result for 50 training symbol from Fig. 12 is also plotted for reference.It can be seen that even with a very small training size, the DT still performs satisfactorily compared to the EJML with perfect CSI.
To understand the reason behind high performance of the proposed tree with low-training size, some training results for the tree are shown in Fig. 14 for a range of training sizes.First, it can be seen that only the points related to E 1 and E 2 are used for training purposes of the proposed tree.Fig. 14(a  It can be seen that the training mechanism for the tree does not even need samples for all points in E 1 and E 2 , and is still able to derive all boundaries even with only one sample for just one point in each of E 1 and E 2 as shown in Fig. 14(c).Obviously, the training accuracy decreases in this case as is the case in general machine learning.However, it can be seen that even with two training samples, the training of tree seems to do a reasonable job in constructing the decision boundaries.It is to be noted that these training symbols are only used for defining the boundaries; need to be sent only once during the training process, and not during the detection phase.
Overall, there are two primary reasons for the high performance of the DT, i.e., 1) knowledge of the communications problem is incorporated in the fixed structure of the DT and 2) the fact that the training data is used to collectively estimate the channel state, and decide the boundary lines through averages as shown earlier in Algorithm 1.In other classification models, for a training size of 100 samples, we have (100/25) = 4 samples per constellation point on average for training.However, for the DT, given the training data only requires E 1 and E 2 points, with an equal number of 100 training samples, we have (100/2) = 50 points for each of E 1 and E 2 .Moreover, considering the training process explained earlier, all these 50 points from a device are jointly used to estimate its phase and amplitude, followed by exploiting the computed phase and amplitudes of both devices to reconstruct the whole constellation and related boundaries.This results in a significant performance improvement for the proposed DT.
From the training plots in Fig. 14   for training, i.e., at least 25 points in total.Even this is only possible when there is no validation process for the training and each training symbol is explicitly used for the training as there is only one sample available per constellation point.Moreover, it is to be noted that some models, such as Gaussian NB, require at least two samples per point to have some variance in samples for each constellation point, hence a minimum of total 50 samples.The training process of the proposed tree, on the contrary, does not require samples for every constellation symbol, and therefore, can perform better with lower training sizes.
Considering this, we further demonstrate the error rate performance for all detectors with respect to training data size in Figs. 15 and 16.In Fig. 15, the training size is varied between 100 (a touch toward online training) to 5000 (offline training) samples.It can be seen that similar to the previous figures, the error rate of some models and the proposed tree at larger training sizes is pretty much the same as EJML.However, for lower training sizes (100, which is still very high for online scenarios), most of the classification models perform poorly; the best one out of them is SVM, which still does not perform very impressively.Compared to these, the proposed DT performs well given the reasons explained earlier.It is also important to highlight here that the training here is done with good quality data with SNR in between 12-15 dB, and tested on similar SNR data.
Fig. 16 further evaluates the performance of all receivers versus EJML (with perfect CSI) over variable training sizes that start from a minimum of two symbols.Given the training mechanism and the known structure of the proposed DT, it can be seen that the proposed DT, besides achieving nearly perfect detection in offline training scenarios with large sized training data, can also perform quite impressively even with very low-training data size, which substantiates its efficiency as the potential detector especially in online scenarios where the AP can be trained on the fly.

5) Performance Comparison With Pilot-Based Methods for Special Case of Frame-Synchronized Transmissions:
When transmissions are assumed to be frame-synchronized, it is possible to perform device activity detection using specially chosen pilot symbols at the start of each frame [38].Activity detection is thus carried out on a per frame basis rather than per symbol basis.For such scenario, this section compares the performance of the proposed DT with pilots-based receivers, which rely on using pilots for activity detection.In this context, a modified version of the proposed DT is compared with S-Hybrid receiver designed in [38].
S-Hybrid relies on the transmission of pilots by active devices at the start of each data frame; the pilot used in [38] is a simple QPSK symbol [(1 − 1i)/ √ 2] which every active device transmits at the start of its frame using its allocated power level.All active devices transmit the same pilot using their power levels.If multiple devices are active over an RB, their pilots will add with each other, and the receiver can use the superimposed pilots to estimate the device load over the RB and identify the active ones based on their power levels.Based on the detected activity, S-Hybrid then performs the data detection accordingly.For the two-device GF-NOMA scenario, if event E 0 is detected, the receiver treats the rest of the frame as no activity.For E 1 or E 2 , it only checks the rest of the data frame against either D 1 or D 2 related points, respectively, for data recovery using a normal M−ary demodulator, e.g., a QPSK demodulator.Finally, for E 1,2 when both devices are active, i.e., a NOMA received signal, only the 16 NOMA points are used for detection using a particular NOMA receiver; the work in [38] using JML receiver, whereas we consider both JML and SIC here in the plots.To work properly, S-Hybrid receiver requires transmissions from active devices to be slotted and perfectly synchronized.Moreover, as the data recovery in the frame entirely depends on the detected activity, more pilots need to be sent to improve accuracy, which results in reduced throughput [38].
The DT does not rely on frame synchronization, and the decision on each received symbol is made independently.However, a simple modification of the DT-based detector can be made to use the frame synchronization to make a framebased decision on activity.A running counter-based modified tree is thus proposed, which uses the standard tree decision on a small set of initial frame symbols to make an overall activity decision on the frame, followed by using only the relevant tree boundaries for that activity decision for further detection.Note that no pilot symbols are required as the frame activity decision is still based on tree decoding of data symbols.
The activity and data detection error performance of S-Hybrid and the modified tree is compared in Fig. 17 for the frame-synchronized GF-NOMA case.The data frames are of 32 symbols each.For S-Hybrid, three pilot symbols are inserted at the start of each frame for activity detection as in [38].Moreover, as S-Hybrid in [38] assumed perfect CSI, for fair comparison, we consider a simple AWGN channel for both S-Hybrid and modified tree here.
For the modified tree, the whole frame is still the normal data frame with nothing changed or inserted.However, at the receiver, the decoded data on first seven data symbols (can be more or less also) is used to estimate the activity event for the entire frame.That is, the first seven data symbols in a frame are initially independently decoded using the proposed DT.The detected events over these seven symbols are used to make a final decision on the activity.The rest of the frame symbols, along with any wrong decisions in the initial 7, are then decoded using only the decision boundaries relevant to the detected event.For instance, if only D 1 is active over a frame (event E 1 ), the DT may decode the first seven symbols as {E 1 , E 1 , E 1 , E 1,2 , E 1 , E 1 , E 1 }.From this, an overall estimate on the frame event is made, which is E 1 .Accordingly, the rest of the frame data, and any wrongly decoded symbols in the first seven (fourth symbol here), are decoded by using only the estimated event related boundary checks of the DT (i.e., a subtree).For E 1 here, only the first two checks of the DT in Fig. 6, i.e., T y 0,1 (s) > 0 and T x 0,1 (s) > 0, are sufficient for data decoding of D 1 , which further reduces the computational complexity of the DT.
Overall, it can be seen in Fig. 17 that both the modified tree and S-Hybrid perform similarly to each other with S-Hybrid performing slightly worse in the low-SNR regime.This can be improved by increasing the number of pilot symbols but will cause further throughput loss.Moreover, S-Hybrid employing JML and SIC perform similar to each other; in fading scenarios, SIC-based S-Hybrid may reach an error floor due to the error propagation in SIC.On the other hand, the modified tree performs as well as S-Hybrid but with no throughput loss as it only uses the data symbols to improve its decision making, and can actually result in reduced complexity for the rest of the data frame.Overall, it can be seen that the DT-based slotted detector can perform better than the considered benchmark.

VII. PRACTICAL CHALLENGES, LIMITATIONS,
AND FUTURE DIRECTIONS The results above show the promising performance of the proposed technique in terms of computational complexity and active device and data detection.Here, we shed light on some of the related practical constraints and challenges, that are subject to our future works.

A. Generalization of the Decision Tree
It is important to note that the DT boundaries and the structure of the DT depend on the number of devices multiplexed over a particular RB and their modulation types/sizes.For instance, for the considered two devices per Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
RB scenario, when both devices use the same QPSK modulation, the number of tree boundaries and the DT checks are different to the case when the lower power device uses QPSK but the higher power device uses a high-order modulation 16QAM.The number of boundary lines in the latter is more and the tree has slightly different checks.Same would happen if the number of devices multiplexed over an RB is increased.This means that a single tree cannot be applied directly to all scenarios and needs slight modifications to accommodate different cases.A solution to manage this is to store multiple trees at the AP for different combinations of the number of devices multiplexed over an RB and their modulation types/sizes.Given this, the AP can invoke the right tree by considering the number and modulation types/sizes of the devices when it allocates them RBs and power levels for GF transmissions.

B. Managing Larger Number of Devices or Power Levels
The number of devices supported by the proposed technique in particular, and NOMA in general, over a particular RB depends on the values of the power levels.For the proposed technique, if a larger number of devices are multiplexed over an RB, the number of constellation points for activity and data detection will increase.While the final DT computational complexity will still be low as shown in Section VI-A2, the power levels will need to be properly designed in order for the decision-tree boundaries to be efficiently designed.To this end, it is more feasible to have two or three power levels, i.e., an overload of 200% or 300%, which is still a good connectivity gain.However, it would be interesting to look into the error bounds for the technique in higher overloading scenarios.

C. Considering Transmission Channel Dynamics
In the considered models, we mainly assume that the channel does not vary too much considering static IoT devices, such as sensors, etc.This allows us to keep the tree boundaries pretty much fixed during the activity and data detection period.However, in case of a more rapidly varying channel, the training symbols will need to be sent much more frequently (as would be the case for pilot symbols for more traditional channel equalization).Fortunately, though, our work has shown that the number of training symbols that need to be sent is small making this approach feasible even for more rapidly varying channels.Alternatively, another solution is to consider the channel between each device and the AP as reciprocal in each direction, so the devices can estimate their channel to the AP using the pilot signals periodically broadcasted by the AP, and then adjust their transmission signal accordingly to facilitate the required power level at the AP [55], [56].In future works, it will be interesting to see how the DT model can be modified to accommodate rapid channel variations and accordingly optimize the boundary lines.

VIII. CONCLUSION
This article focuses on novel low-complexity data-driven receiver design for joint activity and data detection in uplink GF-NOMA considering an IoT scenario with sporadic transmissions, where devices can transmit their data in an arrive-and-go manner without going through any grant-access procedure.While conventional NOMA receivers cannot be applied in such scenarios, and given the benchmark exhaustive search-based optimal EJML receiver suffers from significant computational complexity, by exploiting the structure of the received signal constellation and identifying the optimal decision boundaries, low-complexity DT-based receivers are presented.It is shown that the proposed receivers perform very close or same as the EJML receiver, and better than some other typical classification model-based detectors for GF-NOMA.Moreover, with slight modification, the proposed receiver can also be applied on a frame-synchronized scenario, and outperforms the considered pilots-based detectors that suffer from throughput loss.Comprehensive simulation results are provided to show the performance of the proposed detector in terms of its detection efficiency and parameter learning with minimal training symbols.
While the results are promising, some practical challenges and constraints for the technique and power domain NOMA are also discussed, that include generalization of the decision-tree model, managing higher number of power levels or modulation sizes, and considering transmission channel dynamics.Some possible solutions to tackle these challenges are suggested, whereas detailed insight into these challenges and their potential solutions is subject to our future works.

Fig. 3 .
Fig. 3. DT for active device and data detection under ideal channel conditions; no phase rotations.

- Channel estimation and constellation construction 4 : 13 : 14 :
Place all QPSK symbols from D 1 and D 2 in some subsets S 1 and S 2 respectively, such that S 1 ∪ S 2 = S 5: for i = 1 : length(S 1 ) do for D 1 related points 6: Calculate |S 1 (i)| & θ(S 1 (i)) amplitude and phase 7: end for 8: Compute average |S 1 | and θ(S 1 ) |S 1 | represents the received power of D 1 and θ(S 1 ) the phase rotation 9: Do the same for D 2 points in S 2 to calculate the amplitude |S 2 | and phase θ(S 2 ) for D 2 10: Using |S 1 |, θ(S 1 ), |S2 | and θ(S 2 ), produce the 25 possible points through their combinations -----------------------Computing the boundary lines 11: Using the four D 1 constellation points, compute the four centres, one each between two of the four points.12: Connect the opposite centres to draw the two x − y lines T Follow the previous two steps using the four D 2 constellation points to draw the two x − y lines T x 0Using the centres between each pair of the D 2 points, compute the equations for the small square boundary lines located across the origin i.e., T k 2 , k = 1 : 4. 15: Using the set of four edge points in each of the outer clusters around D 1 points, and similar to previous two steps, compute four sets of x − y axis equations T jk 4 , j ∈ {x, y}, k = 1 : 4 for the outer clusters, and the four sets of square boundary lines T k 3 s for each outer cluster.16: Finally regarding the two lines T k 1,1 and T k 1,2 in each Q in place of the previous big square lines, use the previously calculated |S 1 |, θ(S 1 ), |S 2 | & θ(S 2 ) to find the two outer cluster points in each Q closest to the one D 2 point in that quadrant.17: Compute perpendicular lines to each pair between D 2 point and the two closest outer cluster points to find T 1,1 and T 1,2 .

Fig. 9 .
Fig. 9. Error rate of EJML and proposed tree under zero phase rotation and perfect channel estimation (no training).(a) Average errors comparison.(b) Individual errors comparison.

Fig. 10 .
Fig. 10.Average device error rate of both devices for different receivers; offline (10 000 training symbols per noise value), no phase rotation.

Fig. 11 .
Fig. 11.Individual device error rate of both devices for different receivers; offline (10 000 training symbols per noise value), no phase rotation.

Fig. 12 .
Fig. 12.Average error rate performance comparison with 50 training samples; no relative phase rotation between D 1 and D 2 .
) uses a training data set of 800 samples (100 samples per

Fig. 13 .
Fig. 13.Average error rate comparison between the proposed tree (different training sizes) and EJML (perfect CSI).
(a)-(c), it can be seen that by knowing the tree structure, the training model can successfully draw boundaries with as low as two training symbols, which can be very handy in online training during real-time communications.Obviously, this does not guarantee satisfactory model accuracy with such small training size, and will therefore be accessed here.However, this training size is very low compared to the general classification models in Section VI-B1, which need at least one sample per point Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 14 .
Fig. 14.Channel estimation and training results for the proposed tree with phase rotations; D 1 not rotated, D 2 rotated by 20 0 .(a) Training size is 800; 400 for each of E 1 and E 2 .(b) Training size is 8; 4 for each of E 1 and E 2 .(c) Training size is 2; 1 for each of E 1 and E 2 .

Fig. 15 .
Fig. 15.Average error rate versus variable training size -100 to 5000 samples; training and testing data with high SNR.

Fig. 16 .
Fig. 16.Average error rate versus training size -2 to 250 samples; training and testing data with SNR between 12-15 dB.

Fig. 17 .
Fig. 17.Error rate comparison of modified tree receiver and pilot-based detector (S-Hybrid [38]) in frame-synchronized transmissions; three pilot symbols for S-Hybrid.(a) Average error rates comparison.(b) Individual error rates comparison.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and E 1,2 means that both D 1 and D 2 are active, i.e., y = h 1