Next Generation Multiple Access: Performance Gains From Uplink MIMO-NOMA

The use of multiple-input multiple-output (MIMO) non-orthogonal multiple access (NOMA) based communication protocols is proposed and investigated for the uplink of wireless networks with buffered data-sources, which is the basis of the introduced medium access control (MAC)-layer protocol. To this end, the long-term average throughput is maximized by optimizing the set of users that transmit information at each time slot and their transmit power, the number of packets that are admitted in each user’s queue, and the transmission rates, assuming that the instantaneous channel state information is not available at the transmitters. Also, considering a receiver with multiple antennas, two detection techniques are used to mitigate the interference when two users are chosen to simultaneously transmit information in the same resource block, namely successive interference cancellation (SIC) and joint decoding (JD). More specifically, the outage probability for both considered techniques is derived in closed-from, which is a prerequisite for the derivation and the optimization of the throughput. The formulated multi-dimensional long-term stochastic optimization problem is solved by using the Lyapunov framework. Finally, simulation results verify the gains by using MIMO-NOMA as the basis of the next generation multiple access and illustrate the superiority of JD compared to SIC with respect to the number of the receiver’s antennas.

access (NOMA), according to which, as opposed to orthogonal multiple access (OMA), multiple users are served in each orthogonal resource block, e.g., a time slot, a frequency channel, a spreading code, or an orthogonal spatial degree of freedom [2], [3], [4]. NOMA is based on well-established information-theoretic techniques, such as superposition coding, successive interference cancellation (SIC), and the message passing algorithm [5], [6]. Interestingly, several NOMA-based techniques have been proposed, including power-domain NOMA (PD-NOMA), multicarrier NOMA, sparse code multiple access, pattern division multiple access, low density spreading, lattice partition multiple access, and interleave division multiple access [2], [3], [4], [7]. Among them, PD-NOMA has received considerable attention due to its potential to achieve the capacity of the broadcast and multiple access channel. In PD-NOMA, multiuser detection techniques, such as SIC, are used for mitigating the impact of interference at the decoding of the received messages [8]. More specifically, SIC can mitigate interference, by subtracting the signals that have already been decoded from the received superposed signal. For single-antenna systems, some indicative works that investigate the advantages of PD-NOMA compared to orthogonal multiple access are [9], [10], [11], [12]. Considering the increasing decoding complexity of PD-NOMA with the number of multiplexed messages, a practical form of PD-NOMA is with hybrid user-pairing, according to which solely up to two users can use the same resource block, while different pairs of users are multiplexed by using orthogonal resources [13]. This form of NOMA is also aligned with its implementation in the 3rd Generation Partnership Project (3GPP) specifications [14]. Moreover, it has been shown that NOMA can be used together with multiple-input multipleoutput (MIMO) techniques to further increase the spectral efficiency and connectivity, exploiting the characteristics and advantages of both technologies [15]. For example, in [16], the signal alignment concept has been used to develop a novel MIMO-NOMA framework, the performance of which has been evaluated assuming randomly deployed users and interferers. Also, in [17] a novel minorization-maximization method has been introduced for optimizing the sum rate in NOMA downlink. In [11], the use of quasi-degradation concept for downlink NOMA has been investigated and a lowcomplexity sequential user pairing algorithm was introduced to improve the system performance in terms of total transmission power of the base station (BS). Also, in [18] and [19] it was illustrated that downlink MIMO-NOMA outperforms MIMO-OMA in terms of both sum-rate and ergodic sumrate. Also, in [20], the trade-off between energy efficiency and spectral efficiency has been explored, considering the use of downlink in a multi-input single-output (MISO) system and perfect channel state information at the BS. However, it deserves to be mentioned that most of the aforementioned works solely focus on downlink NOMA and cannot be directly extended to the case of uplink NOMA.
Although downlink NOMA has been well-investigated in existing literature, uplink NOMA -and especially uplink MIMO-NOMA-has not been sufficiently investigated, despite its high practical value for a vast number of wireless applications, such as IoT and teleconference. It is highlighted that the concept of uplink NOMA presents important differences compared to downlink NOMA, which is due to the fact that in the uplink the interfering messages are received from different sources, i.e., via different links. Also, the implementation of NOMA in the uplink does not increase complexity at the mobile users' side, since joint processing is only applied at the BS. Indicatively, the ergodic sum-rate gain of NOMA over OMA has recently been investigated in the pioneering work [21], providing useful insights for the scaling of the gain as the number of antennas increases. Interestingly, in uplink PD-NOMA, SIC can still be effectively applied [22], [23], which also is the basis for Vertical-Bell Laboratories Layered Space-Time (VBLAST) [24]. However, when jointdecoding (JD) is used instead, the received signals of all active sources are jointly decoded and the capacity region of both Gaussian and fading multiple access channel can be achieved [25], [26]. This characteristic has been exploited by [27] in order to design a novel scheme that improves the performance of wireless powered networks in terms of throughput and fairness among users. More specifically, in [27] the sum and the minimum achievable among users has been optimized for both the cases of SIC and JD, assuming perfect channel state estimation and that all nodes are equipped with a single antenna, while the performance gain compared to OMA has also been illustrated. The sum-rate maximization has also been investigated for the case of distributed uplink NOMA in [28], where the use of multiple remote radio heads has been assumed. Although the outage probability of JD was investigated long before the concept of NOMA became popular [26], the outage probability for uplink MIMO-NOMA is particularly challenging and has not been investigated before in the existing literature either for the case of SIC or JD, since the recent work [21] has only focused on the expression of ergodic sum-rate for the case of SIC. However, deriving the expression of the outage probability for such systems has high practical value, since it constitutes the basis to design and optimize practical medium access control (MAC) schemes, where the instantaneous channel state information (CSI) is not available at the transmitters.
Moreover, a different approach to further improve the efficiency of wireless networks is to focus on increasing the long-term quality of experience of all users, instead of focusing on maximizing a common instantaneous data rate among them. This is facilitated by incorporating the delay constraints of data-driven applications jointly with conventional communication resource-constraints, i.e., power, time, frequency, and antennas to further boost performance. To this direction, data-buffer-aided communications can be used, such as in [29], [30], in which the knowledge of buffer-states is employed to prioritize users with long queues. Thus, users avoid to transmit information when their average channel conditions are weak and the queue length is relatively low.
In general, buffers can be used either at the sources or at the relays at cooperative communication networks. In [31], both a single link and a multihop sensor network has been considered, in which the transmitting nodes are equipped with a data and an energy buffer, focusing on the maximization of the long-term average sensing rate. In [32], the outage probability of buffer-aided relaying system with downlink NOMA has been derived. Moreover, in [33], the long-term average network utility has been optimized, by also assuming a cooperative downlink NOMA system, where a source serves multiple users through a buffer-aided relay. Also, in [34], [35], [36], [37], relay and mode selection schemes for buffer-aided cooperative communication systems have been proposed, focusing on downlink NOMA and the use of buffers at the relays. However, the use of buffered-sources in uplink NOMA systems has not been sufficiently investigated by existing literature. More specifically, it has been solely considered by [38], [39], within the context of wireless networks with energy harvesting, based on the assumption that instantaneous channel state information is perfectly and globally known, for which outage never occurs. It also deserves to be mentioned that resource allocation is particularly challenging and conventional methods cannot be used to optimize the long-term performance when buffers are used, due to the stochastic and combinatorial nature of the corresponding problems. In addition, in order to optimize uplink NOMA for the case of a general multi-antenna receiver, the derivation of the outage probability is needed.

B. CONTRIBUTION
The main objective of this work is to answer the following question: Is NOMA useful in the uplink of realistic communication systems with buffered sources when a general multi-antenna receiver is used? It is noted that this is a very critical question, since, as it has been widely accepted, the adoption of NOMA in beyond 5G systems, as the basis of next generation multiple access, depends on the provision of non-incremental gains, especially in systems with multiple antennas at the base station (BS) [40], [41]. To reply to this question, the outage probability of uplink MIMO-NOMA for both considered decoding techniques, i.e., JD and SIC, is derived, under the practical assumption that instantaneous CSI is not available at the transmitters. Also, a practical medium access control (MAC)-layer protocol is designed which is necessary in order to quantify the performance gain. More specifically, the contribution of this paper is summarized as follows: • A practical MAC-layer protocol is proposed that is based on NOMA with hybrid user-pairing, taking into account the packet arrival and departure processes, as well as the coordination requirements. It is noted that both pure NOMA with user-paring and OMA can be considered as special cases of the proposed protocol, since in the latter the users can choose if they prefer to transmit information simultaneously or in different time slots.
• To enable the derivation of the long-term average transmission throughput in closed-form, the outage probability, for joint transmission of N transmit antennas to two non-collocated receive antennas, is derived for both considered detection techniques, i.e., JD and SIC. As an application of this calculation, we obtain the diversity-multiplexing tradeoff exponents for both schemes. • The long-term average of the weighted sum of users' throughput is maximized for both SIC and JD, by optimizing the set of users that transmits information at each time slot and their transmit power, the number of packets that are admitted in each user's queue, and the transmission rate. To this end, the formulated multidimensional long-term stochastic optimization problem is solved by using the Lyapunov framework. The proposed solution is appropriate for the practical case, since for its implementation solely the buffer states and the statistical CSI are needed, both of which vary in a comparatively slow rate, reducing the complexity and the required overhead. • Simulation results are presented to evaluate the performance of the proposed framework and validate the analytical results. To this end, the proposed framework is compared to the special case of OMA, according to which solely a single user can be scheduled to transmit information in each time slot, as well as to the optimal scheduling scheme assuming that the scheduler is not aware of the queue sizes. Specifically, it is shown that the proposed schemes -and especially the one that is based on JD-significantly outperform the considered benchmarks in terms of long-term average throughput and blocking probability.

C. STRUCTURE
The rest of the paper is organized as follows: Section II describes the system model and the proposed communication protocol. In Section III, the outage probability for both SIC and JD is derived, considering the general case of a multiantenna receiver. In Section IV, the proposed scheduling and resource allocation framework is developed. In Section V, simulation results are provided to corroborate the derived analytic results and to illustrate the performance of the proposed transmission algorithm. Finally, closing remarks and discussions are provided in Section VI.

II. SYSTEM MODEL
We consider the multiple access uplink system model depicted in Fig. 1. In particular, an M-user network is considered where each user is equipped with a data buffer and an admission control mechanism that determines if a packet will be transmitted or discarded, a queue that stores the packets to be transmitted and a transmission subsystem that transmits the available data according to scheduling, transmission rate selection and resource allocation decisions. Also, the BS is assumed to be equipped with N antennas. The considered system model is of high practical value in any uplink communication scenario, in which single antenna users transmit information to a multi-antenna BS. This is because, in practice, all communication nodes contain a buffer, although this issue in often overlooked in the existing literature. Of course, different applications require different buffer sizes and delay requirements, which are both taken into account in this work. Among others, the proper use of buffers is critical in IoT applications, including data collection in smart grids, manufacturing, smart cities and communities, smart food and farming, and healthcare applications [42]. Also, the considered system model could correspond to a scenario in which multiple aerial (e.g., unmanned aerial vehicles) or marine nodes communication with a terrestrial BS. The aforementioned scenarios are of paramount importance in beyond 5G networks, which is envisioned to be the communication platform of the Next generation IoT [43], as well as to achieve the convergence of terrestrial, aerial, and marine communications [44].
We further assume that the system operates over a slotbased structure. In more detail, the timeline is divided into successive non-overlapping time slots, denoted by t, with t = 0, 1, 2, . . . corresponding to the time interval [t, t + 1). Moreover, t and t + 1 are defined as the "beginning" and the "end" of the time slot, respectively, while S d is the time slot duration. It is assumed that the number of packets that arrives to queue of user i in a specific timeslot depends on the specific application. Let A(t) = (A 1 (t), . . . , A M (t)) denote the process of random packet arrivals, where A i (t) is the number of packets that arrive to user i during slot t. It is further assumed that only part of the aforementioned packets are admitted in the queue of user i, with the corresponding number being equal to R i .
We assume that PD-NOMA with hybrid user-paring is used, i.e., at most two users are allowed to transmit concurrently in one time slot, while either JD or SIC is applied at the receiver. Hereinafter, z i (t) ∈ {0, 1} denotes whether the i-th user is scheduled to transmit information with transmission rate ρ i (t), while P i (t) represents the power allocated to the i-th user at the t-th time slot. Moreover, without loss of generality, the use of a single frequency channel is assumed. Furthermore, the vectors ρ = {ρ 1 (t), . . . , ρ M (t)}, P(t) = {P 1 (t), . . . , P M (t)} and z(t) = {z 1 (t), . . . , z M (t)} contain all the selected transmission rate, power allocation and scheduling decisions, respectively, for the M users at t-th time slot.
The channel gain between the i-th user and the n-th receive antenna is given bỹ where ω i (t) is the channel state that models large scale phenomena, such as atmospheric processes or shadow-fading, and varies stochastically according to the Markov chain defined by (2).
It is further assumed that the channel state process ω(t) remains constant within a time slot, and evolves according to a general irreducible discrete time Markov chain with a finite state space . Hence, the stationary distribution of ω(t) is given by where π j represents the steady state probability distribution of the Markov chain over states j ∈ . This channel model has been adopted because it is quite generic and can capture, in an abstract way, all channel effects that affect the propagation of signals in a wireless channel, such as large scale fading phenomena, mobility conditions or weather effects, by just modifying the underlying transition probabilities of the Markov chain. In the literature, there exist several works that adopt the same approach in different propagation scenarios (see for example [45], [46], [47]). It should be noted that the presented transmission protocol does not require the knowledge of the underlying transition probabilities of the channel state Markov chain, which may be hard to obtain; only the current channel state knowledge is required, which in practice is slowly varying. Next, more details, about the arrival processes, the departure process, and the queuing dynamics are provided.

A. ARRIVAL PROCESS
As it has already been mentioned, at the end of each time slot, user i has to take an admission control decision on the number of the packets to be admitted in its queue, For simplicity, it is assumed that all packets have fixed size, denoted by S p . Also, A i (t) cannot exceed a maximum value defined by the application, i.e., A i (t) ≤ A max for all users i and slots t. Moreover, let λ = E[A(t)] = {λ 1 , . . . , λ M } include the rate of the packet arrival process for all M users. The arrival vector A(t) is assumed to be independent and identically distributed (i.i.d.) over slots. Further, the arrival processes A i (t) for different users in each slot are assumed to be independent.

B. DEPARTURE PROCESS
At the beginning of time slot t, the scheduling scheme defines which users are allowed to transmit according to the vector z(t). According to the main principle NOMA with hybrid user-pairing, it is assumed that at most two users are allowed to transmit simultaneously in each time slot, hence, it holds that The selection of z i is followed by transmission rate selection and a power allocation control decisions, which are received according to the vectors ρ(t) and P(t), respectively. Considering a peak power constraint for each user, it must hold that with i = 1, . . . M and P i (t) being selected from a set P, the maximum value of which is P max . Furthermore, it is assumed that the transmission rates are selected from the set , depending by the coding schemes available at the users' transmitters, with the corresponding maximum value being equal to ρ max . It is noted that the resource allocation is performed based on the channel states that correspond to large-scale phenomena, while the instantaneous CSI is not available at the users, thus, outage may occur. According to this, the departure processes from each user's queue can be modeled as random processes that depend on the transmission rate selection ρ(t), the resource allocation P(t) and scheduling decisions z(t) as well as the channel state ω(t), and are expressed as where W is the available bandwidth and 1 i (P(t), z(t), ω(t), ρ(t)) are random indicator functions that represent the random success transmission outcomes of user i. Thus, μ i (t) represents the number of the stored packets that depart from the queue and is not necessarily integer, due to the coding rate that is subject to optimization. Moreover, it holds that

), z(t), ω(t), ρ(t)) = E 1 i (P(t), z(t), ω(t), ρ(t))
where P out,i (P i (t), P j (t), ρ i (t), ρ j (t), ω(t)) denotes the outage probability when the i-th and j-th users are scheduled to transmit, which depends from the power allocated to these users and the selected transmission rates, i.e., P i (t), P j (t), ρ i (t) and ρ j (t) respectively, and the channel state ω(t), while E[ · ] denotes expectation. More details regarding the outage probability are given in the next section.

C. QUEUING DYNAMICS
Let Q(t) = (Q 1 (t), . . . , Q N (t)) denote the number of packets currently stored in each of the N queues. In view of the above, each queue evolves as where . It is noted that although the number of packets that is admitted in the queue is integer, Q i (t) is not necessarily integer, since the coding rate is also subject to optimization. This is because the number of packets that are admitted in the queue is measured assuming that all admitted packets have fixed size (S p ) and each packet only contains information bits. However, as it has already been mentioned, the packets that depart from the queue do not necessarily contain the same number of information bits, i.e., their equivalent size is not necessarily equal to S p . Thus, although the number of packets that depart from the queue is also an integer in practice, μ i is not an integer, since it corresponds to the equivalent number of packets with size equal to S p .

III. OUTAGE PROBABILITY
To make further analytical progress, we need to determine the statistics of the channel and specifically the outage probability appearing in (7). In this section, we provide details on the outage probability of each user for a given state ω(t), assuming, without the loss of generality, that users 1 and 2 are scheduled to transmit with transmission rates ρ 1 (t) and ρ 2 (t) and allocated power P 1 (t) and P 2 (t), respectively. As it has already been mentioned, the received signals can be decoded either successively, as in SIC, or simultaneously, as in JD. The challenge in the calculation, as we shall see, lies in the fact that the channel matrix has size N × 2, for N antennas at the receiver, a case for which very few results currently exist. In more detail, the optimal outage probability is achieved when joint decoding between the two users is possible. When JD is adopted, the receiver simultaneously decodes all messages by using the random coding technique [48]. In the case of JD, all rate pairs that belong in the capacity region of the multiple access channel are achievable. In more detail, assuming that the BS's antennas are uncorrelated, perfect CSI is available at the BS, and following [49], the outage probability for the case of JD for user 1 can be defined as The physical interpretation of (9) can be realized by considering the capacity region of the multiple access channel (MAC), which is illustrated in Fig. 2. More specifically, user 1 is in outage when the pair of the target rates (ρ 1 , ρ 2 ) belongs either in the area AB or Bc 1 ρ 1 . This is because every pair of target rates inside the region c 1 BAc 2 is achievable, while if ρ 1 <c 1 , the user of message 1 can be decoded by handling the message of user 2 a interference. The key difference compared to the usual capacity region analysis is that in Fig. 2 the quantities c 1 ,c 1 , c 2 ,c 2 are random and we evaluate the probability their values to be within a given range.
In (9), c 1 , c 2 , and c 12 are given by where s i is the transmit signal-to-noise ratio (SNR) defined as with N 0 being the noise power, and h i is the N-dimensional channel vectors of users i ∈ {1, 2}. The first two expressions correspond to the capacities of users 1 and 2, respectively, in the absence of each other, while c 12 is the sum capacity for both users. In addition,c 1 andc 2 can be expressed as corresponding to the maximum rates of users 1 and 2, when treating each other as interference. Since user 1 will be in outage when either he cannot decode his message even in the absence of user 2, i.e., if ρ 1 > c 1 , or when neither of the two users can decode their respective messages jointly or separately, i.e., in the parameter region where the inequalities (ρ 1 + ρ 2 > c 12 , ρ 1 >c 1 and ρ 2 >c 2 ) hold. Also, the outage probability of user 2, i.e., P JD out,2 is defined by exchanging indices 1, 2.
On the other hand, a simpler but quite effective decoding methodology is VBLAST, first suggested in [24]. Although, strictly speaking VBLAST is a demodulating algorithm, it can be seen as the generalization of SIC for multiple receive antennas. In SIC, a user's signal is decoded first by treating the other signal's as noise. If the corresponding rate that can be achieved for this user belongs in the capacity region, his message can be successfully decoded and subtracted from the observation when decoding subsequent messages. In general, SIC cannot achieve any point of the capacity region of the multiple access channel, but it can achieve all corner points and sets of rates that are dominated by the corner points. It is noted that different corner points correspond to different decoding orders. Since the receiver (the BS) has perfect CSI, an optimal decoding order can be chosen. Thus, the outage probability for the case of SIC can be defined as According to (13), user 1 is in outage either when ρ 1 >c 1 and ρ 2 >c 2 hold, in which case neither user can decode their message alone, or when ρ 2 <c 2 and ρ 1 > c 1 hold, in which case again user 2 can decode its message separately, but user 1 cannot. It is worth noting that the difference with the JD is that the VBLAST (SIC) system is in outage when ρ 1 >c 1 , ρ 2 >c 2 , but at the same time when ρ 1 + ρ 2 < c 12 . Also, P SIC out,2 is defined by (13) by exchanging indices 1, 2.
Similarly to the above discussion for JD, here the physical interpretation of (13) can be obtained from Fig. 2. More specifically, user 1 is in outage when the pair of the target rates (ρ 1 , ρ 2 ) belongs either in the area ACB or Bc 1 ρ 1 . This is because if ρ 1 <c 1 , the user of message 1 can be decoded no matter what user 2 does, while in the region c 1 CBc 1 user 2 can be decoded first and hence also user 1. Thus in the complement of these regions user 1 cannot be decoded using SIC.
For notational convenience, we also define the constants y 10 , y 20 , y 30 through the following equations: According to (14), y 10 , y 20 , y 30 are the values of SNR that correspond to the target rates, since ρ 1 , ρ 2 , ρ 1 + ρ 2 , with ρ 1 and ρ 2 being the target rate (or equivalently the transmission rate) of user 1 and 2, respectively. We also define the functions where (N) is the Gamma function and γ (x, N) is the corresponding normalized incomplete Gamma function defined through Based on the above, we may now express the outage probability for each user for the two considered decoding schemes, namely JD and SIC.
Theorem 1 (JD Outage): For s 1 = s 2 , the outage probability for user 1 when user 2 is present can be expressed as When s 1 = s 2 the outage probability can be evaluated from above by taking the limit s 1 → s 2 ≡ s, resulting to Similar expressions for P JD out,2 is obtained by exchanging indices 1, 2.
Proof: The proof is provided in Appendix A.
where y * 1 = while y * 2 is obtained by exchanging the indices 1, 2 in the expression for y * 1 . In the limit that y 10 y 20 = 1, the last term vanishes since g N (x) → 0, when x → ±∞.
A similar expression for P SIC out,2 is obtained by exchanging indices 1, 2.
Proof: The proof is provided in Appendix B.
The above exact expressions for the outage probabilities will allow us to facilitate the transmission algorithm, which is the focus of this paper. However, they are interesting on their own right. For example, we may re-obtain the diversitymultiplexing tradeoff exponents for the optimal JD as well as the suboptimal VBLAST demodulating algorithms, including their coefficients analytically. To do so, we take the large SNR limit, setting for convenience s 1 = s 2 = s and y 10 = y 20 = s r , so that, to leading order the rates ρ 1 = ρ 2 = r log s, with r ≤ 1. Then, starting with the case for SIC and asymptotically expanding each term, we see that the leading term in (19) is always the last, so that we have confirming that the diversity exponent of SIC is d SIC = (N − 1)(1 − r) [50]. The optimal JD case is more interesting. For N ≥ 2, the leading term in the outage is always the first in (18), thus getting thus having diversity exponent d JD,N≥2 = N(1−r). However, for N = 1, due to fact that the argument of the function g N (·) is s 2r−1 , which can be large or small depending on whether r < 1/2, the term proportional to g N (·) may be O(1) if r > 1/2, and hence with finite outage probability, or proportional to s 2(2r−1) otherwise. Comparing this term with the first, we finally get for N = 1

IV. PROBLEM FORMULATION AND TRANSMISSION ALGORITHM
After obtaining closed-form expressions for the outage probabilities, this section aims at designing a MAC-layer protocol that maximizes the users' throughput by taking the appropriate control decisions in each time slot, while also considering different priorities among users, as well as ensuring that the scheduling and power constraints are satisfied. As it has already been mentioned, the optimal control decision in each time slot needs to be taken based on the information about the queue sizes and the channel states that correspond to large scale phenomena. To this end, the corresponding stochastic optimization problem is formulated and solved by using the well-established Lyapunov optimization framework [51]. In addition to its practical value the development of such a MAC-layer protocol is necessary to evaluate the potential of NOMA in practical buffer-aided uplink systems. Specifically, the communication, simultaneously to two users, creates the risk of excessive interference due to the non-orthogonality of the channel. Hence, a robust MAC-layer protocol, which evaluates both the channel diversity of the multiple users, as well as the buffer-state information, can help alleviate such adverse conditions. While the increase of the number of receive antennas is expected to gradually diminish the non-orthogonal character of the system, it is important to quantify this potential improvement.

Consider any policy that makes admission control R(t) = {R 1 (t), . . . , R M (t)}, scheduling z(t), transmission rate selection ρ(t), and power allocation P(t) decisions. A policy is
called admissible if the user queues are mean-rate stable, i.e., the output and input average rates of each queue are equal. Under an admissible policy, each user obtains a long-term average transmission throughput The goal of the proposed optimization framework is the selection of a feasible admissible policy that maximizes the weighted sum of average users' throughput, i.e., M i=1 θ iRi , where {θ i } M i=1 are positive coefficients provided by the upper layers that facilitate the assignment of different priorities to different users and provides certain notions of fairness. Thus, the corresponding optimization problem can be formulated as max

P(t),z(t),ρ(t),R(t),∀t
whereμ It deserves to be noted that since the formulated optimization problem in (25) is based on the values of the large scale fading coefficients, which are not frequencyselective, the analysis can easily be extended to the case of multiple frequency channels. This is because the channel state that corresponds to user i has the same value for all different frequency channels. Consequently, the problem of throughput maximization multiple frequency channels can be decomposed to the throughput maximization problem at each frequency channel, which corresponds to (25).
Taking into account well-known results on stochastic network optimization [51], [52], [53], an optimal control policy for the above optimization problem can be obtained by solely considering the class of stationary, randomized policies which take control actions based on a stationary probability distribution over the control action set, ignoring past history [53]. In more detail, given that set of admissible policies is non-empty, it is proved that there exists a stationary, randomized policy, which takes control decisions in each time slot purely as a randomized function of the rate of the arrival process and channel state, maximizing the objective function of (25), while satisfying the constraints C 1 − C 6 .
In the considered setup though, there are multiple channel states, while the aforementioned approach to solve the optimization problem requires complete knowledge of the channel state statistics which may be difficult or impossible to be obtained. In order to avoid such a requirement, the Lyapunov optimization framework will be used to solve (25) [51], which is based only on the queue sizes and the observations of the channel state in each time slot. Moreover, the Lyapunov optimization framework offers a mechanism to regulate the trade-off between increasing throughput and reducing delay.

B. MINIMIZING THE "DRIFT PLUS PENALTY"
According to the Lyapunov optimization framework, a "Drift Plus Penalty" expression is derived that is minimized for every time slot [51], [52], [53]. Let Q(t) represent the system queue backlog at time slot t and L(Q(t)) denote the quadratic Lyapunov function that is defined as Moreover, let the state ω 0 ∈ be defined as the renewal state of . Therefore, the sequence {T r } ∞ r=0 , which represents the recurrences times to state ω 0 , is i.i.d., while E{T r } and E{T 2 r } are finite constants given by [54] and respectively, with T being a random variable that follows the stationary distribution of T r . Let t r denote the time of the r-th revisitation to a specific state of the Markov chain, i.e., t r = r−1 j=0 T j . Next, the variable-slot Lyapunov drift is defined as the expected change in the Lyapunov function from renewal time t r to renewal time t r+1 conditioned on Q(t r ), i.e., (Q(t r )) = E L(Q(t r+1 )) − L(Q(t r ))|Q(t r ) . (30) According to the T-slot Drift lemma presented in [51, pp. 73] and using (8), the variable-slot Lyapunov drift can be upper bounded as where B is a finite constant that satisfies for all t Considering (3), as well as the inequalities R i (t) ≤ A max and μ i (t) ≤ ρ i,max , it follows that the above equation can be satisfied by choosing By adding the penalty term −VE[ t r +T r −1 ] to both sides of (31), where V is a utility-delay trade-off parameter, and after some basic algebraic manipulations, the "Drift Plus Penalty" is derived as (34) According to the Markov modulated processes theorem presented in [51, pp. 77], the right-term of Eq. (34) can be minimized by observing Q(t) and ω(t) in every time slot and opportunistically minimizing the expectations. This is achieved by the Multiple User Access (MUA) algorithm, described in the next subsection.

C. MULTIPLE USER ACCESS ALGORITHM
Taking into account that Lyapunov function of the queues' backlog defined by (27) is a measure of the queue congestion in the system, by bounding and minimizing the bound of the difference between the average values of this function in two renewal time instances, conditioned on the current queue backlog, i.e., the drift defined by (30), the system queues become mean rate stable. This means that the constraints of the optimization problem are satisfied, while the queues backlog is bounded and does not grow on infinity. This leads to the development of the algorithm that is proposed in the continue of this subsection.
Considering the current values of Q(t) and ω(t), in each time slot t, the proposed MUA algorithm returns the optimal admission control, scheduling, and resource allocation decisions, based on which the queues of all users are updated. In what follows, the main blocks of the MUA algorithm are discussed in detail.
1) Admission control: The number of admitted packets is selected according to With this equation the queue data size is restricted to A max + θ i V.
2) Scheduling and resource allocation: The scheduling, the transmission rate and power allocation controls, i.e., z(t), ρ(t) P(t), are selected by the solution of the following optimization problem: The optimization problem in (36) is a deterministic combinatorial one and can be optimally solved by searching (e.g., via brute-force) over a finite search space. It can be solved by any node that is aware of the queue sizes and the channel states statistics, i.e., Q and ω, which are acquired by dedicated feedback links and pilot symbols, respectively. For the sake of simplicity of implementation, it is assumed that this problem is solved in a centralized fashion, since it is easier for the BS to acquire this information. Thus, after solving the problem, the BS reports the optimal values to the users. Another alternative though is all users to be aware of the channel state statistics and the queue sizes by receiving feedback (either by the BS or directly by the users) and solve the same optimization locally. However, it deserves to be noted that the use of centralized scheme in which the BS is responsible for the network's coordination are more suitable for practical implementation, compared to distributed iterative ones, in which increased overhead and synchronization are major challenges. Also, despite the fact that (36) is solved based on exhaustive search, it can be solved with polynomial complexity, due to the fact that user pairing is used for practical reasons. To be more precise, user pairing retains the decoding complexity and the resource allocation complexity acceptable. Also, the structure of (36) allows its parallel processing for each pair of users, with the number of different pairs in the extreme case being equal to M(M−1) 2 . However, in practice the number of users pairs that need to be considered is lower, since the selection of users that simultaneously have better channel conditions and more packets stored in their queues dominate other solutions. For each pair of users, for the case SIC the complexity depends on the size considered power and transmission rate levels, i.e., the cardinality of P and . On the other hand, for the case of JD, the complexity solely depends on the size of , since JD can achieve any point of the capacity region and the latter is maximized when users transmit information with the maximum power, i.e., P max . In any case, the number of considered power and transmission rates levels can be retained sufficiently low to guarantee that the required computational time is acceptable. Furthermore, since in (36) the large scale fading is considered instead of the small-scale one, the optimal solution varies more slowly. This means that the problem needs to be solved less frequently and with less stringent time constraints.
3) Queue Update: After completing the selection of the appropriate controls, the queues are updated according to (8).
For the analytical evaluation of the performance of the proposed MUA algorithm, useful bounds are provided in the following theorem, which also provide insights for the trade-off between the average throughput and delay.
Theorem 3: Let's assume that the MUA algorithm is implemented over all time slots t, with a control parameter V > 0 and initial queue conditions Q i (0) = 0, for all users i. The queue backlog is upper bounded by a constant Q max , which can be expressed as where i ∈ N. Furthermore, the long-term time average throughput utility achieved by the MUA algorithm is within O( 1 V ) of the optimal value, i.e., where υ * denotes the optimal value of the objective in the optimization problem in (25). Proof: The proof is based on the approach that has been introduced in [51]. Specifically, for (37), let's assume Q i (t) ≤ Q max at the slot t. It will be shown that the same holds at the frame t + 1.
the admission control part of the MUA algorithm sets R i (t) = 0, which results to Q i (t + 1) ≤ Q i (t) ≤ Q max . Moreover, the second part of Theorem 3 can be directly derived by using the results of the theorem presented in [51,Sec. 5.6].
According to Theorem 3, the long-term time-average throughput can be pushed to within O( 1 V ) of the optimal value at the expense of the worst case queue backlog. Taking into account the Little's Theorem, this leads to an O( 1 V , V) utility-delay tradeoff.

V. SIMULATION RESULTS
In this section we present simulation results, which illustrate the performance of the presented algorithm. For deriving the results, a scenario of three users with one antenna each is considered, i.e., M = 3. The channels of each user to the receiver antennas are assumed to fade as discussed in Section III. In addition, to model shadow fading, it is assumed that the channel state process for each user evolves independently according to a general irreducible discrete time Markov chain with two states, as depicted in Fig. 2. More specifically, if the channel of the i-th user is in the 1-st state it holds that P max ω i N 0 = 0 dB, while in the 2-nd state it holds that P max ω i N 0 = 10 dB. Also, two power levels have been considered for the users, i.e., P = {0, P max }. Thus, the received SNR s i can receive the following values: for P i = 0, 1, for P i = P max and ω i ∈ {state 1}, 10, for P i = P max and ω i ∈ {state 2}.
In addition, the available transmission rates selected by the users are given by the vector ρ = { 1 4 , 1 2 , 1, 2}. Finally, without loss of generality, the packet size, the slot duration, and the available bandwidth are assumed to be equal to S p = 50Kbit, S d = 1 sec, and W = 50 KHz, respectively, while the users input process is assumed to be Bernoulli with rate given by the vector λ = {λ, λ 2 , λ 3 }. Fig. 4 investigates the performance of the proposed transmission algorithm in terms of sum throughput for a specified value of the maximum queue length, i.e., the V parameter (V = 100), when various values of N are assumed and both JD and SIC are employed. Also, for the sake of comparison, the performance that is achieved by OMA is also illustrated, according to which all control actions are also optimized, with the constraint though that solely a single user transmits information in each time slot. In more detail, OMA has been evaluated by considering (36) and by replacing C 5 with M i=1 z i (t) ≤ 1, according to which at most one user can be scheduled for each time slot. Thus, as it has already been mentioned, OMA can be seen as a special case of the proposed NOMA-based framework that enables the information transmission by at most two users in the same time slot. When observing the figure, it becomes evident that the sum throughput increases linearly with respect to the total input rate, until a maximum value is reached. After a specific value of the input rate, although the users queues sizes start to grow, due to the admission control mechanism, the queues sizes are upper-bounded according to the bound of (37). Thus it becomes evident that the constraint C1 of the optimization problem in (25) is satisfied. Furthermore, the comparison between the considered decoding schemes shows that JD achieves higher sum throughput than SIC. However, the superiority of JD over SIC decreases as the number of the receive antennas N increases. In particular, an approximate maximum difference of 10 Kbps between these two decoding schemes is observed when N = 1, which decreases to 1 Kbps when N = 4. This observation can be explained by the fact that multiple antennas make the received signal vectors more orthogonal, and hence allow the receiver to better demodulate and decode the transmitted signals, even with an inferior decoding algorithm, such as SIC-VBLAST. Moreover, although it was expected that NOMA outperforms OMA, an importance performance gain is revealed, especially when JD is used, showing that the use of NOMA in buffer-aided systems is particularly promising, even for a small number of antennas at the receiver, e.g., N = 1.
Figs. 5 and 6 illustrate the performance of the proposed MUA algorithm when JD is employed, for various values of V parameters. In particular, Fig. 5 depicts the blocking probability, P bl defined as the ratio of the difference between the total input rate and the sum throughput to the total input rate, i.e., while Fig. 6 depicts the average total queue size. When comparing the blocking probability of the proposed MUA algorithm for various values of V, it is observed that the range of total input rate for which the blocking probability is approximately equal to zero is increased as the values of the parameter V increases. This was expected since, according to (38), the sum throughput achieved by the MUA algorithm increases as V increases, pushing it within O( 1 V ) to that of the optimal stationary policy. On the other hand, the increase of V leads to an increase of the average total queue backlog, inducing higher packet transmission delays according to the Little's theorem and hence verifying the utility-delay trade-off.
Furthermore, Fig. 7 compares the performance of the MUA algorithm with a suboptimal transmission algorithm that does not have exact knowledge of the queue size. Specifically, the considered suboptimal transmission scheme takes resource, scheduling and code selection decisions based on (36), by replacing Q i (t) = 1, ∀i. Thus, the optimization problem that needs to be solved by the suboptimal transmission scheme remains the same as the original problem of the MUA algorithm, with the only difference of the objective ρ(t)). Moreover, it should be noted that the implementation of the suboptimal transmission algorithm still requires the central node to be aware of the states ω i and the optimal solution to be announced to the users. It can be easily observed by the figure that MUA transmission algorithm has always higher performance than the suboptimal scheme; however, this comes at the expense of increased feedback, since the scheduler needs to be informed about the queue status of all users in every time slot. Nevertheless, since this information does not change rapidly in a dramatic way, it can be presumably given at a slower rate. Furthermore, at high values of total input rate, the performance of the MUA and suboptimal schemes become identical. This was expected since at these values of total input rates, users become infinitely backlogged, and users queue sizes do not affect the decision rule of (36). Interestingly, when increasing antenna numbers at the receiver, the relative gain at a fixed blocking probability is higher, indicating an additional advantage for multiple receiver antennas. Hence, the suboptimal transmission scheme can be considered as a reliable less complex alternative when users are infinitely backlogged.
Finally, in Fig. 8, we present the behavior of the outage probability as a function of the transmission rate of one user at a given transmission rate of the other user. The SNR of both users is set to 20dB. We see that for N = 1, the joint decoding scheme performs far better than the VBLAST-SIC scheme. This improved relative performance continues in the case of multiple receive antennas (depicted N = 2), but since both outage probabilities become much smaller, the actual effect observed, e.g., in Fig. 4 becomes smaller. The superiority of joint decoding scheme compared to VBLAST-SIC is due to the fact the latter does not achieve any point of the capacity region presented in Fig. 2. Also, since the considered channels are symmetrical (i.e., the SNR is equal for both users), the two users have the same outage probability when their transmission rates are equal, i.e., when R = 2 bps/Hz. Moreover, it deserves to be noted that the improvement offered by joint decoding in terms of outage probability is capitalized by the proposed scheme in terms of throughput, as it has been shown in Fig. 4.

VI. CONCLUSION
In this paper, the synergy of uplink NOMA, buffered-sources and multiple antennas at the receiver has been investigated and optimized, under realistic assumptions about the channel state model and the corresponding available information at the central controller. More specifically, based on the Lyapunov framework, a novel practical MAC-layer protocol has been developed, according to which the central controller only needs to know the slowly-changing statistical channel information and the queue states. To this end, two different detection techniques were taken into account, namely SIC and JD, while the corresponding expressions for the outage probability have been derived in closed form. In turn, these expressions were used to maximize the weighted sum of the users' long-term average throughput, by optimizing the set of users that transmit information at each time slot and their transmit power, the number of packets that are admitted in each user's queue, and the transmission rates. Moreover, the simulation results have shown that as the average queue size increases, the average throughput also increases, which leads to the reduction of the blocking probability. Furthermore, the use of JD offers important improvement of performance compared to SIC, especially for a small number of antennas at the receiver. In addition, it is shown that the proposed schemes significantly outperform OMA, revealing the potential of using NOMA in practical buffer-aided communication systems. Finally, it deserves to be noted that since the optimization has been performed based on the values of the large scale fading coefficients, which are not frequency-selective, extending the analysis to the case of multiple frequency channels is straightforward. This is because the channel state that corresponds to a specific user has the same value for all different frequency channels. Consequently, the problem of throughput maximization multiple frequency channels can be decomposed to the throughput maximization problem at each frequency channel, which corresponds to the problem that has been solved in this work.

APPENDIX A PROOF OF THEOREM 1
To prove the theorem we assume that h 1 and h 2 are independent and identically distributed complex Gaussian N-dimensional channel vectors with zero-mean and unitvariance. In this case the random variables y i = s i h † i h i for i = 1, 2 are gamma-distributed with diversity exponent N−1, i.e., the probability density function (PDF) of y i is In addition, the cosine-squared of the inner product is β-distributed, i.e., its PDF is given by To prove the above result, we note that q is the modulussquared of an element of a Haar-distributed N-dimensional unitary vector. In passing we also remark that the form of f (q) indicates that as N becomes larger, the vectors h 1 and h 2 become increasingly orthogonal to each other. We now move to the actual proof of the outage expression of Theorem 1. It is more convenient to breakup the two cases of (9), i.e., ρ 1 > c 1 and ρ 1 +ρ 2 > c 12 ∩ρ 1 >c 1 ∩ρ 2 >c 2 , into the region ρ 1 > c 1 and the region where ρ 1 + ρ 2 > c 12 and c 1 < ρ 1 < c 1 hold. Then, by letting Pr(·) denote probability, Pr(ρ 1 > c 1 ) simply be expressed as Pr(ρ 1 > c 1 ) = Pr(y 1 < y 10 ) To find Pr(c 1 < ρ 1 < c 1 ∩ c 12 < ρ 1 + ρ 2 ) we see that the constraints in the parenthesis are equivalent to y 1 > y 10 , 1 − q < y 10 (1 + y 2 ) − y 1 y 1 y 2 ≡ σ 1 (y 1 , y 2 ), where the last two equations define the quantities σ 1 and σ 2 . Note that due to the first constraint y 1 > y 10 , we have σ 1 (y 1 , y 2 ) < 1. As a result, the integration over q splits into two regions, depending on the values of y 1 and y 2 . In the first σ 2 (y 1 , y 2 ) > σ 1 (y 1 , y 2 ) > 0, which corresponds to the region y 2 < y 20 and y 10 < y 1 < y 10 (1 + y 2 ). In this region the integral over f (q) becomes 1 1−σ 1 (y 1 ,y 2 ) f (q)dq = σ 1 (y 1 , y 2 ) N−1 .
As a result, the probability Pr(c 1 < ρ 1 < c 1 ∩ c 12 < ρ 1 + ρ 2 ) can be expressed as Pr(c 1 < ρ 1 < c 1 ∩ c 12 < ρ 1 + ρ 2 ) To integrate the above expressions we note first that the denominator in the σ 1 and σ 2 expressions exactly cancels the powers of y 1 and y 2 in p 1 (y 1 ) and p 2 (y 2 ), respectively. Then we proceed to integrate the outer y 1 integrals by parts, producing after some manipulations the final result (17) in Theorem 1. Concluding, it is important to point out how the existence of multiple antennas at the receiver for N > 1 affects the above results. Indeed, since σ 1 , σ 2 < 1 the scaling factors in (48) σ N−1 1 and σ N−1 2 are responsible for reducing the outage probability, and in fact in an increasing manner as N becomes large.

APPENDIX B PROOF OF THEOREM 2
In this section we will provide details on the proof of Theorem 2. As in the previous section and considering (13), it is convenient to analyze the outage region split into the region where ρ 1 > c 1 and the region where the inequalities c 1 < ρ 1 < c 1c2 < ρ 2 hold. As before, the probability for the case ρ 1 > c 1 is To evaluate Pr(c 1 < ρ 1 < c 1 ∩c 2 < ρ 2 ), we see that the constraints are equivalent to y 1 > y 10 , 1 − q < σ 1 (y 1 , y 2 ), 1 − q < y 20 (1 + y 1 ) − y 2 y 1 y 2 ≡ σ 3 (y 1 , y 2 ).
This corresponds to the intersection of the half-planes y 10 < y 1 , y 1 < y 10 (1 + y 2 ) and y 2 < (y 20 − y 10 + y 1 (1 + y 20 ))/ (1 + y 10 ). When y 10 y 20 < 1 the above equations create a triangular region between the points (y 10 In contrast, when y 10 y 20 > 1 the intersection extends to infinity. As a result the contribution to the outage from this region can be expressed as where m = y * 1 when y 10 y 20 < 1 and m = ∞ otherwise. In the second region we have 0 < σ 3 (y 1 , y 2 ) < σ 1 (y 1 , y 2 ), in which the q-integral is equal to This region is the intersection of the half-planes y 10 < y 1 , y 2 < y 20 (1 + y 1 ) and y 2 > (y 20 − y 10 + y 1 (1 + y 20 ))/(1 + y 10 ). As above, when y 10 y 20 < 1 the above equations create a triangular region between the points (y 10 , y 20 ), (y 10 , y 20 (1 + y 10 )) and (y * 1 , y * 2 ), with last point moving to infinity when y 10 y 20 > 1. In this case, the contribution to the outage is Both (53) and (55) can be evaluated as discussed in the previous section by first canceling the denominators in σ 1 (y 1 , y 2 ), σ 3 (y 1 , y 2 ) with the corresponding ones in p 1 (y 1 ), p 2 (y 2 ) and then integrating the outer integrals by parts. Adding the corresponding results to (49) we obtain the final result in (19) in Theorem 2.