Terahertz-Band MIMO-NOMA: Adaptive Superposition Coding and Subspace Detection

We consider the problem of efficient ultra-massive multiple-input multiple-output (UM-MIMO) data detection in terahertz (THz)-band non-orthogonal multiple access (NOMA) systems. We argue that the most common THz NOMA configuration is power-domain superposition coding over quasi-optical doubly-massive MIMO channels. We propose spatial tuning techniques that modify antenna subarray arrangements to enhance channel conditions. Towards recovering the superposed data at the receiver side, we propose a family of data detectors based on low-complexity channel matrix puncturing, in which higher-order detectors are dynamically formed from lower-order component detectors. We first detail the proposed solutions for the case of superposition coding of multiple streams in point-to-point THz MIMO links. We then extend the study to multi-user NOMA, in which randomly distributed users get grouped into narrow cell sectors and are allocated different power levels depending on their proximity to the base station. We show that successive interference cancellation is carried with minimal performance and complexity costs under spatial tuning. We derive approximate bit error rate (BER) equations, and we propose an architectural design to illustrate complexity reductions. Under typical THz conditions, channel puncturing introduces more than an order of magnitude reduction in BER at high signal-to-noise ratios while reducing complexity by approximately 90%.

THz signal propagation incurs very high losses that severely undermine the promised gains, such as achieving a terabitper-second (Tbps) data rate [5]. Infrastructure and algorithmic enablers should thus complement THz communications. At the infrastructure level, ultra-massive multiple input multiple output (UM-MIMO) antenna arrangements [8] and intelligent reflecting surfaces (IRSs) [9] are required to overcome the very short communication distances. At the algorithmic level, THz-specific signal processing techniques [10] can get around the limitation of THz quasi-optical propagation and realize seamless connectivity. In particular, optimized distanceadaptive resource allocation solutions [11] are required to tackle spectrum shrinking due to molecular absorption. Furthermore, THz single-carrier modulations can replace orthogonal frequency-division multiplexing (OFDM) to reduce the baseband complexity and avoid the peak-to-average power ratio problem. Most importantly, efficient THz baseband signal processing is crucial to reduce the gap between the huge promised bandwidths and the limited state-of-the-art digital sampling speeds and processing capabilities [12].
Achieving higher spectrum utilization and enhancing the already large available bandwidths in the THz-band can be realized using advanced multiple access methods. Orthogonal multiple access (OMA) systems in which wireless resources are allocated for various users orthogonally in time, frequency, or code domains have traditionally dominated wireless communication standards. OMA's main drawback is its low spectral efficiency when allocating resources to users having poor channel conditions. Non-orthogonal multiple access (NOMA) [13], [14] is introduced to solve this problem by enabling users with significantly different channel conditions to share resources. NOMA can achieve such sharing through superposition coding (SC) of data streams at the transmitter, followed by successive interference cancellation (SIC) at the receiver. Proper user pairing criteria and power allocation policies can guarantee user fairness, affordable complexity overheads, and low signaling costs.
While spectrum is not the main bottleneck at THz frequencies, more spectral efficiency is always desirable, especially in low-complexity single-carrier systems. Maintaining fairness among users in a congested THz band is another motive for THz NOMA, where the gap between the best and worst user scenarios is significantly large. NOMA schemes can further mitigate the hardware constraints that limit the beamforming capabilities in THz devices. Since high-frequency NOMA [15] is likely to be conducted over line-of-sight (LoS) UM-MIMO links, we can apprehend the concept of THz multiple access through studying superposition coding of different data streams over a single point-to-point link. Very few THz-NOMA works exist in the literature. In [16], the prospects of enhancing the achievable data rates in THz NOMA are highlighted. Furthermore, a bandwidth-aware THz NOMA solution is proposed in [17], whereas THz MIMO-NOMA energy efficiency is addressed in [18] by optimizing power allocation, user clustering, and hybrid precoding.
THz UM-MIMO systems will most probably follow adaptive hybrid arrays-of-subarrays (AoSA) architectures [10], [19] where each subarray (SA) typically supports independent analog beamforming. A variety of probabilistic shaping and index modulation techniques [20] can be explored for adaptive array usage, especially in plasmonic THz solutions where antenna elements (AEs) can tune the frequency of operation by simple material doping or electrostatic bias [7]. Such reconfigurability can simultaneously accomplish resource allocation, user clustering, and beamforming. MIMO-NOMA configurations can be single-cluster or multi-cluster. In a single-cluster setting, all users except one conduct SIC, whereas in a multi-cluster setting, users are first partitioned into clusters to reduce interference, and SIC follows. The full potential of THz MIMO-NOMA is realized by the joint optimization of SIC, power allocation, beamforming, and user clustering [21].
MIMO NOMA systems are also largely affected by the data detection scheme at the receiver side. In conventional massive MIMO, linear detectors are near-optimal due to channel hardening [22]. However, correlated doubly-massive THz MIMO architectures require better-performing and more complex non-linear detectors [23], which creates a true bottleneck at the baseband of THz systems [10]. Recently, a family of non-linear subset-stream MIMO detectors based on QR decomposition (QRD) has been proposed. The least-complex member of this family is the nulling-and-cancellation (NC) detector [24], followed by the chase detector (CD) [25] and the layered orthogonal lattice detector (LORD) [26], respectively. Furthermore, a less complex alternative family of subspace detectors [27], [28] decomposes the channel using a punctured QRD, namely, WR decomposition (WRD). We argue that the latter provides a range of performance and complexity tradeoffs suitable for THz UM-MIMO-NOMA scenarios.
This paper studies the downlink of THz UM-MIMO-NOMA systems, assuming adaptive hybrid AoSA beamforming, and focuses on data detection. We investigate two SC use cases, single-user, and multi-user (NOMA). In the single-user case, SC is employed to transmit multiple data streams over a pointto-point link. In the multi-user case, randomly distributed users in a cell are clustered and allocated different powers depending on their distance from a base station (BS), and users in each cluster are allocated the same time and frequency resources. We assume a single-cell scenario and neglect inter-cluster interference; at the receiver side, intra-cluster interference is suppressed via SIC. Since THz beams are highly directional, deriving the bounds on the achievable bit error rates (BERs) of the single-user case provides sufficient insight into the proposed NOMA data detectors' performance. The main paper contributions are: 1) We study the performance of NOMA systems under THz-specific UM-MIMO channel models and applying recently-reported adaptive spatial tuning techniques [10] that enhance THz channel conditions. 2) We present a family of QRD-based and WRD-based detectors tailored for THz UM-MIMO-NOMA. These detectors are extensions to reference subspace detectors that were previously proposed [27] in the context of conventional large MIMO systems. We aim to significantly lessen the baseband computational complexity and simplify THz NOMA detectors' implementation while minimizing performance loss. 3) We propose a simple low-complexity power allocation (through AE allocation per SA) and user clustering NOMA solution that exploits the distance-based THz path-loss between the users and the BS to mitigate intracluster interference. 4) We analyze the proposed detectors' BER performance by deriving approximate closed-form equations. Without loss of generality, we analyze the proposed detectors' performance in the single-user case and empirically illustrate their scalability in NOMA settings. 5) We propose an efficient architectural design that realizes the proposed detectors, and we study the corresponding computational complexity under THz channel conditions and Tbps baseband constraints. This paper's remainder is organized as follows: We first detail the system models in Sec. II. Then, we present THz spatial tuning techniques in Sec. III. Afterward, we illustrate the proposed single-user detectors in Sec. IV, followed by the proposed NOMA clustering, power allocation, and detection schemes in Sec. V. We derive the BER equations corresponding to SIC error propagation in the proposed detectors in Sec. VI, and conduct the complexity study in Sec. VII, showcasing an efficient architecture that realizes the proposed solutions. We present the simulation results in Sec. VIII and draw conclusions in Sec. IX. Concerning notation, lower case, bold lower case, and bold upper case letters correspond to scalars, vectors, and matrices, respectively. We denote scalar norms, vector L 2 norms, and Frobenius norms by |·|, · , and · F , respectively. We also denot by (·) , (·) * , Tr(·), (·), and E[·], the transpose, conjugate transpose, trace function, real part, and expected value, respectively. CN (·) denotes the complex normal distribution, (·) refers to the Q-function, where ( ) = ∫ ∞ − 2 /2 / √ 2 , and P[·] is the probability function.R = [˚] with entries˚is a punctured matrix, and I is an identity matrix of size .

II. SYSTEM MODEL
We utilize the three-dimensional (3D) THz UM-MIMO model of [20]. The AoSAs consist of × SAs at the transmitter and × SAs at the receiver. Each SA is further formed of × AEs. We denote by and Δ the separation between two AEs and two SAs, respectively. THz propagation is highly directional because of low reflection losses, negligible scattered and refracted components, highgain directional antennas, and large array beamforming gains. These factors result in LoS dominance with typical survival of  a single path; a "pencil beam" generated by each SA through analog beamforming. Therefore, we adopt a system model of LoS transmission over single-carrier frequency-flat THz channels in this paper's main embodiments. We then account for a more accurate channel model with persisting multipath components in simulation results. For simplicity, we virtually vectorize the AoSAs in the remainder of this work by setting × = and × = . With each SA being allocated a dedicated RF chain, SAs become the smallest addressable elements of the MIMO multiplexing system, and the role of baseband precoding and combining reduces to simply defining the utilization of the SAs.

A. Use Case 1: Single-User THz MIMO SC
In power-domain SC, we consider concurrently sending multiple data streams across various overlapping combinations of transmitting and receiving SAs (overlying channel matrices). We denote by S the set of superposition-coded data streams of dimensions , = 1, · · · , |S|, where ≥ +1 and 1 = . The multiplexed transmitted symbol vector x = [ 1 · · · · · · ] ∈ X ×1 is mapped to a contiguous set of antennas of indices − +1, · · · , , where X is a quadrature amplitude modulation (QAM) constellation. We thus have the effective channel matrices H ∈ C × being comprised of the columns − + 1, − + 2, · · · , of the overall channel matrix H = [h 1 h 2 · · · h ] ∈ C × . Note that we assume the SC of data symbols to happen at the higher SA indices for convenience. However, SC can still occur on arbitrary SA subsets and channel column permutations would approximate the desired structure. Furthermore, in the proposed channelpunctured solutions, the selected SAs need not be contiguous as long as one SA is common across all streams (more on that in Sec. VII). The equivalent input-output baseband system model can then be expressed as where y = [ 1 · · · · · · ] ∈ C ×1 is the received symbol vector and n ∈ C ×1 is the CN (0, 2 ) noise E[nn * ] = 2 I . SC designates different power levels to the superposed transmitted symbol vectors. We consider allocating a higher power level to smaller-dimension symbol vectors , i.e., < +1 . Hence, each symbol is an element of a scaled complex constellation X (E[ * ] = ), and we thus have x ∈X , whereX is the lattice formed from all possible symbol vectors that can be generated from the X constellations.
The system model with |S| = 3 multiplexed data streams and = = 8 SAs is illustrated in Fig. 1a, where x 1 , x 2 , and x 3 are transmitted from 1 = 8, 2 = 4, and 3 = 2 antennas, respectively. Since the SIC order is the same as that of power levels, x 3 will be decoded first, followed by x 2 after canceling the effect of x 3 , and finally x 1 after canceling the effects of x 2 and x 3 .
An element of H, ℎ , , the frequency response between the th transmitting and th receiving SAs, is defined as where is the path gain, a and a are the receive and transmit SA steering vectors, and are the receive and transmit antenna gains, and / and / are the receive and transmit azimuth/elevation angles of arrival and departure, respectively. The LoS path gain is defined as where , is the communication distance, is the speed of light, is the center carrier frequency, and K ( ) is the molecular absorption coefficient. K ( ) is computed [29] as a summation of absorption contributions from isotopes of gases in a medium. Note that we neglect the effect of mutual coupling, assuming sufficient antenna separations. The ideal analog steering vector per SA at the transmitter side is where Φ , is the phase shift that corresponds to AE ( , ), and is defined as for a wavelength , where ( , ) , ( , ) , and ( , ) are the AE 3D coordinates. Note that we adopted a plane wave assumption for steering vectors because separations between antenna elements can be very small in plasmonic solutions [7]. The THz channel can still be frequency-selective, especially in indoor sub-THz scenarios where sufficient multipath components persist, although much sparser than at mmWave frequencies. The non-LoS (NLoS) component of THz multipath channels can be expressed using the Saleh-Valenzuela (S-V) model as [30] ℎ NLoS where clu is the number of multipath clusters and ( ) ray is the number of paths in the th cluster, with each path having random angles of departure and arrival within a beam region. We further have where and¯, are the times of arrival (following paraboloid or exponential distributions) and Γ and are the decay factors of the clusters and rays, respectively. The angles of departure and arrival are calculated as where Φ ( ) /Φ ( ) and Θ ( ) /Θ ( ) are the cluster azimuth and elevation angles of departure/arrival that follow uniform distributions over (− , ] and − 2 , 2 , respectively, and ( , ) / ( , ) and ( , ) / ( , ) are the ray azimuth and elevation angles of departure/arrival that follow a zero-mean second order Gaussian mixture model.

B. Use Case 2: Multi-User THz MIMO-NOMA
For NOMA, we assume the users in a cell to be divided into two groups: Users in the first group are distributed over an inner disk ( 1 ) of radius N centered at the BS, whereas users in the second group are uniformly distributed over an outer disk ( 2 ) from N to C . We assume a BS with SAs to service two users simultaneously and over the same frequency (power-domain SC): An 1 -SA user 1 in 1 and where x 1 and x 2 are the transmitted power-multiplexed symbol vectors (from all SAs) and H 1 and H 2 are the equivalent channel sub-matrices, with H 1 and H 2 denoting the distancedependent large scale fading coefficients (mainly due to path loss). Therefore, NOMA is realized by clustering inner disk users with outer disk users and designating different power levels to the transmitted superposed symbol vectors.
The single-cell multi-user MIMO-NOMA scenario is illustrated in Fig. 1(b). We consider the number of users to be distributed according to a homogeneous Poisson point process (PPP):Φ 1 with density¯1 in 1 andΦ 2 with density¯2 in 2 , where P[Φ = ] = −¯¯! . We denote by the equal number of users in both disks, which is a Poisson random variable with mean Since future THz communication systems are expected to support a massive number of users/devices, we assume high user concentration. Such dense scenarios facilitate grouping users over narrow sectors as dictated by the narrow THz beamwidths.

III. SPATIAL TUNING IN THE THZ BAND
UM-MIMO systems at THz frequencies are mainly employed to overcome the high absorption and propagation losses. The corresponding high channel correlation in such systems, however, limits the achievable spatial multiplexing gains. While the severity of channel correlation at THz frequencies is challenging, novel reconfigurable THz devices enable unique opportunities to deal with such correlation. In particular, by tuning the separation between SAs and AEs at the transmitter and the receiver, and without complex precoding and combining schemes, good channel conditions can be maintained [31]. In particular, for each communication distance ( ), there is an optimal inter-antenna separation (Δ) for which the channel is orthogonal, thus supporting a maximum number of eigenchannels over which we can transmit multiple data streams. Such dynamic tuning of antenna separations can be achieved in real-time, especially in plasmonic solutions [10]. A shorter wavelength and a smaller both result in shorter optimal SA separation Δ opt , where for symmetric MIMO systems ( = = = = ), we have [20] for odd values of . Nonetheless, spatial tuning fails if is very large, larger than the so-called "Rayleigh" distance (a function of physical array dimensions) [32]. Fig. 2 illustrates the Rayleigh distance as a function of Δ for different frequencies and array sizes ( = 2 and = 128). For very small Δs (few millimeters), a large is required to achieve a few meters of efficient communications under spatial multiplexing. Note that for the same Δ, higher frequencies and more antennas extend the multiplexing-achieving distance. However, for a fixed footprint, a larger results in a Rayleigh distance reduction that is quadratic in Δ.
A massive number of AEs can be suited in a few millimeters for antenna arrays operating in the THz band. Such compactness is further emphasized with plasmonic antennas, in which can be reduced below /2 without exciting mutual coupling effects [7]. THz spatial tuning consists of tuning Δ and calibrating the required number of AEs per SA. Tuning Δ can be achieved by maintaining a specific number of idle AEs between active SAs. In order to make better use of the idle AEs, multicarrier [9] configurations have been introduced (plasmonic AEs can be tuned to different frequencies without changing their physical dimensions). Furthermore, for a given communication distance, the required AEs per SA are allocated to achieve the target beamforming gain; the possible combinations of SAs (one RF chain per SA) dictates the achievable diversity gain. We propose leveraging such adaptability for power allocation and user clustering in SC and NOMA systems, where SA dimensions (number of AEs per SA) can realize power allocation schemes. Note that in this discussion, we did not account for non-uniform array architectures, wideband channels effects, and more accurate spherical wave assumptions.

IV. PROPOSED SC DETECTORS
This section considers the single-user SC scenario of (Sec. II-A). We propose MIMO detectors that build on three QRD-based detectors (NC, CD, and LORD) and three WRDbased detectors [27] (punctured NC (PNC), punctured CD (PCD), and the subspace detector (SSD)). While the formulations per data stream are simple extensions to [27], under SC, we further account for proper stream ordering, sub-matrix selection, and inter-stream interference cancellation.

A. Proposed QRD-Based Detectors
With perfect knowledge of the channel at the receiver side, QRD decomposes H into H = QR, where Q ∈ C × is formed of orthonormal columns (Q * Q = I ), and R = [ ] ∈ C × is a square upper-triangular matrix (UTM) having real and positive diagonal entries. The modified baseband model is expressed asỹ = Q * y = Rx + Q * n, with n and Q * n being statistically identical. By construction, we assume that the streams allocated higher power levels and consequently detected first at the receiver are transmitted via smaller sets of contiguous antennas, including the last SA . Consequently, a single channel decomposition is sufficient to detect all streams (more on that in Sec. VII). After channel matrix decomposition, x | S | is first detected, followed by x | S |−1 , and so forth. In what follows, subscript indicates that the detection routine corresponds to detecting symbol vector x of the th data stream.
We assume optimality in the log-max sense; the maximum likelihood (ML) detector exhaustively searches the lattice X forx where R is the bottom right square submatrix of R of size , H consists of the last columns of H, andỹ consists of the last elements ofỹ. Although this approximation does not take full advantage of receive diversity , it is a key observation that allows for a cost-efficient modular architecture (Sec. VII). All proposed detectors will thus exploit the alternative system model where n consists of the last elements of n. Note that we do not include inter-stream interference in this equation; we account for such interference in the BER analysis of Sec. VI.
A low-complexity NC detector first performs nulling by multiplying y with Q * , which is an operation that is common to all streams, to suppress interference at layer from ( > ). Co-antenna interference is then suppressed via backsubstitution and slicing.
With CD, the error propagation in back-substitution and slicing is mitigated by searching a reduced candidate symbol vector list L (ỹ , R ) before making a final decision. We first partitionỹ , R , and x as For each root-layer , value, a candidate vector is constructed as in (2) and appended to L . After populating X candidate vectors, the final hard-output (HO) solution is selected from By repeating the CD routine, LORD iterates chase detection over various layer orderings, for different root layers, by shifting the columns of H cyclically and accumulating the root-layer symbol of every CD output. Every permuted H at step , = 1,· · ·, , is QR-decomposed into Q ( ) and R ( ) following (3). Denote byx CD , ( ) the CD output at step . The overall LORD solution iŝ

B. Proposed WRD-Based Detectors
Channel puncturing can significantly reduce the complexity of QRD-based detectors. WRD transforms H into a punctured UTMR = [˚] ∈ C × with˚∈ R + by zeroing-out the entries between column and the diagonal, via a matrix multiplication W * H =R, where W ∈ C × . The brute-force procedure for computing W [33] requires complex matrix inversions that are also prone to roundoff errors. Nevertheless, a simpler alternative procedure [34] applies QRD followed by elementary matrix operations. The modified symbol vector at the receiver isȳ = W * y =Rx + W * n.
We similarly defineR as the bottom right square submatrix of size of R, andȳ as the last elements ofỹ. Then, by analogy with (3) we have for the th stream where in this caseÅ ∈ R ( −1)×( −1) is diagonal.
With PNC, we pre-multiply with W * instead of Q * for nulling, and perform back-substitution and slicing aŝ PNC , For all streams, slicing on layers = −1, · · · , 1 is executed in parallel becauseÅ is diagonal. PCD performs the chase detection operations following the partition in (4). An altered list of relevant symbol vectors P (ȳ ,R ) is thus populated, and the corresponding distance to a vector x = [x 1, , , ] is given bȳ For every , ∈ X , this distance is minimized as which is a vectorized slicing operation, and x ( , ) = [x 1, ( , ), , ] . We then add the symbol vector x ( , ) to P and save the corresponding¯ * x ( , ) . The HO solutionx PCD is selected from P as the vector with the minimum distance.
SSD picks from the PCD HO vector the symbol at the root layer for each step . Therefore, the SSD HO symbol vector is assembled over executions of PCD, one symbol at a time, asx Note that for all proposed detectors in the single-user scenario, we detect x | | by treating the interference caused by other streams as unknown. Every time a symbol vector x is detected, by treating streams − 1 down to 1 as unknown interference, the received signal component due to x gets canceled. This paves the way to detecting x −1 from the remaining part of the received signal in the next step. In the particular case of SSD, the received vector is updated before every step as follows: V. EXTENSIONS TO MULTI-USER MIMO-NOMA Having detailed the proposed detectors, we next study their utilization in a NOMA setting. We start by proposing a lowcomplexity joint clustering and power control mechanism.

A. Joint Clustering and Power Control
Although NOMA settings result in intra-cluster interference (ICI), efficient user clustering enhances ICI cancellation in SIC at the receiver. The SIC process distinguishes samecluster users by the difference in their power, where users are allocated power levels based on their corresponding channel vector norms. Hence, an efficient clustering approach couples two users with significantly different channel vector norms, typically a user far from the BS with a near user. Motivated by this realization, we propose a low-complexity joint distancebased (path-loss-based) clustering and power control scheme (JDCP). We assume sufficiently dense networks that guarantee a sufficient number of users in a beamwidth-limited cell sector.
The proposed clustering approach operates as follows: First, the farthest user in disk 1 is grouped with the farthest user in disk 2 . Then, the second farthest user in 1 is grouped with the second farthest user in 2 , and so on. Under such pairing, SIC efficiency is guaranteed because we always allocate more power to the weak user. SIC decoding is only needed at the receiver of the strong NOMA user. User 1 with better channel conditions is the strong user, and user 2 is the weak user (the SNR at user 1 is higher than that at user 2). Hence, we have 2 H 1 ≥ 2 H 2 , which indicates that the central user is user 1 and the cell-edge user is user 2. Therefore, user 2 will be allocated more power. Subsequently, user 2 directly decodes its own data x 2 , treating the interference from x 1 as unknown, while user 1 applies SIC to cancel out x 2 before decoding its own symbol vector x 1 .
Following clustering, the proposed low-complexity power control (PC) mechanism exploits the cellular link's CSI to minimize the interference between NOMA pairs. We select the transmit power of NOMA pairs based on channel conditions; specifically, the distance-based path-loss. At THz frequencies, the distance-based path-loss includes the distance-based absorption loss in addition to propagation losses. However, instead of using (1), we consider the equivalent THz pathloss model that accounts for additional losses in the path-loss exponent. Reported LoS path-loss exponent values at sub-THz frequencies are around = 2.2 [35]. The allocated power for the th close user (in 1 ), based on channel inversion, is given by where 1, is the distance separating the BS and the th user equipment (UE) in 1 , is the path-loss exponent, and rx is the minimum required power for UE signal recovery (also referred to as receiver sensitivity). As for the power allocated to the th far user in 2 , it can be expressed as where is a NOMA PC parameter, 2, is the distance between the th UE and the BS in 2 , and max is the maximum transmit power. The adopted channel inversion technique does not compensate for small-scale fading which is negligible at THz frequencies; it only accounts for the large-scale path-loss effects. Consequently, the proposed PC scheme does not require establishing instantaneous CSI at the transmitter, which is costly at high frequencies and massive dimensions. Moreover, the BS can accurately estimate distances via location updates as defined in the 3GPP TS 23.032: Universal geographical area description (GAD) [36]. We further argue that the raging accuracy is much higher with THz signals [2]. Furthermore, this scheme is particularly suitable for SIC decoding since it guarantees allocating much more power to far users and much less power to close users, which guarantees alluding the worst-case scenario of allocating equal power for both users, a scenario that must be avoided in NOMA. The JDCP scheme is summarized in Algorithm 1.
Spatial domain multiplexing of multi-clusters and multicells can further result in multi-cluster interference (MCI) and multi-cell interference (MCeI). However, the probability of such interference is very low at THz frequencies due to shorter communications distances and narrower beams. Coordinated beamforming techniques can be used, alongside intra-cluster SIC, to suppress intra-cluster interference, intercluster interference, and multi-cell interference.

B. Multi-User MIMO-NOMA Detection
Following JDCP, user 2 decodes its symbol vector x 2 directly by treating the interference due to x 1 as unknown interference. Such detection can be achieved by using any of the proposed detectors in Sec. IV, with computations corresponding to the case of the first detected stream ( = | | = 2). This concludes the operations at user 2 where no SIC is required. Hence, the detection routine at user 2 is, in fact, regular MIMO detection. Nevertheless, SIC-based detection applies to user 1. First, the symbol vector x 2 is detected while treating x 1 as unknown interference (since we assume no communication between users, where each user decodes its own information independently). Then, user 1 cancels the portion of the received signal that is caused by x 2 and decodes x 1 from the remainder of the received signal: By analogy with the construction in Sec. IV, the operation at user 1 can be modeled as dual-stream detection (|S| = 2). The difference here is that the output at the second iteration ( = 1) is the only desired output and the output x 2 at the first iteration is discarded.

VI. CHARACTERIZATION AND ANALYSIS OF BER
Since this work's primary objective is to investigate the performance of detection schemes at the receiving side without optimizations at the transmitter side, the suitable metric for performance analyses is BER rather than achievable sum rates. In what follows, we formulate approximate BER equations that provide insight into the resulting system performance of the proposed QRD-based and WRD-based NOMA detectors. We consider single-user SC and assume the case where all streams are of equal size ( = for all ). We assume a Gaussian channel case to derive closed-form BER equations, and we generate empirical approximate BER bounds for THz channels. The relative BER performances of QRD-based and WRD-based detectors are studied in [27] for OMA-MIMO systems. The main factors that affect the performance under puncturing are: The reduction in error propagation over MIMO layers, the variation in the statistical properties of the elements ofR compared to R, and noise colorness. In this section, we assume the generic case of detecting stream after canceling the interference from stream +1. We drop the index from the symbols for clarity of presentation. For other layers, if slicing at layer is accurate, we havé

A. NC and PNC
where is the th component of the noise vector W * n. Noting that −ˆN C = ±2 √ , the variance of interference plus noise is 2 +4 . This analysis holds when the streams (or users in the NOMA scenario) are adequately separated in power, such that no error propagates from stream + 1 to and the interference from stream − 1 is negligible. This assumption is made by most studies on MIMO-NOMA that consider capacity maximization. However, an additional error component is introduced by inter-stream SIC. Assuming that a one-bit slicing error occurs when detecting stream + 1 (with Gray mapping), the corresponding error component is proportional to +1 = 2 √ +1 log 2 −1 for a scaled -QAM X +1 . Furthermore, the interference power from user −1 (neglectig farther streams) is proportional to −1 2 H −1 . Hence, we have´(˚) = √︂ , and the resultant BER when detecting can be expressed aś The QRD-based NC BER at layer ( = −1, · · · , 1) is studied in [24]. By analogy, and following a similar derivation to that for PNC, we have where +1 ( +1 ) is recursively obtained, and is an occurrence of , the complete set of possible error patterns up to layer . Since ´ = 2 < | | = 2 − +1 with WRD, error propagation is significantly lessened. Nevertheless, this does not guarantee an enhanced BER performance. For a Gaussian channel, puncturing reduces error propagation but also results in performance degradation. At layer , the BER is derived by taking the average over 2 and˚2 . For Gaussian channels, the off-diagonal elements of R are circular symmetric complex Gaussian random variables, and the square of the th diagonal elements is chi-squared distributed with 2( − +1) degrees of freedom. Although the distributions of non-zero off-diagonal elements remain intact inR, the distributions of diagonal elements at upper layers = 1, · · · , − 3 lose degrees of freedom from 2( − +1) down to 4. Since˚2 is on average smaller than 2 for 1 ≤ ≤ −2 (lower degrees of freedom), PNC results in performance loss. However, both NC and PNC are dominated by´(˚) = ( ) at the root layer. For a THz channel, however, we distinguish between two cases: The case of spatial tuning and the case of highly correlated channels. Under spatial tuning, the channel is nearly diagonal (up to some quantization errors), and the studied punctured, and unpunctured detectors reduce to the same detector. On the other hand, under high channel correlation, the diagonal elements of R do not initially possess high degrees of freedom to lose them through puncturing. Hence, the reduction in error propagation across symbol layers is further emphasized in correlated THz scenarios, where puncturing could even result result in performance enhancement. Note that both configurations, with or without spatial tuning, are somewhat deterministic. Hence, no averaging is required over channel realizations; we could directly simulate empirical BER results following equations (5) and (6), for example. Nevertheless, in an indoor THz scenario with sufficient multipath components, the proposed detector's performance can still mimic the Gaussian case.
We next assume Rayleigh fading, where we can derive closed-form BER equations and extend the analysis to arbitrary modulation types. We denote by ( ,¯, ) the function that produces the average BER for an L-QAM constellation over -fold diversity Rayleigh fading with mean branch SNR [37]. The average PNC BER for layer 1 ≤ ≤ −1 iś , (1 −´) log 2 −1 , and because of puncturing, the layers 1 ≤ ≤ −1 only provide 2-fold diversity. Furthermore, the average NC BER at layer < is derived by replacing (err| , +1 ) in equation (6) by its average over , Since +1 in +1 is much larger than , the residual error from SIC, when it occurs, is much more severe than the residual error from back-substitution and slicing. Nevertheless, setting +1 and −1 renders SIC error and interference negligible, respectively. This is because´( +1) will approach zero in this case despite the increase in +1 , and the interference component −1

2
H −1 will also be negligible (we seek maximal power separation between same-cluster users).

B. CD and PCD
An approximate approach for capturing the BER performance of the CD builds on the NC BER equations. Since no  error propagates from layer when searching all its candidate symbols, we sum the BER combinations on layers < assuming´(˚) = ( ) = 0. In the particular case of Gaussian channels, the BER of PCD is given by , .
The approximate BER performances of NC, PNC, CD, and PCD are shown in Fig. 3, for a three-stream SC MIMO scenario. The reduction in complexity under puncturing comes at a graceful performance cost, and inter-stream interference results in error floors. However, under spatial tuning, all detectors behave identically and are robust to inter-stream interference. Note that without spatial tuning, channel correlation results in severe performance degradation (more results in Sec. VIII).
For LORD and SSD, a comparative BER analysis in the context of OMA-MIMO is conducted in [27]. It is argued that with correlated channels, SSD outperforms LORD at high SNR. By only considering the root-layer symbols of the PCD solution after cyclically shifting the layers of H in each SSD iteration, intra-channel interference is mitigated (under puncturing, all layers are only dependent on the root layer). Therefore, using SSD instead of LORD not only reduces complexity but also enhances performance. Note that we considered a low complexity puncturing mechanism in this work, which does not necessarily guarantee maximum achievable rates. In [28], an augmented channel is punctured instead of the true channel, where the augmentation accounts for a minimum mean square error (MMSE) prefiltering and channel gain compensation. The resultant scheme maximizes the achievable rates at an additional complexity cost.

VII. ARCHITECTURE AND COMPLEXITY ANALYSIS
This section details an efficient architecture that realizes our proposed detectors in a modular and low complexity design.
We similarly assume the case of detection at the th steam in a single-user SC setting with = , and drop the index for convenience. Figure 4 illustrates the architectural design for WRD-based detectors. The complexity reduction is on multiple levels: 1) A single channel matrix decomposition (one decomposition for PNC and CD and decompositions for SSD) is required for all streams, as all subsequent streams are assumed to use contiguous subsets of transmitting SAs. Consequently, a global channel matrix QRD/WRD can be stored in hardware, and a multiplexer would select the required channel at input . In the specific case of WRD, we can relax the constraint on the antenna subsets being contiguous to the condition of only including the th SA in all streams. Arbitrary combinations of SAs are thus tolerated. This observation is valid because all layers other than are independent under puncturing and can thus be flexibly arranged in any order.
2) The choice of detectors, whether QRD-based or WRDbased, is such that higher complexity detectors can make use of their lower-complexity counterparts as building blocks. Hence, we propose a hierarchical architectural design in Fig. 4, where SSD uses PCD components that themselves use PNC. With SSD, the computed PCD distances at a layer of interest are forwarded, alongside the corresponding symbol vectors, to a decision processing unit. This occurs on all layers in parallel, where the aggregate output vector is ready after a fulllayer processing delay. Similarly, a QRD-based design features LORD using CD and NC as building blocks, but the resulting architecture is not fully parallelizable.
The proposed design provides the flexibility to adapt detector types depending on varying channel conditions or resource requirements while using a single dedicated hardware processor. 3) Channel puncturing reduces the computational complexity. The PNC routine, which gets executed the most as  the lowest level building block, requires reduced backpropagation computations due to puncturing-induced sparsity (significantly less complex than NC). 4) In the specific case when symbols on each layer are chosen from the same unscaled modulation X for each stream/user, we can store in memory an exhaustive set of the products Rx orRx for all x. Then, a simple scaling by at layer would replace the matrix-vector multiplication. This feature is more useful when all combinations of symbol vectors are required, which is more feasible with relatively low-order MIMO-NOMA systems.
We next analyze the detectors' complexity in terms of floating-point operations (flops), as a function of real addition (RAD) and real multiplication (RML) operations. WhenRx is executed in lieu of Rx, a reduction of ( − 2) ( − 1)/2 multiplications is noted, which is equivalent to 1 = ( 2 −3 + 2) RAD + (2 2 − 6 + 4) RML flops. This reduction accounts to 77% and 88% of the multiplications in a 16 × 16 MIMO system and 32×32 MIMO system, respectively. However, QRD  Table I summarizes these results. With flat fading THz channels, such decomposition computations can be stored in memory for a large number of frames .
The complexity tradeoffs of the proposed detectors are particularly important for THz systems. Although THz communications promise Tbps data rates, state-of-the-art baseband clock speeds are confined to a few GHz [10] (1000 bits need to be processed per clock cycle). Furthermore, THz baseband processing capabilities are limited, and there are no energy-efficient transceivers capable of supporting 1 Tbps. While the main complexity burden comes from channel coding and channel code decoding, data detection can significantly reduce complexity, especially in UM-MIMO scenarios. High parallelism is thus an architectural requirement. The parallelizability of the proposed subspace detectors can be exploited to reduce the frame length at the decoders' input. By splitting the code into sub-blocks corresponding to multiple channel decomposition outputs, each sub-block can be processed on a separate decoding core, reducing complexity and memory usage. However, this comes at the expense of additional calculations to mitigate the loss in performance at the subblock borders.
Several extensions can further enhance the performance of the proposed detectors. For instance, we can easily modify the construction to account for generating log-likelihood ratios (LLRs) as reliability information in a soft-output (SO) setting. To generate LLRs in SSD, we decouple the streams in steps (assuming a single-user scenario with = ). In each step ∈ {1, · · · , } we calculate the LLRs of the bits of symbol ( = ). By exchanging LLRs between detection and decoding blocks, iterative detection and decoding schemes are realized following the "Turbo principle". In particular, the decoder can be fed a priori information LLR , the difference between the detector's SO and its own SO from the previous decoding iteration. The decoder then generates extrinsic LLRs in the form of a posteriori information denoted LLR , where Although iterative schemes are more complex and naturally ill-suited for Tbps constraints, we can adapt the number of iterations according to the THz channel conditions. We can maintain a trade-off between complexity and performance by favoring detection iterations and lowering the number of decoder iterations, for example. In particular, with inherent parallelizability in our proposed detectors, decoding iterations    can be saved from specific sub-decoders and distributed to other blocks for better efficiency. Such adaptive iterative detection and decoding can be complemented by an adaptive transmission scheme (mapping bits to symbols). Note that the subspace detectors themselves can be made iterative [38].

VIII. SIMULATION RESULTS AND DISCUSSIONS
The proposed detectors are simulated according to the system model in Sec. II. Single-user SC and multi-user MIMO-NOMA scenarios are considered. Fig. 5 shows the BER plots for a single-user setting. For reference, all proposed detectors are simulated alongside the ML detector in Fig. 5(a), for 4×4 MIMO with QPSK, and assuming two power-multiplexed data streams of the same length ( 2 = 1 = 4) ( 2 / 1 = 1000). First, we note that the detectors maintain their diversity gains under SC, which means that the power separation is sufficient to cancel residual SIC errors. However, this comes at the expense of a larger SNR span. The best performing detector is SSD, which achieves near-ML performance, followed by LORD, CD, and PCD, respectively (NC and PNC have a diversity order of 1). The BER analysis in Sec.VI validates these results, where PCD is argued to lag behind CD due to performance loss caused by puncturing. Nevertheless, puncturing is argued to result in performance enhancement in SSD compared to LORD.
The results of a THz UM-MIMO scenario where three data streams are multiplexed in a 16 × 16 configuration ( 3 = 16, 2 = 8, 1 = 4) with 16QAM are then shown in figures 5(c) to 5(d). Note that a 16 × 16 configuration at the level of SAs can still be considered an UM-MIMO setting because a very large number of AEs is required in each SA to achieve the required power gains. In 5(c), the power separation is in the order of 100 ( 3 / 2 = 2 / 1 =100) and THz spatial tuning is applied. No error floors are noted, which indicates that the power separation successfully decouples the NOMA streams. The best-performing is user 3, and the worst-performing user 1. As expected, all detectors show identical performance under orthogonal channels. In Figures 5(b) and 5(d), for THz multipath and LoS channels, spatial tuning is relaxed, which introduces significant error floors due to channel correlation,    despite power separation in the order of 1000 to remove the residual SIC error's impact. The gaps between different detectors are clearer at the user with the highest allocated power. The performance of LORD significantly deteriorates at higher SNR values under severe channel correlation, whereas SSD shows the highest resilience (more than an order of magnitude difference in BER). Note that the observed very high SNR values could be significantly reduced by adding antenna and beamforming gains in UM-MIMO, as argued in Sec. III. For instance, 1000 AEs per SAs on both transmitting and receiving sides would result in a 60 dB SNR gain. The BER plots for the multi-user NOMA setting of Sec. II-B are shown in Fig. 6 (16 × 16 MIMO and 16-QAM), where two users are accommodated per cluster. The simulated THzspecific NOMA system parameters are summarized in Table  II. Three different detectors are tested: NC, LORD, and SSD. The detectors are applied directly at user 2, and successively to detect both symbol vectors at user 1. Four different scenarios are simulated, all of which assume equal antenna numbers at the BS and the two users ( = 1 = 2 ). The proposed JDPC scheme (solid curves) is compared to a reference optimal power control scheme (dotted curves) [39], which formulates the power allocation problem as an ergodic capacity maxi- mization problem. The optimal PC scheme suffers from high complexity and slow convergence since it employs a bisection search method. On the contrary, our joint clustering and power control scheme has low complexity, and it only relies on the distance-based path-loss parameter for channel inversion. The distance-based path-loss is a very relevant metric since THz channel conditions are highly distance-dependent, and the low-complexity implementation of JDPC is crucial under Tbps baseband constraints. Unlike the optimal approach, our proposed JDPC scheme guarantees more power to the far user in a cluster, which is suitable for SIC. Furthermore, our proposed scheme results in lower power consumption on average, as we have 1 + 2 ≤ max , whereas in the optimal scheme the transmission power is always max , where The results for a system where spatial tuning is configured on both users are shown in Fig. 6(a). Both optimal power control and channel-inversion-based power control achieve similar BER performances. Furthermore, SSD is clearly shown to outperform LORD at a lower complexity. The superiority of user 1 is also noted. However, tuning SA separations at the transmitter to achieve orthogonality on both channels of both users (at different distances) is not realistic, although optimization schemes can approach such solutions. Figures  6(b) to 6(d) illustrate the corresponding results when such tuning is relaxed on either or both of the channels. It is noted that spatial tuning of SA separations is superior to simple power allocation optimization, where the user with an orthogonalized effective channel avoids error floors. In the presence of error floors, SSD schemes are more resilient.
Finally, it is worth noting that although the achievable gains of power-domain MIMO-NOMA systems are not entirely clear, high-frequency scenarios offer a compelling case for their utilization. In [40], the authors argue that MIMO-NOMA solutions can misuse the spatial dimension because they incur a multiplexing gain loss due to fully decoded streams in SIC. In particular, such loss is noted when comparing MIMO-NOMA to other candidate MIMO schemes such as conventional multi-user linear precoding (MU-LP) and newlyproposed rate splitting (RS) techniques, but not when compared to OMA. On the one hand, our proposed efficient SIC subspace detectors can combat this reduction in multiplexing gain. On the other hand, with near-singular THz channels, spatial precoding in MU-LP fails to reduce inter-stream interference. Therefore, the power domain remains a crucial enabler for multiplexing data. Moreover, as an extension to this work, and given the importance of IRSs alongside UM-MIMO in THz systems, IRS-assisted multi-beam NOMA techniques can be considered [41]; passive IRSs can improve the performance of weak users without requiring additional transmit power.

IX. CONCLUSIONS
In this paper, we propose low complexity subspace detectors for THz MIMO-NOMA systems. We leverage adaptive spatial tuning techniques to allocate NOMA resources and enhance channel conditions. The proposed detectors are studied analytically by deriving approximate error probability expressions and empirically via simulations of single-user (SC) and multi-user scenarios. We propose a low complexity joint clustering and power control scheme that exploits the THz distance-based path-loss parameter to guarantee efficient SIC demodulation. We further present a simple architectural implementation design in which lower-complexity detectors are used as building components of more complex detectors. We demonstrate that the proposed detectors achieve significant parallelism and computational savings at low performance costs, which is much needed for realizing a Tbps baseband for THz communication applications.