GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems

In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency-division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.


Introduction
Recently, a large body of research has been devoted to non-orthogonal multiple access (NOMA) [1][2][3][4] because the requirements of massive connectivity beyond 5th generation (5G) mobile networks necessitate the efficient use of time-frequency resources.In contrast to traditional orthogonal multiple access (OMA), such as orthogonal frequency-division multiple access (OFDMA), NOMA allocates the same time-frequency resource to multiple users in the same cell, which may result in strong multiuser interference.To deal with the strong interference, partially linear multiuser detection in reproducing kernel Hilbert spaces (RKHSs) has been proposed (e. g., see [5][6][7][8][9][10]).The reason for considering partially linear receivers is that, while nonlinear detectors can outperform linear detectors significantly in scenarios with strong multiuser interference, they may be highly sensitive even to small changes in a wireless environment (e. g., those caused by intermittent interference and multipath scattering in massive machinetype communication (mMTC) scenarios).In fact, theoretical studies [11] have shown that linear detectors can achieve a comparable spectral efficiency to nonlinear methods in massive multiple-input multiple-output (MIMO) systems.Therefore, in massive MIMO NOMA systems, purely nonlinear detectors may be inefficient.Based on these facts, studies in [8][9][10] have designed a hybrid multiuser detector in RKHSs that combines the strengths of nonlinear and linear filters.The proposed supervised projection-based learning algorithm, which is a special case of the adaptive projected subgradient method (APSM) [12], has some very desirable features.For example, the algorithm learns to detect modulation symbols of a user directly without any intermediate parameter estimation which is often prone to errors.Furthermore, if the channel information is available, the algorithm can be initialized by a conventional linear filter, and the performance can be further improved in scenarios for which linear filters are insufficient when dealing with strong interference.
Basically, the APSM-based algorithm tracks the intersection of time-varying closed-convex sets to which the desired solution (often a vector or a function) belongs.In this regard, it resembles a classic projection-based set-membership estimation 1 (e.g., see [13,14]).In general, such algorithms consist of a sequence of projections on the constructed closed convex subsets of an appropriately defined Hilbert space.The closed convex subsets are often chosen such that the projections (which require computing inner products) admit a closed-form solution; for example, projections onto hyperplanes, closed balls, closed halfspaces, etc.It is known that the computational effort of projection-based algorithms is dominated by the cost of computing projections.In ultra-low latency (ULL) applications, however, straightforward implementations of a large number of seemingly simple inner products in an orthogonal frequencydivision multiplexing (OFDM) radio frame may consume too much time to satisfy the temporal and latency constraints.Luckily, many algorithms that involve projections and inner products in Hilbert spaces lend themselves very well to massive parallelism.In order to realize our goal of real-time GPU-accelerated multiuser detection, we exploit the intrinsic parallelism of APSM and carefully schedule operations across the memory hierarchy of the graphics processing unit (GPU).
Before we move on to our contributions, we mention the fact that in the future many Radio Access Network (RAN) functions will be virtualized on general-purpose commercial off-the-shelf (COTS) hardware.Recent industry trials have shown that virtualization of RAN functions in the Distributed Units (DUs), consisting of computationally intensive baseband operations, may not be cost-effective (compared to traditional Application-Specific Integrated Circuit (ASIC) hardware) and viable without using modern acceleration platforms.Therefore, baseband algorithms that are suited to such platforms are the subject of current interest in the wireless industry.

Contributions
Our main contribution is a massively parallel GPU-accelerated NOMA system that can fulfill the real-time constraints of ULL in an OFDM-based system.Below, we summarize our contributions.
• Our focus in this study is to accelerate the detection of an OFDM radio frame which may involve computations of a large number of inner products.The acceleration is achieved by state-of-art parallelization and memory management techniques.
• The performance of the developed GPU-accelerated system is then demonstrated in a setup with 6 transmit and 16 receive antennas.Compared to the existing proof of concept (PoC) MATLAB implementation [8] we demonstrate that using massively parallel processor structures on a GPU reduces the processing time to milliseconds.
• GPU-accelerated signal processing is an important part of 5G and beyond communication systems [15].Such platforms are particularly suited to signal processing algorithms consisting of a large number of projections requiring inner products that can be executed in parallel.This means that the techniques that we have developed in this work can also be used in other inner product (and projection) based algorithms, such as projection onto convex sets (POCS) signal processing in Hilbert spaces.

Structure
The paper is organized as follows.In Section 2, we review the algorithm for multiuser detection proposed in [9].In particular, some practical aspects of the algorithm are explained along with its potential for acceleration by parallel signal processing.Section 3 discusses the implementation of the algorithm and the techniques to optimize parallel processing and memory access on the targeted GPU platform.Section 4 describes the hardware equipment and the experimental setup deployed in the proof-of-concept.Finally, the results are presented in Section 5.

Machine learning based Multiuser Detection
In this section, we describe the APSM-based algorithm for multiuser detection that has been studied in [8][9][10].Here, we focus on the implementation and computational aspects of the algorithm, and we avoid repeating unnecessary details.

Preliminaries
In the following, the sets of natural numbers, non-negative integers, real numbers, and complex numbers are denoted by N, N 0 , R, and C, respectively.We define the range This study deals with adaptive filtering in special (real) Hilbert spaces known as RKHSs.Briefly, given an arbitrary (linear) subspace U ⊆ R l , an RKHS (H, •, • H ) is a Hilbert space with the inner product •, • H , uniquely associated with a positive definite function known as the reproducing kernel κ : U × U → R of H.In this study, κ is either the linear kernel denoted by κ L and defined as (∀u, v ∈ U) κ L (u, v) := u T v or the Gaussian kernel denoted by κ G and defined as , where inducing the norm

Partially Linear Filter Design
The studies in [8][9][10] combine the robustness of linear beamforming filters with a higher spatial resolution of nonlinear beamforming filters by designing a partially linear filter in RKHSs.In more detail, the RKHS H G associated with the Gaussian kernel κ G is combined (i.e., summed) with the RKHS H L associated with the linear kernel κ L to obtain a sum RKHS of partially linear filters.To be more precise, a partially linear filter f is defined as an element of the real RKHS , where w L , w G ≥ 0 are fixed weights for the linear and the Gaussian part, respectively.Fact 1 shows how the kernel and inner products are computed in H: Fact 1 (Reproducing kernel of the weighted sum space [16]).Assume that the input space U ⊆ R l has a nonempty interior.Then, given any w L , w G > 0 and u, v ∈ U, κ(u, v) is the reproducing kernel of the sum space H equipped with the inner product The inner products in the two component RKHSs in (3) are simple to compute because, according to (1), they consist of (mainly) kernel evaluations.Note that the closed form in (3) is in general not valid for arbitrary choices of the two kernels.

Multiuser Detection
Consider a multiuser uplink with K users and M receive antennas.We assume a non-dispersive channel so that the received signal (sampled at a fixed symbol rate) at the time t ∈ N 0 is given by where p k ∈ R is the transmit power of user k ∈ 1, K, and b k (t) ∈ C is the modulation symbol.The vectors h k ∈ C M and n(t) ∈ C M stand for the channel signature of user k and additive noise, respectively.Note that we do not assume any distribution and structure of the noise and the receive antenna array, respectively.The objective of multiuser detection considered in this study is to design a filter g k : C M → C for a selected user k, such that (∀t , where > 0 is a small predefined noise tolerance.In other words, the goal is to detect the desired modulation symbols directly without any intermediate parameter estimation.

Adaptive Detection in Sum RKHS
In this section, we describe the APSM-based detection algorithm of [8][9][10].We convert the complex vector r(t enables processing in real Hilbert spaces as considered in [17,18].Similarly, the training modulation symbols are converted to The proposed filter f : R 2M → R operates on r 1 (t) and r 2 (t) separately.The relation between f and the complex-valued filter g described in Section 2.3 is given by (∀t , where i is the solution to the equation i 2 = −1.To simplify indexing, we define a new time index: Henceforth, we denote the input space of received signals by U := r n ∈ R 2M : n ∈ N 0 .We now turn our attention to the design of an adaptive filter f such that (∀n , where the precision is controlled by the design parameter > 0. We assume that f ∈ H and a training sample (r n , b n ) ∈ U × R is available ∀n ∈ N 0 .Then, a closed and convex set of functions in H consistent with the training sample at time n is given by In the online learning setting considered here, the training samples arrive as a sequence and each sample defines a set of the form in (6).Ideally, the objective is to find a filter f * ∈ H such that f * is a member of all these sets, i. e., f * ∈ n∈N0 C n .However, since it is challenging to find a low-complexity algorithm to solve this problem, we allow a finite number of sets not to share a common intersection and consider the simplified problem: for some n o ∈ N 0 , under the assumption that n≥no C n = ∅.The advantage of the above formulation is that we can find an f ∈ H that is close to the intersection in ( 7) utilizing an APSM-based [17,19] algorithm which we describe below.
As in [18,19], given an index set J n of size W , and starting from an arbitrary f 0 ∈ H, we construct a sequence of filter estimates in H given as where is the projection of f n onto the set C j , with β n j given by and where (q n j ) j∈Jn are non-negative weights satisfying j∈Jn q n j = 1.
The index set J n defined as J n := n − W + 1, n if n ≥ W − 1, otherwise J n := 0, n, allows for a subset of sets C 1 , C 2 , . . ., C n to be processed concurrently to accelerate convergence, and the weights q n j can be used to adaptively prioritize the sets.The computational advantage of this algorithm is that the projection P Cj (f n ) only requires simple inner products, and the overall algorithm is amenable to parallelization resulting in significant acceleration as discussed in Section 2.5.
Before we move onto the next section, it can be verified that the filter estimate generated by ( 8) can be decomposed as [9].Since H L is nothing but the Euclidean space R 2M , it is spanned by the Euclidean basis (which we refer to as the linear dictionary in the following) , where e m ∈ R 2M is the canonical basis vector having a one at the m-th index and zeros elsewhere.So, every κ L (r n , •) can be expressed as (8).In contrast to the linear component, the Gaussian dictionary given by D •)} grows with time n as the iterations progress.
Remark 1 (ML: Computation and Communication).It is common in literature to assume that the channels between the users and the base stations remain constant only for a certain time period known as the channel coherence time [20].Therefore, the training time should be much smaller than the coherence time such that a large portion of the coherence time can be used for detection/communication.From a processing and computation point-of-view, the training time is the total time it takes to collect/sample the training set and learn a good detection filter.This means that learning algorithms that have low complexity and can be accelerated are highly desired.Furthermore, in high-data rate communication fast detection is also highly desirable, so that as much data as possible can be detected during the coherence time.

Acceleration via Parallel Processing
As mentioned above, the concurrent projections in (8) accelerate the convergence.Unfortunately, large values of W result in significant computational and memory burden and this may result in intolerable latency in real-time applications.However, note that the W projections in (8) are independent, which means that they can be computed in parallel on platforms equipped with GPUs optimized for such applications.In addition, the computation of β n j in (9) involves the inner product i κ(r i , •), this inner product is a sum of n independent inner products (equivalently kernel evaluations due to (1)), so these can also be computed in parallel.Furthermore, each linear kernel evaluation requires the dot-product while the Gaussian kernel evaluation requires the calculation of the Euclidean-norm (u − v) (u − v) followed by further operations.This means that both kernel evaluations are independent component-by-component computations for any two input vectors followed by further operations on the sum or accumulation of these computations.The complexity of the above operations is linear in the dimension of the input vectors (the number of antennas M in this study), however the computation time can be reduced drastically if each component-by-component computation is executed in parallel.Finally, each computation requires access to data stored in various memory locations (discussed in Section 3) on the computing platform which means that, in addition to the computational aspects, special attention should be paid to the memory allocation and access.Before we move on to the real-time implementation of the algorithm in (8), we remark how this APSM-based algorithm can be implemented in practical communication systems.
Remark 2 (Focus on Detection).APSM is an online algorithm that keeps track of the set of minima of an infinite number of time-changing objective functions.As mentioned in Remark 1, in a traditional communication system, generally an initial "training phase" is carried out during which the desired user sends a certain number of training pilots to the receiver, which allows the receiver to approximate a good initial receive filter using APSM.The training phase is then stopped, and it is followed by the communication or "detection phase" during which the response of the trained filter is used as an estimate of the desired user's modulation symbols.In principle, retraining is only required if the communication channels change abruptly resulting in large errors.In situations, such as mobile scenarios, where communication channels change gradually, we can, in principle, track these changes using APSM through the data symbols.Therefore, after the initial training phase, the latency is mainly dependent on the time incurred during detection of radio frames, which involves computation of a large number of inner products according to (1) and (3).Due to this reason, the acceleration of detection is the main focus of our work in the following.

Real-time implementation
In this section, we review the target platform, followed by a detailed description of the parallelization and optimization techniques used to achieve real-time performance.

Platform overview
GPUs are massively parallel processing devices that share a common architecture for general-purpose programming using the Compute Unified Device Architecture (CUDA) programming model.CUDA is a C++ extension that allows parallel code to be described as threads organized in a hierarchical structure, while also providing a framework for the communication between the host and the device.Each thread represents a logically independent sequence of instructions, similar to a thread in a classic programming model.Threads are grouped into blocks (groups of threads) and grids (groups of blocks).This hierarchical organization maps very well onto the device structure, with threads organized within the Streaming Multiprocessors (SMs) as warps on the hardware cores.A warp is a common grouping of cores executing in parallel as single instruction multiple threads (SIMT), and a SM can have one or more warps active at any given time, depending on the number of cores available per SM architecture.GPUs are composed of one or more SMs as depicted in Figure 1.
The above-mentioned hierarchy implies several types of memory, which differ in their sizes, bandwidths, and latencies.Global Memory is available to all threads and it is the largest (several Gigabytes off chip), but it also has the highest latency (from a hundred up to thousands of cycles).Shared Memory is available within SMs, and it is only available to threads belonging to the same block.It has lower latency (tens of cycles) than Global Memory and is smaller in size (a few kilobytes on chip).Finally, registers are local to every thread, small in number (a few per thread in a typical scenario) with the lowest latency.
Note that threads can cooperate with each other by collaborating to perform tasks more efficiently or operating in batches, for example, to reduce the perceived latency of the overall memory access (known as latency hiding).Finally, the architecture has limited coherence of both execution and memory consistency at different levels of the hierarchy.A detailed description of the platform is found in the CUDA programming guide [21].

Basic Implementation
As discussed at the end of Section 2.4, to access the filter estimate f n in (8), we require access to the overall dictionary D n := {D G,n ∪ D L } and the coefficients γ (n) i at time n.The new filter estimate f n+1 is computed using projections on a sequence (sliding window) of W convex sets constructed from f n , the training sample received at time n, and D n .The Gaussian dictionary is then extended with the newly arriving training sample and, along with the new coefficients γ (n+1) i , it is used to store f n+1 .
Every training sample (the received vector (5)) consists of samples from multiple antennas at time n during the training phase.One CUDA block can be used for each vector in the sliding window, with multiple threads inside the block computing the inner products f n , κ(r j , •) H in parallel.The final summation is performed in a reduction step, which can also be executed in parallel.As the new training samples arrives, the sliding window advances from n to n + 1 and the process is repeated.This implementation variant is depicted in Figure 2. Furthermore, as discussed in Section 2.5, since both the inner products and the per-vector operations can be computed in parallel, we can use the GPU resources efficiently, as detailed later in this section.The top depicts the sequence of input vectors rn for which we need to calculate fn(rn) as an estimate of bn.Each input vector is processed by one CUDA block in parallel, and inside each CUDA block, the corresponding inner product is computed by multiple threads in parallel.The dictionary and the coefficients represent fn.The result of each block is the approximation of the symbol bn, which is stored in an array (bottom).
The detection stage computes f n (r n ) (according to the formulas in ( 1) and ( 3)) as an estimate of b n , where f n is the trained filter at detection time n and r n is the received vector.It does not modify D n or γ ) is essentially the inner product f n , κ(r n , •) H , the estimation can also be parallelized by processing each input vector in parallel, one per block.The required n − 1 inner products are computed one per thread within the block.Finally, the results from threads in a block are summed to produce an estimation from the input vector as . This is depicted in Figure 3.We point out that further parallelization is possible as the detection of each desired user can be performed independently.The number of users that can be detected in parallel is therefore only limited by the GPU resources available.

Optimization of Inner Products
Both training and detection share the computation of the inner products f n , κ(r n , •) H , which is the dominant operation of projection based algorithms.Implemented as a CUDA kernel to be executed on the GPU (not to be confused with the kernel functions κ), it is the primary target for optimization with respect to the latency of the system.
Parameters like the number of threads per group, the number of groups per block, and the total amount of vectors cached in shared memory all influence the performance.Device properties like type and speed grade of global memory, number of SMs, and amount of shared memory are important to consider.The optimal set of parameters is found by sweeping the parameter space combining experimentation and guidance by profiling tools.
While optimizing the parameters determining the execution of the CUDA kernel is part of performance optimization, algorithm transformations for massively parallel architectures are required beforehand.In the following sections, we explore such principle techniques that allow us to achieve the desired performance of a detection latency below 1 millisecond.Each thread in the CUDA block (center) computes the inner product using one dictionary entry by performing the required computations over the components of the vector rn (top).Multiple elements of the dictionary (left) are cached in shared memory (center, left) in batches, allowing for the efficient processing of a subset of the dictionary.The final result is accumulated and stored as the approximated modulation symbol (right).

Multiple Sample Vectors per Block
Our first technique aims at improving the occupancy of the device, which is a measure of the utilization of the resources available within the GPU.Utilization, which is a function of several factors, can be maximized by allocating enough work to keep the device busy within the constraints of the hardware.The most important constraints are the number of threads per SM, the number of registers used per thread, and the amount of shared memory utilized by a CUDA block.
To improve the occupancy of the CUDA kernel, we describe the algorithm using Cooperative Groups (which are primitives for partitioning work inside a CUDA block, see [21]) and assign the processing of an input vector r n in (5) to a single group.The threads within this group are responsible for performing the parallel computations for a particular input vector, and the result is reduced (accumulated) at the end in parallel to obtain the estimated symbol.We then assign multiple groups to one CUDA block, with each group processing a different received vector within the block.We can therefore use the number of groups in a block to maximize the occupancy of the SM.

Usage of Shared Memory
Our second technique benefits from using the shared memory within the SMs of a device.As mentioned before, each block processes multiple input vectors and, for each input vector, computes inner products using the same dictionary entries.Hence each entry of the dictionary may be loaded once and reused multiple times.Shared memory is well suited for this because it is accessible to all groups within a block.As a consequence global memory traffic is reduced by lowering the amount of data requested, reducing the time to load the data.
Every thread in the block reads the vector r n from memory and then calculates the inner products with one entry of the cached shared slice of D n .Once the inner products are calculated for this slice, a new slice of D n is read into the cache and this process continues until the inner products for all members of D n are computed.This is depicted in Figure 4 for the threads of one group; other groups (not depicted) in the same block operate over different input vectors but the same cached slice of the dictionary.

Tuning and Hiding Latency
While calculating the inner products, the CUDA kernel spends most of its time accessing shared memory, and the computations that follow memory reads are faster than the memory accesses, which can be verified by using a profiler tool.Therefore, it is desirable to have a balance between memory accesses and computation, so neither will dominate (and possibly limit) the performance.Both the inner products (i.e., the linear and the Gaussian kernel evaluations) require access to the received vector r n and members of the dictionary D n .The simple approach described in Section 3.3.2can be optimized by changing the order of the computations and increasing the amount of data in the cached slice (see Figure 5).By fetching only a subset of r n once and reusing it across a bigger section of the cache, without computing the full inner product in one go, the number of cache loads is reduced and the efficiency is increased.Note that this comes at a relatively small cost of requiring intermediate storage in shared memory until the entire r n is processed.This "rearrangement" of the inner product computations is shown in Figure 5, and it achieves a better balance between memory access and computations, allowing for load and computation to overlap and decrease the overall inner product computation time by about half.Threads in a CUDA block process subsets of both the input vectors and the dictionary entries in contrast to Figure 4. Since only a subset of the vector rn is processed, the intermediate results (before the entire rn is processed) have to be temporarily stored.The size of the cached dictionary elements is independent of the number of threads in the block (in contrast to Figure 4), and it is tuned to balance the time spent in computing and data load.

Experimental Setup
This section describes the hardware components of our real-time multiuser detection setup.The transmitter, receiver, and signal processing equipment are shown in Figure 6.Except for the Uniform Planar Array (UPA) shown in Figure 7, the components are COTS devices.

Signal Processing
For development and testing, we use different locals servers and also several high-performance computing (HPC) units that are equipped with multiple central processing unit (CPU) cores and at least one GPU.Our local server is shown in Figure 6(a), this server is equipped with a Xeon W-3245 CPU and two RTX 2080 Ti consumer GPUs.This server also handles the signal processing part and the data transfer from and to the Software-Defined Radios (SDRs).

Radio Access
The basestation is composed of four Ettus USRP N310 SDRs.Furthermore, we use a single National Instruments (NI) OctoClock for a global positioning system (GPS) disciplined clock and timing source for our SDRs.With this setup, all four SDRs, each equipped with four ports on the transmitter (Tx) and receiver (Rx) path, behave like a single SDR system with up to sixteen synchronized physical antenna ports for both the Rx and Tx paths.
A subset of our system parameters is shown in Table 1 and is also given in the following Section 5.For over the air signal transmission, we use an OFDM system mode with a sampling rate of 30.72 MHz.Note that we can switch between radio signals from our UPA shown in Figure 7(a) and our Uniform Circular Array (UCA) shown in Figure 7(b).The receive antenna arrays and user equipment (the transmitters) are described below.

Users Antennas
Each user antenna is installed on a tripod, allowing for conveniently adjusting location and height, relative to the receiving antenna array.Every single antenna shown in Figure 6(b) represents a single transmitting user, sending its radio signal, and is connected to one of the eight transmitter ports on our SDR module [22].According to the datasheet, the antenna gain is 5 dBi at 2.5 GHz and up to 7 dBi at 5.7 GHz, and the antenna polarization is vertical.

Uniform Planar Array
The UPA, shown in Figure 7(a), is equipped with 32 cross-polarized patch antenna elements.These elements are arranged in 4 rows and 8 columns.Each of 32 equidistant patch elements operates at a center frequency of 2.442 GHz.

Uniform Circular Array
The sixteen physical Rx antennas are arranged as a UCA, as shown in Figure 7(b).Uniform spacing is ensured by a ring retainer around the 16 antennas.The antenna operates in the 2.4 GHz and 5 GHz Wireless Local Area Network (WLAN) band with an omnidirectional radiation pattern and vertical polarisation.

Performance Results
In this section, we present performance results for the experiments based on the setup presented in Section 4. The objective is to demonstrate the real-time performance of the partially linear filtering algorithm presented in Section 2.4 in combination with the parallelization techniques and optimizations presented in Section 3. In the following, we first show the performance of the partially linear filtering for various simulation settings and parameters.We then compare the detection latencies of multiple platforms and show the acceleration in processing using techniques proposed in this study.
For the algorithm in (8), unless noted otherwise, we use uniform Gaussian and linear weights, i. e., w L = w G = 0.   Nevertheless, we observe that our algorithm can reach acceptable bit error rates even for high modulation schemes.Note that in a practical system our algorithm would be followed by forward error correction (FEC), which will correct the remaining bit errors.
In Figure 9, we demonstrate the BER performance as a function of an increasing number of antennas, where the specified number of antennas was chosen randomly among the 16 antennas available in our system.As mentioned in Section 2.4, the algorithm does not assume any particular antenna array structure.The measurements are executed for the two types of antenna arrays presented in Section 4. We note that the results for both antenna types are very similar.We observe that the users become simpler to separate with an increasing number of antennas, and hence the bit error rate tends toward zero.Even for a relatively small number of antennas as compared to the number of users the performance is acceptable because the Gaussian kernels provide the capability to separate users that cannot be separated linearly.
In Table 2

Conclusions
Our proof-of-concept provides a practical and fast implementation of recently proposed partially linear adaptive filtering for multiuser detection on GPU-accelerated platforms.We exploit the parallelism intrinsic to the mathematical formulation and distribute the computations in an "optimal" way across the targeted GPU.The techniques developed for hiding latency and accelerating memory access are key to reaching real-time performance.As a result, we can perform multiuser detection with a detection latency of below 1 millisecond with our COTS laboratory setup.In future work, similar techniques will be applied to drastically reduce the total amount of processing time spent on training.Moreover, channel coding will be included in the future.Finally, we note that APSM shares many operations with similar projection-based algorithms in Hilbert spaces which have seen many applications in signal processing and machine learning.Therefore, our acceleration techniques may be generalized to such projection-based algorithms and algorithms based on reproducing kernel Hilbert spaces.The developed and used CUDA APSM library code is publicly available on the GitHub server page of the Fraunhofer HHI [23].

Figure 1 .
Figure 1.CUDA-device mapping.A GPU consists of thousands of CUDA cores grouped as SMs executing code in SIMT style.The number of SMs is determined by the size of the device.See Section 3.1.

Figure 2 .Figure 3 .
Figure 2. Training overview.The received vectors in the sliding window are processed by the CUDA blocks in parallel, computing projections of fn onto the W convex sets defined by the W received vectors and the corresponding modulation symbols.The resulting estimate (bottom) is used to update the dictionary and the coefficients (left) then representing fn+1.sequence of received samples (input vectors)

Figure 4 .
Figure 4. Shared Memory implementation.Each thread in the CUDA block (center) computes the inner product using one dictionary entry by performing the required computations over the components of the vector rn (top).Multiple elements of the dictionary (left) are cached in shared memory (center, left) in batches, allowing for the efficient processing of a subset of the dictionary.The final result is accumulated and stored as the approximated modulation symbol (right).

Figure 5 .
Figure 5. Balanced implementation.Threads in a CUDA block process subsets of both the input vectors and the dictionary entries in contrast to Figure4.Since only a subset of the vector rn is processed, the intermediate results (before the entire rn is processed) have to be temporarily stored.The size of the cached dictionary elements is independent of the number of threads in the block (in contrast to Figure4), and it is tuned to balance the time spent in computing and data load.

Figure 9 .
Figure 9.The average relative bit error rate for different numbers of antennas using QPSK.Shaded regions show the ±1SD confidence interval.

Figure 6 (
Figure 6(b), the angular separation of the users is very small, which renders the task of separating them difficult.Nevertheless, we observe that our algorithm can reach acceptable bit error rates even for high modulation schemes.Note that in a practical system our algorithm would be followed by forward error correction (FEC), which will correct the remaining bit errors.
• R l is the Euclidean norm in R l and the variance σ 2 > 0 is a design parameter for the kernel width.The RKHSs associated with κ L and κ G are denoted by (H L , •, • HL ) and (H G , •, • HG ), respectively.
5, a sliding window size of W = 20, and a Gaussian kernel variance of σ 2 = 0.05.We use 16 antennas, and 6 users with uniform transmit power.Each data point is an average of 100 transmissions, consisting of 685 training symbols and 3840 data symbols each.The symbols consist of Gray-coded constellation points.Modulated training and data symbols occupy up to 144 subcarriers on 5 OFDM symbols, and 27 OFDM symbols, respectively.We use 15 kHz subcarrier spacing, which results in a total signal bandwidth of 2.16 MHz. Figure 8.Average relative bit error rate for different modulation schemes.Black bars show the ±1SD confidence interval.BPSK and QPSK were omitted from the plot because the bit error rate (BER) is negligible.
Figure 8, shows the BER performance of standard modulation schemes.All experiments were performed with the above-mentioned parameters.The error rate is negligible for binary phase-shift keying (BPSK) and quadrature phaseshift keying (QPSK), while for higher-order modulations a significant number of bit errors occur.As shown in

Table 2 .
Summary of the performance improvements of the detection CUDA kernel for various optimization steps as described in Section 3.3.
, we present the detection latencies of the different implementations presented in Section 3.3 on different GPUs.The GPUs represent two different device families: The RTX 2060 super and RTX 2080 Ti are consumer-grade boards with Graphics Double Data Rate (GDDR) memory, while the Titan V is a data-center class board equipped with High Bandwidth Memory (HBM) and with more computing resources.All used GPU cards execute the same code (neither device-specific optimizations nor tuning was carried out).The table does not show the time elapsed during training, which amounts to roughly 100 ms.As mentioned in Remark 2, the relatively long training time is not of concern because retraining is, in principle, only required if the environment changes abruptly.For comparison, a MATLAB implementation of our algorithm takes more than 40 s to execute on an i7-6700T CPU with simple parallelization (up to 8 Threads on 4 Cores) from the native Parallel Toolbox.The comparison with signal processing in MATLAB may not be fair, because we did not optimize the MATLAB code for speed.Nevertheless, one may conclude that the optimized CUDA implementation is several orders of magnitude faster than a native CPU-based implementation.