FPGA Acceleration of 3GPP Channel Model Emulator for 5G New Radio

The channel model is by far the most computing intensive part of the link level simulations of multiple-input and multiple-output (MIMO) fifth-generation new radio (5GNR) communication systems. Simulation effort further increases when using more realistic geometry-based channel models, such as the three-dimensional spatial channel model (3DSCM). Channel emulation is used for functional and performance verification of such models in the network planning phase. These models use multiple finite impulse response (FIR) filters and have a very high degree of parallelism which can be exploited for accelerated execution on Field Programmable Gate Array (FPGA) and Graphics Processing Unit (GPU) platforms. This paper proposes an efficient re-configurable implementation of the 3rd generation partnership project (3GPP) 3DSCM on FPGAs using a design flow based on high-level synthesis (HLS). It studies the effect of various HLS optimization techniques on the total latency and hardware resource utilization on Xilinx Alveo U280 and Intel Arria 10GX 1150 high-performance FPGAs, using in both cases the commercial HLS tools of the producer. The channel model accuracy is preserved using double precision floating point arithmetic. This work analyzes in detail the effort to target the FPGA platforms using HLS tools, both in terms of common parallelization effort (shared by both FPGAs), and in terms of platform-specific effort, different for Xilinx and Intel FPGAs. Compared to the baseline general-purpose central processing unit (CPU) implementation, the achieved speedups are 65X and 95X using the Xilinx UltraScale+ and Intel Arria FPGA platform respectively, when using a Double Data Rate (DDR) memory interface. The FPGA-based designs also achieved ~3X better performance compared to a similar technology node NVIDIA GeForce GTX 1070 GPU, while consuming ~4X less energy. The FPGA implementation speedup improves up to 173X over the CPU baseline when using the Xilinx UltraRAM (URAM) and High-Bandwidth Memory (HBM) resources, also achieving 6X lower latency and 12X lower energy consumption than the GPU implementation.


I. INTRODUCTION
Channel model simulation has become an essential part of mobile network planning. Every new cellular technology undergoes a critical simulation phase both before and The associate editor coordinating the review of this manuscript and approving it for publication was Miguel López-Benítez . during the physical deployment phase. It is thus essential to model the channel accurately for the design and evaluation of fifth-generation new radio (5GNR) and beyond wireless networks [1]. To obtain a realistic representation of the propagation effects, thousands of radio frequency parameters need to be adjusted. Parameter recalculation is needed even after the deployment, whenever the network configuration changes (i.e., the number or position of antennas change). geometrybased stochastic model (GBSM) is a popular method for accurately characterizing channels in simulation environments [2]. Several channel models have been developed by different groups, such as 3GPP [3], Wireless World Initiative New Radio II (WINNER II) [4], European Cooperation in Science and Technology (COST) 2100 [5], Mobile and wireless communications Enablers for the Twenty-twenty Information Society (METIS) [6], International Telecommunications Union Radio-communication Sector (ITU-R) [7], Millimeter-Wave Evolution for Backhaul and Access (MiWEBA) [8] and NYU WIRELESS (NYUSIM) [10]. These channel models share many similarities and can be grouped into two main categories: 1) 3GPP/ITU based channel models for frequencies below 6 GHz, with modifications to accommodate up to 100 GHz, and 2) NYUSIM [1] based channel models for frequencies ranging from 0.5 GHz to 100 GHz and provide new features and enhancements, such as spatial consistency, mobility, and spherical wave propagation.
The 3GPP channel model [3] that we chose in this work supports channel bandwidth up to 2 GHz and frequencies ranging from 0.5 GHz to 100 GHz. It provides accurate simulation at the cost of higher complexity than alternatives, and can also model additional components, such as oxygen absorption, blockage, large antenna arrays, and spatial consistency.
COST 2100 [5] is a GBSM for frequency bands below 6 GHz. Cluster power, delays and angles in the COST 2100 model are drawn from fixed geometry locations. This model suffers from limited frequency range and lack of support for scenarios requiring dual mobility, such as device-to-device (D2D) and vehicular-to-vehicular (V2V) communication.
METIS [6] fulfills most of the requirements for fifthgeneration (5G) channel modeling, such as blocking, specular reflection, diffraction and spherical wave propagation. It also adds support of spatial consistency with dual mobility. This model is based on ray-tracing and provides high accuracy at the cost of very high computational complexity.
Network simulators are used to model the routing protocol performance, traffic flows and evaluate the efficiency of the communication system using real-life parameters in a virtual environment [9]. Several channel simulators have been developed previously in the literature [10], [11], [12], [13], [14], [15], [16], [17]. NYUSIM [10], [11] is a geometrybased channel simulator for the physical and link layers of 5G communication systems for frequencies from 0.5 GHz to 100 GHz. In [12] is proposed a geometry-based channel model for millimeter wave (mmWave) frequencies considering the effect of the ground reflection. A three-dimensional (3D) multi-cell channel model is reported [13] for predicting performance of an urban macro-cell setup with enhanced features such as 3D antenna patterns, 3D propagation time evolution, variable terminal speeds, and scenario transitions. A channel simulator for machine-to-machine (M2M) communication in indoor environments is presented in [14].
In [15] is presented a tutorial on an end-to-end simulation system for mmWave module in 5G communication systems. K-Simulator [16] is an open-source standard-compliant modular tool based on 3GPP roadmap for 5G. A stochastic channel model is proposed [17], which adds support for dual mobility and spatial correlation. Accurate 5G channel model simulations require very high computational effort and incur very long execution time on general purpose processors. Hardware acceleration of such functions is an option to speed up the execution, hence to reduce the simulation time. Hardware accelerators based on FPGAs improve the runtime performance and system energy efficiency of computationally intensive accurate channel simulators with respect to both CPUs and GPUs [18]. FPGAs can achieve better fine-grained parallelism by customizing the computing engines and memory hierarchy. E.g., an FPGAs implementing a distributed unit (DU) receiver can improve the performance even under varying computational load conditions, with optimized power consumption and less area [19]. Civerchia et al. [20] studied the optimization of Open Computing Language (OPENCL) designs implementing orthogonal frequency division multiplexing (OFDM) module in the 5G stack on FPGA platforms. Alimohammad et al. [21] proposed an implementation on FPGA of infinite impulse response (IIR) models for Rayleigh fading channels. Xiao et al. [22] studied the use of FPGAs in 5G combined with the neural network optimizations.
Several GBSM emulators have been reported in the literature [23], [24], each with one or more application-specific target scenarios. Hofer et al. [23] proposed a parameterized GBSM emulator for FPGAs. It splits the channel into several stationary regions with fixed Doppler frequencies, hence it is not suitable for fast time-varying models like those used in vehicular mobility scenarios. Another emulator for 3D GBSM for fixed-to-mobile channels is presented in [24]. The channel emulator presented in [25] considers a linearly changing Doppler frequency in the stationary regions, but it has non-continuous output fading and hence suffers from accuracy loss. A ray-tracing based channel emulator is proposed in [26] with support for dual mobility. The proposed technique relies on pre-computed ray coefficients, hence it introduces errors in the ray amplitudes. [27] proposed a technique to accelerate the 3GPP channel model by reducing its computational complexity. It considers a single sub-path, which lowers the accuracy limiting its applicability to real propagation environments. As discussed, most techniques proposed in the literature have some limitations in terms of either accuracy or potential areas of application.
This work started from application requirements of the Innovation Department of TIM, a major Italian telecommunication provider. Their goal is to exploit the accuracy and generality of the 3GPP GBSM [3] to study the evolution of the radio standard and to maximize the planning quality of mobile networks by means of fast simulation tools, leveraging advanced methods and optimizations for acceleration on FPGA platforms. The channel model, initially developed for execution on a general-purpose CPU, is adapted for the target Xilinx and Intel FPGA acceleration platforms, followed by the application of different FPGA optimization techniques. Hence, we analyze the effort required to use the different synthesis tools for these platforms. While the main goal of this effort is performance optimization compared to a channel model targeted for general-purpose CPUs, reduction of the energy per computation is also analyzed compared to implementations on CPU and GPU platforms. The results of in this paper indicate that the proposed techniques allow the creation of fast, efficient and accurate communication channel models.
In this article, we investigate the use of efficient highperformance FPGAs for accelerating the channel model in radio link simulators by means of various multi-objective optimizations. On the one hand, we focus on improving the code structure as well as the memory architecture of the 5GNR channel model on FPGA platforms to maximize the exposed parallelism and match memory access and computational capabilities by leveraging the analysis and synthesis capabilities of HLS design environments. On the other hand, we analyze the performance of different HLS tools while following fairly similar optimization flows. The accelerated channel model is then integrated within a MATLAB-based simulation system (developed by the Innovation department of TIM S.p.A.) via a socket-based client/server architecture, in order to make it easier to use by several groups of researchers in a shared fashion. To analyze and compare the achievable performance on GPU platforms, the CPU implementation is ported to the Compute Unified Device Architecture (CUDA) framework and optimized for GPU targets. The performance achieved is reported for the NVIDIA GeForce GTX 1070 GPU platform, which is implemented using a similar technology node to the FPGAs that we used. Finally, we analyze the effort required when targeting the FPGA platforms using HLS tools.
The rest of the article is organized as follows. Section II introduces the 5G cellular technology and channel model used at the system and link levels. Section III discusses the flow and technologies used for hardware acceleration and the key benefits associated with them. Section IV explains the different optimization methodologies and techniques being adapted to make efficient use of FPGA-based acceleration platforms. Section V describes the overall channel emulation setup adopted in this work and the way different optimization are applied to the channel model. In Section VI, we discuss and evaluate the experimental results for the FPGA acceleration platforms and present a comparative analysis of the performance achieved for CPU, FPGA and GPU platforms. Section VII concludes the work performed in this research.

II. FIFTH-GENERATION MOBILE NETWORK
5G mobile networks promise important communication features, such as very low latency, very high data rates, and support for high density of devices and base stations (BS).
The new cellular network technology is expected to have a substantial impact and aid several sectors, including corporate networks, public networks and infrastructure. Transmission techniques using multi-antenna configurations and MIMO channels are crucial for enhancing the reliability and spectral efficiency of a radio link. For assessing standardized technologies operating with a BS equipped with horizontally arranged antennas, 3GPP has used two-dimensional spatial channel model (2DSCM) on the horizontal cross-section of wireless channels [28]. These models capture poorly the characteristics of a real channel as they consider a twodimensional (2D) plane and the transmission techniques for multiple-input and multiple-output (MIMO) systems (spatial multiplexing, beamforming and precoding, etc.) are limited to the azimuth dimension. A 3D channel model is required to evaluate communication techniques such as vertical sectorization. A narrow elevation beam is tailored to each vertical sector or user equipment (UE) specific elevation to efficiently adapt both the transmission elevation and azimuth for the UE [28].

A. THREE-DIMENSIONAL CHANNEL MODEL
A system-level simulation with many detailed scenarios, a large number of parameters, and sophisticated evaluation metrics requires both significant on-chip data storage and high computational power. Interference calculation becomes even more sophisticated with the inclusion of more complex scenarios. Thus, the requirements for system-level simulators must evolve in different directions, such as propagation channel modeling, interference modeling, and clustering. The propagation effect of a wireless channel can be modeled by combining a large scale propagation model with a small scale fading model of the channel. The former predicts the characteristics of the wireless channel model that change slowly, such as shadowing and path losses. The small scale fading model predicts instead the effect of changes due to the Doppler or multipath effects on a wireless channel.
To model the correlation between the different antenna elements, researchers use spatial channel model (SCM). Unlike other traditional models, SCM incorporates a random power delay profile and an angular profile and defines the large-scale parameters (LSPs) and the small-scale parameters (SSPs) of the channel model separately. Although the BS antenna arrays generate 3D radio beams, it is sometimes modeled in 2D to simplify the calculations by ignoring the elevation angles [29]. 3GPP developed a 3D generic channel model for frequencies ranging from 0.5 GHz to 100 GHz for link-layer and the system-level simulations [3]. The proposed model is a GBSM that extends the ITU/WIN-NERII 2D channel models. It is also influenced by the WINNERII/WINNER+ expansion from the 2D channel model to the 3D channel model [4] and takes into account the elevation angles and the azimuth angle to model small-scale fading effects and correlation among the antenna elements.   In 3GPP GBSM, a cluster is composed of several rays that originate from same scatterers having similar characteristics such as arrival and departure angles. These clusters consist of multipath components having common propagation direction. Fig. 2 illustrates scattering of different sub-paths in GBSM. Several usage scenarios are defined in the 3GPP specification. For elevation beamforming, the urban micro street canyon and open area, 3D urban macro (3DUMA) with outdoor next generation NodeBs (gNBs), Backhaul, device-to-device (D2D), vehicle-to-vehicle (V2V), and outdoor to indoor (O2I) are examples of some common usage scenarios. For each of these propagation scenarios, different parameters are defined to calculate path losses, microscopic   Table 1 are used for the realization of the radio channel using the step-bystep method shown in Fig. 3. The first step of the channel modeling is the identification of the application environment. A simulation scenario is first chosen, then the corresponding network layout (number of BS and UE), and the antenna parameters are specified. The final step for setting LSPs is the calculation of path losses for the assigned propagation conditions.

B. FAST FADING CHANNEL MODEL
Fast fading coefficients model the fluctuating behaviour of the wireless channel due to changes in the UE movement or due to multipath [30]. The fast fading channel model calculates LSPs to generate the channel coefficients. A downlink connection is assumed here for the notations, hence the arrival angles are defined at the UE side and the departure angles at the gNB side. For the up-link, the departure and the arrival parameters have to be swapped to obtain the respective realizations. At this stage, the azimuth spread angle of departure (ASA) and azimuth spread angle of arrival (ASD) and the AOA and AOD are generated, in addition to the zenith spread angle of departure (ZSA) and zenith spread angle of arrival (ZSD), and the ZOA and ZOD. Random coupling among the arrival and departure angles is performed for different multiple path components of the composite channel. Finally, taking into account these parameters, the channel coefficients are generated. Considering N cluster scatterers with M resolvable paths each, the channel impulse response (CIR) for ray m in cluster n, UE antenna element u, and BS antenna element s is In the conventional approach of beamforming, the array factor is applied to the field pattern of a single antenna element in a uniform array. In the 3D GBSM model, the array factor is applied to the coefficients of each channel, for each antenna element. Considering the delays and ray mappings listed in [3, Table 7.5-5], the final CIR H u,s (τ, t) are calculated by combining the partial coefficients for each transmitting and receiving antenna element in each cluster and scatter [31].
The channel can be either represented as a tapped delay line (TDL) or a cluster delay line (CDL). For simplified evaluation, the TDL model is defined as an impulse response, in which a radio channel is characterized by several delay taps while the CDL model is characterized by the arrival and departure directions in the 3D space which allows better beamforming representation. The TDL model defines the correlation between the antenna elements through a static correlation matrix, whereas the CDL model depends on the geometry of the antenna elements and how the channel propagates. To obtain a TDL model, a brick wall window is applied to the delay-scaled CDL model followed by power normalization. The pseudo-code in Algorithm 1 shows the procedure for the channel coefficient generation. The channel coefficients in the GBSM are dependent on the location of the UE in the 3D space, hence they must be calculated dynamically. For nTx transmitting antennas, nRx receiving antennas, nClust number of clusters, NSPS oversampling factor, sampling frequency f and transmission time interval length TTI the number of partial coefficients calculated is Thus, considering simulation parameters nTx = 32, nRx = 2, nClust = 23, NSPS = 4, f = 122.88 MHz and TTI = 0.25 ms, a total of 180 879 360 partial coefficients are generated.

III. FPGA ACCELERATION USING OPENCL AND HLS
Several HLS tools have been introduced for rapid prototyping and hardware development using large FPGAs. These tools take as input programs written in C, C++, OPENCL [32], [33], [34], [35], [36], [37], and other high-level languages [38], [39], [40], [41], [42] alongside some design constraints and pragmas, and translate them into lower level description such as register transfer level (RTL) or hardware description language (HDL) with equivalent functionality. This translation from high-level description into HDL is done by HLS toolchains. The translated design is then transformed into a gate-level description by the synthesis toolchain, and mapped onto the hardware resources of the target device. This circuit description is then mapped to the actual locations on Algorithm 1 Implementation of the channel impulse response Generation in 3GPP Channel Model Input: Input symbols Output: CIR 1: for u = 0 to nRxAntenna do 2: for s = 0 to nTxAntenna do 3: for n = 0 to nCluster do 4: for m = 0 to nCDL do 5: calculate forr rx,n,m as in (2) 6: calculate forr tx,n,m as in (3) 7: end for 8: end for 9: for l = 0 to nSymbol do 10: for n = 0 to nCluster do 11: H u,s,n,m (t) as in (6) 12: end for 13: end for 14: end for 15: end for the target device to reduce the length of the critical paths. The final stage is encoding the circuit description into a binary format (bitstream), which is then used to configure the FPGA on-chip resources and define the initial on-chip static RAM (SRAM) contents. OPENCL is a parallel programming language for multi-core and heterogeneous computing platforms [43]. OPENCL is developed as an open standard by the Khronos group, thus it has an edge over a similar framework, the CUDA, fully controlled by NVIDIA and only available for its devices. OPENCL is designed so that an application can be adapted across different computing platforms. Although OPENCL provides functional portability, platform-specific optimizations are necessary to exploit most of the target platform computational power. This allows software programmers to exploit the architectural features of the underlying platforms, such as the distinction between the local on-chip memory, the global memory, and registers, just like they can do for GPUs [44]. An OPENCL application is comprised of one or more device or kernel functions, and host code. Device code is the part of the code which is highly data parallel and computationally intensive, and will be executed on the accelerator. Host code is the part which sets up the environment and controls data movement to and from the accelerator device and is executed on a general-purpose CPU.
OPENCL devices include one or more compute units (CUs) and each may contain one ore more processing elements (PEs), depending on the platform and the designer implementation choices. OPENCL splits the computations in parallel threads called work-items (WIs) which are then combined together in work-groups (WGs). This approach adds support for data-parallel computations and thus some ''doall'' loop iterations without inter-iteration dependencies (in particular those over WGs), can be mapped to kernel instances that execute in parallel. Not all applications, though, expose high ''doall'' parallelism at the top of the kernel level. Moreover, FPGA architectures permit finer-grained control over the implementation parallelism, e.g., between tasks within a kernel or iterations of an inner loop. For this reason, OPENCL also offers an execution model, more suitable for CPUs and FPGAs than for GPUs, that executes repeatedly a single instance of the kernel and is called single work-item kernel. In this approach, the available parallelism must be defined at a finer grain, using FPGA specific pragmas. A similar approach can also be used to synthesize C or C++ code into a concurrent FPGA implementation, as discussed below. Fig. 4 shows the main elements of an OPENCL design. It relies on a single instruction multiple data (SIMD) paradigm similar to GPUs to better exploit the hardware platforms. It enables the developers to generate efficient code that fits the architecture of target device by providing an abstract but non-uniform memory hierarchy.
OPENCL divides memory into different spaces namely global, local, private, and constant memory. Global and constant memories are shared among all the CUs in a device and with the host CPU, reside in external dynamic RAM (DRAM), and hence have the highest latency. Local memory is shared among WIs in a WG, has lower latency than global memory and is often mapped to on-chip SRAM. Each WI finally has its private memory space, which is mapped to the register file and has the lowest latency.

IV. IMPLEMENTATION AND OPTIMIZATION FOR FPGAs
The usage of a high-level implementation-independent model written in OPENCL, C, or C++ brings dual benefits to the FPGAs. On one side, it enables the designers to generate an application-specific hardware architecture instead of using the fixed datapath of a CPU or GPU. On the other side, it brings high-level programming capabilities to hardware design. In order to use the FPGA device efficiently for accelerating an application, the computation bottlenecks have to VOLUME 10, 2022 be identified and then offloaded on the accelerator device, specifying them as kernels. Calculating H u,s,n,m (τ, t) with (6) for different combinations of the input parameter (u, s, n, m) requires extensive computations. This will increase the simulation time significantly and will limit the number of input parameter combinations that can be explored, while still using a reasonable amount of execution time. However, (6) offers a very high level of parallelism that can be exploited to significantly speed up the computation using a GPU or FPGA.

A. LOOP BASED OPTIMIZATIONS
Since the channel model considers multiple antennas and scatterers, and hence multiple paths, the implementation is organized as a set of nested loops, often without inter-iteration dependencies (also known as ''doall'' loops). To exploit the available parallelism, however, the designer has to provide explicit optimization directives and often restructure the original CPU-oriented code, because the out-of-the box optimization of the HLS tools is insufficient, as discussed below. In the following we briefly discuss the main loop-based optimization techniques.

1) LOOP PIPELINING
When a loop is sequentially executed, the next input data are accepted after the previous computation has been fully completed. Some of the resources however can be used much more efficiently by organizing the computation in stages. Pipelining is a form of computation parallelism that splits a sequential operation chain into several stages and introduces storage elements (SRAM or flip-flops) to store the intermediate results. Pipelines are characterized by two primary attributes namely latency and initiation interval (II). Latency is the total number of clock cycles elapsed for an input data to reach the exit point. II or gap is the number of clock cycles that must elapse before the loop can accept new input data. For a pipeline with initiation interval II and latency L that executes N iterations, the total execution time T when operating at frequency f can be described as in [45] If a design includes two or more chained pipelines, also known as task-level pipelining, the overall II is determined by the slowest one. To achieve maximum performance for a large number of iterations N , it is typically desirable to reduce the II and implement deep pipelines with many stages, hence reducing the overall execution time. Fig. 5 shows execution of code in Listing 1 in sequential and pipelined manner.

Listing 1. Loop pipelning example.
Loop pipelining can be specified in OPENCL kernels with __attribute__((xcl_pipeline_loop(N))), for C/C++ kernels in Xilinx Vitis [46] with #pragma HLS pipeline II=<N>, or for Intel FPGA SDK for OPENCL [47] with #pragma II = <N>. Pipelining may slightly increase the resource usage due to insertion of extra control logic and intermediate storage elements, but it generally increases the overall design throughput by decoupling it from iteration latency.

2) LOOP UNROLLING
If there are no data dependencies among the iterations of the loop, the loop execution performance can be improved by executing multiple iteration in parallel. For such a ''doall'' loop with trip count N , a theoretical speedup of N times, with an increase of resources also by a factor of N , can be achieved by dispatching all the iterations in parallel. If an increase by N of the overall resources is not acceptable, often unrolling is applied partially by creating X copies of the unrolled loop body, where X < N . A loop can be fully or partially unrolled depending upon the performance requirements and resource or data availability. Loops can be unrolled by using the #pragma HLS unroll factor=N in Vitis HLS or #pragma unroll N in Intel HLS, where N is the required number of iterations to be executed in parallel. Fig. 6 shows the execution of code in Listing 2 in rolled, partially unrolled and fully unrolled fashion. Note that unrolling increases the data access parallelism of the loop as well, hence it requires memory architecture restructuring, as discussed below, to achieve the best performance.

3) LOOP TILING
FPGAs have limited on-chip storage resources, which are often insufficient to store all the inputs and intermediate results required by a given algorithm. In that case, if each iteration of a given loop uses different input, intermediate, and output data, it is possible to split the loop into two nested loops, where the innermost requires a manageable amount of on-chip storage, and transferring only the required data onchip at each iteration of the outer tiled loop [48]. For a better understanding of the tiling based optimization, Listing 3 shows an example of nested loops with a large tripcount and hence larger memory footprint. Loop tiling is applied as shown in Listing 4, resulting in a smaller memory footprint. This optimization can be used to add support for larger designs on platforms with limited memory resources.

4) LOOP FLATTENING/COALESCING
Nested loops can be coalesced into a single loop to improve performance by reducing the overhead of nested loop control. However, in both HLS tools that we consider this can only be done automatically for loops where there is no logic specified between the loop statements, only the innermost has a body and all loop bounds are constant except for the outermost loop bound, which can be variable. Listing 5 shows the coalesced structure of nested loops in Listing 3. In Vitis HLS the #pragma HLS loop_flatten must be specified inside each coalesced loop, while on the Intel platform the loops can be coalesced using #pragma loop_coalesce <loop_nesting_level> on the outermost loop.

B. MEMORY OPTIMIZATIONS
Off-chip DRAM is required to store most input and output data for the channel model and to communicate with the host. However, DRAM accesses are much slower than the on-chip SRAM accesses (SRAM is also called block RAM (BRAM) on FPGAs). Hence, to compute the CIR with sufficient performance, the input parameters are read from external DRAM to on-chip memory and then accessed repeatedly on-chip by the unrolled pipelined loop bodies. The computation results are written back using the same strategy.
To support the loop optimizations discussed above, the memory hierarchy and off-chip memory interfaces must be optimized. These optimizations include array partitioning, reshaping, banking, and resource allocation: Data reuse. Exploitable parallelism on FPGAs in most cases is limited by the number of off-chip memory ports. If there are multiple accesses to the same data, data reuse can be exploited by storing them into on-chip buffers, which have low latency and thus reduce total access time [49]. Memory access separation. If the innermost loop accesses slow external DRAM frequently, it will have low performance due to off-chip latency and bandwidth limitations, as mentioned above. By separating these memory transfers from computations, they can be both optimized separately to achieve maximum throughput. Buffering. If memory is accessed deep inside the code, it may use the bandwidth inefficiently and degrade the timing and energy performance. To reduce these penalties, access to global memory can be made ahead of the actual kernel computation usage. These accesses read memory in bursts into deep buffers using wider DRAM interfaces than the actual model data type (doubleprecision floating-point) to fully exploit the parallelism offered by the on-chip DRAM controllers. Performance can also be improved by clocking such memories at higher frequencies (pumping) than the PE. Memory banking/striping. Modern memory interfaces provide access through multiple banks with dedicated access channels, e.g., HBM lanes or DDR channels. Hence, access bandwidth of an array can be increased by striping it across the different memory interfaces (banks) available on the board. The same considerations apply to on-chip BRAM banks to increase the onchip memory bandwidth to match the requirements of the data computations. In HLS, this kind of on-chip memory striping must be performed explicitly by inserting modules to manage data from multiple interfaces. To split data across N banks, on the Intel platform is used the __attribute__((numbanks(N)) directive on the local memories. In the Xilinx Vitis platform, a memory can be either partitioned completely (into registers) or in a cyclic or block manner using #pragma HLS array_partition variable=<name> type=<type> factor= <int> dim=<int>. For off-chip DRAM, on the other hand, a single array must be broken by the designer explicitly into multiple sub-arrays mapped to different HBM or DDR channels, because currently there is no support for HLS automated or aided off-chip memory striping. Regular memory accesses. Irregular and unaligned access to memory subsystems, in particular to DRAM, leads to severe performance penalties. Hence, memory accesses must be kept carefully aligned, so that they can be combined (memory coalesced) by packing the transactions into a single request and making efficient use of the DDR and HBM bandwidths.

V. EXPERIMENTAL SETUP
We used the setup shown in Fig. 7 Table 2 lists the main resources available on these platforms. The final accelerated channel emulator is deployed on these platforms to be used inside the 5G simulation stack.
When targeting FPGAs using HLS tools, the baseline implementation is often a version of the code that has been developed for CPUs using C/C++ or OPENCL code parallelized for GPU. In our case, the channel model was developed in C++ and executed in the MEX co-simulation environment, with the remaining of the 5G stack being executed inside MATLAB. The design is then ported to the Xilinx Vitis development and Intel FPGA SDK for OpenCL environments. The accelerated function (kernel) can be either defined in OPENCL or in C/C++. These kernel modeling styles differ mainly in the way of defining the kernel input parameters and optimization pragmas. OPENCL defines the interfaces to the external DRAM automatically, based on the __global memory attribute, whereas in case of C/C++ these are configured via pragmas. Another difference is how optimizations, such as memory partitioning, loop pipelining and unrolling are specified, and the location where these pragmas should be placed. Finally, OPENCL could in principle allow explicit modeling of data parallelism via WGs and WIs. However, in this project we did not exploit this opportunity because we wanted to exercise finer control over the loop pipelining and unrolling, which is possible only with a single-WI modeling style. Fig. 7 describes the overall socket-based acceleration architecture used for validation and evaluation in this research. The use of sockets to connect the MATLAB client to the acceleration servers enables to serve multiple remote clients avoiding to have physical card mounted into the actual physical machine running the instances of MATLAB.
The baseline implementation of the channel model is cosimulated with MATLAB using MEX and used for validation. To accelerate the channel functions using an FPGA which supports complex simulations with higher numbers of antenna elements and more UE speeds, the design is split into two major parts: host and kernel code. The host code performs tasks related to control and data movement such as allocating space on the device memory, receiving data via the socket from a client, launching the kernel and copying results back to the client via another socket. The design is ported to the respective development environments for target FPGA. A new validation step is used to check that the splitting between client code, server host code, and server kernel code was performed correctly. This out-of-the-box (OOB) implementation is very inefficient since all the data resides in global DRAM and hence limits the scope of automatic or manual optimizations.
The optimizations adopted in this work are described more in detail in Section IV and can be grouped in the following categories: Generic HLS optimization. These optimizations are generic for any HLS flow and can be adopted on any of the target platforms.These optimizations are further divided into two categories here, highlighting their impact on performance and resource utilization respectively. On-chip buffers. Access to global memory, i.e. offchip DRAM, is costly in terms of both time and energy. To overcome the memory bottleneck, the data used by the kernel must be copied into low latency on-chip SRAM buffers at the beginning of the kernel execution, and back to DRAM at the end. This brings dual benefits, firstly by issuing wider DRAM access requests than the single words used in the model computation, thus utilizing its full bandwidth, and secondly by reducing the number of such requests by exploiting data reuse. Loop-based optimizations are then applied to make efficient use of on-chip resources. These optimizations include pipelining, unrolling, tiling, and loop coalescing. Multi-port. Parallel access to on-chip buffers by unrolled loop bodies is still limited by the number of available ports. To prevent stalling of the computations, these buffers should be partitioned to allow multiple accesses through an adequate number of ports. For example, the FIR coefficients are the result of intermediate computations (6). These partial results are kept in SRAM buffers with low latency, thus enabling to compute coefficients for all clusters in SIMD fashion, hence reducing the total latency. Application-specific optimizations (ASO). This type of optimizations are specific to the channel model application and may require modifications of the original CPU oriented algorithm to achieve the best performance. For example, off-chip memory accesses are not aligned initially and hence result in poor bandwidth utilization because of memory stalls. Another optimization is exploiting the algorithm structure to reduce the number of arithmetic operations. In our case, we replaced the iterative computation of an arithmetic sequence with its closed form, which required only multiplication by a constant, and thus reduced both floating point operations and on-chip memory accesses. This also removed false inter-iteration dependencies and enabled unrolling to parallelize computations. 1) Regular access pattern. To improve performance using the optimization mechanisms provided by HLS tools such as loop pipelining, unrolling and loopflattening, it is essential to enable the tools to understand the access pattern of the underlying design. Random or irregular memory accesses can result in poor bandwidth utilization, limiting the achievable speedup. In Algorithm 1, line 11 performs memory access to random memory locations depending on the position of cluster scatterers, which is determined at runtime. This is a bottleneck to the achievable parallelism. To solve this, memory accesses are separated into different banks for each cluster. 2) Closed-form computation. To reduce the number of arithmetic operations, and hence the memory accesses, the algorithm structure can be exploited. In our case, the iterative computation of an arithmetic sequence was replaced with its closed form, which requires only multiplication by a constant, thus reducing both the floating point operations and the on-chip memory accesses. 3) False dependence removal. Memory dependencies occur when a single memory location is both read and written, or written multiple times, within a section of code (typically a loop body). While the true dependencies must be preserved to hold the correctness of computations, false dependencies are the result of conservative estimations performed by an HLS tool when it cannot exactly analyze access sequences. These dependencies can never occur during actual execution of the code and can be resolved after a careful manual analysis of the access patterns in the code. False dependence removal pragmas for HLS tools are used to identify such dependencies and improve the effectiveness of loop transformations. Platform-specific optimizations (PSO). Some FPGA platforms may offer some extra resources, such as off-chip HBM and on-chip URAM on the Xilinx Alveo U280, which can be used to further increase the maximum achievable performance. HBM can be used by specifying a separate interface for each global memory array and can help reducing memory contention and bank conflicts. URAM is a special kind of memory that is wider and deeper than the BRAM and can be used to store large data structures. Since in OPENCL currently it is not possible to control these features, we ported the kernel to C++, which allows more control over these optimizations, while losing some of the portability between different FPGA vendors that is afforded by OPENCL. This version of the kernel achieves a much more balanced resource utilization and hence would allow the creation of multiple instances of the kernel on the target FPGA, to simulate multiple channel models concurrently and independently.

VI. RESULTS AND ANALYSIS
We analyze here the performance of the accelerators before and after optimizations for the two target platforms. We used the reference parameter values from the 3GPP specification [3] for this phase. Table 3 lists some of these parameters and the chosen propagation condition for the channel emulator. To measure the efficiency of the accelerated designs, performance metrics based on resource utilization, latency, and energy consumption are used. The OOB implementation of the channel model on the FPGA platforms is purely a synthesizable version of the original code targeted for CPU execution. Although it is functionally correct, it is very inefficient in terms of computation and memory access. Data reside in the global DRAM and even though the data access is mostly sequential, none of the HLS tools was able to optimize the performance of the memory accesses and exploit the abundant opportunities for data reuse. To reduce frequent accesses to global memory, the data are copied to on-chip buffers before starting the computation core of the kernel, and copied back to the global memory at the end of the execution. To achieve computation parallelism, the data should be accessible in parallel. This is limited by the number of access ports available on the requested memory. Memory partitioning allows multiple accesses in parallel at the cost of increased resource utilization. The final implementation combines all these optimizations with algorithm modifications to improve the regularity of the memory accesses and thus to simplify the addressing logic.

A. LATENCY
Since the primary task of this research is the acceleration of the channel model, the main focus is the reduction of the overall latency of the kernel execution. Table 4 lists the latency of the kernel on the baseline CPU and on the various acceleration platforms, using a single SLR for the US+. For comparative analysis, we also report here the speedups achieved after the application of each optimization.
In the baseline implementation, the CPU cache provides very good DRAM access bandwidth without any programming effort, but the maximum achievable performance is limited by the number of available computational resources. Hence, the OOB implementation on the two FPGA platforms have lower performance than on the CPU, mainly because of time-consuming memory access requests, since all the data reside in off-chip DRAM. This cost is significantly reduced by copying the data into on-chip buffers before the channel model computation starts. Parallel access to these buffers is still limited by the availability of access ports and hence prevents many HLS optimizations, such as pipelining and unrolling. To overcome this, partitioning on-chip memory, which effectively means using several separate banks, is used to increase the number of access ports on these buffers and thus enable the HLS tools to schedule more access requests in parallel. Memory bandwidth utilization, however, is still poor due to the unaligned access patterns, which require a significant amount of multiplexing. To overcome this, ASO are applied (see Section V), yielding overall 95X and 65X speedups on the US+ and Arria platforms respectively compared to the baseline CPU implementation. The FPGA-based designs also achieved ∼3X better performance compared to that achieved on GPU platform. At this stage, the maximum achievable performance is limited by the number of available FPGA on-chip resources.
In addition, the Xilinx US+ platform offers high capacity URAMs and HBM with a number of interfaces that are exploited next, through PSO (see Section V). To increase the number of global memory access ports, separate HBM interfaces are used to reduce bus and memory controller contention on interfaces and improve bandwidth utilization. The complex input and output data structures of the channel model were split into real and imaginary parts and were assigned each to a separate HBM channel. A total of 12 HBM channels were used to avoid the interface stalling and to fully pipeline the loops in the implementation. This leads to a speedup of 151X. Performance is further improved by using URAM resources for some of the data structures, to better balance the resource utilization of the design and yield an overall 173X and 6X speedup compared to the baseline CPU and GPU based implementation respectively. Fig. 8 shows the latency of the design after various optimizations. As this study follows a step-by-step procedure, the new optimizations are added on the top of the ones applied in earlier steps.

B. RESOURCE UTILIZATION
Optimization pragmas affect the resources used by the accelerated function. Since in the OOB implementation all data reside in off-chip DRAM, resource utilization is very low for the US+ platform. The HLS tool for the Arria platform on the other hand tries to optimize automatically the memory accesses, but fails due to the unaligned access patterns and inter-iteration dependencies. Table 5 lists the resource usage for the various designs. The total available on-chip resources are listed in Table 2. The next step in the optimization flow, making the on-chip buffers multi-port, creates an architecture that is best suited for both HLS tools. This also increases the resource utilization, but it does not yet achieve the best implementation performance due to the unaligned memory accesses and inter-iteration dependencies, which are tackled only by ASO. Resource usage for Intel is reported in Fig. 9. HBM and URAM resources on US+ platform are then used by platform-specific optimization, to balance the resource utilization and increase even further the achievable performance. Fig. 10 shows the percentage utilization of resources on US+. Since all designs tile the on-chip buffering of the DRAM arrays to support large problem sizes, resource usage is not affected by increases in the number of total channel coefficients, which affects only the total latency. VOLUME 10, 2022

C. POWER AND ENERGY
Improving the energy efficiency is one of the key advantages of offloading an application function to specialized hardware. General-purpose processors focus on flexibility and hence are not optimized for maximum efficiency for each application.  Offloading functions to FPGA hardware accelerators can enhance energy efficiency not only by reducing the execution time, but also by using only the required hardware to execute the task. Table 6 reports the energy consumption of the design at various optimization levels. The CPU implementation has the highest energy consumption due its higher thermal design profile (TDP). The energy consumption is highest for OOB designs because of their higher latency and power-expensive accesses to memory since all the data reside in DRAM. The optimizations applied help reducing the total latency and making efficient use of the available resources, which also reduce the energy consumption. Fig. 11 shows the energy consumption of implemented design at different optimization stages. The GPU platform implementation consumes more energy since the GPU that we used has less on-chip memory than the FPGAs.
FPGA platforms consume much less power than CPUs and GPUs with respect to their computational capabilities. Moreover, optimizations for resources also save power and those for latency also reduce energy consumption.

VII. CONCLUSION
The simulation of the 3GPP 3DSCM channel model can be significantly accelerated using FPGA platforms from different vendors (we report for Xilinx and Intel) by applying a range of optimization techniques. The achievable speedup is limited by the memory, as the bandwidth limit is reached before the computing resource limit. A nominally portable OPENCL implementation allows to design for FPGAs using HLS tools. However, straightforward porting of the original C++ code targeted for a CPU to OPENCL does not reach good out-of-the-box results. A comprehensive set of memory and loop-based optimization techniques are needed to tackle this challenge, and can improve the performance by many orders of magnitude. While the initial implementation was much slower than CPU execution, with optimizations its execution is two orders of magnitude faster.
Using the accelerated channel model, a higher number of parameters can be simulated compared to the model running in MATLAB environment or C++ code on a CPU in a comparable amount of time. Hence, the accelerated model supports simulation of a wider speed range for UE, and more antenna elements can be considered.
To the best of our knowledge, this work is the first implementation of 3GPP 3DSCM on both Xilinx and Intel FPGA platforms, with a detailed comparison of the achievable results on both. An impressive 95X and 65X speedup was achieved on the US+ and Arria platforms compared to the baseline CPU implementation through a combination of generic optimizations alongside application specific optimizations. The performance on the Xilinx US+ platform improved even further, to 173X, by exploiting the on-chip URAMs and HBM. Since the data types were kept the same as those in the baseline CPU implementation, i.e. double precision floating point, there was no change in accuracy for the FPGA implementations. This was particularly challenging, because the FPGAs considered in this study, despite being both aimed at data center applications, lack optimized support for double-precision floating-point adders or multipliers.
To compare our results with those achievable on another highly parallel acceleration platform, namely GPUs, the channel model was re-implemented and optimized using the Compute Unified Device Architecture (CUDA) framework and language for an NVIDIA GPU implemented on a comparable technology node to our FPGAs. One of the FPGA platforms exhibits both better performance, between 3X and 6X, and energy consumption, between 4X and 12X, than the GPU. This is thanks to the better memory access achievable for the memory-intensive channel model on the FPGAs, since they have a larger on-chip memory but comparably less computational resources than the GPU.
Over the timeline of this research work, 20% of the time was spent porting the CPU-based application to FPGA platforms, to make it synthesizable. At this stage, reports from HLS tools helped in analyzing bottlenecks of the design. 10% of the development time was spent for each one of the on-chip and multi-port memory optimizations. The application-specific optimizations required in-depth analysis of access pattern, computational flow and memory dependence, and took around 30% of the total development time. Platform-specific optimizations for HBM took about 20% of the total time and included memory access analysis and partitioning into separate HBM channels. Finally, exploiting URAM resources to balance the BRAM utilization required the final 10% of the development time.
The HLS tool for one of the FPGAs provides a user friendly graphical interface for rapid development and debugging, while there is no such Integrated Development Environment for the other FPGAs. However, both tool sets provide detailed design reports that enable micro-architectural optimizations. OOB implementation of code not specifically written for FPGAs is obviously sub-optimal, due to the lack of an efficient on-chip memory architecture that is comparable to the cache in a CPU.
The HLS tool for one of the FPGAs tries to automatically create an optimized memory architecture, but since the algorithm memory access pattern, albeit regular, was hard to analyze, the tool worsens the performance and increases the resource usage. This highlights the need for experienced hardware designers, familiar with HLS tools, who can partially rewrite the top application and manually optimize memory access. Lastly, the C++ based kernels for one of the FPGAs offer more control over the optimizations, such as the parameters of the DRAM interfaces or the choice of some specific computational resources, than in the nominally more portable OPENCL flow. Although HLS still has some shortcomings compared to hand-crafted RTL implementations, it enables rapid design space exploration and thus ultimately can achieve respectable quality of results with a reasonable design optimization time.