On-Board Processing of Synthetic-Aperture Radar Backprojection Algorithm in FPGA

—Synthetic Aperture Radar (SAR) is a microwave technique to extracting image information of the target. Reflected from the target electromagnetic waves are acquired by the aircrafts or satellites receivers and send to a ground station to be processed by applying computational demanding algorithms. Radar data streams are acquired by an aircraft or satellite and sent to a ground station to be processed in order to extract images from the data, since these processing algorithms are computationally demanding. However, novel applications require real-time processing for real-time analysis and decisions and so onboard processing is necessary. Running computationally demanding algorithms on onboard embedded systems with limited energy and computational capacity is a challenge. This paper proposes a configurable hardware core for the execution of the backprojection algorithm with high performance and energy efficiency. The original backprojection algorithm is restructured to expose computational parallelism and then optimized by replacing floating-point with fixed-point arithmetic. The backpro-jection core was integrated into a system-on-chip architecture and implemented in an FPGA (Field-Programmable Gate Array). The proposed solution runs the optimized backprojection algorithm over images of sizes 512 × 512 and 1024 × 1024 in 0.14s (0.41J) and 1.11s (3.24J). The architecture is 2 . 6 × faster and consumes 13 × less energy than an embedded Jetson TX2 GPU. The solution is scalable and therefore a tradeoff exists between performance and utilization of resources.


I. INTRODUCTION
Synthetic-Aperture Radar (SAR) is a remote sensing technique widely used to monitor the surface of the Earth with application in many different areas, including ships and oil spills tracking, terrain erosion, drought and landslides, deforestation and fires [1]. One of the main features of SAR that makes it a very attractive technique for remote sensing is its capacity to This  operate under adverse whether conditions with clouds, smoke or rain and without a light source. SAR sensors can be installed on satellites, aircrafts or unmanned aerial vehicle, depending on the application. Data collected from these sensors must be processed to obtain images from the Earth surface. The Backprojection (BP) [2], [3] time-domain algorithm is one the most used algorithms for SAR image formation due to its robustness and quality of results.
SAR algorithms, and BP in particular, are quite computation and memory intensive and therefore processing is usually done at a ground station. However, running the generation of images onboard opens a new vast set of applications over recovered images, like, for example, image classification and detection. This would permit to undertake real-time analysis and decisions onboard.
Using high-performance computing platforms, like multicore processors and Graphics Processing Units (GPUs), to run SAR image formation algorithms is not feasible on-board since they require high-power, while onboard computing systems are usually low-power.
An alternative is to consider dedicated hardware architectures implemented in Field-Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC). These allow the design of an optimized hardware architecture, which improves performance and reduces power. Compared to ASIC, FPGAs are less performing and require more power. However, they allow redesigning the algorithm after deployment and are less costly for medium volume production.
Therefore, FPGAs are attractive devices for on-board processing systems since they offer high performance, compact size, reduced weight, and low power, while at the same time allow on-board hardware redesign to cope with the needs of different missions and algorithmic modifications.
The hardware architecture designed with FPGAs can be fully optimized for each particular algorithm, like the BP algorithm. The data representation and the arithmetic operators can all be explored and designed to obtain the best architectures with particular tradeoffs between hardware resources, performance and energy. This is particularly true for hardware accelerators, where operations are preferably implemented using fixed-point arithmetic due to the overhead introduced by the implementation of floating-point units. Therefore, this work considers a full optimization of the representation of the variables and the arithmetic operators.
Another major aspect of custom designed hardware architectures is the right balance between computation and communication. The BP projection algorithm requires memory transfers of large volume of data. As the parallel computation increases, the pressure over the available memory bandwidth increases up to a point where increasing the parallelism leads to no further performance improvements. To reduce the memory bandwidth pressure and allow further parallelization, data must be cached on-chip and re-utilized as much as possible. The memory architecture is therefore a major aspect when designing these algorithms. The work proposed in this paper redesigns the BP algorithm so that a tradeoff is established between on-chip memory and parallelism, that is, the onchip memory size determines the degree of parallelism of the design. The integrated design of software and hardware, where the software algorithm is reorganized oriented by the hardware design, further improves the computing architecture.
The main contribution of this work is therefore a novel hardware architecture for the execution of the backpropagation algorithm for SAR image generation. The proposed architecture is configurable to achieve different tradeoffs between performance and hardware cost. The project of the architecture integrates several contributions, namely BP algorithm rescheduling for balanced computation and communication and data type analysis and design to improve the efficiency of the architecture.
The proposed core was integrated in a system-on-chip architecture with a soft processor, implemented in a medium density FPGA and tested with synthetic and real images obtained from the ARFL dataset 1 . The system was compared to the execution of the algorithm in a desktop i7 processor, an embedded GPU with a Jetson TX2 board, a NVIDIA GeForce RTX 2060 GPU and previous FPGA design for BP processing. This paper is organized as follows. Section 2 describes the related work on algorithms and architectures for the fast execution of the BP algorithm. Section 3 presents the BP algorithm in detail. Section 4 describes the design of the BP in hardware, including algorithm rescheduling, data type optimization, and hardware design. Section 5 details the performance and resources of the proposed system. Section 6 concludes the paper.

II. RELATED WORK
Two types of algorithms are used commonly for SAR image formation: frequency-domain and time-domain algorithms [4]. Frequency-domain algorithms have a computational complexity of O(N 2 log N ), for N pulses of N × N samples of an image, lower than the O(N 3 ) complexity of timedomain algorithms. The disadvantage of the frequency-domain algorithms is that they are not applicable to all cases and require multiple assumptions, including geometric approximations and particular flying trajectories. Other associated problems include a non perfect motion compensation that can lead to some image focusing problems, and interpolation in the frequency domain that can generate some unexpected elements in the formed image.
In spite of having a bigger complexity, time-domain algorithms are not affected by the limitations associated with the frequency-domain algorithms. The Backprojection (BP) [ [3] time-domain algorithm does not require a separate motion compensation and is not subject to the constraints mentioned before. Besides, it works with spotlight, stripmap and scan imaging modes. The backprojection algorithm implementation for circular SAR designed by LeRoy Gorham and Linda Moore [2] is a ready to use MATLAB implementation that accepts different dataset formats like the Gotcha Volumetric SAR Dataset [5], Backhoe Data Dome [6], and GMTI Challenge Problem [7]. The PERFECT Suite [8] provides an implementation of the backprojection algorithm, among others, with CUDA (Compute Unified Device Architecture) and OpenMP versions.
The BP algorithm is quite computation and memory intensive. Therefore, it has been modified to speedup the SAR image formation process. The fast backprojection algorithm [9], the fast factorized backprojection algorithm [10] and the accelerated backprojection algorithm [11] are examples of the improved BP algorithm. In [12] the BP algorithm was designed with fixed-point arithmetic and executed in a CPU, with performance improvements up to 25%, depending on the image size, with negligible image quality reduction. The time to process the SAR data reduced from 84 to 75 seconds. These optimizations reduce the computation time of SAR data but they are still unable to deal with real-time requirements.
To further reduce the computation times of the BP algorithm multiprocessing computing architectures are used. The BP algorithm is highly parallelizable, so multiple operations can be executed in parallel, which reduces the execution time of the algorithm. GPUs [13]- [18] and multicore processors [19] are the two most common multicore processing architectures considered for BP algorithm speedup.
In [19] several approximate strength reduction optimizations, such as quadratic polynomial approximations, Taylor series as square root calculation method and trigonometric function strength reduction, are used to reduce the computational load of the algorithm. Large images are processed in one second using a multicore platform with 16 processors. In [15] the authors note that in practice not all pulses are going to contribute to every pixel of a SAR image. This is taken into consideration to reduce the number of computations.
High-performance computing platforms require high-power, which is not available in onboard computing systems which are low-power. Embedded GPUs can be considered since they have lower power requirements. In [16] a mobile graphics with a peak performance close to 1 TFLOPS is used to run the BP algorithm in about 3 seconds. The work in [18] reduces the processing time of the fast factorized BP in a Jetson TX2 GPU for images of size 2k × 2k from 35 to 5 seconds, depending on the stage factor of the algorithm. Embedded GPUs tradeoff power by computing capacity and are still based on general purpose architectures with some energy inefficiencies.
FPGAs and ASIC devices allow the design of custom hardware accelerators with better performance and energy. ASICs are faster and require lower energy than FPGAs but are expensive and do not allow hardware modifications after fabrication, like the FPGAs.
In [20] the backprojection algorithm is implemented in FPGA with a mesh of 64 processor cores to provide real-time processing. Cholewa et al. [21] developed a BP module to generate one line of the final image at a time, looping over all pulses for each line. The results obtained show that the implementation scales almost linearly with the parallelization factor. In [22] the author provides an automatic framework based on OpenCL to map the backprojection algorithm in FPGA. The solution explores the available parallelism of the algorithm and shows good results when compared to designs generated from hardware description languages. Another implementation of the BP algorithm in FPGA [23] considers multiple independent core units that receive raw data and generate a pixel contribution, to be accumulated to the current pixel value. In this implementation, an Arria-V SoC from Altera is used, which also integrates an ARM Cortex-A9 dualcore processor. the design is able to process an image in 120 milliseconds. The work in [18] also runs the fast factorized BP algorithm in two different FPGAs reaching execution times for images of size 2k × 2k range from 0.8 to 42 s, depending on the target FPGA and the algorithm configuration. Data in the FPGA is represented with 16-bits. In [24] a hardware/software solution is proposed for SAR image processing. The proposal partitions the solution in hardware and software, mapping some of the most computation intensive parts of the algorithm to hardware.
The utilization of 16-bit data in the FPGA design from [18] is a major design optimization. One of the main concerns when implementing algorithms using accelerators is the tradeoff between performance and precision. GPUs, for example, tend to use single or half-precision instead of double-precision floating-point, either because the devices do not support it or because of the overhead introduced. However, assuming a constant fixed-point size for all variables is inefficient. A constant size may not be enough for some variables, which reduces precision, or may be too much for some variables, which requires unnecessary hardware. Instead of a fixed-point representation for the whole variables of the algorithm, this work considers a full optimization of the representation of the variables.
Memory organization is another major aspect of any computing platform. Previous FPGA designs rely on a high external memory bandwidth to feed multiple parallel cores. To reduce the memory bandwidth requirements as the parallelism increases, data must be properly cached on-chip and reutilized. The architecture proposed in this paper redesigns the BP algorithm to allow a scalable architecture without the necessity to increase the memory bandwidth.

III. BACKPROJECTION ALGORITHM
The backprojection algorithm Backprojection (BP) is a SAR image formation algorithm that converts radar echo data in a SAR image. The algorithm determines the contribution of each reflected pulse for each pixel on the output SAR image. The backprojection algorithm takes as input the location of the image platform for each pulse, the location of the output pixels, and the SAR data set. In the description of the BP algorithm the following nomenclature is considered: • R -Distance from platform location to each pixel.
• R0 -Distance to the first range bin. This value is constant. • x k , y k , z k -Radar platform location in Cartesian coordinates. • x, y, z -Pixel location in Cartesian coordinates. • r c -Range to center of the swath from radar platform.
• g x,y (r k , θ k ) -Wave reflection received at r k at θ k (calculated using the linear interpolation in equation 3). • Data(p, n) -Wave sample of pulse p and range bin n. The BP algorithm performs a sequence of steps for each input point (pixel, pulse) as follows: 1) Computes the distance from the radar platform to the pixel under analysis: 2) Converts the distance to an associated range position from the data set of received echoes; Assuming that range bins are equally spaced, the factor 1 R(n+1)−R(n) can be replaced by a multiplication with a constant 1/∆R; 3) Obtains the samples at the computed range via linear interpolation, using equation 3 [23].
Computes the complex exponential for the matched filter from R (see Equation 4).
5) Scales the sampled value by the matched filter to determine the pixel contribution.
prod(x, y, k) = g x,y · e i·ω·2·|R| (5) 6) Accumulates the contribution into the pixel. The final value of each pixel is given by equation 6.
The backprojection algorithm with the steps enumerated above is described in Algorithm 1, where k u = 2πfc c represents the wave number, f c is the carrier frequency of the waveform and c is the speed of light.
The computational complexity of the algorithm for an image of size iy × ix is proportional to iy × ix × p, that increases quadratically with the size of the image and linearly with the number of pulses.
To improve the interpolation quality of the BP algorithm it is common to upsample data in the range dimension prior to backprojection. This work considers 8X upsampling via FFT/IFFT. The input data is first translated to the frequency domain using an FFT, the result is zero-padded and then Algorithm 1 Backprojection algorithm pseudocode, based on [8]. 1: for all Y pixels y do 2: for all X pixels x do 3: for all pulses p do 5: 10: prod ← gx,y × mf

IV. DESIGN AND IMPLEMENTATION OF THE HARDWARE ARCHITECTURE
In this section, the hardware architecture to run the backprojection algorithm on FPGA is described. The methodology to design the system has the following steps: 1) Algorithm rescheduling to improve memory access. The algorithm was reorganized to improve data accesses from external memory and to increase on-chip memory reuse. Loop tiling exploits spatial and temporal locality allowing data to be accessed in tiles permitting to execute the same operations over a block of data stored in local memory; 2) Data Type Optimization. The data type representations were converted from single and double-precision floating-point to fixed-point. This allows to reduce the complexity of arithmetic operators implemented in hardware and the memory required to store data. The fixedpoint representation (number of integer and fractional bits) was found for each variable in order to achieve an algorithm accuracy close to the SNR specified by the designer. The fixed-point format determines the hardware area of the final solution as well as the performance; 3) Design the upsampling and the backprojection algorithm in hardware. A dedicated configurable accelerator is designed to run both the upsampling and the backprojection algorithm.

A. Algorithm Rescheduling
The output pixels of the backprojection algorithm can be calculated independently of each other, so the algorithm is highly parallelizable and the outcome is not affected by the order in which pixels are calculated.
Two main methods of improving an algorithm execution in hardware is parallelization of computations and datapath pipeline. A datapath pipeline achieves a throughput of one output pixel per clock cycle, but this requires reading a complex data point from external memory in a single cycle. This is difficult to achieve in the backprojection algorithm since in each clock cycle two complex data points, g(p, b) and g(p, b + 1), would have to be retrieved from memory for each pulse p. Since the reading order of pulses may not be sequential (depends on b), several clock cycles are needed to read each data point from external memory. A turnaround solution consists of storing all bins of a pulse in on-chip RAM (Random Access Memory) memory. This on-chip allows random cycle access to data which permits pipeline execution.
The execution throughput of the algorithm can be further improved with multiple pipelined datapaths. Using on-chip memory to cache data permits multiple accesses to data points using multi-port memories. Memory with multiple ports can be implemented on FPGA using on-chip memory and data replicated in multiple memories to allow multiple parallel accesses.
The large data volume would require a large on-chip memory to store all data on-chip that may not be enough even in large FPGA devices. For example, considering an image with 512 single floating-point complex pulses upsampled 8 times needs 512×512×8×4×2 = 16M Bytes of on-chip memory. Therefore, only a subset of data can be loaded at a time. Since there are no data dependencies between the calculation of different output pixels and there are no constraints over the order with each output pixels can be produced, input data, g(p, b), can be reused by calculating the contribution of each pulse for a set of pixels (see Algorithm 2).
Algorithm 2 New scheduling for the backprojection algorithm where the contribution of a single pulse is calculated for all output pixels in a line. for all pixels in X x do 7:

19: end for
This scheduling of the algorithm calculates the contribution of each pulse in all pixels in a line before proceeding to the next pulse. Therefore, the data associated with a single pulse is reused a number of times equal to the number of pixels in a line. This reduces the pressure over the link to external memory to retrieve data pulses, since this data is cached in onchip memory and reused multiple times. It also allows parallel pixel calculation since data can be easily replicated to offer multiple memory ports.
Another aspect of this algorithmic organization is that data loading can be done in parallel with data processing, that is, data for the next pulse can be loaded while the data of the present pulse is being processed. The execution time of each iteration is therefore determined by the highest latency between data transfer and/or upsampling, and data processing.
The most efficient solution is the one with a balanced latency between communication/upsampling and processing.
In case the pipelined data processing is slower, parallel datapaths can be used. Otherwise, data reuse can be augmented to increase the time available for data transfer and upsampling. Increasing data reused can be easily achieved with the backprojection algorithm by increasing the number of pixels processed for each pulse (see Algorithm 3).
Algorithm 3 New scheduling with tiling for the backprojection algorithm where the contribution of a single pulse is calculated for all output pixels in a set of lines. for all pixels in X x do 8:

13:
prod ← gx,y × mf 14: acc(x, l) ← acc(x, l) + prod 15: In the new algorithm rescheduling, each pulse is used for all pixels in a subset of lines, T L, that is, a tiling of T L lines. So, data pulse reuse is increased T L× which reduces the pressure over the external memory access and upsampling calculation by a proportional amount. The drawback of the solution is that more on-chip memory is necessary to store partial accumulations for all pixels under processing.

B. Data Type Optimization
The original implementation of the backprojection algorithm uses double-precision floating-point arithmetic. This offers a large dynamic range of values with high precision. However, it is more computationally intensive and occupies more memory than other floating-point representations, like float or half-float, and fixed-point or integer representations. When implemented in hardware, the latency and the occupied resources are larger than the other simpler implementations.
Therefore, the hardware implementation of an algorithm is usually preceded by an analysis of the dynamic range of all data so that hardware-friendly data representations can be used. In this work, fixed-point representations were used to replace double-precision floating-point with negligible accuracy loss. To achieve this, the methodology represented in Fig. 1   Performance is not considered explicitly as a metric because the circuit is fully pipelined. Changing the bitwidth of a fixedpoint number may influence the critical path and consequently the maximum operating frequency. However, reducing the bitwidth has an impact on both area and performance. Since the area is being optimized it will also improve the performance. So, the optimization is guided by the area and the operating frequency is determined for a final solution. The results of all architectures assume a fixed frequency for all designs, even knowing that smaller designs can run at a higher frequency. The frequency is mostly influenced by the routing when the FPGA has a high percentage of utilization. These factors are difficult to include in the design exploration tool, so they are omitted. Therefore, performance was not considered as an assessment metric.
The hardware area of the circuit used in the methodology is obtained from models of the hardware implementation of arithmetic operators. A model was considered for each type of arithmetic unit: addition/subtraction, accumulation, multiplication, square-root, and trignometric functions, sin/cos, within the variation range of the operands. For example, the number of LUTs in addition equals the operand sizes, the number of LUTs. For the remaining operations, a table with the number of LUTs and DSPs was determined for all operand sizes within the range of the variables involved in each operation.
Initially, the value range of all variables is determined for a set of images. Considering the integer part of the highest values of each variable, the number of necessary bits to represent the integer part is found. Then, considering each variable at a time, we find the maximum and minimum number of bits to represent the fractional part. The maximum number corresponds to the necessary number of bits to keep the accuracy loss as small as possible. The minimum number of bits corresponds to those bits necessary to obtain an accuracy higher than a user specified value. This value determines the achievable reduction in hardware resources.
Then, starting with the highest number of fractional bits for all variables, the number of fractional bits is reduced. The variable with the lowest ratio ∆acc/T otalacc ∆ hw /T otal hw is chosen to reduce one fractional bit. Here, ∆ acc and ∆ hw , represents the variations in accuracy and hardware area estimation, respectively. T otal acc and T otal hw , represent the accuracy and total hardware area estimation, respectively, of the algorithm.
The process is iteratively repeated until achieving the lowest acceptable accuracy defined by the user. The range of values for the number of fractional bits determines the range of the accuracy and the utilization of hardware resources.

C. Hardware Design of the Backprojection Algorithm
This section presents the proposed hardware design of Algorithm 3 described in section IV-A. The proposed hardware module, SAR IP, is configurable in the number of fractional bits, the tiling factor and the number of parallel datapaths. The full circuit with a single datapath is represented in Figure 2.
The circuit is fully pipelined to increase the throughput, and several on-chip buffers for data reuse and communication/computation overlap. Figure 2 identifies the main modules but they are all connected in a stream-like fully pipelined architecture . Synchronization only happens at the input and output of data with valid/ready signals. All operators, including the upsample, are pipelined.
The module includes one on-chip buffer to store the platform positions. The contents of this memory are constant throughout the execution of the algorithm. A second on-chip buffer is used to store the data points. At each time, all data points of a single pulse are stored in the buffer and reused. While a set of data points of a particular pulse is being used, the data points of the next pulse are loaded to the on-chip buffer. Therefore, this on-chip buffer is split into memories that work in a ping-pong fashion.
The pixel accumulation is also implemented with on-chip memory. Likewise, while the output pixels associated with a set of data points are being generated, the previous output pixels are being transferred to external memory. So, this buffer also works in a ping-pong fashion.
The size of the operators is statically configured before synthesizing the circuit. The tiling factor that determines the degree of data reuse is also configurable. The larger the tiling block, the larger size of the on-chip memories to store the accumulations. The hardware datapath that implements the BP algorithm can be replicated to allow parallel calculation of output pixels. Algorithmically this corresponds to unrolling the X loop of Algorithm 3. The nested loops of the designed architecture accept continuous dataflow of data with an iteration interval of one. Figure 3 illustrates the implementation of SAR IP core with two parallel datapaths.
Increasing the number of parallel datapaths requires a proportional increase of the hardware resources and memory. The upsample and the position memories are shared by the parallel datapaths. Since each datapath needs dual port access to the data memory, this memory is replicated with the same contents for each datapath to increase the number of memory ports.
The execution time of the backprojection algorithm, can be theoretically estimated for a particular image size, ix, iy, number of pulses, p, tiling T L, upsample factor, U , and number of parallel datapaths, DP . Since the architecture is fully pipelined and neglecting the initial pipeline filling and final pipeline emptying, the number of clock cycles, BP cycles , to execute the whole backprojection algorithm, including upsampling, is estimated by Equation 7.
where Upsample(p, U) is the number of cycles to execute the upsample of p pulses by the upsample factor U . The expression considers the number of cycles to run the backprojection algorithm and the cycles to run the upsample. The worst case determines the total latency of the circuit.
The number of cycles is therefore determined by the maximum number of cycles between upsampling and backprojection. The designer must determine the best hardware configuration for a particular image size, memory bandwidth and operating frequency to obtain the most efficient solution, that is, a solution with the minimal idle times among all processing modules.
Another important aspect of the architecture is the required on-chip memory. The total on-chip memory, OCM , in bytes of the proposed architecture is given by Equation 8 OCM = ix × sizeof (P latP os)+ The maximum tiling factor and the number of parallel datapaths are constrained by the resources of the target FPGA device.

V. EXPERIMENTAL RESULTS
The SAR IP core has been described in VHDL, integrated in a System-on-Chip architecture and tested on FPGA. The target device is an Artix-7 XC7A200T FPGA in the Nexys Video trainer board. The hardware design and implementation All architectures were tested with synthetic images of sizes 512 × 512 with 512 pulses and 1024 × 1024 with 1024 pulses, and a real image from the AFRL dataset with a size of 512 × 512 and 512 pulses. The 1024 × 1024 output images are the same as those with half the size but with half the sampling. Any other sizes can be considered with proper configuration of the architecture.

A. Analysis of the Error with Fixed-Point Design
The proposed methodology for floating-point to fixed-point conversion was utilized to convert the algorithm to a custom fixed-point design. Different SNR constraints were considered, namely: 40 db, 60 db, 80 db, 100 db, 120 db and highest SNR. In all cases, the fixed-point representation for all variables was determined (see results for images of size 512 × 512 in Table  I).
From the table it is possible to observe the number of bits considered for the integer part and the range of fractional bits for each variable of Algorithm 1. The square root is one of the most critical with a low range of variability of the fractional part. The SNR with double-precision floating-point is 138.87 dB. The highest SNR achieved with fixed-precision was 137.65 dB, close to the floating-point implementation. The upsample operations were also implemented with fixed-point with the variable range identified in the table as g(data).
The fixed-point representations are similar for other image  Upsample  11232  88  78  BP  15878  63  58  Total  26945  151  134 sizes, with a small increase in the number of bits of the integer part, namely in the R, b and acc variables with from 1 to 2 bits more. The last column shows the results for double-precision floating-point arithmetic. The difference in accuracy is negligible.

B. Analysis of the Circuit Area
Algorithm 3 was designed and implemented in the FPGA as an accelerator in the form of an IP core (SAR IP). The full precision algorithm was also implemented to obtain a reference to which the fixed-point implementations can be compared with (see Table II).
The accelerator is configurable and, therefore, can be designed with custom data-point representations for all variables. Different architectures (Arq.) were generated and implemented considering the fixed-point representations reported in Table I for different SNR (see the occupation of FPGA resources in Table III for an image of size 512 × 512).
The best architecture has an SNR close to the SNR achieved with double floating-point. Also, the architecture with lowest SNR is about 30% lower than the architecture with best SNR. Considering images of size 1024 × 1024, the upsample module was configured for 1024 points (see results in Table  IV).
The architecture for the larger images mainly differs in the upsample module since the datapath only differs in the final accumulators. The architecture with the highest SNR needs more resources, as expected. Choosing the right architecture will depend on the acceptable SNR and the resources of the  target device.
Since the proposed architecture is configurable in terms of parallelism and tiling factor, we have analyzed the occupied resources of the SAR-IP for different configurations of tiling and number of datapaths (see Tables V and VI).
The core was configured with up to 8 parallel datapaths. The most balanced solution depends on the correct computation balance between the upsample and the BP hardware module. When the number of datapath units increases, the execution of the BP algorithm reduces proportionally. Since the upsample always needs the same time to compute, the tiling has to increase proportionally to the number of datapaths to balance the execution of both upsample and BP. Assuming that both modules work at the same frequency, a balanced solution is achieved when both parallelism and tiling increase proportionally.

C. Performance Analysis of the Hardware/Software Architecture
The architectures with SNR ≈ 100 dB were considered for testing in the FPGA with images of size 512 × 512 and 1024 × 1024. Results can be obtained for other configurations of the architecture with different SNR and other image sizes by configuring the SAR IP and resyntesize the system.
The core was integrated in a System-on-Chip (SoC) with an embedded RISC-V CPU (see Figure 4).
The system uses a low-performance RISC-V soft-processor to control the memory sub-system and peripherals, including the SAR IP core. The set of peripherals includes internal memory to store the firmware, external memory to store the input and output images and one UART. The data communication between the external memory and the core is done with a DMA (Direct Memory Access) through the DDR memory controller. The DMA is configured by the CPU and allows direct access between the core and the memory. The area of individual modules and the full hardware/software architecture area for a configuration with DP=8 and tiling = 8 is shown in Table VII). The most used resources for the configuration with DP = 8 are BRAMs and LUTs. These determine the highest parallel factor. Other DP factors can be considered above 8 to increase the utilization of the available resources and improve the performance. However, DP factors not powers of two are still not supported by the configurable core, so they were not considered in this study. This is left for a future upgrade of the core.
To test the architecture, the DDR memory available in the board was utilized to store the input data. The full SoC system was implemented in the FPGA with a frequency of 125 MHz considering different configurations of the SAR IP core. From these, the execution times were determined (see Table VIII).
The SAR IP runs the BP algorithm for images of size 512 × 512 in a Artix-7 FPGA in 0.14 seconds. Real-time processing of images with size 512 × 512 is achieved with all configurations, according to the SAR image capture times referred in [20]. According to [20], real-time processing is achieved with 60000 pixels/s. to For images of size 1024 × 1024, the core takes 1.11 seconds to run the BP algorithm using 8 parallel datapaths. From table VIII it is possible to verify the scalability of the architecture, since the execution time reduces proportionally  [25] to the increase of the number of datapaths.

D. Comparison of SAR IP with other Computing Devices
The BP algorithm was executed on a PC desktop with a quad-core Intel Core i7-4700MQ processor with 16GB of RAM, on a NVIDIA GeForce RTX 2060 GPU and on the Jetson TX2 module with a 256-core NVIDIA Pascal GPU. The algorithm was run for both image sizes (see results in table IX).
The fastest platform is the GPU and the slowest is the CPU. However, these are not appropriate for embedded computing since the power is too high. Compared to the embedded GPU, the proposed FPGA architecture is 2.6× faster for images of size = 512 × 512 and 2.5× faster for images of size 1024 × 1024. Also, the FPGA has an energy consumption 13× smaller than the embedded GPU. This is an important advantage for embedded computing.

E. Comparison of SAR IP with other FPGA-Based Platforms
The execution time and the energy to process an image of the proposed architecture were also compared with previous FPGA works. Since different FPGA devices are used in previous works, efficiency metrics were also considered for a fair comparison. The efficiency metrics consider the number of pixels/pulses (PP) per LUTs, DSPs and BRAMs that are processed per second. For example, as shown in table IX, the proposed solution runs 512 × 512 × 512 pp, in 0.14 seconds with 80589 LUTs. In this case, the LUT efficiency equals 512 × 512 × 512/0.14/80589 = 11, 896 PP/s/LUT. Another metric was considered to determine the energy efficiency (kPP/Energy), that is, the number of processed backprojection pixels per joule. The comparative results with previous implementations on FPGA are shown in table X.
All previous works consider large FPGAs with high consumption for an embedded system, except the work from [24] that was implemented in a ZYNQ7020. However, it runs partially in the embedded processor, so it takes over one minute to run the algorithm.
Considering the area efficiency, the proposed solution is the most efficient in terms of LUTs (1.8× better) and BRAMs (1.5× better). Since multiplications can be implemented with DSPs or LUTs, a design with lower ratio of DSPs has a higher ratio of LUTs and vice-versa. This means that the proposed solution achieves the highest pixel processing per second per area. Since the architecture is scalable, the ratio closely applies to different picture sizes. The fully optimized pipelined architecture also achieves the best performance efficiency (PP/Time/kLUT) and the best energy efficiency (kPP/Energy) by almost 2× and 4×, respectively, compared to the best previous architecture. The major improvements in performance and energy achieved with the proposed architecture are mainly from the fine customization of the fixed-point representation and a fully pipelined and scalable architecture with configurable parallelism.
The proposed methodology for floating-to fixed-point conversion of the algorithm helped to obtained an architecture with better performance and energy efficiency. The methodology can be applied to other algorithms as an initial design step before hardware design. to improve the final architecture.
The proposed architecture is also scalable, which allows its custom design for FPGAs with different hardware densities and for images with different dimensions. It can also be designed with different levels of parallelism and different external memory bandwidths so that the right balance between communication and computation can be obtained. It also allows the fine-grain customization of fixed-point formats so that solutions with different tradeoffs between performance and accuracy can be generated with high efficiency.
In terms of energy consumption and power, the proposed accelerator as shown high-performance with the highest efficiencies compared to other embedded processing solutions and FPGAs. This is particular relevant for on-board processing of algorithms with high complexity with low energy.
Another relevant aspect of the proposed solution is the integration of both upsampling and SAR image formation in a single custom hardware architecture. This improves the quality of the generated image while optimizing the performance and energy efficiency. Previous works do not consider this design aspect.
The work described in this paper has therefore contributed to the advance of high-performance on-board processing at low cost and energy.

VI. CONCLUSIONS
On-board processing systems have recently emerged in order to overcome the huge amount of data to transfer from the satellite to the ground station. SAR imaging is a remote sensing technology that can benefit of on-board processing. This paper proposes an FPGA-based hardware core for onboard processing of SAR images.
The original algorithm was reorganized to improve the accesses to data stored in external memory. The proposed architecture has been designed in a development board with an Artix-7 FPGA. Experimental results indicate that the proposed SAR IP core can fulfill real-time requirements with low resources in a low cost FPGA. The architecture is more energy efficient than other computing platforms and previous FPGA implementations, The proposed core is configurable and therefore can be configured for different performance, area and power requirements.