A Real-Time Wideband Subband LMS Algorithm for Full-Duplex Communications

In-band full-duplex (IBFD) communications has the potential to nearly double the spectrum efficiency of existing 5G communications. Digital signal processing to mitigate self-interference is one way to realize this full-duplex capability. However, the aggregated wideband nature of 5G coupled with the potential for large delay spreads and short coherence times makes the realistic implementation of such an IBFD processor challenging. In this paper, we address the architectural considerations and describe the practical implementation of a real-time IBFD digital processor and its performance when operating on wideband data collected in a highly dynamic environment.


I. INTRODUCTION
I N-BAND full-duplex communications can significantly improve the spectral efficiency of 5G cellular communications, with the potential to double both the uplink and downlink throughput [1]. Although there are signal processing approaches to realize digital cancellation for full-duplex communications [2]- [4], most papers in the open literature that describe practical full-duplex implementations have focused on RF and analog cancellation circuits [5]. For digital implementations, 5G presents some unique challengesmost notably, wireless data transfers must occur in as much as 400 MHz of instantaneous bandwidth [6]. With the potential for RMS delay spreads of 100-300 ns [7] and sample rates in excess of 1 Gsps, a finite impulse response (FIR) filter with several hundreds of coefficients may be needed to suppress self-interference. These difficulties, along with a relatively short coherence time [8], make the task of implementing a real-time canceler in digital hardware that is capable of tracking a rapidly changing channel a significant challenge.
In this paper, we consider both the problem and candidate solution to the digital baseband implementation of a fullduplex communications system that is hosted by a Xilinx field-programmable gate array (FPGA). We describe the design of a subband adaptive filter (SAF) least mean squares (LMS) architecture operating in 1 GHz of instantaneous bandwidth. The SAF LMS architecture described in this paper is preferable to that of a frequency-domain block LMS algorithm [9], as it is better suited to following rapid changes in the channel impulse response and is not susceptible to extreme dynamic range challenges arising from quantization error. To the best of our knowledge, there are no other papers in the open literature that describe the implementation of a real-time LMS algorithm operating in excess of 2 Gsps (1 GHz of instantaneous bandwidth) for not only full-duplex operation [10], [11] but for any application.
This paper is organized as follows: Section II describes the technical approach to achieve such high data rates, Section III describes experimental results demonstrating the approach's efficacy, and Section IV concludes with the novelty of this approach.

II. TECHNICAL APPROACH
Baseband full-duplex operation was realized using a "twochannel" architecture [12] presented in [13] and depicted in Fig. 1. In this architecture, a reference channel (REF) samples the output of the transmitter, y T x , after its convolution with the REF impulse response, h ref , of the receiver which includes the response of the analog-to-digital converter (ADC). A second channel, the over-the-air (OA) channel, receives the external signal of interest (SOI), y soi , and a backscattered transmitted signal after its convolution with the OA impulse response, h oa . Note that h oa represents both the channel backscatter and the impulse response of the RF receiver. The SOI experiences a different multipath environment than the transmitted signal and thus is only subject to the impulse response of the RF receiver and not h oa . As such, the signals associated with the REF and OA channels are, respectively, where the * operator denotes convolution. Within the context of the canonical LMS system model, y ref is the system input x(n) and y oa is the desired signal d(n). Hardware impairments arising from, for example, the high-power amplifier (HPA) in the transmitter can significantly decrease full-duplex cancellation performance. However, by employing the two-channel architecture depicted in Fig. 1, the deleterious effects of transmitter noise and distortion are mitigated by sampling the transmitter output in the process of forming a linear equalization filter [14]. This enables cancellation performance which would otherwise be unachievable using a single-channel receiver with linear equalization alone [15]. Additionally, sampling the transmitter output with a two-channel architecture enables the system to cancel self-interference irrespective of the modulation used by-or other signal statistics of-either the transmitter or the external SOI. A SAF LMS approach is then used to identify the equalizer, h eq , whose ideal impulse response is Lastly, the SOI is recovered by subtracting the equalized REF channel from the OA channel: The LMS algorithm is among the most popular adaptive filtering algorithms because of its algorithmic simplicity. However, LMS suffers from various issues, including convergence speed and stability in the presence of in-band interference [16]. With the canonical LMS formulation, the error signal e(n) is expected to converge to zero. However, in the presence of other in-band signals such as with orthogonal frequency-division multiplexing (OFDM) as well as selfinterference in the case of full-duplex communications, the error signal should decrease but not converge to zero. This is because, after the adaptive filter has converged and reached steady-state, the error signal is expected to be identical to the signal desired to be received (i.e., the SOI). More specifically, the estimate of the SOI y soi in Eq. (2) is identical to the error signal e(n).
Here we propose the integration of two separate enhancements to the traditional LMS algorithm: subbanding and parallelization. Additional unique and novel enhancements and adaptations are performed to realize real-time operation at 2 Gsps. Therefore, both digital signal processing methodology and approaches for practical hardware implementation must simultaneously be considered.

A. SUBBAND LMS
Consider the canonical system model [17] where n denotes the sample index, T denotes the vector transpose operation, denotes the vector of P unknown system coefficients, T denotes the vector of the most recent P system inputs, and η(n) denotes any source of noise or interference that is independent of the system input. With the a priori error signal defined as whereŵ(n) denotes the vector of M adaptive filter coefficients, the canonical normalized adaptive coefficient update is [17]ŵ where µ denotes the step-size. The implication of Eq. (5) is that the adaptive filter coefficientsŵ(n) are updated with every new sample of the received signal x(n). However, this canonical approach cannot be realized in real-time at 2 Gsps because current hardware is not able to achieve such high clock rates. Therefore, subband LMS architectures are considered for a practical implementation because they enable the adaptive filter coefficients to be updated at a much slower rate than the data rate while still being able to quickly adapt to a rapidly changing channel.
Subband LMS approaches have been widely explored [18]- [20], most often to reduce computational complexity and improve convergence speed. However, these methods often require specially designed filter banks. Other methods have mitigated the additional delay introduced by subband filtering [21]. However, such an approach uses block Fourier transforms which introduce fixed-point challenges with extraordinarily high dynamic range requirements when implementing on an FPGA. Due to these challenges, an approach from [22] is adopted.
Using the adopted subband approach [22], the coefficient update becomeŝ where N denotes the number of subbands, is the system input vector for the kth subband, e k,D (n) denotes the decimated error signal for the kth subband, M denotes the number of coefficients of the adaptive filterŵ(n), and µ denotes the step-size. This adopted SAF approach also allows the flexibility to employ a user-selected filter bank architecture. Therefore, a generic cosine modulated filter bank with prototype filter based on [23] is selected. Then, the filter bank is constructed by [24] h k (n) = 2h p (n) cos (2k is the prototype filter, and L denotes the number of coefficients in h p (n). Note that h k (n) only represents the analysis filter bank. The synthesis filter bank is unnecessary for the realtime operation of the SAF LMS architecture and is omitted to simplify and reduce computational complexity.
A naive implementation of the adopted SAF architecture is shown in the block diagram in Fig. 2, and while the methodology is correct, this architecture cannot operate at a sample rate of 2 Gsps due to excessive FPGA resource consumption. This is primarily because the adaptive filterŵ(n) is replicated across all N subbands, which suggests that many optimizations may be realized. Therefore, options to reduce FPGA area consumption for a practical hardware implementation must be considered that are not generally consid- ered for a non-real-time software implementation. Equivalent functionality may be achieved with instantiating the adaptive filter only once by transposingŵ(n) with the subband filters H k (z) to the beginning of the signal processing chain. Consequently, the loop architecture is rearranged such that the error signal e(n) undergoes subband filtering rather than the desired signal d(n). As a result, this will eliminate the computation of an error signal for each subband and instead, a single error signal will be computed as in the canonical LMS architecture. As a result of these optimizations, a block diagram of this optimized subband architecture is shown in Fig. 3. The development of this architecture targeted for an FPGA implementation operating at 2 Gsps is unique and novel and is solely motivated by the real-time requirements and constraints imposed by such extremely high data rates.

B. PARALLELIZED LMS
Current FPGA clock rates are generally limited to approximately 500 MHz in practical operation despite being advertised to operate at faster clock speeds. To achieve realtime data rates in excess of 500 Msps, multiple data samples per clock cycle must be processed. This requires the use of parallel FIR filters which must be integrated into the architecture to enable processing of more than one sample per clock cycle. With R denoting the number of samples processed simultaneously per clock cycle, each of the subband analysis filters H k (z) and the adaptive filterŵ(n) in Fig. 3 are replaced with its equivalent parallel FIR architecture, an example of which is shown in Fig. 4 for R = 4. In Fig. 4, each sub-filter h s,r for r ∈ [0, R − 1] includes every Rth coefficient and is defined as where h s (n) represents the serial FIR filter coefficients to be implemented in parallel.

C. HARDWARE IMPLEMENTATION
A real-time hardware implementation poses a number of challenges-most notably, the very high data rate relative to the FPGA clock rate. Specifically, the data rate of 2 Gsps  greatly exceeds the theoretical maximum FPGA clock ratein our case, 775 MHz [25]. Even if the data rate requirement was not as demanding, generally it is very difficult to achieve the theoretical maximum FPGA clock rate when the FPGA is highly utilized due to placement constraints, routing congestion, fan-out limitations, as well as other factors. However, the algorithm benefits from parallelization and downsampling via the subband process to meet the high throughput requirements. Specifically, by instantiating the adaptive filter only once as the shown in Fig. 3, this optimized subband architecture will require M R fewer multipliers compared with the naive approach, significantly saving FPGA area.
There are many feasible combinations of R and clock rate to achieve the desired data rate. For example, selecting R = 5 with a clock rate of 400 MHz or selecting R = 8 with a clock rate of 250 MHz will both achieve a data rate of 2 Gsps. However, specific compromises and tradeoffs only relevant to a hardware implementation must be considered. For example, meeting FPGA timing constraints becomes more challenging as the clock rate increases. Reducing the clock rate to more easily meet timing constraints while still maintaining the same overall data rate may be accomplished by increasing R to process more data samples per clock cycle. However, increasing R will necessarily increase the FPGA area consumed. As FPGA area consumption increases, meeting timing constraints also becomes more challenging. We also recognize that a hardware implementation imposes delay in the signal processing loop, which has a deleterious effect. However, this may mitigated by minimizing L and M as much as practically possible. Therefore, a balance among reducing both clock rate and R to meet timing constraints and also reducing L and M to meet convergence and stability requirements while still ensuring L and M are large enough to meet system performance requirements must all be simultaneously achieved for a successful hardware implementation.
This balance was achieved with a clock rate of 500 MHz and R = 4 samples per clock cycle, resulting in a data rate of 2 Gsps. Additionally, if the number of subbands N is chosen such that N = R, then the downsampling step as part of the subbanding process simply discards R − 1 of the R samples being processed simultaneously. As a result, the entire FPGA design will run at a single clock rate, vastly simplifying the design effort by completely avoiding the implementation of multiple clock domains and clock domain crossing logic.
The aforementioned issues are only some of the factors to be considered where theoretical performance must be sacrificed in order to realize a practical implementation. Another factor that must be considered is the selection of a fixedpoint implementation. The dynamic range of every data path within the FPGA must be considered to appropriately choose the number of bits and decimal point placement for each path, which will necessarily introduce quantization error and therefore degrade performance.
To help overcome the numerous aforementioned implementation challenges, the proposed algorithms were implemented using MathWorks HDL Coder to target the Xilinx Zync UltraScale+ XCZU28DR. This commercial off-theshelf software tool enables rapid prototyping of digital signal processing algorithms in a high-level language and is able to automatically generate the equivalent low-level hardware description language (HDL) code, significantly reducing development time. The resulting HDL code is then used by Xilinx tools to create an FPGA bitstream file. As a result, the entire design may be implemented in real-time hardware without manually writing any HDL code. After successful implementation, Table 1 lists the FPGA resources consumed. Despite the high data rate of 2 Gsps combined with the computational requirements of subbanding, the design consumes less than 70% of the resources available on a relatively small FPGA. To the best of our knowledge, there are no other realtime digital design implementations that realize any variant of an LMS adaptive filter at 2 Gsps to compare this table against.

III. EXPERIMENTAL RESULTS
To test the efficacy of the full-duplex communications architecture developed with consideration of a real-time FPGA implementation in a representative 5G environment, results were gathered in both simulation and a laboratory demonstration. In both cases, data was obtained using an over-theair test in a representative 5G environment. The test setup consisted of an S-band radio operating in 1 GHz of instantaneous bandwidth placed in an anechoic chamber along with our real-time full-duplex apparatus. A reflector mounted on a rotating pedestal was placed 3 m away from the radio. The multipath delay spread due to this reflector, as well as the chamber's walls, was approximately 0.12 µs and thus required a cancellation filter with M = 256 coefficients at 2 Gsps.
Choosing M such that M ≥ P is a necessary-but not sufficient-condition for the adaptive filter to converge correctly. Additionally, poor performance was empirically observed unless the condition L ≥ 8N was satisfied. As a result, the following adaptive filter architecture parameters have been selected: N = 4, L = 32, M = 256, R = 4, and µ = 0.01.

A. METHODOLOGY COMPARISON
Simulations of results from the proposed SAF LMS architecture were conducted using the measured over-the-air data. The resulting mean squared error (MSE) is shown in Fig. 5. This result is compared against the MSE from using the normalized block LMS (NBLMS) algorithm with the block size equal to N . The NBLMS algorithm is defined as [26] n = kN + i where N denotes the block size, k the block index, n the sample index, and i the sample index within a block. We compared our technique with the NBLMS algorithm because it updates the adaptive filter coefficientsŵ once every N samples, equivalent to the proposed SAF technique. In contrast, canonical LMS in Eq. (5) updatesŵ on every sample, which cannot be realized in real-time for 2 Gsps and is therefore an inappropriate comparison. By only comparing the proposed approach with another LMS approach that also could feasibly be implemented in real-time at 2 Gsps, we ensured a fair comparison by simultaneously considering the performance of the methodology and practical implementation limitations.
Also shown in Fig. 5 is the minimum achievable MSE. The minimum MSE, J min , is achieved when using the optimal Wiener filter, w opt , and this are computed as [26] w opt = R −1 P where R denotes the autocorrelation matrix of x, P denotes the cross-correlation vector between x and d of length M , and E[·] denotes the expectation operator. Two metrics that are often used to evaluate LMS algorithm performance are steady-state MSE and the time to converge to steady-state. While Fig. 5 shows that both of the two approaches converge in steady-state nearly to J min , the time required to do so is drastically different. Quantitative results including time to converge within 95% of J min and the final steady-state MSE are shown in Table 2. As shown, the proposed SAF approach isolated an external signal of interest within 0.1 dB of the optimal steady-state MSE and significantly improved convergence time compared to the NBLMS approach.

B. LABORATORY DEMONSTRATION
In addition to the simulation results, a real-time hardware demonstration was conducted. The system consisted of a Xilinx Zynq UltraScale+ RFSoC ZCU111 Evaluation Kit 1 which is pictured in Fig. 6 and includes the Zynq UltraScale+  XCZU28DR FPGA discussed in Section II-C. The Zynq FPGA hosted the SAF self-interference cancellation algorithms that were optimized for a real-time hardware implementation and described in Section II. The RFSoC includes two 12-bit analog-to-digital converters (ADCs) and were both clocked to achieve an effective 2 Gsps sampling rate. The samples from each of the two ADCs were demultiplexed into four parallel data streams-as described in Section II-C where R = 4-and each data stream was simultaneously presented to the FPGA at 500 MHz. With a FPGA clock rate of 500 MHz and four samples processed per clock cycle, a data rate of 2 Gsps is realized.
A custom two-channel receiver was developed as an RF front-end and placed before the ADCs of the RFSoC, and is pictured in Fig. 7. The receiver's specifications are given in Table 3. As described in [14], one of the two channels of the receiver was coupled to the output port of the transmitter and formed the reference (REF) channel as illustrated in Fig. 1, while the other channel of the receiver was directly connected to the receive antenna and formed the over-the-air (OA) channel. The RF receiver used two stages of downconversion on each channel to generate signals in the second Nyquist zone in the intermediate frequency (IF) bandwidth from 1-2 GHz that fed both of the ADCs on the RFSoC. Approximately −20 dBm of transmit power was coupled into the OA channel of the receiver from both direct-path leakage and backscatter.
With a 10 dB noise figure, the receiver's integrated noise  floor across 1 GHz of instantaneous bandwidth was approximately −74 dBm. Given that the receiver was noise-limited rather than distortion-limited (cf. Table 3), this resulted in the receiver having a dynamic range of approximately −20 dBm + 74 dBm, or 54 dB. Because the desired signal d(n) was derived from the reference (REF) channel and the input signal x(n) was derived from the over-the-air (OA) channel, each of which included receiver noise of approximately equal variance, the upper bound on cancellation performance was 3 dB lower than the dynamic range of the receiver, or 51 dB, which is directly related to the Cramér-Rao lower bound (CRLB) on adaptive filter coefficient (ŵ) estimation accuracy [27].
To emulate a 5G environment, we modeled a full-duplex base station with an over-the-air downlink-uplink connection and a mobile transceiver in the 5G n78 band (3.3-3.8 GHz S-band) [6]. We placed our base station antenna into the 10 m × 5 m anechoic chamber pictured in Fig. 8. At the far end of the chamber, we placed the mobile communication transceiver that was connected to one of the horn antennas (as pictured), as well as a corner reflector mounted on a rotating pedestal. The pedestal spun at a rate of approximately 30°/s and emulated the dynamic backscatter present in a multipath environment. At the near end of the chamber, we placed the base station antenna which was connected to the high-power amplifier (HPA) pictured in Fig. 9. This signal was coupled into the RF transceiver which in turn was connected to the ADCs of the RFSoC and digital baseband processor.
We emulated full-duplex base station transmission across  all channels of the 5G n78 band with a single active mobile transmitter present. The output of the FPGA digital canceler was connected to one of the digital-to-analog (DAC) channels on the RFSoC and in turn was connected to one port of a spectrum analyzer after being converted to IF. Another DAC channel on the RFSoC served as a pass-through port prior to FPGA digital cancellation and was connected to a second port of the spectrum analyzer after IF conversion. The measured signal at the output of the RF receiver with and without digital cancellation as captured by the spectrum analyzer is pictured in Fig. 10. Without self-interference cancellation, all of the channels in the 5G n78 band are completely filled with self-interference due to the base station transmission coupling back into the RF front-end receiver with both direct path leakage and near-in backscatter. However, after digital cancellation, the mobile radio's transmission is recovered in the absence of error, with digital cancellation suppressing self-interference by five orders of magnitude-in this case, 50.5 dB-to come within 0.5 dB of the CRLB.

IV. CONCLUSION
Wideband operation, a rapidly changing RF channel, and a relatively long delay spread all represent significant challenges for the real-time implementation of a baseband fullduplex communications system. In this paper, we presented the novel joint signal processing and FPGA architecture that is able to achieve near-optimal cancellation performance when operating in 1 GHz of instantaneous bandwidth in a representative 5G environment. Most other LMS approaches in the open literature typically consider either theoretical performance or real-time hardware implementation constraints separately and individually. Instead, by jointly considering both in an interdisciplinary fashion, we developed and realized an LMS adaptive filter with an optimized architecture integrated with subbanding and parallelization that processes data at 2 Gsps in real-time.