An Efficient Filter-Bank Multi-Carrier System for High-Speed Wireline Applications

This paper proposes an efficient multi-carrier system that combines filter-bank multi-carrier signalling, decision-directed channel estimation, and frequency-domain timing recovery to eliminate the overhead associated with cyclic prefix, large side-lobes, and pilot carriers. Furthermore, a technique is proposed to halve the required number of FFTs (IFFTs), reducing their complexity by 29% for a 32-point resolution; a method is proposed to correct tilt and stretch distortion; and a gain controller with adaptive loop coefficients is adopted to achieve the same stability but 65% higher tracking bandwidth regardless of the FFT size. The concept is validated at the system level, where impairments are applied, enabling an in-depth comparison to conventional discrete multi-tone signalling. Assuming a 32-point FFT, a $35dB$ channel, and an overlap factor of 3, results show 101% improvement in capacity, 100% improvement in power efficiency, and 101% improvement in area efficiency, and all while maintaining comparable latency. This work enables very low-resolution multi-carrier schemes, which were previously impractical due to the significant overhead.


FIGURE 1. Generalized top-level block diagram of a multi-carrier system.
solution to conventional DMT signalling in terms of capacity, power efficiency, area efficiency, and latency, analyzes the reduction in complexity from the proposed simplified FBMC coding, and highlights the improvement in noise tracking bandwidth from the proposed adaptive gain controller. Section V concludes the paper.

II. BACKGROUND
This section provides a brief background of multi-carrier signalling, highlights the shortcomings of DMT and advantages of FBMC, discusses its implementation, and analyzes the challenges of channel estimation and timing recovery. For a comprehensive background of FBMC, we refer the readers to [7] and [8].
Throughout this manuscript, we reserve i to denote the packet (or frame) index, k to denote the frequency bin (or symbol) index, and n to denote the time (or sample) index.

A. MULTI-CARRIER SIGNALLING
Multi-carrier signalling is the concept of splitting a channel into multiple frequency bins and sending a modulated tone in each to convey information.
Shown in Fig. 1, the transmitter maps a sequence of bits D IN to a vector of N − 1 complex-valued symbols i [k], each selected from pre-defined constellations of size 2 B [k] representing B[k] bits. These symbols are scaled by P[k] to produce symbols X i [k]. Then the vector is encoded, serialized and sent through the DAC to form the continuous-time transmit signal x(t). After passing through the channel, the received signal y(t) is sampled, deserialized, and decoded to recover N − 1 symbols Y i [k]. Each is scaled by C[k] to produce symbols i [k]. Finally, using constellations B[k], i [k] is converted back to bits to form the receiver binary output sequence D OUT .
The number of encoded bits (bit-loading, B [k]) and the separation between symbols (power-loading, P[k]) is optimized using the Water Pouring Algorithm [14]. B [k] is set to transmit as many bits as possible using the least amount of signal power, whereas P[k] is adjusted to ensure an equal error rate across all bins. This maximizes spectral efficiency [2].
Practical channels are band-limited, producing Inter-Symbol-Interference (ISI). This is a significant burden for 4-PAM. However, assuming multi-carrier frequency bins are sufficiently narrow, the frequency response remains approximately constant over each bin's bandwidth. As such, symbols only experience scaling and rotation errors which is easily corrected using single-tap equalization. This promises a more precise and efficient method to overcome the impairment leading to higher data rates compared to 4-PAM [4], [5], [6], [8].
High-speed wireline applications must adhere to strict latency and complexity requirements. In a multi-carrier system, both parameters scale proportionally with the coding resolution and thus the number of bins. As a result, these systems are bin-limited [15]. Moreover, whereas in wireless, the number of bins equals the coding resolution, in wireline, this number is half [16]. The reason is that wireless systems use bandpass communication, which enables the transmission of complex encoder outputs. However, wireline systems use baseband communication, which restricts the system to real-valued outputs. Given a 2N-point encoder, the first N bins may send unique information, but the remaining N bins must be reserved to send the complex conjugate of the data [10], [16]. Finally, the DC (bin 0) is generally not used as it cannot support complex modulation, and AC coupling would likely block it. As a result, a typical implementation may assume a resolution of 2N = 32, enabling only N − 1 = 15 usable bins. This resolution is proposed in [3] and achieves a compromise between system performance, latency and complexity. From simulation, such a system has a latency of around 10ns, significantly less than the 100ns required for Forward-Error-Correction (FEC) [17] and exhibits a complexity similar to 4-PAM [3]. This limitation worsens the overhead associated with conventional multicarrier techniques such as employing CP and reserving bins for channel estimation and timing recovery.

B. DISCRETE MULTI-TONE (DMT)
The most well-known coding scheme is Discrete Multi-Tone (DMT), also referred to in wireless applications as Orthogonal Frequency Division Multiplexing (OFDM). Here, the encoder (decoder) implements the Inverse Fast-Fourier Transform IFFT (FFT) following (1) and (2).
As shown in Fig. 2, DMT implements rectangular windowing [7], [8]. Consequently, ISI causes energy from one frame to leak into the next, producing interference among symbols and degrading performance. To mitigate this effect, a guard interval called a Cyclic Prefix (CP) is inserted between subsequent frames [8], [18]. The CP is created by appending   the end portion of each frame to its front, resulting in an overall length of 2N +N CP Unit Intervals (UI), where N CP is the length of the CP. As demonstrated in [10], when the CP is long enough, the overlap among frames is removed, eliminating interference among symbols and enabling single-tap equalization. This encoding process is depicted in Fig. 3.
Although DMT is the most straightforward implementation of multi-carrier signalling, it suffers from two shortcomings. First, the CP adds an overhead that is most pronounced with short frame lengths in bin-limited systems [8], [19]. And second, as shown in Fig. 2, rectangular windowing produces large side-lobes in the frequency domain, with the first being attenuated by only 13dB [20], [21]. In an ideal system, frequency bins are orthogonal to one another, and this is not a concern. However, in a practical system, impairments such as time and frequency offset deteriorate orthogonality among bins and produce interference among symbols with a magnitude proportional to the side-lobe height [7], [8], [20]. As such, minimizing side-lobes is critical for achieving capacity.

C. FILTER-BANK MULTI-CARRIER (FBMC)
An alternative coding scheme that overcomes these pitfalls is Filter-Bank Multi-Carrier (FBMC) [22], [23]. Also referred to as offset-QAM or Staggered Multi-Tone (SMT) [8], it is considered as one of the critical innovations enabling wireless 5G communication [13]; we propose to adopt this scheme for high-speed wireline communication.
As depicted in Fig. 4, whereas DMT employs a rectangular window, FBMC applies a shaped window p(t). This filter smooths the transition between frames, alleviating the need for CP while also improving side-lobe attenuation [8], [24]. However, unlike a rectangular window with a length of 2N, this filter has a length of 2NO where O is an integer larger than 1 denoted as the overlap factor. To maintain the same throughput, these lengthened frames overlap each other. As such, whereas DMT employs a CP to avoid frame overlap, FBMC takes advantage of it.
A popular FBMC filter is described by (3) where a r follows Table 1 [24]. Note that with O = 1, we obtain a  conventional rectangular filter.
In DMT, symbols are independent of one another. As shown in Fig. 5a, by sending a symbol at bin k and frame i, the receiver output signal power is contained within a single bin and frame. However, in FBMC, frame shaping produces a deliberate symbol interference pattern [7], [8]. This pattern, depicted in Fig. 5b, is two-dimensional, spanning multiple bins and frames. As a result, whereas in DMT, one complex-valued symbol is transmitted per bin in every frame, in FBMC, the real and imaginary components of symbols are transmitted separately where the latter is delayed by half a frame period. This concept is depicted in Fig. 5c,d for DMT and FBMC, respectively. By doing so, we ensure symbols are independent of one another. At the receiver, both symbol components are combined, effectively achieving one complex QAM symbol per bin in every frame [8]. As such, in an ideal system, both DMT and FBMC achieve the same capacity. However, with a practical system having impairments, FBMC outperforms DMT as it does not suffer from the overhead associated with CP and reduced side-lobe attenuation [7], [8], [20], [24].
The encoding and decoding of FBMC signals can be expressed analytically. Equations (4) and (5) describe the synthesis and analysis functions, respectively. The transmission pattern and realignment process are expressed in (6) and (7)

D. FBMC IMPLEMENTATION
There are three methods to implement FBMC coding. These include adopting filter banks, Frequency Spreaders (FS), or Poly-Phase Networks (PPN) [7], [13], [25]. We will focus on the last as it is the most efficient in terms of hardware. This method is depicted in Fig. 6.
Here, a vector of N −1 input symbols X i [k] is decomposed into in-phase and quadrature components. Then quadrature phase rotation is applied, ensuring a π/2 phase difference between neighbouring bins. Next, Hermitian symmetry is added as the symbols enter a pair of IFFTs, each generating 2N real-valued output samples. These are sent through a pair of PPNs which concatenates frames O times, applies shaping, and overlaps them with one another. For its implementation, see Fig. 9. Finally, the quadrature waveform is staggered by half a frame period, combined with the in-phase waveform, serialized, and transmitted through the DAC.

E. CHANNEL ESTIMATION AND TIMING RECOVERY
Channel estimation is the process of determining the required C[k] to correct linear distortion. Timing recovery deals with finding the ADC sampling frequency and phase that correctly positions the FFT window (coarse synchronization) and minimizes jitter (fine tracking) [19], [26]. Although the channel is assumed short-term stationary, impairments such as jitter and noise cause the overall link response to change continuously. Furthermore, in FBMC, frame shaping reduces the tolerable sampling phase error compared to DMT. For these reasons, the ability to accurately track changes with sub-UI accuracy is critical to improve link performance.
A popular approach is to send channel estimation and timing recovery information by transmitting known symbols called pilot carriers. These can either be sent as a periodic training sequence once every few hundred frames, or by reserving a fraction of the bins to send them continuously [19], [27], [28], [29], [30]. However, the former approach is blind most of the time and results in poor tracking performance [31], whereas the latter gathers information continuously but introduces overhead, particularly for binlimited systems [30].
As an example, if we assume a 32-point DMT system and restrict pilot carriers to once every 256 frames, the overhead is negligible; however, from simulation, the tracking bandwidth is reduced by 18 times and no-longer meets the CEI-56G-LR-PAM4 requirement [32]. On the other hand, by reserving four bins to send them continuously, the standard is met, but a 27% overhead is introduced [19]. When including the overhead from three CP, the total overhead worsens to 33%. Furthermore, this does not include the penalty resulting from large side-lobes. Evidently, an alternative is needed to overcome this trade-off.

III. PROPOSED FBMC SYSTEM
This section proposes using FBMC signalling in wireline applications to eliminate the CP and lower side-lobes, and proposes an accurate FBMC channel estimation and timing recovery system to eliminate pilot carriers.
The proposed receiver is depicted in Fig. 7a. Initially, the received signal y(t) is sampled by the ADC and fed to the FBMC decoder to produce symbols Y i [k] where the in-phase and quadrature components remain separate. Next, the pair of N − 1 symbols are sent to a set of N − 1 adaptive equalizers, one per bin, to correct linear distortion, producing final decisions i [k], which are later converted to bits. Finally, rotation equalization ∠C i [k] is sent to the timing recovery to estimate and correct sampling phase error. We now provide more details on these three components.

A. SIMPLIFIED FBMC CODING
This work adopts PPN coding since it is the most efficient hardware alternative among the three schemes mentioned earlier. However, we propose a modification that differentiates it from previous works.
As shown in Fig. 6, conventional FBMC PPN coding requires a pair of IFFTs (FFTs) to enable separate computation of the in-phase and quadrature waveforms [7], [13]. Although the symbols are complex-valued, the transmitted waveforms are real-valued. As a result, the imaginary component from each IFFT output (FFT input) is unused. This inefficiency can be exploited to process both signals simultaneously yet independently, using a single IFFT (FFT).
An IFFT generates a real-valued output when the input symbols are mirrored around the Nyquist bin, and complex conjugation is applied (Hermitian symmetry). To generate an imaginary-valued output, the input symbols are multiplied by j, then mirrored around the Nyquist bin where negative complex conjugation is applied. As such, given two streams of input symbols X I and X Q , by sending X I + jX Q in the lower N − 1 bins and X I − jX Q in the upper N − 1 bins, two independent signals are produced simultaneously using a single IFFT; the first is contained within the real component of the output, and the second within the imaginary. Evidently, both outputs are orthogonal and can be separated for further processing. At the receiver, the reverse operation is applied. Here, the FFT receives two orthogonal signals. The first set of symbols is recovered by mirroring the upper N − 1 bins around Nyquist, applying complex conjugation, adding to the lower N − 1 bins, and scaling by 1/2. The second set of symbols is recovered by mirroring the upper N − 1 bins around Nyquist, applying negative complex conjugation, adding to the lower N − 1 bins, and scaling by −j/2. This has been expressed analytically in (8). Section IV analyses the IFFT (FFT) complexity, achieving a 29% reduction for a 32-point resolution.
Fig. 8 depicts the proposed FBMC encoder and decoder. Input symbols X i [k] are decomposed into in-phase and quadrature components. Then quadrature-phase rotation is applied. Next, instead of applying Hermitian symmetry, the proposed IFFT encoding is added before entering a single IFFT. The now complex-valued output is separated into real and imaginary streams and sent through a pair of PPNs. The block diagram of the transmit PPN is displayed in Fig. 9 assuming O = 4. Next, the quadrature path is staggered by half a frame period, combined with the in-phase path, serialized, and transmitted through the DAC. At the receiver, samples are de-serialized. A half frame period delay is applied to the in-phase path as both signals enter a pair of PPNs. The receiver PPN has the coefficients of each FIR filter set in the reversed order. The quadrature waveform is then combined with the in-phase waveform. After being sent through a single FFT, the proposed FFT decoding is applied. Finally, quadrature-phase rotation is removed forming the decoder outputs Y I,i [k] and Y Q,i [k]. These complex-valued outputs contain deliberate interference and thus must remain separate until after equalization. The proposed scheme does not introduce additional power, area, or latency. This is because the j, −j, 1/2, −j/2, conj, and flip operations can all be implemented trivially, and thus the proposed IFFT encoding and FFT decoding only require 2N−2 complex additions each. However, these adders can be eliminated by incorporating the operations within the quadrature phase rotation, which also requires 2N − 2 additions. As such, we remove an IFFT and an FFT from the link at no added cost.

B. ADAPTIVE EQUALIZER
Each equalizer, shown in Fig. 7b takes concepts discussed in [19] and adapts them for FBMC signalling by applying four modifications, discussed shortly. The basic concept is as follows. Instead of using pilot carriers, the approach performs decision-directed channel estimation. It determines the channel response by directly observing the scaling and rotation error of data-filled constellations. This alleviates the need for pilot carriers and improves accuracy [19].
determined during the previous frame period. Then tilt and stretch correction, discussed shortly, is applied to produce the equalized symbolX i [k]. X i [k] is then sliced to produce the final decision i [k], i.e., rounded to its nearest constellation point following (9) where S is the set of constellation points [19], [26].
The feedback path separates the magnitude and phase from symbolsX i [k] and i [k]. This is accomplished using two cartesian-to-polar (C2P) converters [33]. The two extracted magnitude values are sent to the gain controller and the two-phase values to the rotation controller. Both implement proportional-integral control. Following (10) and (11), the two control loops correct the error by adjusting C i [k]. They eventually drive the gain and phase differences to 0, ensuring symbols land in the center of their decision boundaries. Finally, the resultant |C i+1 [k]| and ∠C i+1 [k] are converted back to cartesian form using a polar-to-cartesian (P2C) converter [34] to produce the next frame's equalization value At startup, a training sequence synchronizes initial coefficients. Decisions i [k] are overwritten to match the transmit values, allowing the equalizers to converge.
The proposed equalizer implements four features that differentiate it from previous works. First, the in-phase and quadrature symbols are equalized separately. This is a requirement of FBMC; however, it does not double the complexity since each multiplier is only required to output either its real or imaginary component. Second, whereas in DMT, sampling phase error causes all constellations to rotate in the same direction following (12), where θ err is the phase error normalized to the sampling period; in FBMC, odd bins rotate in the opposite direction. Similarly, rotation equalization causes these bins to revolve in the opposite direction. As such, to ensure convergence, loop coefficients are made negative for every other bin. With this, when subject to sampling phase error, all equalizer coefficients C[k] rotate in the same direction. Third, a tilt and stretch correction function is added; this is discussed next. And fourth, the gain controller implements adaptive loop coefficients to achieve constant amplitude-independent stability; this is discussed shortly. (12) Timing recovery targets the average group delay of bins. However, with practical channels, certain bins will arrive earlier than others. As such, it is common for bins to experience sampling phase error, even with ideal timing. In DMT, this causes constellations to rotate, which is easily corrected using single-tap equalization. On the other hand, FBMC experiences additional forms of linear distortion. Fig. 10a,b depict the magnitude and phase response of a 4-QAM FBMC constellation in bin 1 as sampling phase error is swept across 2N = 32 samples. Whereas DMT would experience a flat magnitude response and a linearly descending phase response, FBMC does not. In fact, each symbol within the constellation experiences a different response. This behaviour results in constellation rotation, scaling, tilt and stretch. Tilt distorts constellations into a rhombus shape, whereas stretch distorts constellations into a rectangular shape. Fig. 10c, depicts a 16-QAM constellation following single-tap equalization but without tilt and stretch compensation. Although magnitude and rotation distortion is removed, tilt and stretch are still present. This causes certain symbols to land close to decision boundaries which degrades performance. The solution is to add cross-coupled equalization, as shown in Fig. 11. ψ[k] corrects tilt distortion, and λ[k] corrects stretch distortion. Both coefficients are set using lookup tables controlled by the current rotation equalization. Approximate arithmetic can be employed using programmable three-stage shift-and-add circuits to reduce complexity. Furthermore, it is sometimes possible to fix λ[k] = 1 if the relative group delay between bins is small. Fig. 10d shows the constellation following tilt and stretch correction. Both are removed, with all symbols landing near the center of their decision boundary.
Gain control is a non-linear system. With a linear controller, outer symbols experience a lower phase margin than inner ones. As a result, the loop must be over-damped to avoid stability concerns, but this reduces the noise tracking bandwidth. Alternatively, [19] implements a linear gain controller in the logarithmic domain, effectively linearizing the system. However, applying LOG and EXP conversions requires additional delays to meet timing requirements, lowering the bandwidth. Instead, this work adaptively updates loop coefficients to maintain stability without introducing additional delay.
Constant stability is achieved by ensuring a constant sensitivity function: ∂M err [k]/∂|C[k]| [35]. Such a result is possible by continuously dividing the loop coefficients K P and K I by the input amplitude |Y i [k]| as shown in (13).  (14). The operation is implemented using two lookup tables as depicted in Fig. 12. Assuming the size of constellations does not exceed 256-QAM, each lookup table only requires 32 entries; thus, the complexity is minimal. The noise tracking performance of the proposed system is analyzed in Section IV, showing the same stability with 65% higher bandwidth when compared to [19].

C. TIMING RECOVERY
The timing recovery also adopts concepts discussed in [19] but applies one modification, discussed shortly. The basic concept is as follows. When the sampling phase deviates from its ideal position, constellations rotate according to (12) [19], [28]. As mentioned, a conventional approach uncovers this rotation by reserving bins to send known pilot carriers. However, this introduces overhead. Instead, our approach obtains timing information by monitoring the required rotation equalization of data-filled bins. This alternative alleviates the need for pilot carriers and improves accuracy [19]. Shown in Fig. 7c, the timing recovery block extracts N −1 rotation equalization values ∠C i [k]. These values are subtracted by N − 1 offsets φ expect [k] used to adjust the ideal sampling phase θ ofs following: φ expect [k] = 2π kθ ofs /2N. The error φ err,i [k] is sent through a linear regression block to estimate the sampling phase error θ err,i . The result is input into a proportional-integral controller, which controls a phase interpolator and adjusts the sampling phase of the ADC.
At startup, a training sequence is used to synchronize equalization. Then timing recovery enters synchronization mode, where it only observes the lowest bin. By rotating the phase interpolator, it drives φ err [1] → 0. Doing so achieves coarse frequency and phase synchronization. At run-time, data is sent, and the timing recovery enters normal operation mode where it observes all bins. If phase drift is encountered, constellations try to rotate. However, the adaptive equalizers compensate by applying appropriate equalization, ensuring they stay square. By observing the rotation equalization across all bins, an accurate estimate of the timing offset is uncovered and corrected. This procedure has been verified in simulation.
The proposed modification concerns the PI controller. A first-order controller is unable to remove residual sampling phase error when frequency-offset is present [19]. Although this error can be removed by adjusting φ expect [k], it requires a separate control circuit. Instead, this work implements a second-order transfer function by adding an additional second-order integration path. By keeping the coefficient small, the stability of the system is unaffected, yet the error is removed.

IV. SIMULATION RESULTS
This section compares the proposed FBMC system against conventional DMT signalling. We then analyze the reduction in complexity from simplified FBMC coding before concluding with the analysis of the proposed gain controller.
As shown in Fig. 13, a top-level model is created, whose coding can be set to either DMT or FBMC. Typical impairments are included, such as crosstalk from seven aggressor channels, a 7-bit 32GS/s DAC (ADC) with a 5.5-bit ENOB, 1.5mV 2 of input-referred noise from a Continuous-Time Linear Equalizer (CTLE), 0.01UI rms of random jitter, and 0.02UI P of Dual-Dirac jitter (set to meet the CEI 56Gb/s specifications for channel compliance testing using reference transmitter and receiver [32]). The victim and aggressor channels are publically available through the IEEE 802.3 Ethernet Working Group [36]. For both modulations, a CTLE is used to partially equalize the received signal and help reduce DMT's CP lengths to 0UI, 1UI, 2UI, 2UI, and 3UI for the 0dB, 12dB, 22dB, 35dB, and 42dB channels, respectively. This optimization ensures the CP encompasses the majority of the pulse response without adding unnecessary overhead. Note that a CTLE is not required but improves the link's capacity. Furthermore, 2, 4 and 6 pilot carriers are employed for 16, 32, and 64-point DMT coding to enable channel estimation and timing recovery. This amount ensures sufficient averaging without introducing unnecessary overhead and is based on the analysis performed in [19]. The proposed FBMC system does not require CP or bin reservation. Bit and power-loading are optimized for a Bit-Error-Rate (BER) of 10 −4 , a typical pre-FEC target.
Results are displayed in Fig. 14. Each column depicts one of the four comparison metrics, including communication link capacity, power consumption efficiency, silicon area efficiency, and signal processing latency. In contrast, the rows depict the swept simulation parameters, including channel attenuation at Nyquist, IFFT (FFT) resolution, and frame overlap factor. For each experiment, when not swept, the channel attenuation is fixed at 35dB, the IFFT (FFT) resolution at 32, and the overlap at 3. We now analyze these results.

A. CAPACITY
Column a in Fig. 14 highlights the relative improvement in capacity with respect to a conventional DMT system. Noticeably, the proposed FBMC system outperforms in all scenarios. While sweeping channel attenuation, we observe a maximum of 2.01 with a 35dB channel. The trend is expected to increase with higher attenuation channels since this lengthens the CP and adds overhead. Apart from an anomaly, our results observe this trend. While sweeping FFT resolution, we observe a maximum of 2.01 at a resolution of 32. The trend is expected to decrease with resolution since CP and bin reservation overhead are most pronounced with short frame lengths. This trend can be observed from the results. Finally, while sweeping the overlap factor, results show a maximum of 2.08 for O = 4. The trend is expected to increase with larger overlap since increasing the filter length further attenuates side-lobes, reducing interference among symbols [20], [21]. This trend is also noticeable in the results. Note that with an overlap of 1, we are comparing two DMT systems, one of which employs the proposed pilot-carrier-less channel estimation and timing recovery; this matches results from [19]. From analysis, FBMC modulation improves capacity by 35%. The rest comes from eliminating the overhead.

B. POWER EFFICIENCY
Column b in Fig. 14 highlights the relative improvement in power efficiency with respect to a conventional DMT system, in terms of bit per unit energy. Power estimates are based on post-layout results from a synthesized 16nm FinFET DMT design. Results from this design are scaled based on a first-order approximation: P = μCV 2 f . This considers the transition activity percentage μ, relative number of gates C, supply voltage V, and clock frequency f . We assume the DAC and ADC power is constant no matter the design. Furthermore, although conventional DMT coding does not employ the proposed pilot-carrier-less channel estimation or timing recovery, it too requires similar mechanisms. Therefore, we assume the same power consumption for these similar functions.
As shown, the power efficiency is considerably improved in FBMC. The trend follows the increase in capacity with a maximum improvement of 2.00 experienced with a 35dB channel, 2.00 with a 32-point IFFT (FFT), and 2.02 with an overlap of 4. This result is expected since the power consumption of the proposed FBMC system closely matches that of DMT but at almost twice the performance. Two factors are responsible for this. First, simplified FBMC coding reduces complexity, as will be analyzed shortly. And second, FBMC serialization and deserialization are vastly more straightforward than in DMT. Let us assume a parallelized DAC and ADC architecture is adopted, which inherently serializes and deserializes data to and from a bus of width 32. In a DMT system, each frame requires gearboxing to a width of 32 + N CP . This operation requires a complex FIFO circular buffer since the gearing ratio is not a power of two. Post-layout results reveal that this module consumes more than 18% of the receiver's DSP power budget. In contrast, although FBMC frames are also stretched, they are overlapped such that the effective frame lengths remain unchanged, eliminating the need for gear-boxing and enabling a more straightforward and less power-hungry design.

C. AREA EFFICIENCY
Column c in Fig. 14 highlights the relative improvement in area efficiency with respect to a conventional DMT system, in terms of bit per unit area. Silicon area estimates are also based on the same 16nm synthesized design. In a similar manner, we assume the analog area is unchanged and scale the DSP area based on the relative number of gates.
Much like the previous metric, results closely follow the capacity and power efficiency trend with a maximum improvement of 2.01 with a 35dB channel, 2.01 with a 32-point IFFT (FFT), and 2.02 with an overlap of 4. This is because the proposed FBMC area is comparable to that of DMT but at almost twice the performance.

D. LATENCY
Column d in Fig. 14 highlights the relative change in system latency with respect to a conventional DMT system. Although transitioning from DMT to FBMC provides many benefits, it can experience a slight increase in latency. Results from this analysis are derived from the same 16nm design, which adopts a pipeline implementation set to meet postlayout timing requirements. As shown, while the relative processing latency is unaffected by channel attenuation and only slightly by IFFT (FFT) resolution, its increase with larger overlap factors is more pronounced. These results are expected as the decoder must wait to receive the whole frame before processing. With a first-order approximation, latency is directly proportional to the frame length. As the length of an FBMC frame is often multiple times longer than in DMT, it will experience more latency. However, the effect is less pronounced than one might anticipate. Results show a relative increase of only 13% (from 10.6ns to 12.0ns) even with a triple-length frame. The reason is that the coding of signals is not instantaneous. Practical implementations often take multiple clock cycles to complete. As a result, the increase in frame length contributes only a small percentage to the overall processing latency. Furthermore, simplifications made using the proposed FBMC coding and removing the gearbox reduce the computation, contributing to less delay.
Note that this analysis depicts the relative change in latency between DMT and FBMC. On the other hand, if we observe the absolute change in latency between FBMC scenarios, it will increase proportionally with coding resolution. For example, a 16-point FBMC system experiences a latency of 6.0ns whereas a 64-point version experiences 24.0ns.
In general, transitioning from DMT to FBMC results in improved capacity, power efficiency, and area efficiency at the cost of a slight increase in latency. This being said, there are scenarios where transitioning from conventional DMT to the proposed FBMC improves all aspects with no compromise. One such example occurs with a 42dB channel when transitioning from 64-point DMT to 16-point FBMC with O = 2. Our simulations uncover a 30% improvement in capacity, 6% improvement in power efficiency, 29% improvement in area efficiency, and 3.8 times less latency. This observation coincides with results from [20] and highlights the considerable performance benefits found in FBMC signalling.
Furthermore, our analysis does not consider the complexities associated with clocking. As mentioned, a DMT frame is rarely a power of two in length. Assuming the DAC and ADC are implemented using a power of two number of slices, the outputted DSP clock will require a nontrivial division from fs/32 to fs/(32 + N CP ). This adds additional complexity and could further motivate the transition to FBMC signalling.

E. SIMPLIFIED FBMC CODING
We now analyze the reduction in complexity thanks to the proposed FBMC coding, where only a single IFFT (FFT) is needed to encode both the real and imaginary waveforms. This analysis assumes multiplication dominates complexity, and addition is comparatively negligible. Fig. 15 compares the number of real-valued multipliers required for the proposed single IFFT versus the conventional double IFFTs. Results show an improvement which is inversely proportional to resolution. For example, with a 32-point IFFT, the proposed approach reduces complexity by 29%. This improvement is also experienced in the receiver and is invariant of the overlap factor.
To explain these results, a conventional pipelined IFFT (FFT) design requires approximately Nlog 2 (2N) complex multipliers [37], where each complex multiplier requires three real-valued multipliers. However, assuming Hermitian symmetric symbols and thus a real-valued signal, redundant calculations can be removed; we denote this as a redundant IFFT (FFT) [37]. As a side note, the reference DMT system adopts this simplification. Nevertheless, whereas conventional FBMC coding requires a pair of redundant IFFTs (FFTs), the proposed approach requires only a single conventional IFFT (FFT). As shown, this reduces complexity, ensuring FBMC systems remain comparable to their DMT counterpart.

F. GAIN CONTROL WITH ADAPTIVE LOOP COEFFICIENT
This section compares the noise tracking performance of the conventional gain control system found in [19] to the proposed one. As shown in Fig. 16, the Noise Transfer (NTRAN) and Noise Tracking (NTRACK) functions are overlaid for both designs. NTRAN describes the amount of noise recovered by the system, whereas NTRACK describes the amount of noise remaining at the output. For this analysis, we assume a 1GHz DSP clock. The loop gain in both systems is equated to achieve the same amount of NTRACK peaking. From simulation results, the 3dB point of the proposed NTRACK curve, or the point at which noise is no longer tracked, is increased by 1.65 times from 2.3MHz to 3.8MHz. This is thanks to the reduction in loop latency from 8 clock periods to 5. The improvement is constant no matter the coding resolution or overlap factor. As such, this approach outperforms the alternative while also reducing complexity.

V. CONCLUSION
This paper proposes an efficient multi-carrier system that combines filter-bank multi-carrier signalling, decisiondirected channel estimation, and frequency-domain timing recovery to eliminate the overhead associated with cyclic prefix, large side-lobes, and pilot carriers. Furthermore, a technique is proposed to halve the required number of FFTs (IFFTs), reducing their complexity by 29% for a 32point resolution; a method is proposed to correct tilt and stretch distortion; and a gain controller with adaptive loop coefficients is adopted to achieve the same stability but 65% higher tracking bandwidth regardless of the FFT size. A top-level model is created to enable an in-depth comparison between the proposed solution and a conventional discrete multi-tone system in terms of communication link capacity, power consumption efficiency, silicon area efficiency, and signal processing latency. Assuming a 32-point FFT, a 35dB channel, and an overlap factor of 3, results show 101% improvement in capacity, 100% improvement in power efficiency, and 101% improvement in area efficiency, and all while maintaining comparable latency. In general, we have demonstrated that transitioning from conventional DMT signalling to the proposed FBMC system improves the data rate without necessarily compromising power, area, or latency. The proposed system enables very low resolution multi-carrier schemes, which were previously impractical due to the detrimental overhead.
HOSSEIN SHAKIBA (Senior Member, IEEE) received his B.Sc. and M.Sc. degrees in Electrical Engineering from the Department of Electrical and Computer Engineering at the Isfahan University of Technology, Iran, in 1985 and 1989, respectively, and his Ph.D. degree in Electrical Engineering from the Department of Electrical and Computer Engineering at the University of Toronto, Canada, in 1997. He has over 35 years of teaching, research, design, and management experience in the area of analog circuit and system design for various applications with focus on wireline communication in both the industry and academia. He is currently working on system and circuit development for next generation serial links at Huawei Canada in collaboration with the wireline industry with emphasis on link design, modeling, and analysis including statistical and signal integrity. He is also actively involved in conducting research with various universities and co-supervises several graduate students.