A 130-nm Fusion-Based Deconvolution Kernel Generator IC for Real-Time mmWave Radar Motion Compensation

High-resolution mmWave radar represents an attractive sensing modality for precise, real-time sensing on resource-limited edge devices. Inherent parasitic platform motion such as vibration, however, significantly degrades perception resolution. Although existing methods of motion compensation have addressed this issue, such algorithms entail compute-heavy post-processing and thus would not be feasible for short-time-scale operation on a resource-constrained edge device. We present the first known IC-enabled solution for vibratory motion compensation, in the form of a mixed-signal, real-time deconvolution kernel generator chip. The custom IC, implemented in the open-source SkyWater 130-nm technology, functions as a sensor-fusion-based filter generation engine, operating in processing windows of less than 95 ms and fusing synchronized data from an auxiliary accelerometer to synthesize a kernel capable of correcting the received radar signals in real time. Measurement results demonstrate accurate kernel generation, while yielding an average spur suppression of 35 dB across a range of vibration frequencies and amplitudes and 26 dB across measured target velocity. Such performance, in conjunction with the demonstrated operation on short time windows of radar data, validate the chip’s potential for realistic deployment in real-time applications requiring high-resolution perception.


I. INTRODUCTION
MmWave radar sensors, operating in the 60-GHz and 77-GHz bands, have emerged as powerful vehicles for high-resolution object detection and motion tracking, particularly in automotive, consumer, and industrial applications.Contrary to alternative imaging modalities such as light detection and ranging (LiDAR) and camera-based vision, mmWave radar simultaneously provides sub-cm range resolution, high velocity resolution, and unparalleled maximum range and velocity, all while demonstrating robust operation under challenging environmental conditions [1].Such advantages render The associate editor coordinating the review of this manuscript and approving it for publication was Dusan Grujic .mmWave radars indispensable in domains requiring precise perception; advanced driver assistance systems (ADAS), for instance, rely upon short wavelengths to achieve the required resolution for high-performance detection and maneuvering [2].While many radar-based perception systems exploit the sheer computational efficiency and scalability of back-end processors and/or cloud computing to assess the incoming data, the reality is that many detection scenarios require realtime, resource-efficient information extraction right at the edge for immediate response capacity, without the latency and power consumption incurred by bulk post-processing [3].Autonomous navigation, for instance, requires maximum response times on the order of tens to hundreds of milliseconds, which can be feasibly achieved only via localized sensing.Accurate perception, in other words, must be attainable via a self-contained edge device with fully integrated sensing and processing circuitry.CMOS ICs, boasting easily incorporable small areas, low power/resource consumption, and high-speed operation, represent the optimal platform for such edge sensing.
Yet, despite the promise of integrated silicon mmWave radars for edge-based detection, classification, and real-time perception applications, the reality is that such devices are rarely stationary; indeed, inherent parasitic platform motion, particularly vibration, substantially impairs detection and tracking capacity [4], [5], [6], [7], [8].As shown by [9] in the case of frequency modulated continuous wave (FMCW) radar, the presence of vibration yields a sidebandsaturated range-Doppler profile exhibiting spectral spurs around the center lobe corresponding to the target velocity.These adjacent spectral peaks present notable challenges for unambiguous range and velocity estimation, a situation exacerbated for high-bandwidth mmWave systems requiring simultaneously high range and Doppler resolution [10].Automotive vehicles and aerial drones, for example, often integrate such sensors on the hood (or drone body), rendering the device highly susceptible to engine vibration.Motion capture and wearable device applications (e.g.activity tracking) induce similar detrimental platform motion that corrupts the IF signal returns and reduces sensing capacity.Achieving high-resolution sensing in the presence of such inevitable external motion, therefore, requires that real-time platform motion compensation be integrated into the edge IC device itself.
Assessing the scope of published literature, there currently exist no implementations of real-time, computationally efficient vibration/motion correction on a dedicated radar IC.Indeed, most existing silicon implementations with integrated back-end DSP engines [11] only enable post-processing motion correction and are often simply generalized for wideband noise elimination, which does not sufficiently suppress vibration-induced sideband spurs.And while an existing DSP chip or CPU could be easily programmed to perform such motion-specific compensation, such modules are ultimately designed for maximum application flexibility, thus providing superfluous features that consume unnecessary power and area overhead.Although such additional hardware resources are undoubtedly advantageous for fastruntime post-measurement analysis, they are superfluous for processing at real-time intervals.The Texas Instruments (TI) AWR1843 mmWave module used in this work, for instance, contains a general-purpose DSP back-end that consumes more than 420 mW during operation [12], magnitudes greater than that necessary for a motion compensation procedure.Even the industry state-of-the-art, ultra-low-power, fixedpoint TI C55X DSP chips, designed in 45-nm process technology, consume a minimum of 15-30 mW during basic arithmetic operations [13].Normalizing against process node and supply voltage for an appropriate comparison with the presented 130-nm chip yields an approximate scaled power consumption on the order of ∼100 mW, again magnitudes greater than that required.The additional package area overhead (>100 mm 2 ) induced by a commercial DSP chip incurs an additional disadvantage compared with a custom solution.
Meanwhile, even CMOS solutions such as Google's Soli sensor, while edge-based and real-time to some extent, nevertheless rely upon computationally complex back-end machine learning algorithms to achieve high-resolution perception [14].Published non-integrated methods similarly address platform motion via post-processing correction algorithms that analyze the holistic data cube in their compute-heavy pipelines and thus would not be feasible for real-time/short-time-scale operation on a resource-constrained edge device.Such non-integrated pipelines include published synthetic aperture radar (SAR) processing algorithms (e.g., Doppler Keystone Transform [15], range/Doppler migration correction [16], [17]), oversampling smoothness (OSS)-based phase retrieval [18], and vibration parameter estimation via discrete fractional Fourier transforms [19], among others [20], [21], [22].
Ultimately, real-time vibratory motion correction on an integrated mmWave edge device can be achieved via multisensor fusion with an auxiliary sensor.In particular, measurements can be simultaneously collected from a peripheral accelerometer or inertial measurement unit (IMU), yielding information about the platform vibration itself, to correct the data from the primary radar unit.Prevalent among existing radar sensor fusion works [23], [24], [25], however, is the fact that pipelines utilizing auxiliary IMU data are entirely post-processing-based and thus incur both latency and infeasible resource consumption/computational complexity for edge applications.
To address this shortcoming, we present the first custom IC, implemented in open-source SkyWater 130-nm CMOS [26], dedicated purely to enabling real-time mmWave FMCW radar platform motion compensation, via early sensor fusion.The proposed integrated fusion-based solution to singlemode vibration suppression takes the form of a deconvolution kernel generation engine, operating at the radar frame level (38-95 ms intervals) for minimal-latency, real-time operation, thereby eliminating the need for compute-intensive postprocessing.In particular, the chip processes synchronized data from an auxiliary accelerometer and generates a series of frequency-domain deconvolution kernels in real time, with the generated filters applied to the simultaneously collected radar data off-chip to correct the vibration-impaired IF signal returns.
Fig. 1 depicts a functional block diagram for the designed analog/mixed-signal IC, which takes as an input the raw signal from an external analog accelerometer that senses the parasitic vibratory motion of the radar platform.The chip analog front-end detects both the frequency and amplitude of the incoming vibration signal, with the resulting digitized data processed by the custom back-end digital deconvolution kernel generator, which formulates and outputs a serialized vector representing the optimal filter for vibration suppression.Architecture-wise, the DSP back-end is designed to emulate the notch-filter-based kernel generation process proposed and verified in the authors' prior art [4].Contrary to the reference algorithm, however, the IC implementation formulates a single-notch, rather than triple-notch, compensation filter spectrum to reduce on-chip computation and resource consumption.Such computational complexity is obviously not a concern in the purely software-based processing algorithm developed in [4], which also employs a digital IMU given the absence of any of the efficient analog vibration detection circuitry present in the custom chip.The resulting frequency-domain kernel is subsequently applied to the corresponding mmWave radar data spectrum off-chip to compensate for the signal-corrupting platform vibration.Measurement results demonstrate accurate generation of the appropriate motion deconvolution kernel/filter in response to varying input vibration signals, yielding an average of 35 dB vibration mode suppression across vibration frequency and amplitude and 26 dB across target velocity.Such gains, while slightly lower than that achieved in the software-based solution of [4] (29-39 dB), nevertheless ensure significant hardware and processing latency reduction, and exceed those of existing ICs, as described in Section V.
This paper is structured as follows: Section II conducts a literature review of comparable related implementations; the architecture and design considerations of the IC, including the analog front-end and back-end DSP engine, are outlined in Section III; Section IV describes the setup used for chip measurements, with the achieved platform motion compensation results and state-of-the-art comparison table presented in Section V; finally, Section VI concludes the work with a brief summary and discussion of future directions.

II. RELATED WORK
Since the chip is unique in its capacity for integrated real-time and resource-efficient, low-computational-complexity vibratory motion compensation, no prior work exists for comprehensive comparison in all aspects.Perhaps the most related domains are the areas of 1) full-duplex (FD) radios and wireless communications, in which self-interference cancellation/equalization (SIC) is of utmost importance [34], [35], and 2) inertial sensor fusion for high-resolution imaging or platform motion compensation.Tables 1 and  2 present optimal comparisons in each field, respectively.Ultimately, the designed IC's combined low computational load and minimal latency is prominent across both domains holistically.
Works [27], [28], [29], [30], [31], [32] in Table 1 present full-duplex radios for channel estimation with incorporated hardware for interference suppression.Such non-silicon implementations, which exhibit simulation performance superior to that of measured CMOS FD radio ICs, are modeled in hardware and provide an apt comparison for the degree of achievable SIC relative to integrated computational complexity.Note that the significant degree of multidomain suppression, attained at the obvious expense of power/resource consumption, is necessitated given the required signal integrity in FD radio communication applications.Considered collectively, of particular significance is the fact that all implementations, unlike the designed IC, are learningbased, requiring upwards of N = 2000 training samples and relying upon computationally complex least-squaresbased estimation procedures to determine filter and channel parameters of interest.Furthermore, most works augment the computation even further via performance-enhancing techniques such as truncated singular value decomposition (TSVD), nonlinear estimation, principal component analysis (PCA), and widely linear least-squares estimation, yielding infeasibility for real-time deployment on a resource-limited edge device.
Whereas [27], [28], [29], [30], [31], and [32] provide a source of comparison with regards to spur suppression algorithmic complexity and achieved cancellation in fullduplex radios, [23], [24], [25], [33] in Table 2 present IMU-enabled motion compensation schemes, which yield an apt point of reference for the degree of fusion-enabled suppression achievable given both the compensation/processing time scale and computational complexity.While some of these prior works do indeed achieve noise/sideband suppression comparable to that of the designed IC, all entail either or both significant processing time intervals and/or unrealistic degrees of computational complexity, both of which ultimately preclude real-time, edge-based operation.Among sensor fusion techniques utilizing alternative modalities such as LiDAR and camera-based vision, several existing works addressing tracking in the presence of unpredictable target motion do indeed operate at realtime-compatible intervals on the order of 100 ms [36], [37].However, such techniques entail some combination of ML-based detection procedures, extended/unscented Kalman filters, and the interacting multiple model algorithm, all demonstrated in back-end software only, with computational complexities significantly greater than that of the proposed approach.
Meanwhile, the presented chip, described in detail in Section III, achieves 26-35 dB vibration spur/noise suppression with minimal pre-processing, operating on the frequency-domain transformation of the digitized time-domain, complex-valued raw radar data (achieved via only a single applied fast Fourier transform (FFT)), with a computationally inexpensive parameter-lookup-based motion compensation procedure.The aggregate of this algorithmic simplicity and the IC's real-time processing interval of 38-95 ms is unparalleled among existing FD radio and fusion-based noise/interference correction implementations.magnitude data in an on-chip SRAM array for subsequent serial readout.Separate frequency and amplitude processing chain 8-bit successive approximation register (SAR) analogto-digital converters (ADCs) comprise the interface between the analog and digital blocks of the chip.The following sections provide details on both the analog and digital internal circuitry.

A. ANALOG FRONT-END
The demonstrated IC takes as an input the high-passfiltered single-channel acceleration voltage-domain signal from an analog accelerometer (Analog Devices ADXL335).While many sensor-fusion-based applications rely upon digital accelerometers/IMUs, given the tendency for purely DSP-based, fixed-resolution systems requiring higher bandwidth/SNR than that provided by their analog counterparts, use of an analog accelerometer provides several advantages for real-time, edge-based platform motion compensation.To begin with, the reliance upon a purely analog input precludes any synchronization that would be necessary in fusing data from the mmWave radar with outputs from a digital MEMS sensor.In other words, analog sampling can be entirely dictated by the radar ADC sampling rate and frame rate, thereby eliminating the need for digital interpolation and saving computation.Furthermore, there is no need to apply any of the higher-complexity Kalman/smoothing filters often prevalent in digital sensor fusion scenarios.Relatedly,  since the analog front-end operates in continuous time, with constantly updating vibration parameter outputs, the chip does not need to store windows of accelerometer data in memory for back-end frequency and amplitude analysis, saving both compute and memory storage area.An additional advantage of analog MEMS sensors is the capacity for direct real-time adaptation of the sampling rate and resolution in a closed-loop manner.Though not currently implemented, analog-domain adaptive feedback would permit fast and continuous optimization of power consumption and signal integrity in accordance with previously acquired data, with significantly shorter closed-loop settling times unhindered by intermediate digital conversion interfaces, yielding unparalleled low latency for real-time sensing.Finally, contrary to digital interfaces, only a single I/O is needed with an analog input signal; hence, for multi-channel, resourcelimited applications, an analog device is clearly the expedient choice.
An input amplifier chain, consisting of a single-endedto-differential converter and programmable-gain amplifier (PGA), converts the single-channel, high-pass-filtered input into a differential signal to mitigate the effect of common-mode noise in the subsequent filter stage.To provide the capacity for input amplitude adaptation, the PGA gain is programmed via two digital input bits that switch in or out capacitors in the feedback network of the differential operational transconductance amplifier (OTA)based amplifier.Fig. 3 depicts the top-level design of the two-stage input amplifier chain.The integrated differential OTA utilizes a folded cascode architecture to maximize the input common-mode range, with common-mode feedback setting the desired output common-mode voltage level and capacitive load compensation used to achieve a sufficient phase margin.Bias voltages for the OTA are generated via a reference network employing a pair of Sooch cascode current mirrors [38].
Filtering of the raw accelerometer input signals is performed to mitigate the impact of sensor noise and is achieved via an active low-pass G m -C biquad filter [39], shown in Fig. 4, with the individual transconductor stages implemented via a simple single-stage differential OTA with commonmode feedback.Given the anticipated low-frequency nature of the input vibration signal, the optimal low-pass corner frequency f 3dB occurs in the sub-kHz band, requiring either excessively low bias currents for the G m stages, and hence, increased susceptibility to device mismatch/noise, or large filter capacitors.Opting for the latter, the G m -C filter capacitors, selected to achieve f 3dB ≈ 500 Hz, are situated off-chip to minimize area.As designed, the filter's measured second-order roll-off yields sufficient attenuation of mid-frequency noise components.
As shown in Fig. 2, the filtered acceleration signal is then distributed to the independent frequency and amplitude processing chains to assess the nature of the vibration.Frequency detection is performed by first converting the vibration signal, resembling a sinusoid with a strong fundamental tone, to a square waveform with well-defined edges.This is accomplished via a differential-input, hysteresis-enabled continuous-time comparator, shown in Fig. 5.The buffered single-ended output is then passed to a low-frequency phaselocked loop (PLL), which serves to lock onto the vibration frequency, generating an analog voltage at its charge pump output proportional to the detected frequency.
The type-II PLL architecture [40] consists of a voltage-controlled ring oscillator (VCO), a D flip-flop (DFF)-based frequency divider, a phase-frequency detector (PFD), a charge pump (CP), and an off-chip RC loop filter, interconnected in a negative feedback control loop.The charge pump output RC filter and complete PLL loop are designed for a worst-case 1-percent settling time of less than 500 ms, given a maximum magnitude input vibration frequency step from 0 Hz to 50 Hz.In a typical platform motion scenario, therefore, in which input frequency deviations are not nearly as large or abrupt, the PLL ideally provides sufficiently fast settling.Such gradual input frequency variations also mitigate the phase noise induced by the VCO, while phase errors due to the reference continuous-time comparator output itself are suppressed by the low loop bandwidth.
Meanwhile, the amplitude processing chain consists of a simple differential-to-single-ended converter, followed by a peak detector circuit that generates at its output an analog voltage corresponding to the maximum signal level achieved during the given sampling interval.The peak detector circuit, depicted in Fig. 6, utilizes a current mirror as the rectifier element, followed by an output buffer, with C 1 functioning as the charge storage capacitor and M N 1 providing the necessary reset functionality within each sampling interval.
The analog voltage outputs from the PLL charge pump and peak detector are subsequently digitized via separate 8-bit SAR ADCs designed using an active track-and-hold circuit, MIM-based binary-weighted capacitive digital-toanalog converter (DAC), and latched comparator, as illustrated in Fig. 7.The track-and-hold circuit employs an open-loop design, with M N 1 and C 1 serving as the primary sampling switch and storage capacitor, respectively, and M N 2 and C 2 included on the negative amplifier input terminal for clock feedthrough and charge injection [41] cancellation.Meanwhile, the latched comparator, employing the double-tail architecture proposed in [42], is designed with minimal need for speed optimization given the relatively low sampling frequency.Furthermore, as described in Section III-B, only a select number of output bits are actually used, relaxing integral nonlinearity (INL) requirements at the designed 8-bit resolution.Note that while lower-bit ADCs would have yielded power and area savings, the utilized resolution was selected to allow for convenient open-source modular design reuse.Separate synchronous ADC controllers govern the circuit timing and are synthesized in the digital domain.

B. DIGITAL BACK-END DECONVOLUTION KERNEL GENERATOR
The primary engine of the chip is implemented in a custom DSP back-end, which takes as inputs the digitized frequency f v and amplitude a v and generates a frequency-domain second-order infinite impulse response (IIR) notch transfer function vector of length N representing the deconvolution filter needed to correct the vibration-corrupted radar data, where N corresponds to the processing interval.For instance, utilizing a maximum processing time window of 95 ms, a frame periodicity of T frame = 19.0ms, and N chirp = 255 chirps constituting each frame, yields N = 1275.Use of a notch filter as the deconvolution kernel stems from the need to ''invert'' the f v -separated sidebands located about the primary Doppler lobe frequency f d = 2v x /λ in the spectrum of the downconverted IF signal, which may be represented as a function of the FMCW fast-time t f ∈ [0, T c ] and slow-time index n ∈ [0, (N − 1)]: for a target with reflectivity ρ, range R 0 , and velocity v x along the radar line of sight, and a sequence of chirps, each with start frequency f c , duration T c , bandwidth B, and wavelength λ.Here, the second addend in the complex exponential argument of (1) captures the signal component due to the vibration trajectory of the radar platform, while the first and third addends capture the target range-and velocity-dependent signal components, respectively.Fig. 8 graphically represents the chirp sequence parameters and illustrates a typical vibration-impaired IF signal Doppler (slow-time) spectrum for various target velocities.
The top-level architecture of the deconvolution kernel generator engine, written in Verilog, is presented in Fig. 9, while Fig. 10 depicts the detailed internal architecture and critical signal paths of the DSP engine.Time-and energy-efficient kernel estimation is achieved via two primary techniques: a) lookup-based coefficient determination, and b) a carefully optimized evaluation pipeline.Together, these two features allow for kernel generation within the prescribed real-time processing interval, while avoiding the additional memory (and hence, area) overhead that would be incurred by alternative correction techniques such as template matching.

1) DSP MEMORY ORGANIZATION
The designed deconvolution kernel generator contains four 8-KB SRAMs, laid out as individual arrays of four 2-KB arrays, as shown in Fig. 10.SRAM #1 contains a lookup table (LUT) for the transfer function coefficients of a second-order IIR notch filter, whose frequency response may be determined by evaluating the z-domain transfer function given by ( 2) at z = e −jω .
As indicated by (2), the second-order IIR notch filter transfer function contains only four unique coefficients, with b 0 = b 2 and b 1 = a 1 , and can therefore be indexed using only two bits (C 0 , C 1 ).In reality, a 0 simply takes on a constant value of 1; however, two bits are still required to code the remaining coefficients.The constant k is determined by the notch quality factor Q and the π-normalized notch frequency ω 0 , as shown below in (6), with ω 0 determined by f v .Note that greater vibration amplitude increases the maximum Doppler leakage/migration δ mig f d about the primary lobe in the Doppler spectrum, requiring greater notch bandwidth ω (i.e., reduced Q) to compensate.This inverse relationship between Q and a v is summarized by (3).
Thus, given a fixed set of quality factors and quantized frequencies, the transfer function coefficients are simply precomputed and stored in the coefficient SRAM for fast O(1)-time table lookup during the frequency response evaluation.As illustrated in the ADC interface block of Fig. 10, the 12-bit address used to index into the 8-KB lookup table consists of the top three bits of the digitized amplitude (A 7:5 ), followed by the top seven bits of the  digitized frequency (F 7:1 ), and concluded by the two-bit code corresponding to the desired coefficient (C 1:0 ).Note that the precomputed coefficients are stored using a 16-bit signed fixed-point representation with B frac = 13 fractional bits.SRAM #2 is preloaded with the length-N angular frequency vector at which to evaluate the frequency response for the second-order IIR notch filter, with the 16-bit unsigned values normalized to a [0, 2π) scale.All necessary data preloading for SRAMs #1 and #2 takes place via a 2-to-4 decoder for bank selection and an input deserializer operating upon a single serial data input, a necessary design choice given technology I/O limitations.SRAM #3 is used to record the computed deconvolution kernel frequency response magnitude data, stored in 16-bit unsigned fixed-point format with B frac = 13 fractional bits.Finally, SRAM #4 stores the computed deconvolution kernel frequency response phase data.

2) LOOKUP TABLE REQUIREMENTS AND ANALYSIS
As described above, determination of the filter coefficients in (2) is greatly expedited via use of a lookup table.Analyzing the second-order IIR notch filter transfer function in more depth, given a vibration frequency f v and sampling rate f s , the 132230 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.π-normalized angular vibration frequency is simply: after which the π-normalized notch bandwidth ω can easily be expressed in terms of f v and the notch quality factor Q.
This bandwidth determines the constant k in (2), as follows: Now, given T frame = 19.0ms and N chirp = 255, the maximum detectable Doppler frequency f d max , equivalent to f s , is simply f d max = f s = N chirp /(2T frame ) ≈ 6.710 kHz.Hence, low vibration frequencies in the typical range f v ≤ 50 Hz, combined with high Q, yield ω 0 ≈ 0 and k ≈ 1, which requires sufficient coefficient fixed-point fractional resolution to accurately capture.With B frac fractional bits, the maximum value attained by k is simply k max = (2 B frac − 1)/2 B frac , from which a lower bound on the ratio f v /Q can be derived.
Fig. 11 illustrates the valid frequency/quality factor parameter space V given the bound in (7), along with the eventual selected lookup table subspace L for various fixed-point fractional resolutions B frac .Selection of the optimal fixed-point fraction resolution B frac inside the deconvolution kernel generator engine is dependent upon the expected and desired range of vibration frequencies and notch quality factors, respectively.Now, the frequency-detection PLL lock range and subsequent SAR ADC latched comparator compatible input common-mode level are designed for vibration frequencies in the approximate subspace F = {(f v , Q) : f v ∈ [4, 50] Hz}, and hence, it is required that the lookup table values in this critical frequency subspace reside entirely within the valid parameter space, i.e., Furthermore, notch quality factors exceeding Q = 5, despite their optimal frequency selectivity, do not provide sufficient radar sideband suppression.This is evident in Fig. 12, which illustrates the maximum achievable notch suppression for a utilized real-time processing interval of N = 1275 samples, as well as for an interval of twice the length, across all possible notch frequency and quality factor pairs in the f v -Q space.In other words, a minimum notch bandwidth is necessary for effective vibratory motion compensation, yielding the additional requirement that As can be observed from Fig. 11, B frac = 12 fails to satisfy stipulation (8), while B frac = 14 yields a V upper boundary providing unnecessarily high Q at the minimum f v associated with F, in light of ( 9).B frac = 13, however, satisfies all set requirements, and thus, the deconvolution kernel generator is designed using this fixed-point fractional resolution.
The lookup table itself is generated using a fixed set of eight quality factors, Q ∈ {0.2, 0.3, 0.4, 0.5, 0.7, 1, 2, 5}, and 128 quantized frequencies, f v ∈ [0, 64) Hz with uniform 0.5-Hz spacing, ensuring that F ⊆ L and that (8) and ( 9) are satisfied.The 3:7 amplitude:frequency bit allocation in the SRAM address, shown in Fig. 10, is selected to favor notch location accuracy over quality factor, while permitting tolerance of ADC nonlinearity via disuse of the bottom bit(s).

3) DECONVOLUTION KERNEL EVALUATION PIPELINE
Formulation of the deconvolution kernel frequency response is controlled by a six-state Moore finite state machine and entails computation of the second-order IIR notch filter magnitude and phase data for each desired phase/frequency loaded in SRAM #2, via the three-stage pipeline labeled at the right of Fig. 10 and outlined in detail in Fig. 13.Given the need to evaluate the transfer function at z = e −jω across the target phase/frequency vector, requiring inherently nonlinear complex operations, the pipeline makes use of open-source coordinate rotation digital computer (CORDIC) rectangular-to-polar and polar-to-rectangular converters [43].Stage 1, requiring 15 clock cycles, rotates the transfer function coefficients by the appropriate phase via sequential CORDIC polar-to-rectangular converters to generate real and imaginary components, with the transfer function numerator and denominator complex components summed in combinational logic.Stage 2, requiring 15 clock cycles, converts The length of the evaluation vector is determined by a simple counter depicted in the kernel evaluator block at the right of Fig. 10, which records how many phase/frequency points are preloaded into SRAM #2 to ensure that the minimum number of clock cycles is utilized during the full computation.The effect of evaluation vector size, equivalent to the processing interval length N , upon suppression performance is indicated in Fig. 12. From Fig. 12(a), it is evident that, as expected, notch attenuation improves with decreasing Q and increasing vibration frequency, with the periodic maximum-attenuation bands corresponding to the frequency locations representing integer multiples of the Ndetermined Doppler resolution.Doubling N , as shown by Fig. 12(b), doubles the Doppler resolution and hence, provides twice as many instances of maximum attenuation at the lower vibration frequencies for improved performance.Such performance gains, however, come at the cost of increased on-chip computation and processing interval/latency.Thus, N is capped at 1275 samples.
Throughout the evaluation, the final magnitude and phase data values are simultaneously stored in the corresponding output memory banks (SRAMs #3 and #4).Once the kernel frequency response evaluation procedure is complete for all phases/frequencies, the magnitude and phase outputs are read out through a single pin via an output serializer, again necessitated by technology I/O constraints, along with an active-high output-valid signal.As illustrated in Fig. 10, an output multiplexer selects between data from the four SRAM banks, with access to the outputs from SRAMs #1 and #2 simply included for debugging purposes.

IV. CHIP EVALUATION EXPERIMENTAL SETUP
Chip measurements are implemented via the diagrammatic setup depicted in Fig. 14.All data from the accelerometer and radar are collected using a vibrating platform, with application of the vibration signal and off-chip compensation of the radar signal s(t) implemented after the fact on successive short windows of data, as if the correction were done in real time, in accordance with the approach detailed in [4].Briefly summarizing this approach, the frequency-domain spectrum of the radar data (with high-pass-filtered phase to remove the velocity-induced component) is multiplied pointwise with the chip-generated deconvolution kernel H deconv (jω) to suppress the vibration-induced sidebands.Reconversion to the time-domain and restoration of the velocity-induced phase component yields the final corrected radar signal s ′ (t) in each processing window.Denoting the high-pass-filtered component of the input radar signal by s hpf (t), the remaining residual component by s res (t), and the corrected high-pass filtered signal by s ′ hpf (t), this simple computation may be summarized as follows: A 34-mm eccentric rotating mass vibration motor (ERM) placed at the floor of the radar platform induces a controlled vibration ranging from 0 to 50 Hz, with the amplitude and frequency of the vibration adjusted via the applied DC voltage.A simple microcontroller interface serves to generate a variable pulse-width-modulated (PWM) signal input into the motor driver controlling the ERM operation.Radar data are obtained via a Texas Instruments (TI) AWR1843 77-GHz FMCW mmWave module, with the raw complex ADC data collected via a TI DCA1000EVM data capture evaluation board.Raw voltage measurements are recorded from a 1.8 V, ±3g Analog Devices ADXL335 analog accelerometer mounted directly behind the transmitter/receiver (TX/RX) antenna array and parallel to the ground, with the accelerometer x-axis oriented parallel to the radar line-of-sight.Rounding out the assembly, a simple box target is situated 3.5 m away from the radar platform at 0 degrees azimuth/elevation angles relative to the TX/RX antenna array.
A Digilent AD2 arbitrary waveform generator is used to synthesize an input analog waveform corresponding to the recorded raw accelerometer data for an interval of time corresponding to the window of radar data.This single-channel signal synthesized by the AD2 module acts as the input to the analog front-end, with analog control and bias voltages generated via external bias circuitry.A Zynq-7000 Zybo Z7 all programmable system-on-chip/field-programmable gate array (APSoC/FPGA) serves as the primary output data interface and testbench control engine, providing digital control and timing inputs and monitoring the serial output data.Correct timing for data collection is ensured via an interrupt-based synchronization mechanism implemented on 132232 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the FPGA, which indicates when each new cycle of data is available.
At this point, it is pertinent to document an unexpected non-functionality of the analog-front end frequency detection PLL.Likely the result of layout-induced parasitics, which could not be fully evaluated given the absence of RCbased extraction in the still-developing open-source tools, the charge pump output voltage was measured to be below the compatible level for the SAR ADC comparator, resulting in a ring oscillator frequency well under the minimum frequency needed to achieve the PLL lock state.Eventually, the PLL was necessarily bypassed by feeding its input, i.e., the railto-rail output from the continuous-time comparator, directly into the FPGA, which employs a simple counter to determine the vibration frequency.
The input clock frequency generated by the FPGA is minimized in order to optimize chip power consumption.Given a sampling interval of 95 ms, corresponding to the maximum on-chip resource consumption, the back-end DSP engine only needs to run at a rate of approximately 10-11 Hz; f clk = 500 kHz satisfies this requirement.

V. MEASUREMENT RESULTS
The die micrograph of the fusion-enabled deconvolution kernel generator IC is presented in Fig. 15.As indicated, the chip measures 2.92 mm × 3.52 mm, occupying an active area of 10.28 mm 2 .Layout of the mixed-signal IC is implemented using separate supply/ground domains for the analog and  digital circuitry to attain optimal noise suppression, with corresponding I/O pads routed at the periphery of these blocks.
To evaluate the effectiveness of the designed motion compensation engine, it is necessary to first quantify the accuracy of the on-chip deconvolution kernel generation process.In other words, the phase and magnitude of the generated kernel across all phase/frequency bins are compared with the ideal transfer functions for a given  vibration frequency and notch quality factor.Furthermore, kernel generation accuracy is computed across a range of vibration frequencies and notch quality factors to verify chip robustness.
Fig. 16 illustrates the ideal and chip-generated magnitude and phase transfer functions for particular vibration frequencies and notch quality factors.Note that these particular measurements were conducted via an ideal fixedfrequency/amplitude input sinusoidal waveform and indicate minimal apparent deviation between the generated and ideal transfer functions.Quantitatively, the precision of the on-chip kernel generation is illustrated in Fig. 17, which plots the mean absolute error (MAE) between the generated and ideal magnitude and phase transfer functions, normalized with respect to the maximum ranges for the magnitude and phase, respectively.The insignificant error reflects perfect vibration frequency and amplitude detection in the analog frontend, with the almost imperceptible deviation due merely to quantization error and the inherent nonlinearity of the complex transfer function evaluation process via linearized CORDIC hardware units.In other words, the measured results manifest the inherent advantage in the utilized LUT architecture: its error contribution is determined solely by the selected granularity.Optimizing this resolution leaves analog estimation imprecision as the primary source of error, which Fig. 17 confirms to be within the boundaries of the LUT step size.
Note that the chip's analog bias and common-mode offset voltages are optimized to yield accurate frequency and amplitude estimation of the input vibration signal, in accordance with the measured dynamic characteristics of the corresponding circuits.Under inevitable process and mismatch variation, these inputs may be tuned to adjust subcircuit output levels accordingly.The input gain adopted by the input PGA may also be coarsely adjusted via the two digital control bits to adapt to the observed variation.Furthermore, since the bijective mapping into the transfer function coefficient lookup table is dependent upon the LUT SRAM contents itself, digital offsets may easily be introduced into the stored data vector arrays via user access of the serial input port.Hence, the designed analog and digital adjustment mechanisms provide flexibility and robustness against device mismatch and global variation.
As described in Section IV, the accelerometer and mmWave radar signals are recorded simultaneously and precisely synchronized, with subsequent length-N pointwise multiplication of the chip-generated deconvolution kernel and frequency-domain radar spectrum performed for each successive 95-ms time window of data to obtain the final motion-compensated Doppler profile.The degree of vibration spur suppression is quantified via the improvement in spurious free dynamic range (SFDR) between the corrected and original, uncorrected spectra.Fig. 18 demonstrates the effectiveness of the vibratory motion compensation across a range of target velocities v x , where v x = (f d λ)/2, given a corresponding Doppler shift frequency f d .Comparing the original raw radar spectra with the final corrected spectra for each target velocity, it is evident that the vibration-induced spectral sidebands occurring at a spacing f v about the main Doppler lobe are significantly suppressed.The resulting compensated spectra thus contain significant signal power in only the main lobe corresponding to the target velocity frequency bin.
Quantitative correction results are represented in Fig. 19, which plots the spur suppression across a range of vibration amplitudes and frequencies, as well as target velocities.Notably, the demonstrated motion compensation scheme exhibits an average SFDR improvement of 35 dB across vibration frequency and amplitude and an average of 26 dB across target velocity.Note that the suppression achieved via the real-time on-chip kernel generation is slightly inferior to that reported in [4] (39 dB and 29 dB, respectively, for the frequency/amplitude and target velocity sweeps).This 3-4 dB degradation is due primarily to the placement of only a single notch in the deconvolution filter, rather than notches at the first three harmonics, as is implemented in [4].The advantage, however, is much lower computation cost, yielding maximal feasibility for real-time, edge-based sensing.
Chip measurements yield an average power consumption of 2.43 mW for a 500-kHz input clock frequency and real-time processing interval of 95 ms.This average power includes that consumed by the PLL and would be even less if considering the insignificant energy required by a digital cycle counter, as is implemented in the FPGA bypass path.To assess the demonstrated chip's performance relative to that of existing ICs, we again reference the closely related regime of full-duplex radios (FDX) for channel estimation and interference suppression.While direct, comprehensive comparison is difficult given that the chip's unique application differs slightly from that of FDX ICs, we may nonetheless obtain a holistic, streamlined measure of its relative performance via: 1) the qualitative algorithm computational complexity (a proxy for integrability on a resource-constrained edge device), and 2) the interference suppression efficiency η relative to the canceller power consumption, as defined in (13).Utilizing this efficiency metric allows for a fair evaluation of the achieved suppression G [dB] across ICs, normalizing against power consumption P [mW] (a function of workload), with greater suppression efficiency indicative of superior performance.
Note that the logarithmic relationship (i.e., dB/mW) between interference suppression and power consumption utilized by ( 13) may be justified in reference to standard filter performance.A simple digital FIR filter, such as that commonly used in FDX ICs, for instance, achieves a logarithmic roll-off rate that scales linearly with the filter order, or number of taps, where the latter is linearly proportional to the required power consumption.While not universally accurate given the variety of interference suppression hardware implementations, this ratio nevertheless provides a useful, approximate measure of performance efficiency across chips.
Table 3 compares the metrics and qualitative computational complexity demonstrated by the presented IC with those of existing state-of-the-art FDX ICs.As tabulated, the achieved suppression efficiency of the measured chip surpasses that of existing implementations.With regard to the constituent metrics, the achieved 26-35-dB interference reduction of the measured chip either exceeds or is comparable to that of the FD transceiver ICs, where the reported SI suppression of the latter excludes RF/circulator isolation, considering only the on-chip analog/digital baseband (BB) suppression.Although the degree of on-chip computational complexity, and hence, edge device resource consumption and integration feasibility, is similar, particularly in comparison with [44], [45], and [46], the power consumption of the demonstrated IC is significantly less than that of any of the existing ICs.Note that, for a fair comparison, the cited power numbers include only the energy consumed by the SIC circuitry across all silicon designs, excluding the power dedicated to the FDX front-end RF circuitry.Furthermore, any additional energy required by the off-chip radar spectrum correction is negligible given its low-power digital pointwise multiplication implementation, while the ADXL335 average power of 630 µW amounts to a mere fraction of the IC's expenditure.The achieved metrics thus support the chip's classification as a resource-efficient edge device for real-time, fusion-based mmWave radar platform motion compensation.
One pertinent aspect to make note of is the chip's slightly larger area in comparison with that of the FDX ICs included in Table 3.However, as can be observed from the chip micrograph in Fig. 15, the majority of the active area is consumed by the 32 KB total of SRAM utilized for intermediate input and output data storage, a step taken to account for technology-imposed I/O constraints.Addressing this, as well as utilizing a more advanced technology node, we can expect an approximate 4x area reduction in future design iterations.

VI. CONCLUSION
This work has presented the complete design, verification, and measurement of an IC edge device permitting power-and computation-efficient real-time mmWave platform vibratory motion compensation.The analog front-end provides a means to fuse data from an auxiliary IMU/accelerometer, while the custom digital back-end efficiently generates the appropriate deconvolution kernel necessary for compensating the vibration-induced spurs witnessed in the noisy radar spectrum.Operating on short time scales of less than 95 ms for viable real-time motion compensation, the IC achieves an average 35 dB SFDR improvement across vibration frequency and amplitude and 26 dB suppression across target velocity.
In terms of limitations, although the open-source SkyWater 130-nm PDK provides an optimal platform for accessible and reproducible design, constraints incurred by still-developing design and verification tools preclude the capacity for a live demonstration.Enhanced performance can also be expected with use of a more advanced technology node.Furthermore, the degree of vibration suppression exhibited by the designed on-chip deconvolution generator is limited by the use of only a single-notch kernel, which, though advantageous with respect to computational complexity and power consumption, ultimately caps the maximum achievable SFDR.Future work involves extending the kernel to multiple notches, which can be implemented via either on-chip convolution, requiring an FFT engine to attain O(n log n) complexity, or simple notch replication at the appropriate harmonic frequency bins.And although measurements utilize only a single RX channel, incorporation of the angular dimension would be necessary to resolve and correct multiple targets at equal range with sufficiently close Doppler velocities.Relatedly, while the current architecture addresses only a single induced vibration tone, the methods presented can be generalized to a superposition of multiple tones in future implementations.Another significant improvement to the chip would be complete incorporation of the radar sensing RF/analog hardware, which would yield a fully self-contained mmWave IC with integrated motion compensation capabilities.
Despite such limitations, the designed IC nevertheless represents a notable step towards high-resolution, real-time sensing for resource-limited edge devices and autonomous systems.

FIGURE 1 .
FIGURE 1. Functional diagram for the IC-enabled vibratory motion compensation via real-time sensor fusion.

Fig. 2
Fig. 2 depicts the overall architecture for the analog/mixedsignal deconvolution kernel generator chip.The 130-nm IC, fabricated in fully open-source SkyWater technology, consists of an analog front-end, used to process the frequency and amplitude of the accelerometer-detected platform vibration, and a custom on-chip DSP back-end, which formulates and stores the estimated motion deconvolution kernel phase and

FIGURE 2 .
FIGURE 2. High-level architecture for the designed chip.

FIGURE 3 .
FIGURE 3. Input amplifier chain, comprising a single-ended-to-differential converter, followed by a programmable-gain amplifier stage.

FIGURE 4 .
FIGURE 4. Active second-order low-pass G m -C biquad filter circuit.

FIGURE 5 .
FIGURE 5. Continuous-time comparator circuit, used to generate the input to the frequency detection PLL.

FIGURE 6 .
FIGURE 6. Peak/envelope detector circuit in amplifier processing chain.

FIGURE 7 .
FIGURE 7. 8-bit SAR ADC circuit used to digitize the analog outputs from the frequency and amplitude processing chains.

FIGURE 8 .
FIGURE 8. (a) Diagram of a general FMCW RX chirp sequence, with labeled chirp parameters.(b) Typical IF signal spectra given a radar platform undergoing vibration at frequency f v , for zero-, positive-, and negative-velocity targets.

FIGURE 9 .
FIGURE 9. Back-end digital deconvolution kernel generator top-level architecture.

FIGURE 10 .
FIGURE 10.Internal circuitry and signal paths of the custom DSP engine.

FIGURE 11 .
FIGURE 11.Viable f v -Q parameter space and lookup table subspace for (a) B frac = 12, (b) B frac = 13 (design resolution), and (c) B frac = 14 bits, as well as (d) a low-frequency close-up illustrating the critical subspace F .

FIGURE 12 .
FIGURE 12. Maximum achievable second-order IIR notch filter attenuation across f v and Q for processing intervals of (a) N = 1275 samples (95 ms), and (b) N = 2550 samples (190 ms).

FIGURE 14 .
FIGURE 14. Measurement setup for chip evaluation and fusion-aided vibratory motion compensation.

FIGURE 15 .
FIGURE 15.Die micrograph of the 130-nm IC, highlighting key circuit components.

FIGURE 17 .
FIGURE 17. Mean absolute error (MAE) between the ideal and chip-generated kernel transfer functions: (a)/(b) magnitude MAE ϵ m and phase MAE ϵ φ plotted across f v for fixed Q = 0.2, and (c)/(d) ϵ m and ϵ φ plotted across Q for fixed f v = 20 Hz.MAE results indicate perfect analog-domain estimation, with errors within the bound imposed by LUT resolution.

FIGURE 18 .
FIGURE 18. Demonstrated vibration sideband/spur suppression for various target velocities, corresponding to various frequencies f d for the primary Doppler lobe.(a)-(e) illustrate the original, uncorrected radar Doppler spectra, while (f)-(j) illustrate the spectral results of the IC-enabled vibratory motion compensation.

TABLE 1 .
Interference Suppression in Full-Duplex Radio Applications.

TABLE 2 .
Motion/Noise Suppression in Inertial Sensor Fusion Applications.

TABLE 3 .
Performance Summary and Comparison With Existing Interference Suppression FDX ICs.