A 1-ps Bin Size 4.87-ps Resolution FPGA Time-to-Digital Converter Based on Phase Wrapping Sorting and Selection

A field-programmable gate array (FPGA) high-resolution time-to-digital converter (TDC) based on phase-wrapping, sorting, and selection to achieve an extremely fine bin size of 1 ps is proposed in this paper. Based on Nutt interpolation method, a wide measurement range with a high resolution can be realized at the same time. The input signal is fed into tapped delay lines (TDL) with regularized and automated cell placements to generate a multitude of delayed signals with plenty of regularized phase shifts. Due to periodicity, those phase shifts will be equivalently wrapped within a reference clock period and then phase sorting, ROM-based selection are applied to construct a merged TDL with uniform phase division across the reference clock period. The FPGA TDC was implemented successfully on both Altera Stratix IV to achieve a resolution as fine as 1ps with a measurement range of 1s. The short-range integral non-linearity errors (INL) are measured as −1.470– 1.676 LSB for Stratix IV to demonstrate its excellent linearity.


I. INTRODUCTION
High resolution time-to-digital converters (TDC) are indispensable as the time-measuring cores in most scientific and technical applications nowadays. They are widely used in various measurement fields such as on-chip jitter measurement [1], time-of-flight (TOF) experiments and utilities [2], [3], [4], [5], [6], positron emission tomography (PET) scanner [7], etc. With the continuous evolution of the above-mentioned fields, the time-measurement precision requirement continues to increase and thus necessitates more high-resolution TDCs. In order to reduce the impact of process, voltage and temperature (PVT) variations caused by environmental factors [7], having a complete control over the full characteristics of the high-end TDC is crucial. The following factors must be considered for TDC applications: low The associate editor coordinating the review of this manuscript and approving it for publication was Ilaria De Munari . manufacturing cost, wide dynamic range, high resolution, good linearity, and low PVT sensitivity. Although most TDCs are implemented in application-specific integrated circuits (ASIC) with more design controllability and better performance, they also have high implementation and manufacturing costs. These economical drawbacks make custom-ASIC TDC design unsuitable for small-scale or fast-to-market productions. In comparison to ASIC, field-programmable gate arrays (FPGA) are able to provide cost-effective circuit implementations with much lower economical barrier for TDC design and more functionalities for complex FPGA module interoperability. The other advantages of FPGA realization lie in its flexibility and configurability to shorten the development cycle, as demonstrated in the FPGA TDC IP cores in prior arts [8], [9], [10].
The simplest implementation of TDC could be constructed as a high-frequency counter which counts the input time interval/pulse according to a reference clock [11]. However, VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to achieve a high enough resolution with this TDC structure requires the FPGA to operate at a frequency level that is not feasible in current state-of-the-art technologies. To make matters worse, higher operation frequency usually induces further increase in power consumption. In contrast, time interpolation can be applied to achieve a very high resolution with a much lower system frequency compared to a counter-based TDC of equal resolution to enhance the area and power efficiency of the system. Fig. 1(a) is a conceptual timing diagram based on the classic Nutt method for time interpolation. The pulse signal input Tin is segmented into T c , T f 1 and T f 2 . T c is synchronous with the reference clock CLK and thus can be readily measured by a coarse counter. The fractional periods T f 1 and T f 2 , however, require fine TDCs or interpolators with a significantly finer resolution than T CLK . To avoid circuit metastability associated with narrow pulse widths, both fractional period widths are extended by one T CLK with circuit shown in Fig. 1(b) [12] to result in timing diagram shown in Fig. 1(c). Once all pulse segments of T in has been successfully measured, the interval of input pulse T in can then be calculated as Since the effective resolution of the TDC is dominated by the interpolator or fine TDC, many architectures mainly focus on how to improve the effective resolution of the fine TDC. The rest of this section will mainly discuss about the prior arts on fine TDC developments. There are multiple prior techniques employed in the implementation of fine FPGA TDCs, such as large-scale phase matrix (LSPM) [12], [13], wave union [14], [15], Vernier delay lines [16], [17] and tapped delay line (TDL). Several approaches of TDL technique have been demonstrated in prior arts, such as TDL utilizing uniform taps [18] and non-uniform taps [19]. Within the uniform tap TDL approach, several structural variants have also been revealed. Some of these variants include the pseudo-segmented delay line (PSDL) TDL [20], multi-edge TDL [21], [22], [23], multi-channel TDL [13], [14], [24] and merged TDL [25].
LSPM uses delay taps on the pulse input as count-up enable signals of multiple counters connected to the taps and then infers the time interval of the pulse input as the average of the counter outputs. As a consequence, the resolution can be enhanced by increasing the number of counters used in the circuit [13]. One of the benefits is its inherently long input range up to 22 s due to its composition of multiple counters with delayed signals [13]. Since the counters have long input range and can behave effectively as both coarse and fine TDCs, no interpolation is required and LSPM is not subject to any precision error caused by measurement mismatch between the coarse and fine TDCs. The main drawback is the requirement to utilize wire lines specific to Kintex-7 to reach a high resolution with low non-linearity [13] which may limit the option of using the other FPGAs without such unusual resource. It is feasible to generate the required delays via twodimensional PLL arrays. However, it comes with a significant loss of TDC resolution due to the serious design constraints for those embedded FPGA PLLs [12].
TDL uses delay taps to sample the pulse input. The readings from tap registers are then processed to yield the corresponding digital output. FPGA arithmetic carry-chain logic resources are usually used to generate the delay taps within the TDL module [18], [19], [20], [21], [24], [25], however, TDL module construction using cascade chains [26] and DSP [27] has also been observed. Some of the benefits are the economical resource utilization, extensive research progress and innovation, such as up to 1.11 ps and 1.56 ps resolution with PSDL TDL TDC [20] and nonuniform tap TDL TDC [19]. In addition, there are quite many other TDL-type TDCs invented, including multi-edge TDLs (splitting one delay chain/line into multiple lines via ripples to be independently converted), multi-channel TDLs (employing multiple sub-TDL modules to improve the average of the result) and merged TDLs (employing multiple delay chain/lines and selecting taps for achieving close to equal-interval taps). One of the drawbacks, as noted in [13], is its fine measurement range being constrained by its delay line length, necessitating an addition of coarse TDC in a two-stage TDC setup to enlarge the input range in a cost-effective manner. However, the integration of coarse and fine TDCs induces precision error caused by the mismatch between them. Such two-stage setup will also inevitably increase the logic/area resources required for the TDC.
Wave-union (WU) is a technique implemented on top of TDL. It modulates the pulse input to generate multiple pulses as inputs to the underlying TDL circuit which can be finite or infinite for finite-step response (FSR) and infinite-step response (ISR) wave union launcher, respectively [15]. The converted results of those generated pulses can be averaged to enhance the TDC resolution. Accordingly, one of the main benefits is the extraordinarily high resolution, for example, 0.7 ps LSB resolution with 3.29 ps RMS resolution achieved by FSR-type wave-union launcher [14]. While ''RMS resolution'' is used to describe a measure of precision in this paper and others [14], [19], [20], some prior arts also used ''RMS precision'' to describe the same measure [23]. Another benefit is bin size reduction by averaging large bins with smaller bins, for example, 10 ps resolution accomplished on a Cyclone II FPGA device [15]. However, one of the drawbacks is its high non-linearity errors, with integral nonlinearity (INL) error as high as −25.09 LSB to 54.92 LSB reported [14]. Another drawback is the additional dead time and complexity to correct WU-specific U-type and W-type edge cases, as shown in 16 cycles, 45 ns dead time observed in Wu and Shi [15].
Vernier delay line uses two controllable delay line loops with oscillation frequencies very close to each other for accuracy enhancement. One loop is controlled by the pulse input and counted by the other controlled by the clocksynchronized pulse input to generate the output code. Unlike in TDL structures, the delay line loops used in Vernier structures do not generate ultra-wide bins, and consequently, a more accurate clock division can be achieved with lower linearity errors, as shown with DNL of −0.20 LSB to 0.25 LSB and INL of 0.03 LSB to 0.82 LSB in prior arts [16]. One of the drawbacks of the technique is the requirement to circulate a large number of cycles for each measurement [16]. It leads to a long dead time which can be as high as 602 ns [16] and 1042.1 ns [17].
To explore the extreme capability of FPGA TDC, a brandnew linearization technology for TDL structure based on phase wrapping, sorting and selection is proposed in this paper to further narrow down the performance gap between FPGA and full-custom TDCs without the need of any instrument-based bin-by-bin calibration to save time and cost. The rest of the paper is organized as follows. Section II presents the operation principle of the proposed TDC. Section III reveals the circuit structure and elaborates its FPGA implementation. Section IV presents the experimental results. Finally, Section V concludes this work.

II. PROPOSED STRUCTURE AND OPERATION PRINCIPLES
In this paper, TDL is chosen to be the interpolator of proposed TDC due to its simplicity, extensive coverage and analysis in prior arts. Improvements in TDL shown in prior arts include fine resolution with low usage of logic resources [13], [14], efficient delay line [20], bubble elimination [21], multiple measurements with common TDL outputs [24], and elimination of ultra-bin related nonlinearities [25]. However, such improvements are not without drawbacks. Such drawbacks include custom routing lines for LSPM [13], bubble errors in multi-channel WU [14], limited to no multi-channel compatibility in PSDL [20], saturation of performance improvements in multi-edge TDL [21], multiple calibration units for each delay line in multi-channel TDL [24], and LSB resolution tied to ultra-bin size in merged TDL [25]. Fig. 2 shows the schematic diagram of a TDC with a traditional tapped delay line as the interpolator. The start signal propagates through the buffers with an intrinsic propagation delay of τ . The state of the TDL is sampled on the rising edge of the stop signal and the resolution is equivalent to τ in theory. The output of the traditional TDL-TDC is a thermometer code which can be converted into a binary code to be the final output.
The TDC resolution can be improved by employing the phase wrapping method [28]. The input signal T in is quantized by multiple delays of the system clock as shown in Fig. 3 and the latch flip-flops are replaced with counters to quantize T in by all delayed clocks. The clock is delayed by τ (with τ < T CLK ) for each tap of the delay line. The final output code can be easily obtained by summing up the results of all counters to circumvent the usual thermometer-to-binary code conversion of traditional TDL TDCs. The resolution is equivalent to that of counter-based TDC with a theoretical clock period of τ which can be made much smaller than T CLK as shown in Fig. 3(c) [28]. In practice, the equivalent resolution becomes the cell delay or the phase shift among clocks which limits the achievable TDC resolution to FPGA logic gate delay. However, when a specific delay happens to be larger than T CLK , it will be wrapped back into the reference clock period due to the periodicity of the clock signals. The wrapped phase for τ > T CLK can be calculated as  [28], [29], [30].
where mod and are the modulo operation and floor function respectively. In general, τ for individual delay cell of advanced technologies is much shorter than T CLK and henceforth phase wrapping may not occur on short delay lines. It is necessary to make the aggregated delay τ ≥ T CLK of a specific tap to achieve phase wrapping. By inserting enough delay cells into the delay line to make τ ≥ T CLK , phase wrapping is still able to be realized. In this way, the effective resolution τ can become much smaller than both the cell delay τ and the clock period T CLK to achieve much better resolution than the traditional τ -limited TDL TDC. Increasing the number of delay cells in the delay line will increase the wrapping density of the delayed clocks and thus improve the effective TDC resolution.
The difficulty for practical implementation is that it is not only impossible to keep the same delay τ among delayed clocks but also even harder to achieve homogeneous τ after phase wrapping due to device mismatch and additional parasitic elements caused by APR (automatic placement and routing) procedure as revealed in Fig. 4. To alleviate this problem, a TDC based on delay wrapping and output averaging is proposed [28]. With multiple TDC cores for parallel measurements, the outputs are averaged and rounded to get the final result. By averaging, the delay variation is reduced and the RMS resolution is also improved at the expense of extra FPGA logic utilization. Another solution is delay cell merging to generate nearly homogenous delay among merged cells [25], [31]. Although the original delay cells have highly variant delay times, multiple cells can be placed and merged as a new cell in the delay line to generate nearly uniform cell delay to compose the so-called merged delay line (MDL) at the cost of poor TDC resolution. As a combination of delay cells, the merged cell inevitably has a longer delay to worsen the resolution. A similar strategy will be adopted in this paper to merge the wrapped cell delay τ instead of the cell delay τ itself to not only substantially enhance the resolution but also produce nearly uniform phase shift among the delayed clocks as shown in Fig. 4. In theory, a very long delay line needs to be used to generate more than enough phases required by the target resolution. However, the order of wrapped delays is substantially varied and largely uncorrelated to the original cell delays as shown in Fig. 4(a), e.g. C i and C j becoming τ 3 and τ 5 , which needs to be resorted before selection. After sorting the wrapped phases with accurate post-simulation, the resulting wrapped phases are expected to be unevenly spaced between one another, therefore, appropriate wrapped phases will be selected to form the target MDL according to the predefined bin size. This selection process is equivalent to the merged window operation [24] after wrapping and sorting to form a much more uniform merged delay line. It suppresses the DNL of the final TDC and thus effectively improves the linearity of the system. The more wrapped phases to be selected, the easier to achieve uniform phase shift for the MDL. Assuming there is n wrapped phases uniformly distributed in one reference clock period after selection, the bin size of the proposed TDC can be easily calculated as where f is operation frequency of the reference clock. Better resolution can be achieved with higher frequency or larger number of selected wrapped phases. Equivalently, the required number of selected phases according to a specific bin size is (4) and the number of phases/delays generated by the original delay line must be much larger than n to ensure good enough linearity after wrapping, sorting and selection. It deserves notice that the pulse width of the clock signal will be shrunk or stretched due to the mismatch among delay cells [32]. Even though the duty cycle of the reference clock is set to 50%, the high frequency clock signal is more likely to be saturated or vanished before reaching the end of long delay line [33] to cause large measurement error. This effectively puts a physical limit in the number of delay elements that can be placed in a single delay line. Even though the problem can be solved by shortening the delay line or reducing the clock frequency, the effective resolution will be reduced also according to (3).
A possible solution to resolve this problem is to utilize a 2-D delay matrix to provide the same number of delays while limiting the number of delay cells in one line [12]. However, multiple 1-D delay lines connected in parallel as depicted in Fig. 5(a) can provide the same number of phases with even less delay cells than 2-D delay matrix since no vertical delay line is required. As shown in Fig. 5(b), the wiring delays from the input to all delay lines spread over quite many nanoseconds which are much larger than the target ps-level resolution and enough to diverse the phases generated by those delay lines. These multiple delay lines are equivalent to a k × m 1-D delay line with the maximum propagation path of merely m instead of k ×m delay cells. The impact of pulse shrinking/stretching is substantially reduced. It will be even better to feed T in instead to the multiple delay lines and then sample all delayed inputs by CLK as revealed in Fig. 5(c) to provide exactly the same resolution but operate the circuit at much higher frequency [12]. Due to fewer delay cells existed in each delay line, not only the dead time but also the offset of the system will be dramatically reduced. Through the help of wrapping, sorting and selection for all the phases generated by multiple delay lines, the final uniform MDL of the proposed TDC can be successfully composed from uneven wrapped phases τ i as depicted in Fig. 5(d).

III. CIRCUIT IMPLEMENTATION
Among the most accurate prior arts, three FPGA TDC are reported to achieve 0.7 ps for wave-union TDC [14], 1.11 ps for PSDL TDL TDC [20] and 1.29 ps for LSPM TDC [13]. However, TDC performance is not solely determined by its resolution. The nonlinearity error is another key parameter for performance evaluation. The proposed TDC is expected to have much lower nonlinearity error compared to the prior arts' while retaining the same level of resolution to prevent the painful tradeoff between linearity and accuracy. The in-depth implementation is explained as follows.

A. CONSTRUCTION OF MULTIPLE DELAY LINES
The reference clock frequency of the FPGA DTC implemented on Altera Stratix IV is set to be 800 MHz by a PLL through overclocking the on-chip oscillator to provide a reference clock period of 1,250 ps. In order to achieve the same level of accuracy of the best prior arts, the resolution is set to be 1 ps. To conquer the impact of pulse shrinking/stretching mechanism on such high frequency clock, the number of cells in each delay line is set to be 32 or less by experiments. For the successful selection of 1,250 wrapped delays under the constraint of INL ≤ ±0.5 LSB to ensure all outputs bits are valid, 240 delay lines are required to compose a fine TDC and the total VOLUME 10, 2022 number of delay cells is 7,680 which is much larger than 1,250 as predicted by (4). The excessive delay cells are necessary to counteract the non-uniformity of cell delays after wrapping to make sure all required delay stages can be selected successfully under such high resolution, 1 ps bins. Similarly, the stringent ±0.5 LSB INL requirement is necessary to confine the inaccuracy caused by TimeQuest, Intel Quartus II's timing analyzer, which will be added up along with equipment inaccuracy and other defects to the final measurement error. In this way, the proposed phase wrapping, sorting and selection is able to eliminate the need of very fine cell delay τ to achieve a much high resolution τ and the FPGA combinational logic can be readily used to construct the required virtual delay line. In theory, the selector is used to connect the 1,250 selected cells that successfully fit the INL requirements to their corresponding counters for final output accumulation. This is done by connecting/enabling or disconnecting/blocking each cell to its corresponding counter in the counter array. The conceptual block diagram of the fine TDC is plotted in Fig. 6 with m = 240, k = 32 and n = 1250.
The detailed structure of the fine TDC is shown in Fig. 7(a), and its corresponding logic layout after APR is shown in Fig. 8(a). As shown in the figure, irregular layout is yielded by the APR process to induce more wiring delay and parasitic variations. An increase of phase selection difficulty and cell redundancy is to be expected of this irregularity. To resolve this problem, a C++ program was composed to place all the delay cells regularly in the adaptive logic modules (ALM) inside logic array blocks (LAB) through modifying the Quartus II settings file (.qsf).
To achieve the best uniformity among cell delays, the cells of each delay line are placed regularly along the horizontal or vertical direction to achieve as uniform as possible parasitic among cells as depicted in the exemplified user-defined placement of Fig. 7(b). Such critical element logic layout placement is applied by assigning each LAB to a predefined location in the.qsf file with the help of set_location_assignment command. It is realized by a composed C++ program with the syntax shown below: set_location_assignment < value> -to <destination> <value><Elem>, <X coord>, <Y coord>, <Z coord> <destination> the element name after compilation <Elem> can be LABCELL (logic array block cell), MLABCELL (Memory LAB cell) and FF (Flip Flop); <X coord>, <Y coord> indicates the location of the designated element and <Z coord> (also labeled as N) sets the number of assignable resource (such as ALUT or register, et al.) inside the specified element [35]. Since the routing between each delay cell and its corresponding counter had much impact on the counting result, the LSB of corresponding counter is also placed in the same ALM to get the best timing accuracy. The logic structure of Stratix IV ALM is illustrated in Fig. 7(c) [34]. Each ALM is only able to realize one delay cell and one counter bit due to the  physical wiring constraint. According to insignificant impact on timing, all the other counter bits are placed out the regular placement region to ease the design and implementation. The resulting layout for the user-defined layout placement plotted in Fig. 8(b) shows a much more regular and ordered wiring than the one shown in Fig. 8(a). To prevent further modification to the user-defined placement for preserving the timing among delay cells and counters, LogicLock, a Quartus floor-planning software utility [35], is used to reserve this region exclusively for the fine TDC submodules which will be separated from the rest of the circuit.

B. TIMING RETENTION AND PROCEDURE FOR PHASE WRAPPING, SORTING, AND SELECTION
The flow chart of delay line construction is shown in Fig. 9. First of all, the number (m) and length (k) of delay lines are set to almost double the required number (1,250) of total delay cells in theory form the beginning of iteration. The succeeding delay line placement is executed by one of our C++ programs. A new (m, k) value pair will be chosen if the program fails to fit the FPGA physical layout restrictions. After successful user-defined delay line placement and circuit synthesis, TimeQuest is utilized to perform timing analysis along with all gate delays and layout parasitic. The analysis result contains detailed timings for all delay cells which will be saved in a timing report printout file (.rpt). Then, phase wrapping will be executed by another C++ program to calculate the remainders of all cell delays divided by the clock period (1,250 ps) according to (2) which will spread over the reference period ranging from 0 ps to 1,249 ps. Next, phase sorting and selection will be executed in accordance to the given resolution (1 ps) to construct the virtual delay line with simulated INL ≤ ±0.5 LSB. If it fails, larger m or k needs to be set to generate more cell delays for wrapping, sorting and selection. After some iterations, the virtual delay line will be finally constructed. All the ROM bits of delay cells selected to compose the virtual delay line will be set to '1' to enable their corresponding counters. The ROM bits of all the other non-selected delay cells will be reset to '0' to disable their succeeding counters by clock gating which will be explained in the next subsection.
The main challenge is to tap the counters of selected delay cells to the accumulator shown in Fig. 6. After phase selection, some modifications to the whole FPGA module are still required to connect the selected counters to accumulator by hard wires. To actualize the changes in the module, recompilation is required which, however, will affect the timings of selected cells and result in a big loss in accuracy. To address the recompilation problem, incremental compilation can be tried by creating design partitions which make incremental compilation possible by allowing each partition to be synthesized and placed separately and thus prevent recompilation across partition boundaries [36]. However, the selected counters inside the design partitions still need to be wired to the accumulator. Recompilation cannot be prohibited completely and the cell delays will be changed again. On the other hand, programmable logic can accommodate changes to a system late in the design cycle. These last-minute design changes, commonly called as engineering change orders (ECOs), allow functionality changes of a circuit even after the design has been fully compiled [36]. The problem is only approximately 20 changes can be made through Quartus II to the wirings of multiple delay lines with thousands of cells. It makes ECOs infeasible either.
To entirely avoid recompilation and retain the accurate timing after selection, all hardwire connections between the delay cell counters and the accumulator need to exist before compilation. Since there is no way to know which cells will be selected before compilation, one feasible solution is called clock gating to connect all counters to the accumulator and enable the selected ones to be accumulated according to the information stored in the on-chip FPGA ROM (read only memory) which can be viewed as a realistic implementation of the selector in Fig. 6. The phase wrapping, sorting and selection after compilation and timing analysis can be made automatically with the composed C++ programs and the corresponding enabling/gating information for the counters of all selected/unselected cells will be downloaded to the ROM through JTAG (Joint Test Action Group) interface.
The complete circuit of the proposed TDC is drawn in Fig. 10 according to the classic Nutt method for dynamic range expansion. The widths of T f1 and T f2 are both less than one reference clock period T CLK and 1-bit counters are  adequate to quantize them in theory. However, the widths of T f1 and T f2 are occasionally larger than T CLK due to jitters. In practice, 2-bit counters are used instead. The contents of all enabled counters will be accumulated and then combined with the coarse counter content to be the final output according to (1).

IV. MEASUREMENT RESULTS
For Altera Stratix IV, an 800 MHz system clock is generated from an overclocked PLL powered by their on-chip 50 MHz crystal oscillators. The utilizations of combinational adaptive look-up tables (ALUT) and dedicated logic registers on Altera Stratix IV are 32% and 19%, respectively.
To evaluate the real performance of proposed TDC, the measurement environment is set up as follows. Test pulses are generated by Agilent 81134A pulse/pattern generator to be measured by not only the realized FPGA TDCs but also Tektronix DPO70404 Digital & Mixed Signal Oscilloscope which measures the pulse width accurately by averaging tens of thousands of samples with the finest waveform interpolation setting to average out the impact of jitter and noise. Since such accurate measurements are quite time-wasting, SikuliX, a GUI automation software testing tool by Raimund Hocke, is used to coordinate the vender development programs for controlling all equipment and FPGA automatically as illustrated in Fig. 11  In order to demonstrate the short-range linearity of proposed TDC, the test pulses with widths ranging from 5 to 5.2 ns with 1 ps steps are measured for FPGA TDC at ambient temperature around 25 • C as shown in Fig. 12. The short-range INL is measured first for Altera Stratix IV TDC without phase wrapping, sorting and selection to be as large as −10.382 -10.530 LSB in Fig. 12(a) which is not comparable to the linearity of the best prior arts. Then, the short-range DNL and INL are measured again for Altera Stratix IV with the proposed method to be −1.767 -1.759 LSB and −1.470 -1.676 LSB respectively in Fig. 12(b) and (c) to prove the power and effectiveness of this new method. It reveals the exceptional linearity of the proposed TDC. To demonstrate the impact of temperature variation, the measured long-range INL for pulse widths ranging from 100 ns to 1000 ns with 5 ns steps are −1.542 -1.838 LSB, −1.963 -1.819 LSB and −1.156 -0.986 LSB respectively for Altera Stratix IV at 0 • C, 25 • C and 50 • C as drawn in Fig. 13. It proves the proposed TDC architecture is rather robust and temperature-insensitive. The RMS resolution is measured to be 4.87 ps for Altera Stratix IV as demonstrated in Fig. 14. Table 1 concludes the performance of the implemented FPGA TDC along with its prior arts for easy comparison. Even though realized with less advanced process nodes, the resolutions of both TDC implementations are comparable to those of wave-union and PSDL TDCs [14], [20] with significantly lower INL errors. VOLUME 10, 2022

V. CONCLUSION
With the help of the regularized and automated TDL LAB cell placement and the newly proposed phase wrapping, sorting and ROM-based selection process, the proposed TDC architecture is able to achieve an LSB resolution as fine as 1 ps with a short-range INL of −1.470 -1.676 LSB, RMS resolution of 4.87 ps, and a dead time of 28 ns in Altera Stratix IV to achieve the best linearity compared to the prior arts. In brief, the proposed structure accomplishes the best linearity and is capable to deliver a more accurate and linear TDC design across different FPGA platforms.
More importantly, all technical details for implementation of the proposed circuit are revealed to help interested readers to speed up their own high accuracy FPGA TDC designs. To further accelerate the design for time-domain FPGA applications, we will focus on the development of a low jitter, high accuracy FPGA DTC evolved from this proposed architecture and the integration of built-in self-test (BIST) technology for TDC in the future. In 1996, he joined the School of Computer Science and Systems Engineering, Kyushu Institute of Technology, Japan, where he is currently the Vice President for Education/Student/Information and a Professor. His research interests include logic testing and dependable systems. He is a member of the IPSJ and fellow of the IEICE. He received the Young Engineer Award from IEICE, in 1997, the IEEE ITC 2005 Most Significant Paper Award, and several Best Paper Awards from IEICE, IEEE WRTLT, and IEEE ATS.
TRIO ADIONO (Senior Member, IEEE) received the B.Eng. degree in electrical engineering and the M.Eng. degree in microelectronics from the Institut Teknologi Bandung, Indonesia, in 1994 and 1996, respectively, and the Ph.D. degree in VLSI design from the Tokyo Institute of Technology, Japan, in 2002. He is currently a Professor with the School of Electrical Engineering and Informatics, and also works as the Head of the IC Design Laboratory, Microelectronics Center, Institut Teknologi Bandung. He holds a Japanese Patent on a high quality video compression system. His research interests include VLSI design, signal and image processing, VLC, smart cards, electronics solution design, and integration. Taichung where he is currently a Researcher. His research interests include field test, delay testing, and temperature sensor design. He is a member of the Institute of Electronics and Information and Communication Engineers, and the IEEE. VOLUME 10, 2022