Multichannel, Low Nonlinearity Time-to-Digital Converters Based on 20 and 28 nm FPGAs

This paper presents low nonlinearity, compact, and multichannel time-to-digital converters (TDC) in Xilinx 28 nm Virtex 7 and 20 nm UltraScale field-programmable gate arrays (FPGAs). The proposed TDCs integrate several innovative methods that we have developed: 1) the subtapped delay line averaging topology; 2) tap timing tests; 3) a direct compensation architecture; and 4) a mixed calibration method. The code density tests show that the proposed TDCs have much better linearity performances than previously reported ones. Our approach is cost-effective in terms of the consumption of logic resources. To demonstrate this, we implemented 96 channel TDCs in both FPGAs, using less than 25% of the logic resources. The achieved least significant bit (LSB) is 10.5 ps for Virtex 7 and 5.0 ps for UltraScale FPGAs. After the compensation and calibration, the differential nonlinearity (DNL) is within [–0.05, 0.08] LSB with σDNL = 0.01 LSB, and the integral nonlinearity (INL) is within [–0.09, 0.11] LSB with σINL = 0.04 LSB for the Virtex 7 FPGA. The DNL is within [–0.12, 0.11] LSB with σDNL = 0.03 LSB, and the INL is within [–0.15, 0.48] LSB with σINL = 0.20 LSB for the UltraScale FPGA.

TDCs can be realized through analog or all-digital methods [16]. Recently, all-digital topologies have become very popular, and they can be implemented in application-specific integrated circuits (ASIC) [17], [18] or field programmable gate arrays (FPGA) [14], [19]- [34] with a subnanosecond resolution. ASIC-based TDC is a mature solution because of its better precision and linearity [35]. However, ASIC approaches tend to be more suitable for large-scale application specific or general purpose commercial developments. Compared with ASIC approaches, FPGAs are able to provide greater flexibility with a lower cost and a shorter developing cycle. With the rapid advances of FPGA technologies, powerful design environments, and a wide variety of applications, FPGAs are suitable for design verification, scientific experiments, and high-end instruments, and have become an ideal platform for integrated system design.
The timing resolution is a primary parameter for a TDC. For the simplest counter-based TDCs, the resolution is limited by the frequency of the driving clock [16]. To break the limitation and achieve a picosecond resolution, the Vernier delay line [33] and tapped delay line (TDL) [20], [32], [34] architectures have been presented and widely applied. The TDL has become a mainstream method for implementing FPGA-TDCs recently [14], [19], [36], since it can be easily realized by using the carry chain modules in FPGAs. The resolution of TDL-TDCs is mainly determined by the manufacturing process of FPGAs, and it has been improved from 200 to 3.9 ps root mean square (rms) since 1997 to 2017 [34], [37].
The wave union method [19], multichain averaging method [26], [38], matrices of counters [32], and two-dimensional (2-D) Vernier structure [31] have been presented to break this "process-related" limitation as well. These methods can achieve a better resolution than raw-TDL TDCs, however, they usually require much more logic resources, and have higher system complexity or a much longer dead time.
The nonlinearity of a TDC is another vital parameter, since it can influence the measurement precision directly. The nonlinearity can be characterized by the differential nonlinearity (DNL) and the integral nonlinearity (INL) based on the code density tests [16], [35], expressed as follows:  [35]. Compared with ASIC-TDCs, FPGA-TDCs usually show worse linearity performances. For TDL-TDCs, the clock skews and the poor uniformity of carry chains [16], [24] are the main culprits for the nonlinearity, missing codes, and bubble problems. It is difficult to remove them completely. To reduce the nonlinearity, the dualphase [28], downsampling [36], TDL reorganization [24], [25] and tuned-TDL [29] methods have been presented recently. A ones-counter encoder was reported to remove the bubbles in FPGA-TDCs [37]. However, these methods still cannot enhance the linearity up to the level comparable to ASIC-TDCs [17]. The demand for multichannel TDCs has been growing strongly especially for applications such as ToF measurements for 2-D and 3-D ranging, LIDAR, time-resolved spectroscopy, and fast-FLIM [11], [12], [39]- [41] that require real-time acquisition. ASIC-based multichannel designs are reliable and competitive in the aspects of the resolution, linearity, and power consumption. The latest FPGAs have also great potential for implementing multichannel TDCs, as they provide a massive amount of logic and IO ports with fast and flexible development tools. Many multiple-channel TDCs have been reported based on both ASIC and FPGA devices in the last few years. For most ASIC multichannel TDCs, the number of channels achieved is around tens of channels [8], [42], [43]. Several designs with hundreds, even to a thousand of channels TDCs were reported specifically for fast FLIM and 3-D ranging applications [6], [39], [44]. However, the targeted specifications of these TDCs are not aimed for high linearity, but are limited due to system requirements such as low power consumption and a higher fill factor. FPGA-based multichannel TDCs [21], [25], [45]- [47] are able to implement more than ten or even hundreds of channels within a single FPGA, however their linearity cannot compete with ASIC-TDCs. Most multichannel TDCs published earlier are not able to achieve a large channel number, a high resolution, and high linearity simultaneously.
Various procedures are required for calibrating process, voltage, temperature (PVT) variations and nonlinearities [48], [49]. In an FPGA, the influence of voltage jitters can be negligible since the power noise has been effectively restrained [23]. The temperature variations will influence the delay speed of a TDL resulting in LSB variations. Several methods were reported [28], [38], [45] to compensate them by using look-up table methods or correcting temperature coefficients. The static nonlinearities are mainly caused by the nonuniformity of TDLs and clock distributions. The bin-by-bin calibration techniques [14], [21], [45], [49] can be used for correcting nonlinearity offsets, whereas the binwidth calibration techniques [22], [30] are for reducing the nonlinearity of the binwidth.
Chen et al. [30] presented a missing-code free FPGA-TDC through combining the direct-histogram architecture and the tuned-TDL method. Although this method improved the linearity greatly, it is not suitable for multichannel applications due to the larger consumption of resources for implementing histogram counters. To achieve a TDC with 1) high linearity, 2) a long measurement range, and 3) low consumption of digital resources, we present several new methods (and implement multichannel TDCs) listed as follows.
1) A sub-TDL averaging topology is presented to achieve fast removals of the bubbles and zero-width bins and preliminary suppression of the nonlinearity. 2) A unique tap timing test based on the sub-TDL topology is proposed to calculate the actual tap timings in a TDL. 3) A new direct histogram compensation architecture and a mixed calibration method are developed to boost linearity with minimum logic resource cost. 4) To demonstrate our approaches, we implemented 96channel TDCs in both the Virtex 7 XC7V690T and the UltraScale XCKU040 FPGAs.

II. DESIGN AND ARCHITECTURE
Since the Virtex 7 FPGAs are different from UltraScale FP-GAs in the arrangements of logic modules, we will describe the proposed approaches, but with different configurations and methods selected for implementing our multichannel TDCs. We will demonstrate how the proposed methods achieve to improve the linearity by comparing with traditional topologies.

A. TDL-TDC and Sub-TDL Averaging Topology
As shown in Figs. 1 and 2, TDLs were implemented with the cascaded carry chain modules, CARRY4 in Virtex 7 and CARRY8 in UltraScale shown in Fig. 2(a) and (b), with the carry output ports as the taps and a column of sampling D-flipflops (D-FFs). The input port of the TDL can be connected to photon sensors such as single-photon avalanche diodes (SPAD) and photomultiplier tubes. When a new photon event is detected, the hit signal with a 0-to-1 or 1-to-0 transition will propagate along the TDL. The states of the hit signal in a TDL are registered by the D-FFs at the taps sampled by the clock. The states are assembled and represented as thermometer codes (1 111 000 . . . or 0 000 111 . . . ), before being converted to one-hot  Fig. 1. The one-hot codes will then be converted to normal binary codes as the fine codes by the OH2BIN converters. To construct the histogram, the fine codes are used as the addresses of the memory. With coarse and fine code structures, the FPGA-TDC is able to achieve a longer measurement range.
The carry chain module contains a series of multiplexers (MUXs) as the basic delay elements of a TDL, as shown in Fig. 2. The CARRY4 module contains four MUXs in Xilinx 7 (both Virtex 7 and Kintex 7) FPGAs, and the CARRY8 contains eight MUXs in new 20 nm UltraScale and 16 nm UltraScale+ FPGAs. Traditional TDL-TDCs splice all carry outputs to a single thermometer code directly. However, the dedicated fast lookahead carry logic architecture in the CARRY modules contributes to significant nonlinearity, missing codes, and serious bubble problems due to the mismatch in the propagation delay along the delay lines [27], [37]. The shorter the tap interval becomes, the more serious the bubble and the missing code problems will be introduced. To solve this problem, we proposed a sub-TDL averaging topology to rearrange and regroup the carry outputs into several subsections with a shorter thermometer code as shown in Fig. 1. For a Virtex 7 FPGA, a TDL is separated into four sub-TDLs. The fine codes of the four sub-TDLs are summed up to form averaged TDL subsequently. This method is applied similarly to an UltraScale FPGA, but dividing a TDL into eight sub-TDLs. Using the sub-TDL topology is equivalent to using a less advanced process by elongating the tap interval, and therefore removing the appearances of bubbles. The LSB or the bin size of a TDL is equal to the full-scale range divided by the number of taps. From Fig. 1 for a raw TDL (4n taps) and a sub-TDL (n taps) built by n CARRY4s in Virtex-7 FPGA, the LSB of a raw TDL is as follows [50]: where Δt j,i is the propagation delay of the jth tap in the ith CARRAY4 module, Δt AVE is the average propagation delay of a tap (the exact delay model should include all delays on the routes and buffers to D-FFs, but here we only adopt a simple model). Also from Fig. 1, the LSB of the sub-TDL (the total delay divided by n) is around (note that the sampling instances of the sub-TDLs are different, but the delays are similar) The sub-TDL averaging method for using four sub-TDLs together to obtain averaged TDL can be considered as a new bubble-free version of the multichain-TDL technique [26]. The original multichain-TDL technique used multiple TDLs to obtain a TDL with a smaller binwidth, but it still requires extra logic circuits to remove bubbles. Similar to [26,Eq. (4)], the LSB of averaged TDL When n 1, we have LSB Ave ≈ LSB raw . The code density test results of the individual sub-TDLs for both Virtex-7 and UltraScale are shown in Fig. 3. The advantage of the sub-TDL averaging approach is that as the equivalent binwidth of the sub-TDLs has been multiplied, the bubbles will not exist anymore in the sub-TDLs. Averaged TDL has no zerowidth bins (DNL = -1) and the number of missing codes (DNL ࣘ −0.9) [50] is reduced effectively, as shown in Fig. 5. Since the missing codes still exist in averaged TDL, additional methods are required to improve linearity.

B. Tuned-TDL and Tap Timing Test
The TDLs in the Virtex 7 and UltraScale FPGAs are shown in Fig. 2(a) and (b). Each delay element "MUX" contains two types (CO and O) of outputs. These two types of output signals have different delays [29]. The tuned-TDL method selects one of the two output types to improve the linearity. In our work, the tuned-TDL method is used with the sub-TDL topology. For the CARRY4 module in a Virtex 7 FPGA, the CO and O output ports are mutually exclusive, whereas in the CARRY8 module in an UltraScale FPGA, the CO and O ports are all registered within the same CLB module. Therefore, each CARRY8 is able to generate 16 carry outputs. In 2016, a dual-sampling method using all 16 carry outputs was presented with a bin size of 2.25 ps [27]. However, the bubble problems and the nonlinearity are exacerbated with the reduced binwidth.
The actual timings of the TDL taps are desired for investigating the uniformities of the carry chains and the clock skews. Since the circuits of CARRY chains are fixed in FPGAs, the binwidth and location of missing-codes are static and therefore predictable. We therefore proposed tap timing tests to quantitatively analyze the time intervals of taps and select the taps with better intervals based on the sub-TDL topology. Similar to code density tests, an amount of random hit signals are fed into the TDC, and the 16 binary codes (B n , n = 0, . . . , 15) converted by the OH2BIN converters from all 16 sub-TDLs (CARRY8) are directly readout and collected. This set of the binary codes are generated after every measurement. The timing differences between taps, D n , can be calculated by the following equation: where L is the number of the measurements and B n,m is the nth binary code for the mth measurement. From (6), a set of timing differences from D 0 to D 14 can be quantified. Fig. 4 illustrates the results of the tap timing tests. It shows the ideal and the actual bin timings. The actual bin timings show that the widest bin is about 2.3LSB (CO 1 to CO 5 ) and the narrowest bin is less than 0.1LSB (CO 7 to CO 4 ). The highlighted sub-TDLs (in red) indicate how the mismatched timings of the bin boundaries contribute to the nonlinearity of FPGA-TDCs. The number of the used TDL taps depends on the requirements of applications and the proper taps to be selected with relatively uniform time intervals. There is a tradeoff between the resolution and the linearity achieved. In this case, we selected 8 out of the 16 taps in a CARRY8 with the average bin size of 5 ps (LSB).  Fig. 5(c) and (d), show that the zero-width bins (DNL = -1) are totally removed from both FPGAs, and the width of the widest bins are well controlled such that DNL < 2 LSB. The root mean square (rms) binwidth is improved from 1.53 to 1.13 LSB for the Virtex 7 FPGA and from 1.85 to 1.12 LSB for the UltraScale FPGA.

C. Compensated Histogram and Mixed Calibration Method
The high consumption of FPGA resources makes the previous design not friendly for multichannel design [30]. To  simultaneously achieve the optimized linearity, fast calibration, and low resource consumption, we proposed a direct histogram compensation and a mixed calibration method, see Fig. 6.
The measured events are expressed by fine codes and counted in the corresponding bins of the histogram. In a raw TDC, large quantization errors are generated since the time intervals between two adjacent TDL taps are largely nonuniform, and only one binary code is processed in each measurement. A compensation approach has been introduced in 2016 to solve this problem, but it was only used for postprocessing, introducing much more processing time especially in multichannel applications [46]. To achieve the fast and direct histogram compensation, we reassigned the fine code to a main bin calibration factor (BCF m ) and a compensation bin calibration factor (BCF c ) when a hit signal is measured. These two factors (BCF m , BCF c ) are the fine code outputs of compensated TDC. To calculate the BCF c and BCF m , the binwidth of the raw TDC needs to be estimated by performing code density tests first. The kth code transition level T[k] is needed for calculating the main and compensated   where W [n] is the code binwidth of the nth bin. BCF m and BCF c are calculated accordingly. For the bins located within a single ideal normalized bin (highlighted in blue in Fig. 7), only BCF m is valid for readdressing the measured result. For bins which covers across different bins (highlighted in red), both BCF m and BCF c will be generated to address two bins at once. This process can be simplified as below pseudocode: The histogram compensation method will correct the measurement bias by readdressing the fine codes, and the missing codes are compensated as well. As shown in Fig. 8(a) and (c) and Table II, the linearity of compensated TDC is further improved. The DNL is reduced to [−0.73, 0.79] LSB, and σ DNL is reduced to 0.29 LSB for the Virtex 7 FPGA. The DNL is reduced to [−0.75, 0.86] LSB, and σ DNL is reduced to 0.35 LSB for the UltraScale FPGA. Fig. 8(b) and (d) show that the distributions of the binwidths (for both TDCs) are well-shaped, showing no missing codes. Since the minimum DNL has been improved to be better than -0.8 LSB after the compensation, the binwidth calibration is feasible for enhancing the linearity further. The code density test needs to be re-executed after the BCF m and BCF c are loaded. The binwidth calibration factor set of calibrated TDC contains two vectors for the main and compensation histogram(WCF m and WCF c ), respectively. The results of the second code density test can be used to calculate the WCF m and WCF c : BCF m , BCF c , WCF m , and WCF c can be calculated by using the offline methods (MATLAB) or on-the-fly approaches (on-chip processing). A probability profile of the code distribution can be obtained through code density tests, and then the histogram compensation performs code reassignment based on the probability profile and use of the binwidth calibration techniques to correct the distribution. To save the resource consumption, we splice two BCFs and two WCFs into a mixed-calibration factor set, and they are stored in the calibration block randomaccess memory (BRAM). The calibration BRAM dispatches the stored calibrated factors to the histogram BRAMs when the fine codes of averaged TDL are valid at the address ports. The flow diagram of the proposed TDC in the Virtex 7 FPGA is shown as Fig. 9. For different applications, a true dual-port BRAM or a two BRAMs working at the simple dual-port mode can be selected for histogramming.

III. EXPERIMENTS AND RESULTS
The results of the experiments and tests were used to evaluate the performances of the proposed calibrated TDCs. Two independent low-jitter crystal oscillators (DSC1103) were used as the signal sources for the code density tests. The temperature and operating voltage on the FPGA chip were maintained within a stable range.

A. Linearity Test Results of Calibrated TDCs
The DNL, INL, and standard deviations (σ DNL and σ INL ) are the main parameters to evaluate the linearity. When compensated TDC is compared with calibrated TDC, both the DNL and INL are improved significantly. The results are summarized and illustrated in Table II and Fig. 10. After the calibration, DNL pk-pk (peak to peak of the DNL) and INL pk-pk (peak to peak of the INL) are improved by more than 11-fold and 16-fold for the Virtex 7 FPGA, respectively. For the UltraScale FPGA, the DNL pk−pk and INL pk−pk are improved about 7-fold and 5-fold, respectively. The standard deviations, σ DNL and σ INL , are improved by about 29-fold and 15-fold for the Virtex 7 FPGA, respectively. For the UltraScale FPGA, the σ DNL and σ INL , are reduced by 11fold and 4-fold, respectively. The equivalent binwidth w eq and the equivalent standard deviation σ eq were proposed by Wu for assessing the linearity performances of TDCs [51]. It is defined as the following equations:

B. Time Interval Measurements
To verify the measurement error and the rms resolution of the proposed TDC, programmable delay generators (such as IDELAYE2 and IDELAYE3) are used for generating the known time intervals between an origin signal and a delayed signal. The time intervals are measured by the presented calibrated TDCs and an oscilloscope (Teledyne LeCroy WaveRunner 640Zi) at the same time. Both of the original signal and the delayed signal are outputted via two SMA connectors. The external jitter is minimized, since the time intervals are generated in the FPGA chip and sent to the TDC directly. The IDELAYE2 and IDELAYE3 are continuously calibrated by an IDELAYCTRL module based on a low jitter reference clock to prevent PVT variations. The time interval of IODELAY can be dynamically controlled with a step of 39 and 4.6 ps in IDELAY2 and IDELAY3 modules, respectively. With this arrangement, different time intervals were generated to cover the entire TDLs of the TDC. Each experiment captured 80 000 samples, and the time intervals were calculated based on the histogram. The measurement results and the rms resolution are shown in Fig. 11. The average rms resolution is 14.59 ps with σ = 0.84 ps for the Virtex 7 FPGA and is 7.80 ps with σ = 0.45 ps for the UltraScale FPGA. The standard deviations of the time intervals measured by the oscilloscope are 14.86 ps for the Virtex 7 FPGA and 8.55 ps for the UltraScale FPGA, respectively. The standard deviations of the differences between the measured results obtained by the TDC with the results are 4.04 and 5.37 ps for the Virtex 7 and the UltraScale FPGAs, respectively.

C. Configurations and Multichannel TDC Design
The configurations of the multichannel TDC are various. In the Virtex 7 FPGA, each clock region contains 50 rows of CARRY4s. In the UltraScale FPGA, each clock region contains 60 rows of CARRY8s. To reduce the large nonlinearity contributed by the clock distribution, the TDLs are placed in two center clock regions in the Virtex 7 FPGA and are placed within the single clock region in the UltraScale FPGA. The single and dual sampling phases [28] can be selected according to the length of TDL and the frequency of the sampling clock.
In this paper, we implemented 96-channel calibrated TDCs in both Virtex 7 and UltraScale FPGAs. According to the postimplementation utilization report shown in Table III, each channel costs around 700 LUT modules and 1200 registers. The BRAM usage depends on the configuration and the designated measurement range. The minimum BRAM usage is 1.5 BRAM per channel in the dual-BRAM mode. However, the number of channels is not only limited by the resource usage. The timing requirement, routing congestion level, and system expandability should also be considered. Therefore, the space between two TDC adjacent channels needs to be guaranteed. The previous work [30] presented a high linearity, low dead time FPGA TDC. However, the high logic consumption makes the TDC not suitable for a multichannel design. For multichannel applications, we presented this work to achieve both high linearity and low resource consumption. Fig. 12 shows the place and routing results for the 96-channel TDCs in the Virtex 7 (left) and UltraScale (right) FPGAs. To demonstrate the uniformity of the proposed multichannel TDC, we demonstrated the code density test results for 16 out of 96 channels (in both FPGAs). These 16 channels are evenly distributed in the used chip area. According to test results shown in Table IV, the linearity performances of the TDC channels in different locations show good uniformity.

IV. CONCLUSION
In this paper, we proposed and evaluated the following: 1) a new sub-TDL averaging TDL topology; 2) an innovative tap timing test; 3) a new hardware-embedded histogram compensation and a mixed calibration methods. The sub-TDL averaging is able to remove bubbles and zerowidth bins without consuming additional resources and extra process time. The novel taps timing test is able to quantify the actual timing of TDLs. The histogram compensation and mixed calibration methods are also used to correct the conversion bias and the binwidth deviation directly with limited resource consumption. By integrating these methods, high linearity and low-cost FPGA-TDCs were implemented and tested in the Virtex 7 and UltraScale FPGAs, respectively. The bin size (LSB) achieved 10.5 and 5.0 ps with the rms resolution of 14.59 and 7.80 ps for the Virtex 7 and the UltraScale FPGAs, respectively. Compared with previously published works, listed in Table V, the linearity has been significantly improved. The 96-channel TDCs were also implemented and tested in both FPGAs, and they show good uniformity from the test results. Our solutions demonstrate significant improvements compared with previously reported studies. They also have potentials for future applications for fast 3-D ranging or time-resolved imaging that were previously using other techniques (such as Raman spectroscopy, agricultural research, and wind farm).