A High-Speed FPGA-Based True Random Number Generator Using Metastability With Clock Managers

True random number generators (TRNGs) are fundamentals in many important security applications. Though they exploit randomness sources that are typical of the analog domain, digital-based solutions are strongly required especially when they have to be implemented on Field Programmable Gate Array (FPGA)-based digital systems. This brief describes a novel methodology to easily design a TRNG on FPGA devices. It exploits the runtime capability of the Digital Clock Manager (DCM) hardware primitives to tune the phase shift between two clock signals. The presented auto-tuning strategy automatically sets the phase difference of two clock signals in order to force on one or more flip-flops (FFs) to enter the metastability region, used as a randomness source. Moreover, a novel use of the fast carry-chain hardware primitive is proposed to further increase the randomness of the generated bits. Finally, an effective on-chip post-processing scheme that does not reduce the TRNG throughput is described. The proposed TRNG architecture has been implemented on the Xilinx Zynq XC7Z020 System on Chip (SoC). It passed all the National Institute of Standards and Technology (NIST) SP 800–22 statistical tests with a maximum throughput of $300 \times 10^{6}$ bit per second. The latter is considerably higher than the throughput of other previously published DCM-based TRNGs.


I. INTRODUCTION
R ANDOM numbers are used in many important security applications, such as the generation of secret and public keys, nonces and challenges in authentication protocols and initialization vectors for cryptographic systems. For this reason, in the last few years the importance of hardware-based True Random Number Generators (TRNGs) has become crucial. A TRNG is a hardware system that generates a sequence of random numbers without the need of an initial seed and it is based on the unpredictability of some physical phenomena such as thermal noise, phase noise and quantum phenomena that are typically exploited in analog circuits [1]. On the other hand, a lot of effort has been recently focused towards the design of fully digital TRNGs to be implemented on Field Programmable Gate Arrays (FPGAs) due to their widely adoption [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17]. Typically, the existing designs rely on metastability [2], [3], [4] and jitter [5], [6], [7], [8], [9], [10], [11], [12], [13], where the randomness is commonly extracted by Look-up Table (LUT)-based Programmable Delay Lines (PDLs) and free running Ring Oscillators (ROs). The use of two PDLs driving the data and the clock inputs of a Data Flip-Flop (FF) is proposed in [2]. There, the PDLs are designed as series-connected LUTs each one implementing an inverter. The delay of each LUT is modulated by setting its unused inputs to appropriate values, without interfering with the implemented logic. By carefully tuning the delay difference between the two PDLs, the setup or hold time of the FF can be violated, thus forcing the FF to enter its metastable region. Alternative approaches are used in [5] and [10], where the output of several ROs are fed to a XOR gate whose output is sampled by a FF. In the presence of a phase jitter, the FF may sample glitches with a non-deterministic length, thus producing a random bit. The architecture demonstrated in [9] exploits a LUT-based tunable delay chain in conjunction with a time-to-digital converter consisting of 32 8-bit series-connected fast carry chains. The physical implementations of such schemes require a careful analysis of the appropriate number of ROs and inverters within each RO, and need specific tuning of the PDLs delays, that could be significantly affected by the routing. Aside from the used technique, most of the above-mentioned architectures require a post-processing block and a feedback control to achieve adequate randomness.
Recently, the use of clock managers has been investigated as an effective way to more easily design TRNGs on FPGA devices [14], [15]. These hardware primitives, which aim at producing one or more output clock signals with a programmable frequency, are typically available in modern FPGAs: the Xilinx devices provide the Digital Clock Manager (DCM) [18], whereas the Clock Manager (CM) is available within Intel [19] chips. In [14], [15], the Dynamic Partial Reconfiguration port (DRP) of DCMs is used to configure their parameters on the fly, thus tuning the frequency of signals driving the data and the clock input of a FF. Unfortunately, such strategies lead to a relatively slow random bit production rate, since hundreds of clock cycles elapse between two consecutive samplings. This brief describes an alternative use of the clock manager hardware primitive to design a high-speed TRNG. In place of tuning the frequency of the output clocks, the proposed design exploits its dynamic phase shifting (DPS) capability to force one or more FFs to enter the metastability region. The proposed solution has the ability to automatically tune the phase shifting of the DCM so that the random sequence generation automatically starts when this condition occurs. To enhance the randomness, we also propose an unconventional utilization of the carry-chain primitive included in the FPGAs slice with a configurable feedback scheme. Finally, a simple on-chip post processing scheme is proposed that exploits only one Digital Processing Signal (DSP) slice without reducing the bit production rate. When implemented on the Xilinx Zynq XC7Z020 System on Chip (SoC), equipped with a 28nm Artix -7 based programmable logic, the proposed TRNG passes all the National Institute of Standards and Technology (NIST) SP 800-22 [20] and AIS statistical tests [21] with a throughput up to three orders of magnitude higher than the DCM-based solutions [14], [15].

II. THE PROPOSED USE OF THE XILINX DCM
The Xilinx DCM offers the opportunity to dynamically tune the phase of the output clock at run time, without reconfiguring the FPGA device [18]. To this aim, the primitive uses two input signals, psen and psincdec, as described in Fig. 1a. When psen is asserted high, the dynamic configuration starts and the phase of the output clock is incremented/decremented if the value of psincdec is high/low. The phase shift resolution T shift only depends on the frequency F VCO of the voltagecontrolled oscillator (VCO) internal to the DCM and it is equal to 1 56F VCO [18]. Fig. 1a also describes the proposed idea of using the DCM with the purpose to ingenerate metastability in a FF. The system clock, i.e., the DCM input clock clk_in, drives the clock input of a FF. The DCM produces the output clock clk_out with the same frequency that drives the data input of the FF. To force the FF entering the metastable region, clk_in and clk_out should arrive almost simultaneously at the FF inputs, thus violating the FF setup/hold timing constraints. However, due to their different routing paths, the arrival times of data and clock edges may significantly differ, thus preventing the metastability occurrence. The DCM phase shift is exploited to compensate this difference in routing delays.
It is worth noting that if T shift is too large, as depicted in Fig. 1b, the rising edge of clk_out arrives either too earlier or too later than the rising edge of clk_in. That means that the phase shift resolution achieved by the DCM could be not as fine as required. For this reason, similarly to [9], we have also used a carry chain primitive and four FF replicas. This allows a much finer phase shift resolution control at the destination to be achieved and prevents the metastability region is sometimes skipped. Fig. 1c depicts the application of this principle with the CARRY4 hardware primitive within the slice of the Xilinx devices. Four FFs receive as data input the sum outputs (O [3:0] ) of a CARRY4 chain. The signal clk_out drives the CARRY4 input signal C YINIT . The generic signal O i+1 is delayed with respect to O i by the propagation delay τ MUX of the multiplexer in-between the output positions i and i + 1, with i = 0 . . . 2. In the event that T shift is too large to ingenerate metastability, it is reasonable to assume that τ MUX < T shift , hence a finer phase shifting of the FF data inputs is obtained. 1 As a consequence, the probability that at least one FF enters the metastability region increases. The propagation of the signal clk_out through the four bit positions of the carry chain is assured by setting the CARRY4 input signal S [3:0] to logic 1 (the signals D [3:0] do not affect the carry propagation and they can be set, as instance, to logic 1). Finally, it is worth noting that the CARRY4 primitive and the four FFs can be placed within the same slice, so that the differences of the propagation delays of the signals O [3:0] can be considered independent of the routing between the XOR gates and the FFs but only affected by the propagation delays of the multiplexers in the carry chain. Fig. 2 illustrates the complete TRNG core and its integration within a specialized testing architecture realized on a Heterogeneous FPGA System on Chip (SoC). From a toplevel perspective, it can be seen that the random sequence generated by the TRNG core is transferred towards the external DRAM memory by means of AXI-Stream transactions managed by a Direct Memory Access (DMA) module. The latter is configured by the ARM -based processing system via dedicated AXI-Lite interfaces.

III. THE PROPOSED TRNG ARCHITECTURE
The DCM-based scheme discussed in Section II is completed with a XOR gate that merges the output of the four FFs. Thus, it does not matter which of them became metastable. The output of the XOR is then sampled by an FF (FF XOR ) clocked by the system clock (i.e., ck_in). If the phase difference between clk_in and clk_out is high enough to avoid metastability (due to either the initial DCM state or an unpredictable difference of the signals routing), the four FFs of the randomness generator sample the same stable value and the output T of FF XOR is 0. The signal T is the input of a Finite State Machine (FSM) that controls the configuration signals of the DCM in a feedback fashion. If T is 0, the FSM starts a phase shifting transaction with the DCM by asserting psen to 1 for one clock cycle (psincdec can be constantly set to 0). Then, the FSM waits for a fixed number N of clock cycles (here N = 20) still monitoring the value of the signal T. If T continues to be 0 during the N clock cycles, a new phase shifting transaction is activated. This auto-calibration procedure of the phase difference between clk_in and clk_out ends when T becomes 1, as a reasonable sign that one FF has entered its metastable region. In such a case, the FSM stops shifting the phase of clk_out and a signal En is set to 0 starting the acquisition of the random bit-stream. A preliminary analysis, on different placement sites, demonstrated that after the autocalibration which takes 160 clock cycles on average, at least one of the fours FFs actually enters the metastable region. Indeed, over a 10Mb sequence outputted by the XOR gate, the percentage of 0's and 1's is close to 50%. In the proposed scheme, the signal T represents the raw random bit that is generated with a throughput equal to the system clock frequency. With the goal of increasing the randomness of the signal T, a further technique is here adopted in conjunction with the use of the DCM. As visible in Fig. 2, the signals O [3:0] are in a feedback loop to drive the selectors S [3:0] of the multiplexers of the carry chain. When the signal En is 1, the auto-calibration phase is still running and S [3:0] is set to "1111", as explained above. Once metastability has been ingenerated, En is set to 0 and S [3:0] is set to O [3:0] . Such a selection is performed by four multiplexers, controlled by En, whose logic can be implemented by the four Look-up Tables (LUTs) within the same slice of the CARRY4. The purpose of the proposed scheme is to force the XOR gates of the CARRY4 in a race condition. The generic XORi gate in the carry chain, with i = 1 . . . 3, receives as inputs the output signal O i and the carry C i propagating from the previous position, whose value depends on O i−1 : if O i−1 = 0 then C i = D i−1 (i.e., logic 1), otherwise C i = C i−1 . The first XOR gate in the chain, i.e., the one having O 0 as output, has clk_out and the same O 0 as inputs. It is worth noting that, differently from the conventional ring oscillators [5], each XOR gate oscillates with different phase due to the delay path composed by MUXCY. Furthermore, the state of each oscillator depends on the state of the previous one.
The random bit T is then inputted to a shift register that aggregates 32 consecutive bits into a 32-bit single packet P. The latter is then elaborated by the on-chip post-processing module to eliminate any bias in the generated random sequence [2], [7], [14], [15]. To avoid the detrimental effects on the generation rate caused by traditional Von Neumann corrector, we adopted the simple post-processing methodology detailed in Fig. 3. The packet P is inputted to a 32-bit accumulator. With P i being the i-th 32-bit packet, the accumulation result is P i,ACC = (P i + P i−1,ACC )(mod 2 32 ). This logic can be easily accommodated into one of the Digital Signal Processing (DSP) slices also available in modern FPGAs. The FSM activates the accumulator register through the CE signal only once 32 consecutive random bits have been collected by the shift register. To reduce the number of any repetitive patterns, a dynamic bit-flipping is performed. Towards this aim, the final post-processed 32-bit packet P i,PP is generated as given in (1): It is worth noting that the proposed bit-flipping mechanism does not introduce any periodic feature in the generated bit-stream since it depends on the value of the random bit P i,ACC [31]. The FSM asserts the signal valid, thus enabling the AXI-Stream-based transfer, only when a new 32-bit packet P PP is produced.
IV. PRELIMINARY ANALYSIS All the physical experiments discussed in the following have been performed on the Xilinx Zynq XC7Z020 SoC of a Zedboard TM board, equipped with a 28nm Artix -7 based programmable logic and a general-purpose ARM -based processor. First of all, we separately analyze the impacts of the various TRNG components (i.e., the DCM phase shifting, the race condition on CARRY4, the Post Processing) on the sequence randomness, by selectively enabling/disabling them and analyzing 10M bits random sequences with the AIS-T8 entropy test. Results collected in Table I show that the effect of each component is additive: the bit entropy obtained by the joint application of two or more techniques is always higher than that obtained by each one when applied separately. The DCM shifting functionality has the highest impact on the sequence randomness compared to the race condition on CARRY4, which itself alone is not enough to achieve a significant entropy. As expected, the proposed Post Processing scheme notably increases the bit entropy; however, it is able to let the TRNG exceed the minimum pass threshold (i.e., 0.987 dictated by the T8 test) only when the bit entropy of the raw sequence is already significantly high (i.e., when the DCM and race condition on CARRY4 are both enabled). Fig. 4 depicts the bit entropy of the raw bitstream collected at the node T for different number of phase shifts of the DCM. Interestingly, the entropy of such a sequence shows a maximum value corresponding exactly to the number of phase shifts performed by the DCM when the proposed autotuning procedure is controlled by the FSM, thus confirming the validity of our approach.
To investigate the start-up behavior of the proposed TRNG, the autocorrelation of 200 1Mbit different start-up sequences and the correlation between all the possible sequences pairs have been calculated. Fig. 5 shows three examples of autocorrelation histograms for different lags and the obtained correlation matrix. The maximum absolute value of all the autocorrelation coefficients (correlation matrix) has been found to be merely 0.014 (0.02), thus proving the correct start-up behavior of the proposed TRNG.
V. EXPERIMENTAL RESULTS One thousand consecutive 1M-bit random sequences have been generated and acquired for a total of 1G random bits at different temperatures, by means of a temperature chamber. In first experiments, the maximum clock frequency achieved by the DMA (i.e., 100MHz) has been used for the entire architecture of Fig. 2. The sequences have been then analyzed with the NIST SP 800-22 and AIS statistical tests. All tests have been successfully passed 2 at all the operating conditions. Most relevant results are collected in Fig. 6. It shows that the proposed TRNG design is not systematically influenced by the temperature. Tests performed with the NIST SP800-90B suite also shown that our TRNG achieves the minimum (maximum) H ∞ entropy of 0.937 for the ttuple test (0.998 for the multi mmc prediction test), while its mean H ∞ entropy over all the tests is 0.979.  Then, to prove the correct running of the proposed TRNG also at its 300MHz maximum frequency, the architecture of Fig. 2 has been provided with two separate clock domains to guarantee that, while the TRNG core is clocked at 300 MHz, the DMA of the testing infrastructure operates at 100MHz. Also in this case, all the NIST SP 200-22 tests have been passed with a minimum proportion of 98.1%. Table II collects results, obtained for several competitors, in terms of the resources requirements, the maximum operating frequency (O.F.), the metric (Thr · /(Slice · OF) introduced in [4], the worst proportion value and the entropy obtained from NIST SP 800-22 tests and AIS T8, respectively. First of all, it is worth noting that among the compared designs only [5], [7], [8] and the proposed solution pass all the NIST SP 800-22 tests. Moreover, our TRNG achieves the highest throughput and one of the highest (Thr·/(Slice·OF) value, which is more than one magnitude order higher than those shown by the other DCM-based technique [14]. Only [12] shows a higher value of the comparison metric (+13%), but its throughput is considerably lower than the proposed TRNG (−96%). The bit entropy of the proposed TRNG is comparable to that achieved by the competitors, and even better at the 100MHz clock frequency. The TRNG described in [13] has a comparable maximum throughput but its proportion is only 70%.  Table III reports a comparison in terms of power dissipation. A direct comparison is possible only with the architecture in [12], and results show that the energy per generated bit consumed by the proposed TRNG is about 47% lower than [12].
As a case study, we applied our TRNG output for the generation of random one-time passwords (OTPs), each one composed by 10 ASCII symbols. We collected 40 sequences of 80 bits (i.e., 40 OTPs) to test their safety on the website [23]. All the OTPs were classified as "very strong password" with a claimed mean hacking time of 14M years with a home computer.

VI. CONCLUSION
A new design of a DCM-based TRNG for an easy implementation on FPGA devices has been presented. It exploits the dynamic capability of the DCMs hardware primitives to fine tune the phase difference between two clock signals. The metastability ingenerated by the latter signals is used as a randomness source. The required phase difference is automatically set by a simple FSM. A smart use of the CARRY4 hardware primitive further increases the randomness of the generated bits. Finally, a low-latency onchip post-processing scheme is also presented. The proposed TRNG architecture has been implemented on the Xilinx Zynq XC7Z020 System on Chip (SoC). It passed all the NIST SP 800-22 and AIS tests, showing a throughout that is considerable higher than those of previously published DCM-based TRNGs.