Exploring the Usage of Fast Carry Chains to Implement Multistage Ring Oscillators on FPGAs: Design and Characterization

Ring oscillators (ROs) serve as basic building blocks in a lot of application scenarios, where they must ensure high reliability, flexibility, and low-area/energy footprint. With the recent advances of the Internet-of-Things (IoT) technology, in particular, the necessity to endow interconnected devices with security facilities has increased as well. In this context, the efficient implementation of ROs on field-programmable gate arrays (FPGAs) is crucial, even though it hides some pitfalls. This article presents a new design strategy for multistage ROs relying on the carry chains (CCs) available into modern FPGA devices. Several configurations of ROs designed as proposed here have been characterized in terms of hardware costs, jitter, and temperature/voltage sensitivity. In all the evaluated cases, the proposed design allows to achieve predictable routing schemes through the automatic place and route (P&R), while reducing slice occupancy and energy consumption by up to 50% and 44%, respectively, in comparison with the traditional lookup table (LUT)-based ROs. When realized on a Artix-7 device, the basic version of the proposed oscillator realized using 33 inverting stages allows obtaining multiphase outputs oscillating at 29.7 MHz with a standard deviation less than 10 kHz. The analysis conducted also demonstrates the high flexibility of the novel circuits, such as the possibility to easily change their behavior depending on the target application requirements. As an example, by exploiting additional pass-through elements, the proposed scheme achieves a sensitivity of 49 kHz/°C that is more than 4 times higher than that shown by the corresponding traditional LUT-based competitor, thus making it more suitable for thermal monitoring applications.

Abstract-Ring oscillators (ROs) serve as basic building blocks in a lot of application scenarios, where they must ensure high reliability, flexibility, and low-area/energy footprint.With the recent advances of the Internet-of-Things (IoT) technology, in particular, the necessity to endow interconnected devices with security facilities has increased as well.In this context, the efficient implementation of ROs on field-programmable gate arrays (FPGAs) is crucial, even though it hides some pitfalls.This article presents a new design strategy for multistage ROs relying on the carry chains (CCs) available into modern FPGA devices.Several configurations of ROs designed as proposed here have been characterized in terms of hardware costs, jitter, and temperature/voltage sensitivity.In all the evaluated cases, the proposed design allows to achieve predictable routing schemes through the automatic place and route (P&R), while reducing slice occupancy and energy consumption by up to 50% and 44%, respectively, in comparison with the traditional lookup table (LUT)-based ROs.When realized on a Artix-7 device, the basic version of the proposed oscillator realized using 33 inverting stages allows obtaining multiphase outputs oscillating at 29.7 MHz with a standard deviation less than 10 kHz.The analysis conducted also demonstrates the high flexibility of the novel circuits, such as the possibility to easily change their behavior depending on the target application requirements.As an example, by exploiting additional pass-through elements, the proposed scheme achieves a sensitivity of 49 kHz/ • C that is more than 4 times higher than that shown by the corresponding traditional LUT-based competitor, thus making it more suitable for thermal monitoring applications.
With the rapid expansion of the Internet-of-Things (IoT) network, the need for miniaturized objects embedded with electronics, software, and sensors is increased as well [2].In this context, the field-programmable gate array (FPGA) technology has nowadays consolidated as one of the most popular realization platform because of its high flexibility and attractive computation capabilities, especially for the implementation of heterogeneous systems-on-chips (SoCs) that exploit both programmable logic fabric and microprocessors [14].In the near future, it is expected that in several application fields, such as smart cities, connected vehicles, healthcare systems, and even data centers, more and more infrastructures would benefit from the synergy between the IoT approach and FPGA technologies.This increase of interconnected devices pushes the demand for preserving the information down to the chip level.Indeed, despite of the incessant progresses in the fabrication of logic programmable devices, security is nowadays a crucial issue.As an example, preventing hard faults due to either external attacks or FPGA performance degradation caused by aging mechanism is mandatory before enabling such infrastructures in our daily lives.Also, FPGA bitstream encryption is essential for anticounterfeiting purpose.However, conventional nonvolatile memory (NVM)-based approach for storing the secret key suffers from reverse engineering and sidechannel attacks.In this regard, several approaches have gained popularity in the last decade.Most common solutions in the literature include physically unclonable function (PUF) for hardware authentication [1], [2], [3], true random number generator (TRNG) on the fly free session keys generation [4], [5], voltage sensors for remote side-channel attacks detection [7], and ON-chip aging estimation circuits [9], [10].All these methods rely on ROs that have to be efficiently implemented within the FPGA device.
Although the functionality of an RO is based on a quite simple circuitry, its desired hardware characteristics, and consequently its design, are strongly target application-dependent.Just as an example, ROs used to implement a PUF circuit must be ideally no sensitive to different environmental conditions (e.g., voltage variations) [15], which is exactly the opposite of the behavior expected by a voltage sensor [7].Often, such designs make use of multiple RO instances, for which ensuring nominally identical frequency behaviors is mandatory [16], [17].Therefore, achieving a relatively good prediction of the nominal RO frequency as a function of its length, at design stage, is crucial.However, as it is well known, SRAM-based FPGAs rely on programmable logic and configurable routing resources: during the implementation phase, unless of specific user's constraints, the place and route (P&R) tool explores the available design space relative to the target chip and adopts the default floor-planning strategy to automatically select the most proper resources, place them onto specific sites, and configure interconnections between logic blocks accordingly.As a result, implementing ROs on such devices with easy-totune, predictable, and repeatable behaviors is not trivial.
The above considerations motivate the focus of this work.Furthermore, many previous research works confirm that FPGA-based ROs designed by exploiting lookup table (LUT) primitives require manual P&R, in order to achieve reliable and effective implementations [7], [16], [17], [18], [19].In addition, due to its circuitry consisting just of chained LUTs and routing resources, the conventional RO design demonstrates poor adaptability to different application requirements and relatively low flexibility in fine-tuning the oscillator behavior.With the aim to overcome these limitations, this article presents a new design methodology to efficiently deploy ROs on FPGA devices.It adopts in an unconventional manner dedicated fast carry chain (CC) resources available within modern FPGA chip families [20], [21] in place of configurable LUTs.As a result, the automatic P&R is driven by a dedicated interconnection scheme, thus ensuring predictable and repeatable behaviors without any user's constraint.To the best of our knowledge, this is the first proposal of using CCs to realize multistage and multiphase ROs.The proposed solution significantly simplifies the design and allows better frequency fine-tuning and higher flexibility to achieve the specific application requirements.A comprehensive study, including hardware characterization, intra-/inter-die analysis, and evaluation of sensitivity to voltage/temperature variations, has been conducted to demonstrate the effectiveness of the proposed RO design methodology.Results highlight that, in addition to the advantages in terms of reduced design complexity, high flexibility, and independence of performance of the output load, an RO designed as proposed here is cheaper and less energy consuming compared with the traditional LUT-based counterpart.The rest of this article is organized as follows.Section II provides a brief background and overviews the state of the art.Section III introduces the proposed RO design methodology, whereas its mathematical model for frequency estimation is described in Section IV.Section V presents experimental results obtained for ROs at different lengths based on the conventional and new design methodologies.An application case study exploiting ROs based on the proposed design methodology is also discussed in Section VI.Finally, conclusions are drawn in Section VII.

II. BACKGROUND AND RELATED WORKS
According to the literature review from [22], LUTs are the most commonly adopted resources in order to implement ROs on FPGAs.These hardware primitives can be properly configured at design time to map possibly any m-input Boolean function, with m depending on the target FPGA technology.In order to implement an N -stage RO, many LUTs are cascaded connected, as shown in Fig. 1(a).In such a case, with the generic LUTi being configured, as depicted in Fig. 1(b), not(I0) is outputted, regardless of the value assumed by the I1 and I2 inputs.The frequency of the RO combinatorial loop depends on the propagation delay τ p across the overall path, including logic τ logic and net τ net contributions.The latter is, in turn, influenced by the FPGA sites selected for placement, the length of routed interconnections, and the number of pass transistors (PTs) enabled through the programmable switch matrices.
One of the major challenges in designing these architectures relies on the fact that the layout obtained by the P&R phase is not known a priori, unless of specific user's constraints.The automatic P&R floor-planning strategy exploits complex heuristic searches aimed at identifying a limited design space with balanced characteristics in terms of delay and interconnection density [24].Interfering with this process in order to drive the P&R toward a desired and predictable implementation is not easy to put into practice.The complexity of the problem significantly increases for those applications that involve the usage of multiple identical ROs [1], [4], [7], [8], [9], [10], [11], running at a specific frequency and in conjunction with surrounding logic circuitry [25].In such a case, specific placement constraints responsible for assigning RO resources to a locked region within the FPGA chip could be used in order to avoid undesired packing of multiple LUTs.However, such a strategy could be not enough.Indeed, neither the position of the generic LUTi nor the configured interconnections are locked within the constrained region, potentially leading to different RO frequencies at each new implementation run.As a result, the design of LUT-based ROs with predictable and repeatable frequency behaviors must necessarily pass through manual P&R [7], [16], [17], [18], [19].Moreover, given that the RO output is usually retrieved by interfering with the combinatorial loop, the propagation delay τ p is also influenced by the load capacitance [26], [27], [28].This aspect represents a further challenge for many applications.Just as an example, in the case of PUFs, multiple ROs must have identical nominal characteristics, as well as marginal and balanced contributions due to the load capacitance, so that the small frequency difference due to the process variations can be detected adequately [26].
In the recent past, various alternatives to conventional LUT-based oscillators have been demonstrated [8], [22], [29], [30], [31].Burgiel et al. [29] proposed to use input/output buffer (IOBUF) primitives available in Xilinx FPGAs inside a loop of a conventional LUT-based multistage RO.As illustrated in Fig. 2(a), this proposal uses the IOBUF element to drive the I/O pad pin, thus allowing the RO frequency to be tuned by changing the drive strength and slew rate.However, the usage of an IOBUF element makes the RO more sensitive to temperature variations, still requires manual P&R, and does not allow granular fine-tuning of the RO frequency.In fact, changing the slew rate of the IOBUF from SLOW to FAST introduces a scaling of approximately 3-10 MHz that cannot be modulated (similar considerations arise for the changes of the drive strength).
All other alternative proposals discuss the realization of single-stage oscillators.Some of them [8], [30], [31] implement noncombinatorial loops by using sequential elements.Their main advantages are compactness and simplicity of the circuitry that, in its most basic design, consists of just one feedback latch/flip-flop (FF).However, the effectiveness of this architecture has been successfully demonstrated just for TRNG applications [30], where the metastability produced by the feedback latch is exploited as entropy source.On the contrary, La et al. [22] evaluated the possibility to implement combinatorial loop ROs by using digital signal processing (DSP) and Mux resources available within modern FPGA devices, as schematized in Fig. 2(b) and (c).Both these schemes rely on dedicated resources that allow reducing the design effort mainly because of their compactness, but, as a drawback, they are suitable only for the implementation of single-stage ROs having fixed frequency.On the contrary, the design methodology proposed in this article exploits CCs in a more effective manner and enables the realization of multistage ROs whose behavior can be easily configured to comply with the requirements of the target application.

III. PROPOSED RO DESIGN
Nowadays, CCs are available in various extents in most FPGA devices produced by many vendors [20], [21].They consist of hard-wired resources specialized for efficient ON-chip implementation of arithmetic operations, such as additions and multiplications.A CC typically consists of k internal stages of cascaded multiplexers that, in combination with auxiliary XOR gates, implement as many full adders each exploiting the basic 1-bit carry look-ahead logic.Most importantly, CCs are placed neatly within the chip, so that longer chains can be implemented by cascading multiple instances through dedicated routing.Our proposal aims at exploiting this unique property for the implementation of ROs.Indeed, during the automatic P&R step, the position of the first CC can be used as the "anchor point" for the whole design; then, both cascaded CCs and nets are placed and routed through a delay-deterministic scheme, thus avoiding the need for manual P&R.
According to the above consideration, the design illustrated in Fig. 3 (in the following, named CC_ROv1) is here proposed as the basic configuration to enable the CC unit to operate as an oscillator.Without loss of generality, the architecture includes x CCs, each composed by k = 4 internal stages.In order to enable the propagation of the oscillating signal along the circuit, all the internal stages must be used for each CC, except for the last one where the designer can choose the convenient number of stages to be exploited.Thereby, the selectors of multiplexers MXs, the first one excepted, are properly set to constant values following the scheme shown in Fig. 3. Thus, while the multiplexers MXs in the even positions propagate the signal coming from the previous multiplexer, those in the odd ones transfer on the carry line the signal coming from the XOR gate in the previous stage.In this way, a chain of XOR gates, each acting as an inverter stage, is formed.Finally, to make odd the number of inverting stages, the first LUT, highlighted in gray, maps a NAND function: the low-to-high transition of the en signal triggers the first inversion through the LUT, which is then propagated to the XOR gate in order to produce O 0 .It is important to note that the function mapped within the first LUT has to be chosen based on the configuration of other components (e.g., selectors and inputs of the multiplexers).We verified that such kind of changes in the first LUT content does not significantly affect the behavior of the RO architecture.In the subsequent chain positions, the proposed scheme allows the oscillating signal to be propagated to the next stage by alternating the outputs from the XOR gate and the multiplexer, respectively.As a result, the CC_ROv1 configuration actually includes y = 2x + 1 inverting stages, highlighted in gray in Fig. 3.
From the schematic of Fig. 3, it can be observed that the remaining LUTs are not necessary to produce the oscillation, but they could be exploited as pass-through elements to fine-tune the RO period.In such a case, the external multiplexers aligned to the CC even positions can be used to  select the corresponding LUT outputs, as depicted in Fig. 4.This choice expands the design space, leading to the new configurations CC_ROv2 and CC_ROv3, where one or two LUTs per CC are enabled, respectively.As deeply analyzed in the following, the proposed configurations exhibit different characteristics in terms of oscillation frequency and sensitivity to voltage/temperature variations.Besides the interesting properties mentioned above, as shown in Fig. 3, the proposed scheme can provide the oscillating output RO out through one or more of the unused XOR gates, such as that at the position kx − 1, thus avoiding interference with the interconnect loop.This allows realizing efficient multiphase oscillators suitable for applications that need load-independent RO frequencies [1], [26], [28].

IV. MODELING THE PROPAGATION DELAY
In this section, we introduce the mathematical model for the frequency estimation of the proposed RO scheme.Let us consider the path involved in the combinatorial loop of the CC_ROv1 configuration, as highlighted in blue in Fig. 5.Then, the RO frequency f RO can be computed as 1/2τ p by applying (1) and (2) for modeling τ p .Therefore, τ i→ j represents the delay contribution associated with the generic segment i → j in the path, τ LUT is the LUT access delay, and τ loop is the interconnection delay due to the net named loop in Fig. 5 τ According to the adopted FPGA technology, the delay switching characteristics at the nominal conditions are provided by the vendor for most of the contributions highlighted in Fig. 5, except τ O 0 →B 1 , τ O 2 →B 3 , and τ loop .Table I summarizes the values of each contribution τ i→ j , with reference to the FPGA chips belonging to the Xilinx Artix-7 (speed grade −1) family; a similar approach could be adopted with different devices.The four examined cases, in the following, named fast-fast (FF), fast-slow (FS), slow-fast (SF), and slow-slow (SS), take into account, respectively, the process corner (first letter) and the standard delay format (SDF) adopted for delay modeling (second letter).
Table II, instead, reports the τ O 0 →B 1 and τ O 2 →B 3 delays related to the specific tracks illustrated in Fig. 5.Although related to programmable routing, such contributions can be considered as constants, regardless of the RO length, since the start point and the endpoint of the interconnection are fixed by the adopted architecture, as it will be detailed later.
It is worth noting that the τ loop delay depends not only on how many stages are involved in the oscillator, but also on   the interconnection length and the number of pass transistors crossed along the loop path.Without loss of generality, the dependence between τ loop and the number of inverting stages y is here derived for the proposed RO design by adopting a procedure similar to [32].
The above model has been validated by comparing its prediction with results obtained by postimplementation timing reports, for several CC_ROv1 samples, with y ranging from 3 to 33.On average, the achieved error is lower than 0.25%.It is worth noting that the model discussed in this section relies on production level devices specifications furnished by the manufacturer.To provide high accuracy, such parameters are released once enough production silicon of a particular device family member has been characterized; thus, no significant risk of underestimation of delays exists.

A. Hardware Implementation
The CC-and LUT-based designs with various lengths have been implemented by using the Vivado 2018.3Development Tool.As an example, Fig. 7 illustrates the layout obtained for the proposed RO_CCv1 configuration with y = 5 when the Xilinx xc7a100tcsg324-1 FPGA is selected as the target device.In all analyzed cases, using the same hardware description language coding, the P&R tool operates without any additional manual constraints and achieves a predictable routing scheme.This property comes from the carry-chain itself that autoconstraints both the placement and the routing paths between the internal nodes of the oscillator architecture.On the contrary, conventional LUT-based designs require each stage of the oscillator to be carefully placed and routed using a regular distribution over consecutive slices, by means of manual designer action.In particular, from Fig. 7, it can be noted that, while the connection between adjacent CCs is based on hard-wired routing (blue line), the interconnections between the XOR gates and the multiplexers rely on fast neighborhood routing tracks and proceed through identical paths across multiple CC stages (purple lines).On the other hand, interconnections external to the CCs, i.e., O 0 →B 1 , O 2 →B 3 , and the loop, rely on the programmable routing.However, as visible in Fig. 7, such nets connect neighbor elements through a substantially (and automatically) constrained path.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.⃝ and 5 ⃝ are actually the only programmable junction points.However, under default routing order rules, the P&R tool selects them as preferred way points.In very congested layouts, as a common practice, a physical block (PBlock) that inhibits other nets to be routed through the switchboxes used by the oscillator could be easily exploited.
Table III summarizes nominal frequency at the four corners, area occupancy, and energy per transition (EPT) of the proposed RO designs.The latter exhibits different energy/area characteristics at a given y, depending on the number of additional LUTs used in the various configurations.In general, the higher the number of stages of the RO, the higher the cost to fine-tune the behavior of the oscillator.Just as an example, the energy overheads of the CC_ROv2 configuration with respect to CC_ROv1 one are 26%, 34.8%, and 38.79%, for y = 5, y = 9, and y = 17, respectively.Similarly, when moving from the CC_ROv2 to the CC_ROv3 design, the 22.2%, 25.4%, and 28.2% energy overheads must be payed for the y = 5, y = 9, and y = 17 lengths, respectively.Table III also compares the new ROs with the LUT-based counterparts considering different lengths.At a glance, it can be observed that the proposed CC-based ROs reduce both the amount of occupied slices and the energy consumption with respect to the LUT-based counterparts at similar frequencies.Just as an example, the CC_ROv2 (y = 5) design runs at a frequency close to the nine-stage LUT-based RO, but it uses 33.3% less slices and dissipates 13% lower energy.Overall, the slice and energy savings exhibited by the proposed CC_ROv1 and CC_ROv2 over the LUT-based implementations span, respectively, from 11.1% to 50% and from 13% to 44%.On the other hand, the CC_ROv3 circuits allow expanding the space of possible frequencies without requiring additional slices, in contrast to the conventional RO designs.To achieve

B. Test Setup
In the next subsections, in order to characterize the behavior of both CC-and LUT-based ROs and to appreciate the general validity of the proposed methodology, we report results of experimental tests performed on 28-nm CMOS Artix-7 xc7a 100tcsg324-1, 28-nm CMOS Zynq xc7z045ffg900-2, 40-nm CMOS Virtex-6 xc6vlx240tffg1156-1, and 16-nm Fin-FET Ultrascale+ xck26sfvc784-2lvc devices.To this purpose, a Tektronix MSO64 mixed signal oscilloscope (2.5 GHz) has been used to capture and analyze output signals.A Keithley precision measurement dc supply 2280S-60-3 has been exploited to power supply the devices under test, also monitoring the current drawn by the FPGA boards.For the purpose of a reliable analysis, frequency measurements were repeated for at least 10 5 cycles; then, the average values and the standard deviations were recorded, thus avoiding artifacts due to the stochastic fluctuation.An oscilloscope screenshot is depicted in Fig. 8.In this case, the configuration under test is the CC_ROv1 with y = 33, obtained by cascading x = 16 CCs.Two phase-shifted output signals are retrieved from the XOR gates in the 31th and 63rd positions within the chain, as shown in Fig. 8(c).As expected, these signals have the same frequency, with mean value µ around 29.7 MHz and a standard deviation σ of ∼10 kHz.The histogram plot in Fig. 8(b) reports the Gaussian frequency distribution retrieved from 115 080 repetitive measurements that achieves σ /µ(%) of ∼0.03%.The latter is in line with the literature [33] for conventional LUT-based ROs.In the analyzed case, the time difference between the two output signals is ∼7.6 ns, and the histogram plot in Fig. 8(a) illustrates the

C. Temperature Sensitivity
As it is well known, temperature variations have a twofold impact on MOSFET devices, with the carrier mobility and the threshold voltage decreasing as the temperature increases due to the scattering and to the Fermi level and bandgap energy shift, respectively.At the device level, such mechanisms coexist impacting the current in opposite directions.But, while in static CMOS gates operating in the above threshold region, the first one is dominant, thus leading to an increase of the delay as the temperature increases; in PT logic circuits, they balance out quite differently.In such a case, the carrier mobility reduction is significantly contrasted by the lower threshold voltage, which, in turn, lead to a propagation delay reduction.This behavior is expected to be much more evident when ROs are realized on FPGA platforms exploiting a massive usage of PT logic circuits [34], [35].By transistor-level simulations performed by using a standard CMOS process technology, we investigated the effects of the coexistence of PT and static CMOS logic stages, typically occurring in the target platform.Although based on some speculations on the FPGA internal architecture [36], simulation results show that, in such a case, the temperature influences the RO frequency quite differently than in traditional static CMOS circuits.Fig. 9 illustrates the implementation of the single LUT [36] adopted in the simulations and the plots of the RO frequency under temperature changing.It can be observed that, due to the massive usage of PT stages within the FPGA fabric, initially, the effect of the threshold voltage prevails.Then, as the temperature increases, the two phenomena tend to balance each other out, until the carrier mobility reduction becomes the dominant effect.Moreover, from our simulation, we verified that increasing the number of PT stages in the RO path moves the local maximum toward higher temperatures.
In order to evaluate the impact of temperature on the analyzed ROs, we performed a measurement campaign using the ACS DY16-T climatic chamber, varying the temperature from 5 • C to 75 • C. The acquisition of the ROs outputs was always performed after the thermal transient was concluded.To this purpose, the die temperature was monitored through the internal sensor and the precision measurement dc supply that also allowed verifying that the standard deviation of the absorbed current is below 10 nA.Besides the LUT-based reference designs, the CC_ROv1, CC_ROv2, and CC_ROv3 configurations have been characterized for y = 3, 5, 9, and 17 at the temperatures Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The latter is here defined as ( 4) and provides a synthetic metric that allows evaluating how, in the ranges of observation, the RO frequency changes as a consequence of the operating temperature with respect to the nominal condition As the evidence of the fact that more logic and routing resources contribute to the RO frequency, Fig. 10 highlights that the VI temp increases with the RO length for both the CC_ROv1 and LUT-based designs.Even more interesting are the results obtained for the proposed CC_ROv2 and CC_ROv3 configurations.As shown by the preliminary analysis dis-cussed at the beginning of this section, the synergy between CCs and LUTs could be exploited to modulate the behavior in temperature of the RO by using some LUTs as passthrough elements, as shown in Fig. 4. In such a case, the additional PT stages are responsible of the flipped trend within the range 25 • C-50 • C. As illustrated in Fig. 10, depending on the number of additional pass-through LUTs, the behavior in temperature can be relatively fine-tuned to increase or decrease the RO sensitivity with respect to the CC_ROv1 circuit.Indeed, the CC_ROv3 configuration exhibits a VI temp up to 0.4% and a sensitivity of 0.049 MHz/ • C (y = 3 in the range 5 • C-50 • C), which is 4.45 times higher than the five-stage LUT-based counterpart.As a consequence, such a design is ideal for those applications that aim at monitoring performance degradation due to the temperature [8], [9], [10], [11].Whereas, the CC_ROv2 configuration allows achieving the lowest VI temp , regardless of the target length, thus becoming an effective candidate for the implementation of circuits requiring high resilience to the temperature variations [1], [2], [3].
For the purpose of a deep analysis, we investigated also the impact of using different types of slices to implement the proposed ROs.The Xilinx FPGA devices from the Series-7, for example, are organized in configurable logic blocks (CLBs) that include two slices for each switch matrix, placed up and down, as shown in Fig. 7.Moreover, they account for only logic slices (type L) and more complex slices that can also be used as distributed memory (type M).While the above characterization refers to the usage of slices L positioned at the bottom of the CLBs, Table IV reports the RO frequency measured at different temperature points when the placement schemes Up-L and Down-M are chosen on the xc7a100tcsg324-1 chip.It can be appreciated that, even though the three implementations are located within the same die region, the RO frequencies differ, in accordance with the slice type.This is more evident in the case Up-L, since the connections between each slice and the corresponding switch matrix follow different paths.
An intradie analysis has been also carried out by considering the Down-L placement for the proposed CC_ROv2 configuration within four different FPGA regions (i.e., 1: X0Y3, 2: X1Y2, 3: X1Y1, and 4: X0Y0).Fig. 11 plots the σ /µ (%) metric as a function of the temperature and points out two important observations.The former is that, as expected, different sites exhibit different absolute RO frequencies.Moreover, each site has its proper characteristic under different temperature conditions, which suggests an uniqueness related to the process variations.The latter is that, for a given site, the standard deviation σ remains almost stable, indicating that temporal fluctuation has little impact on the measured frequency.It is important underlying that this behavior has not been verified for the LUT-based RO, which demonstrated σ values up to 36% higher at the parity of placement and temperature conditions.
In order to investigate the interdie behavior, the new CC_ROv2 design has been characterized on ten xc7a100tcsg324-1 chips at different temperatures.The resulting statistics are summarized in Fig. 12, where the point indicates the mean frequency value, computed by averaging the µ results from the ten chips, and the bars report the distance with respect to the highest/lowest observed frequency.At a glance, it can be noted that the mean frequency follows a trend similar to that noticed until now.The range given by the bars, instead, demonstrates that the RO frequency may vary significantly from die-to-die as the result of global variations [33].At the same time, when looking at different temperature conditions, the frequency range seems to be unchanged, thus suggesting that the temperature equally affects all the chips.Finally, with the aim of generalizing the applicability of the proposed strategy, and to show its thermal behavior on very different process technology nodes, Fig. 13 plots the temperature sensitivity of the 40-nm CMOS Virtex-6, 28-nm CMOS Artix-7, and 16-nm FinFET Ultrascale+ implementations of the CC_ROv1 design with y = 5.It can be clearly observed that the sensitivity to temperature variations is significantly reduced when advanced technologies, such as the Ultrascale+, are employed.This finding is in accord with the conclusions achieved in previous works [38], [39].

D. Jitter Analysis
To evaluate the jitter behavior of the proposed ROs, we analyzed the variations of the running period within a certain interval time, i.e., long-term jitter [37].Table V reports the relative standard deviation of the RO period (σ p /µ p ) measured for the proposed circuits at length y = 17 and for a 35-stage LUT-based sample, considering the xc7a100tcsg324-1 FPGA device.All the ROs under analysis have shown a Gaussian distribution of the oscillation period, which typically identifies random jitter.In general, it can be observed that the compared designs exhibit quite similar characteristics, with the CC_ROv2 and CC_ROv3 configurations able to slightly reduce the σ p /µ p over the LUT-based counterpart, regardless of the temperature.

E. Voltage Sensitivity
To perform the voltage sensitivity analysis, we used the TI UCD90120A power controller in order to modify the internal core voltage within the range 0.8-1.12V, with 1 V being the nominal condition.Because of the aggressive voltage scaling, the frequency variation measured for the new and LUT-based RO designs spans also within few tens of MHz.To better highlight the different behaviors exhibited by the referenced Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.VI.The latter also shows the VI volt defined in (5).From this metric, it can be noted that the frequency of each implementation changes following a similar trend.The CC_ROv1 sensitivity to voltage variations appears to be slightly lower than the LUT-based counterparts at N = 5, 9, 17, and 35.The opposite trend is exhibited by the CC_ROv2 designs compared with the LUT-based ROs at N = 7 11, 21, and 41.We speculate that this is due to the higher number of PT stages used in the path of the latter configuration

VI. CASE STUDY: THE CC_ROV1 DESIGN FOR TRNGS
As an example of applications and to verify the effectiveness of the proposed architecture, we utilized the CC-based RO in the realization of a TRNG.To this purpose, we slightly modify the design presented in [23] to accommodate a CC_ROv1 oscillator with y = 9.According to results in Table V, such a configuration indeed exhibits a jitter quite similar to that of the oscillators, which can be exploited as entropy source to generate random bit sequences.Fig. 15 illustrates the block diagram of the realized TRNG.The CC_ROv1 block used to generate seven multiphase oscillating outputs at 108 MHz.Such output signals are sampled by FFs receiving a clock signal formed by the DCM.The basic idea exploited in [23] is to tune the phase of such clock through the DCM in order to force on one or more FFs to enter the metastability region.The input clock clk_in running at 62.5 MHz is sent to the DCM that, according to the control signal psen given by the FSM block, performs a phase shifting each 20 cycles to produce the clk_out used as clock signal for the sampling FFs.The phase shifting performed by the DCM continues until the signal T resulting from the XOR operation changes its value, meaning that a violation is occurred at the first stage of FFs.Random bits are, therefore, generated and stored within a 32-bit shift register that is responsible to feed the ON-chip postprocessing circuit demonstrated in [23].
Table VII summarizes the results obtained by the NIST SP800-22 statistical tests for 64 M random bits acquired as 64 consecutive 1 M bit sequences.All the statistical tests show  a proportion and a P value T higher than the minimum pass threshold set by the NIST SP800-22 standard, demonstrating that the proposed oscillator can be effectively employed in the design of a TRNG able to produce a random bit sequence through a very low complex design consisting of 37 LUTs, 124 FFs, 4 CCs, and 1 DSPs.Finally, we performed the AIS-31 T8 test that measures the randomness of the output bitstream in terms of byte entropy.The AIS test shows that the TRNG architecture of Fig. 15 achieves a byte entropy of 8.0 overcoming the recent LUT-based competitor scheme [6] that reaches a byte entropy of 7.996, with a somewhat similar energy efficiency.

VII. CONCLUSION
This work presented a new design methodology to realize multistage ROs by means of the CC resources available within modern FPGA chip families.ROs realized as proposed here show the following benefits.
1) They enable automatic P&R while ensuring predictable and repeatable behaviors because of the dedicated interconnection scheme relying on.Furthermore, their oscillation frequency is not affected by the load on the output node.These results are in contrast to those discussed in several prior works dealing with tedious manual layout actions of LUT-based ROs [7], [16], [17], [18], [19].2) Because of the possibility to strategically add pass-through elements to the oscillating path, they can be easily configured to adjust their thermal/voltage behavior and/or their nominal frequency to comply with the specific application requirements.For instance, our experiments show that one of the proposed configurations is more suitable to work as temperature sensor, whereas the other two solutions can better fit a PUF application because of the reduced temperature sensitivity.
3) The hardware description of the proposed ROs is straightforward, and its portability is ensured.To demonstrate this aspect, we characterized it on different chip technologies, including 40-nm CMOS Virtex-6, 28nm CMOS Artix-7 and Zynq-7000, and finally 16-nm FinFET Ultrascale+.4) As a final remark, the above benefits are achieved by reducing the LUTs count and the energy consumption by up to 83% and 44%, respectively, with respect to the LUT-based solution.

Fig. 3 .
Fig. 3. Proposed CC-based multistage RO design (CC_ROv1 configuration).Inverting stages are realized through the XOR gates and the LUT highlighted in gray.Shadowed LUTs are exploited to permanently set the selectors of corresponding multiplexers.Forward propagation of the oscillation signal relies on the chain of MX elements.

Fig. 4 .
Fig. 4. Proposed CC-based multistage RO design.(a) CC_ROv2.(b) CC_ROv3.Additional LUTs are used as pass-through elements to enable fine-tuning of the RO characteristics.O ′ i signals are delayed copy of O i .

Fig. 5 .
Fig. 5. Propagation delay path for the CC_ROv1 design.Red labels identify the interconnection segments as reported in the technical documentation for delay estimation purpose.

Fig. 6 .
Fig. 6. τ loop as a function of y at different corners.

Fig. 14 .
Fig. 14.Frequency plot obtained for the CC-and LUT-based ROs under different supply voltages (xc7z045ffg900-2 FPGA device).Subplots refer to different lengths and similar nominal frequencies: y = 3 (top left), y = 5 (top right), y = 9 (bottom left), and y = 17 (bottom right).For reference LUT-based designs, the number of inverting stages that best fits the nominal frequency of the corresponding CC_ROv1 and CC_ROv2 was chosen.
Exploring the Usage of Fast Carry Chains to Implement Multistage Ring Oscillators on FPGAs: Design and Characterization Fanny Spagnolo , Member, IEEE, Stefania Perri , Senior Member, IEEE, Massimo Vatalaro , Member, IEEE, Fabio Frustaci , Senior Member, IEEE, Felice Crupi , Senior Member, IEEE, and Pasquale Corsonello , Senior Member, IEEE
FREQUENCY VALUES IN MHZ the frequency around 65 MHz at the FF corner, for instance, a 43-stage LUT-based RO would occupy 11 slices and consume 5.57 pJ, which is 37.5% wider and 11.4% more energy consuming than the CC_ROv3 configuration with y = 17.

TABLE V σ
p /µ p (%) MEASURED UNDER DIFFERENT TEMPERATURE CONDITIONS

TABLE VII NIST
RANDOMNESS TEST RESULTS