10 Gigabit White Rabbit: sub-nanosecond timing and data distribution

Time synchronization is a critical feature for many scientiﬁc facilities and industrial infrastructures. The required performance is progressively increasing everyday, for instance, few tens of nanoseconds for Fifth Generation (5G) networks or sub-nanosecond accuracy on next family of particle accelerators and astrophysics telescopes. Due to this exigent accuracy, many applications require speciﬁc timing dedicated networks, increasing the system cost and complexity. Under this context, the new IEEE 1588-2019 High Accuracy (HA) default proﬁle is intensively based on White Rabbit (WR) which can provide sub-nanosecond accurate synchronization for Ethernet networks. However, current WR solutions have not been designed to work properly with high data bandwidth delivery services even in 1 Gigabit Ethernet (GbE) links. On this contribution, the authors propose a new architecture design that enables WR and, consequently, the IEEE 1588-2019 HA proﬁle to be deployed over 10 GbE links solving the already identiﬁed data bandwidth problem. Furthermore, this work addresses different experiments needed to characterize the system performance in terms of time synchronization and data transfer. As ﬁnal result, this contribution presents for the ﬁrst time in the literature a new WR system which allows high bandwidth data exchange in 10 GbE networks while providing sub-nanosecond accuracy synchronization. The proposed solution maintains the time synchronization performance of existing WR 1 GbE devices with signiﬁcant advantages in terms of latency and data bandwidth, enabling its deployment in applications that integrate data and synchronization information in the same network.


I. INTRODUCTION
T IME synchronization is a key factor for many applica- tions in the Information Technology (IT) age.Some of them are related to scientific infrastructures in the framework of high energy physics and astrophysics.Synchronization is also the cornerstone for industrial infrastructures from strategic sectors as telecommunication, smart grids, finance, data centers, Internet of Things (IoT), avionics and new generation networks [1], [2].This variety of applications and requirements has motivated the development of standard protocols such as Network Time Protocol (NTP) [3] and IEEE 1588 [4].In terms of synchronization accuracy, NTP is able to provide millisecond-scale whilst IEEE 1588 usually ensures a microsecond-scale performance [5], although it can be improved up to hundreds or even tens of nanoseconds using specific configurations.However, these time synchronization protocols do not fulfill synchronization requirements for many new applications and therefore, they are not suitable for their systems, motivating the development of more advanced synchronization technologies such as White Rabbit (WR) [6], [7].For a more detailed review of candidate infrastructures requiring accurate timing, please read [8].For the sake of completeness, some relevant cases are briefly described here: • High energy physics facilities.These infrastructures are composed of several kinds of elements such as actuators, sensors and control units normally intercon-nected via a communication network.These devices must be synchronized with high accuracy networks often in the nanosecond scale to accomplish the system goal.Under this context, the time synchronization system architecture is typically composed of a time generator, i.e. atomic clock or Global Navigation Satellite System (GNSS) receiver and an optical fiber network where a packet-based protocol such as IEEE 1588 is deployed for timing distribution.Remarkable examples are European Organization for Nuclear Research (CERN) [9], Helmholtzzentrum für Schwerionenforschung (GSI) [10] or International Fusion Materials Irradiation Facility (IFMIF) [11], [12].• Astrophysics applications.These demand high precision timestamping mechanisms for capturing and correlating interesting events under study in the nanosecond range with picosecond level jitter.Furthermore, low jitter frequency dissemination mechanisms are also mandatory for many distributed telescopes.Most notorious applications are telescopes arrays such as Cherenkov Telescope Array (CTA) [13] or Square Kilometer Array (SKA) [14] and neutrino telescopes such as KM3Net [15].• Smart Grid systems.In these infrastructures, timing is used as a mechanism for Data Acquisition (DACQ) in Wide Area Measurement Systems (WAMS) monitoring activities which is very relevant for forensic analysis in case of power grid failures.In addition, timing is critical for synchrophasor data to be handled by the Power Management Units (PMUs) [16], [17] according to the IEEE C37.118 standard that requires 1 µs.Under this context, it is feasible to use a combination of GNSS sources together with specific IEEE 1588 profiles as indicated in IEEE C37238-2011.However, synchronization performance must be improved by a factor of 10 to deal with other errors [18].These new conditions show that time synchronization requirements are becoming more demanding for this kind of applications [19], [20].• Telecommunication infrastructures.The Third Generation (3G) and Long Term Evolution (LTE) mobile systems [21] demand high accurate time synchronization in the microsecond scale for the different neighboring base station as stated in [22].Timing can be provided by wired, usually optical fiber [23], or radio interfaces as in [24].A simple solution consists on deploying GNSS receivers connected to each base station.Nevertheless, this method is very expensive and presents some issues such as spoofing or jamming.Packet-based synchronization technologies can be used for timing dissemination to reduce the number of GNSS receivers.Under this context, there are specific IEEE 1588 profiles such as ITU-T G.8265.1, G.8275.1 and G.8275.2 that have been designed to fulfill the existing telecommunication infrastructures requirements.In addition to current mobile networks needs, Fifth Generation (5G) technologies demand more strict synchronization re-quirements between 110 ns and 12.5 ns [25] and, at the same time, require advanced reliability and redundancy capabilities [22], [25]- [27].• Data centers.Time synchronization is a key factor for important system functions in data centers as data consistency, event ordering, tasks scheduling and resources sharing [28].Moreover, nanosecond level synchronization is required for some applications in the finance segment where time and latency notions are crucial factors.Under this context, multiple alternatives such as NTP, IEEE 1588, Huygens algorithm [28], DataCenter Protocol (DTP) [29] or Google TrueTime [30] can be used for some data centers applications.However, most recent data centers infrastructures are demanding deterministic solutions that can provide a synchronization performance below 10 nanoseconds [31].These new requirements avoid the utilization of previous described solutions due to their non-deterministic nature and accuracy limitations.Furthermore, the utilization of WR on financial applications is remarkable important for the development of the visibility networks as stated on [32], [33].This allows to develop trading networks that guarantee fair play between the different High Frequency Traders (HFTs) connected to each stock exchanges.
In these applications, the utilization of new time synchronization technologies that fulfill current demands is a mandatory issue to take into consideration.Moreover, they also require a high data bandwidth network forcing the deployment of an independent timing network which significantly increases system costs.A possible solution is to integrate data and timing information in the same network infrastructure as investigated by international initiatives such as [34] or [35].
In this scenario, WR technology [6], [7] has been designed as a synchronization protocol based on IEEE 1588-2008 and includes advanced synchronization mechanisms such as frequency synchronization and phase measurement in order to overcome timestamp resolution limitations, reaching subnanosecond performance.However, current WR solutions have not been designed to take advantage of full data bandwidth of the network link presenting a low performance in terms of data delivery.Under this context, WR can not be easily deployed for some scientific and industrial applications such as telecommunications infrastructure and data centers where 10 Gigabit Ethernet (GbE) networks, and even faster ones are required.
This contribution provides a novel solution to deal with aforementioned needs combining high accuracy time synchronization and high bandwidth 10 GbE capabilities.This paper has been organized using the following structure: The WR technology is presented and briefly described in section II.The proposed architecture for the 10 GbE compliant synchronization system is exposed in section III.The system validation and the different experiments are explained in section IV.Finally, the conclusion and future work are respectively presented in sections V and VI.

II. WHITE RABBIT PROTOCOL OVERVIEW
The WR [6], [7] is a synchronization technology developed within the framework of an open source and international collaborative project started at CERN in 2009 [9], [36].It has been adopted by scientific institutions as GSI [10] and private companies such as Seven Solutions [37] because of its timing performance and integrability.
WR includes some extensions for IEEE 1588-2008 that have been accepted in a draft to take part in IEEE 1588-2019 as High Accuracy (HA) profile [38], [39].The WR specification describes that this technology provides subnanosecond accuracy and frequency distribution precision better than 50 picoseconds for point-to-point links up to 10 km.Nevertheless, the distance limitation can be easily fixed up to 120 km without using optical amplification.
The WR synchronization is implemented through different kinds of mechanisms that are described in the following lines: • Frequency synchronization (syntonization): It is based on Synchronous Ethernet (SyncE) technology that uses the received physical data stream to rebuild the transmission clock from the other endpoint.However, the main difference regarding SyncE is that the local clock is adjusted to follow the reference clock coming from the network using a servo control loop based on a Phase Locked Loop (PLL) instead of directly using the recovered reference.based system is that the timestamp resolution is limited to the device clock frequency.To overcome this problem, WR implements a Digital Dual Mixer Time Difference (DDMTD) module that is able to measure the phase difference between two clocks in picosecond scale [40].Then, this information can be combined with t 2 and t 4 timestamps to improve its accuracy obtaining t 2 p and t 4 p.Additional calibration [41] needs to be performed in order to achieve accuracy levels below the nanosecond.As can be seen at Fig. 1, fixed delays in the transmission and reception paths of each WR device ∆ rxs , ∆ txs , ∆ rxm and ∆ txm are taken into account.These delays are constant for each product and firmware version, therefore, they must be calibrated only one time.On the receiver path, there are two variable terms known as bitslide that are noted as ε M and ε S .They change every time the network link is established due to an internal alignment process in the serializers/deserializers circuitry, hence an initial measurement must be performed in order to compensate them.In WR links, fiber strands are normally used with different transmission wavelengths in each device resulting in distinct propagation speeds.Conse-quently, master to slave transmission delay δ M S is different to slave to master delay δ SM with a relationship determined by the fiber asymmetry factor alpha α (1).
The compute_RTT (line 36 in Algorithm 1) function follows (3) to extract the Round Trip Time (RTT) of the optical fiber link with an enhanced precision thanks to phase information (t 4p and t 2p ).
The compute_delay (line 37 in Algorithm 1) function calculates the one way delay using (4).
The compute_clock_offset (line 38 in Algorithm 1) function uses (4) to compute the clock offset correction to be applied.
The WR network is composed of different kind of elements whose nomenclature follows conventions specified in IEEE 1588 standard.These elements can be classified in different categories according to their roles such as the Grandmaster (GM), several intermediate devices also known as Boundary Clocks (BCs) and Ordinary Clocks (OCs) (endnodes).The GM is responsible for obtaining the accurate timing reference from a stable source, e.g.atomic clock or a GNSS receiver, and distributing it to the rest of the network.The BCs disseminate received timing information to the next hierarchy level elements and they are usually implemented with switch devices.The end-nodes behave as (∆rxs, ∆txs, ε S ) ← get_phy_calibration() 24: P calibration ← handle_signaling_packets() 25: (∆rxm, ∆txm, ε M ) ← recv_master_calib(P calibration ) 26: stop_phy_calibration() 27: ∆ ← compute_f delay(∆txm, ∆rxm, ∆txs, ∆rxs, ε M , ε S ) 28: α ← read_asymmetry_calib() 29: start_W R() 30: WR_packets_exchange: 31: (Sync, Fup, Delreq, Delresp) ← handle_packets() 32: (t 1 , t 2 , t 3 , t 4 ) ← get_tstamps(Sync, Fup, Delreq, Delresp) 33: phase M ← receive_master_phase(Delresp) 34: phase S ← read_slave_phase() 35: (t 2p , t 4p ) ← ref ine_Rx_tstamps(t 2 , t 4 , phase M , phase S ) 36: delay M S ← compute_delay(α, delay M M , ∆, ∆txm, ∆rxs) 38: of f set M S ← compute_clock_of f set(t 1 , t 2p , delay M S ) 39: adjust_clock(of f set M S ) 40: goto WR_packets_exchange IEEE 1588 OC devices recovering the reference clock from the network link and adjusting its local oscillator to be used for a specific application or to provide this reference using other synchronization mechanisms.However, one of the main issues related to the WR technology is the utilization of GbE links.WR is not a good candidate for timing subsystems in many applications (research infrastructures, data centers and telecommunication networks among others) that require network interfaces with higher speed such as 10 GbE or even faster ones.It forces the utilization of separated networks for data delivery and time dissemination.Under this context, the authors have implemented a new generation WR system for 10 GbE networks including an enhanced datapath that is able to cover data bandwidth and timing requirements.This novel design is described in the next section.

III. 10 GIGABIT ETHERNET SYSTEM IMPLEMENTATION
This section introduces the proposed system for the implementation of WR using 10 GbE technology.This development is a collaborative work that has been performed at the University of Granada in collaboration with Seven Solutions engineers.

A. HARDWARE
The hardware platform chosen to implement the WR 10 GbE system is the WR Zynq 16 ports (WR-Z16) board (Fig. 2).It has been developed by Seven Solutions and includes a Xilinx Zynq XC7Z035 System on a Chip (SoC).This SoC contains a Field Programmable Gate Array (FPGA) device and a hard ARM microprocessor enabling a complete standalone platform with hardware accelerators for critical functions and advanced software features.The WR-Z16 board has 16   The FPGA firmware block design implements a fully functional 10 GbE WR device with a single network port.Fig. 3a shows its architecture and main parts are briefly described in the following lines: • CPU.It is the main processor ant it is responsible for executing the system software.(TSU) module to generate very precise packet timestamps.
The design presents important differences in comparison to the current 1 GbE WR architecture (Fig. 3b) that are highlighted in the following lines: • The endpoint for the current 1 GbE WR solution contains the TSU inside the PCS block.Consequently, the TSU block is tightly coupled to PCS one preventing the re-utilization of different third-party PCS modules for specific applications.• Timing system in the current 1 GbE WR node solution implements the servo control loop together with the WR IEEE 1588 inside a soft-microprocessor.In contrast, the proposed solution uses the soft-microprocessor exclusively for the servo control loop while the WR IEEE 1588 is executed by the main processor.In the future, this soft-microprocessor could be also replaced by specific control logic to reduce resources and increase the control loop bandwidth.
It is important to highlight that in the proposed architecture, the 10 GbE MAC core [42] and 10 GbE PCS (Base-R) module [43] from Xilinx are used.However, a significant change in this implementation is that the proposed design is flexible enough to work with any other MAC and PCS IP cores.In order to do that, different elements have been fully reorganized to integrate the timing-related units whilst the data communication modules have not been modified, as shown in Fig. 3a.This allows to use specific PCS/MAC for deterministic behavior, higher bandwidth or low latency applications with minimum architecture changes in the design gaining a lot of flexibility for customization.

C. SOFTWARE
The software ecosystem is based on a Linux system together with a specific software for the servo control loop in the soft-microprocessor.The software components are shown in Fig. 4 and are classified in different categories: userspace daemons, kernel modules and phase-control servo loop firmware.
• The userspace daemons refer to the user applications that control the system using system calls.The Hardware Abstraction Layer (HAL) daemon is responsible for managing the hardware in the FPGA granting some hardware access requests from other applications.The Timing daemon implements the WR IEEE 1588 stack that is responsible for exchanging timing packets for WR protocol.In this section, a system characterization in terms of time synchronization, resource consumption and data bandwidth delivery has been performed.For this purpose, some experiments have been designed and performed to check that system requirements are fulfilled.In the time synchronization experiments, standardized metrics such as phase noise and jitter, Maximum Time Interval Error (MTIE) and Time Deviation (TDEV) have been used to measure timing performance.Both phase noise and jitter indicate the stability of a signal and are interrelated.Specifically, phase noise is the instability of a frequency expressed in the frequency domain and it is relevant and widely used to evaluate the timing performance on applications related to Radio Frequency (RF) systems, radars or telecommunication links.On the other hand, jitter is the fluctuation of the signal waveform in the time domain measured with standard estimators [44] (ITU-T) such as MTIE and TDEV.TDEV is a highly averaged, Root Mean Square (RMS) type calculation showing values over a range of integration times.It provides information about the presence and contribution of the different types of noise (according to the noise power-law spectral-density models) and helps to identify the sources of the different contributions.MTIE is another jitter measure that provide information about the maximum value regarding time error.It shows largest time-phase swings for various observation time windows and it is widely used for equipment testing and telecommunications measurement.The different tests provide evidences of performance based on these widely used performance metrics.
In order to perform the experiments and retrieve all the previous described metrics, required equipment and tools are briefly commented:

A. FREQUENCY DISTRIBUTION PERFORMANCE TESTS
The phase noise is a very critical issue in many applications where the frequency dissemination is mandatory.It impacts the quality of the transmitted frequency and can constraint the system scalability because the noise increases with the number of network hops.Consequently, a deeper study about the optimal parameters for the frequency and time dissemination in terms of phase noise must be accomplished in order to guarantee the best conditions for this application.To perform this, specific measures for the phase noise analysis, described in [44], should be used.
Regarding phase noise experiments presented in this section, the required equipment is composed of a Morion MV89 [45] to provide a stable reference and two WR-Z16 boards, one acts as a GM and another as a slave.The 10 MHz and PPS signals are provided to the GM WR-Z16 SMA sockets and an optical fiber is connected between it and the slave device as shown in Fig. 5.The master device elevates the reference 10 MHz signal to its internal working frequency and uses the 1PPS input to get a reference point of the beginning of a second.After the internal frequencies are locked to the reference, the device outputs a 10 MHz signal on a SMA port which is connected to a Microsemi 3120A [46] phase noise device.This instrument also uses the 10 MHz output from the Morion MV89 oscillator as a reference, measuring the frequency precision between the reference and the WR-Z16 in GM mode.After this measurement, a point-to-point link has been deployed between the GM WR-Z16 and a second WR-Z16 as a slave.
The phase noise results depend on the response of the phase control servo loop inside the Timing Controller that is implemented by means of a simple PI controller although advanced techniques could be used [47].Therefore, a tuning process has been performed to obtain the optimal values for the proportional and integral terms of the control loop (k p and k i ).The selected metric to compare the performance of the loop is the RMS jitter between the 10 MHz output signals from the GM WR-Z16 and the slave WR-Z16.It is noteworthy that because of differences in the 10 GbE and 1 GbE designs, the 10 GbE one uses negative values for proportional and integral terms whilst the 1 GbE alternative requires positive ones.Therefore, the absolute values are used for comparative purposes in this section.Two sweeps of the k p and k i terms have been carried out for the tuning process of the phase control loop.First, a coarse sweep provides a rough estimation of the regions in which the phase control loop can lock to the reference 10 MHz signal regardless of jitter.Afterwards, a finer tuning sweep in the region with the lowest expected jitter is performed.Eventually the optimal values for the control loop can be estimated from the fine-tuning sweep via interpolation.This process is repeated for both the 1 GbE and 10 GbE versions of the software.1 shows the values of the coarse sweep for 1 GbE and 10 GbE.These are the borderline cases in which the RMS jitter is in the range of or hundreds of picoseconds.Lower values of the proportional term lead to a region in which the RMS jitter always stays below ten picoseconds in the slave side, while the GM device approximately halves that amount in most of the regions.Slave devices can reach the 5 picosecond jitter milestone in the best cases, adding few hundreds of femtoseconds in the step between master and slave.In both 10 GbE and 1 GbE, the GM jitter in the analyzed region can be modeled as a smooth concave surface with a single relative minimum.In the slave side, however, the topology of that surface can be more complex.The different minimums found are at few hundred femtoseconds from each other.The optimal values for the proportional and integral terms of the phase control loop can be seen in Table 2.
The k p -k i pairs resulting in the best performance found in Table 2 are selected for further analysis of their phase noise and frequency stability.For this scenario, the 10 MHz output signal from both GM and slave devices have been connected to the phase noise device, maintaining the reference signal.Both scenarios have been sampled during 60 minutes and have been measured with 1 GbE and 10 GbE links for the sake of comparison.
Fig. 6 shows the main results extracted from the optimization process that was performed regarding the PI control loop.The traces illustrate that the 1 GbE and the 10 GbE versions have different PLL bandwidth.This is due to the use of specific PI constants in each version, which have been optimized to reduce the RMS jitter (see Table 3).Fig. 6 shows the phase noise traces for all four configurations with higher detail.
In order to compare the results with other devices in the WR sphere, the standard WR Switch (WRS) and its improved version including a low-jitter daughter-board noise values have been selected from [48].Table 3 also collects some results from the mentioned paper in regards of the phase noise values for the standard WRS and the enhanced version.As shown in Table 3, phase noise levels in the WR-Z16 is lower, both in low and high frequencies, than the standard implementation of the WRS (see Table 3).However, WRS with the daughter-board presents better performance in terms of phase noise [48].Under this context and as future work, WR-Z16 hardware could be updated to integrate the specific components of the daughter-board providing a significant enhancement in the phase noise behavior.

B. LONG TERM SYNCHRONIZATION STABILITY TESTS
In addition to the experiments regarding phase noise, the capabilities of the new system to provide a stable longterm time transfer signal are measured in this section.For these experiments, a WR link is established using two WR-Z16 boards.One of them has been configured to act as a WR GM device that is responsible for distributing the external time reference (Morion MV89) to the WR slave.After the synchronization procedure is completed, the 1PPS SMA output ports from both devices have been connected to a Keysight 53230A universal frequency counter [49] to measure the synchronization accuracy between both devices for more than ninety hours (Fig. 7).Timing performance in this experiment setup is limited by the Morion MV89 accuracy and, consequently, better results can be obtained if an atomic clock or maser is used instead.τ represents the observation time interval.It can be seen that it is close to 1e-11 level at τ = 1s and that it decreases until a minimum value of 6e-12 s when τ = 2000 s.These results show a slight improvement compared to previous evaluations of WR nodes [50].In the TDEV, it can be observed the effect of the temperature between night and day conditions.This is studied in a deeper way in some projects such as Clock Network Services (CLONETS) [34] to provide additional mechanisms to actively compensate them.The MTIE [44] is outlined at Fig. 9, showing the worst case analysis.The obtained results indicate that for the analyzed period, the MTIE never exceeds 103 ps.This clearly fulfills the WR requirements.After considering all the obtained results, it can be assumed that the 10 GbE WR link between WR-Z16 devices can be compared with the most evolved WR nodes in terms of time synchronization [50].Time synchronization experiments (frequency dissemination and long term stability) fully validate the 10 GbE solution, showing slightly better performance than most of the current WR devices but with architectural improvements and newer features as described in previous sections.

C. RESOURCE CHARACTERIZATION
In this section, a resource characterization of the 10 GbE design has been performed, comparing its results with the utilization profile of the WR standard implementation using the WR-Z16 platform (Table 4).The resource utilization of the 1 GbE system is lower than the 10 GbE one.This indicates that the WR standard design is a very simple one that only includes the minimal set of functionalities.Meanwhile, the proposed solution includes a high bandwidth data transfer system enabling the implementation of complex user applications thanks to the ARM processor.Although the FPGA resources consumption is higher for the 10 GbE solution, the percentage of utilization of the FPGA, specially on the new families, is minimal.Therefore, the small footprint guarantees a small impact for final users whilst it provides all the new and improved features.
In addition to the resource utilization characterization, the CPU load effects have been analyzed.In the 1 GbE WR architecture, the WR protocol was implemented using a softprocessor in an FPGA device, so the main CPU can be dedicated to user applications splitting the time synchronization tasks from user processes.Under this context and as shown in [51], the synchronization performance is not degraded when the main CPU executes high load tasks.In the 10 GbE solution, the soft-processor does not implement the WR IEEE 1588 stack and only takes care of frequency syntonization and phase measurement, as described in section II.WR IEEE 1588 daemon runs in the main CPU of the device, the ARM processor.Due to this new configuration, it is important to demonstrate that the activity of the main CPU does not impact the synchronization quality.In the first experiment, the CPU utilization by the WR IEEE 1588 daemon was measured, obtaining a value below 1%.After this result, additional experiments were performed using the stress and cpulimit tools in order to evaluate the impact of CPU load in the synchronization.The former is in charge of running several processes to create a high load condition in the ARM processor.The latter is able to limit the maximum CPU utilization for a specific process allowing to directly control the CPU load.Thanks to these tools, the system behavior under different CPU conditions was analyzed as presented in Table 5.This table shows the PPS offset between the master and slave devices in picoseconds using their mean and standard deviation metrics.
As a result of CPU load experiments, it can be concluded that the system is able to deploy high load tasks without degrading the WR synchronization performance even when CPU utilization reaches the 100%.The reason for this behavior is that WR provides a physical clock syntonization, fine phase clock adjustment and deterministic hardware timestamps.All these mechanisms ensure that the clock offset computation is not affected by the dynamic packet latency in the software domain as it only implies a delay in the application of that correction that is performed each second by default.This section shows data performance of the solution taking into consideration two different metrics: data bandwidth and endpoint latency.As commented before, these are very important because they significantly impact on several industrial and scientific applications.The data bandwidth experiments have been performed using two external computers with 10 GbE network interface cards named computer A and computer B. The former is equipped with an Endace DAG 10X2-S 10 GbE card [52] whilst the latter includes a Solarflare Communications SFC9120 network interface card [53] as can be seen at Fig. 10.The two computers are connected to the WR-Z16 via an optical fiber link.Then, the WR-Z16 bypasses the incoming packets from one interface to the other one.With this simple scenario, conventional software tools such as iperf, netperf or nload can be used to measure the bandwidth performance of the WR-Z16 system.For the experiments, the nload has been used.
The data bandwidth results (Fig. 11) reveal that the 10 GbE Endpoint on the WR-Z16 board is able to cope with the 91.2% of the total capacity of the link.In the light of the results, the 10 GbE system can be used in high data bandwidth applications without presenting any drawbacks.Moreover, Fig. 11 shows that the maximum bandwidth provided by the 10 GbE network interface card is 9.17 Gbps.Under this context, the 10 GbE system reaches the 99.45% of this data bandwidth, giving the possibility to increase the obtained link utilization result if the network card is replaced for other with higher bandwidth performance.For the latency test, a different approach has been implemented as shown in Fig. 12.In this case, no external computers are needed to measure the endpoint latency.Instead of this, some additional IP cores are inserted in the FPGA design to generate and consume packets.A Time Base block provides a common time base and two TSU modules that recognize the start of frame in each endpoint and store a timestamp of that event are used.The reference clock is 156.25 MHz and for this reason, the resolution of the timestamp is limited to 6.4 ns.The latency calculation is performed by subtracting the two timestamps and dividing the result by two for every single packet.This scheme does VOLUME N, 2020 not consider any asymmetries between the transmission and reception data paths and because of the propagation delay on the optical fiber.However, the latter is negligible for short links (0.4 ns for 10 cm) compared to the endpoint latency.An important note to take into account is that the system latency can be significantly reduced using low latency alternatives for the network components such as the MAC and PCS ones.Latency experiments have been performed considering different data utilization conditions (from no data to full 10 GbE speed rate).The results demonstrate that the 10 GbE Endpoint latency is not affected by dynamic data bandwidth and it presents a latency value between 198-205 ns with a deviation value of 2.59 ns.

V. CONCLUSION
On this contribution and for the first time in the literature, the authors have designed and implemented a new WR-compliant design that can be deployed over 10 GbE networks.This has required to fully re-design the system modules in order to achieve the desired modularity, flexibility and performance goals.
The proposed solution has been implemented using the WR-Z16 board that is equipped with a new Xilinx Zynq FPGA-SoC.Its FPGA firmware presents a fully modular design that provide separated paths for timing and data components.Moreover, some advanced software has been developed to control the hardware (HAL and kernel drivers) and to perform the time synchronization (Timing daemon).A CPU load impact study has been performed in order to demonstrate that high load conditions do not affect the synchronization performance.As a result, the 10 GbE solution is able to implement application tasks in the CPU without degrading the time synchronization quality.
Finally, several experiments have been performed focusing on the timing, latency and data bandwidth capabilities.For the timing characterization, some tests related to time and frequency distribution and long term stability have been carried out.The frequency distribution scenarios have required an initial analysis to find the best PI control loop constants.Once obtained these values, frequency stability has been measured obtaining better results (In 10 GbE design: 4.0e-12 for GM and 4.7e-12 for slave ; In 1 GbE design: 4.9e-12 for GM and 6.4e-12 for slave) than ones related to the standard WRS.The long term stability experiments have been performed measuring the PPS signals from two WR-Z16 devices.Under this context, outcomes show that the proposed system has a similar synchronization performance than the most evolved WR 1 GbE nodes, with a synchronization accuracy below 103 picoseconds according to MTIE metric.
On the other hand, the data bandwidth and latency tests show that the proposed system is able to cope with the the 91.2% of the total capacity of the link meanwhile the latency presents a value between 198-205 ns with a deviation of 2.59 ns that is not impacted by the data utilization.Despite the achieved performance, the proposed system has not been optimized for ultra-low latency applications.Nevertheless, the modularity of the solution allows integrating low latency components if needed.
As relevant outcomes of this contribution, WR protocol can be fully exploited on many novel applications including scientific facilities or high-end data centers that demand high data bandwidth requirements together with high accurate synchronization mechanisms.The advantages of the 10 GbE approach are clear.First, it allows the integration of data and timing information in a single network, easing the development of an integrated solution and reducing the cost of wiring, commissioning and maintenance.Second, the 10 GbE interface allows a better interoperability of the technology with other network equipment, concretely on the telecommunications domain.Third and finally, a modular architecture has been designed in order to guarantee interoperability with the existing network IP block ecosystem, allowing easy customization for specific purposes such as low latency or deterministic applications.This also benefits the integrability of this solution and the interoperability with other alternatives.In summary, the new approach provides a way to a fully integrated data and timing network and represents a significant improvement of the state of the art of time transfer solutions.

VI. FUTURE WORK
The authors identify some interesting and promising future lines.Firstly, the current design can be extended to work with higher data bandwidth interfaces such as 25 GbE.As a second step and thanks to the modularity of 10 GbE system, the inclusion of more than one port and switching mechanisms to create a WR switch for high data bandwidth applications is feasible.Finally, a deeper research work could be performed in order to improve noise performance using enhanced hardware and advanced servo control loops algorithms.

FIGURE 1 :
FIGURE 1: WR link delay model A simplified version of the WR synchronization process is depicted in pseudo-code listing in Algorithm 1.The first steps are focused on establishing a WR link, comprising tasks such as frequency syntonization and Physical (PHY) layer calibration.Once the link has been properly initialized, the synchronization procedure that removes the clock offset between master and slave.The compute_fdelay (line 27 in Algorithm 1) function uses (2) to calculate delays associated to the Rx and Tx paths.
10 GbE SFP plus (SFP+) ports, a specialized clocking circuitry to generate and control the needed clocks and some SubMiniature version A (SMA) sockets for the Pulse Per Second (PPS), 10 MHz input clock and 10 MHz output clock.

•
The kernel modules have been developed to run inside the Linux kernel.The most important element is the network driver that is in charge of controlling the 10 GbE IP blocks and DMA components in the FPGA.The original driver has been adapted to access the high accurate timestamps for each packet provided by the Timing IP.•The phase-control servo loop firmware is an embedded software that runs in a soft-processor inside the Timing IP core.Its main goal is to perform the phase control loop using a Proportional Integral (PI) algorithm that controls the local Digital-to-Analog Converter (DAC) to adjust the clock frequency.

FIGURE 5 :
FIGURE 5: Experimental setup for phase noise experiment.

Fig. 8
Fig.8shows the TDEV[44] that represents the phase difference stability between both devices.Under this context,

FIGURE 11 :
FIGURE 11: WR-Z16 bandwidth experiment results.Generated data traffic on Computer A (X-axis) vs data traffic that reaches computer B (Y-axis).
It is the main RAM memory in charge of storing data and program information needed by the processor.
• DDR.• 10 GbE Endpoint.It contains needed blocks for the 10 GbE technology such as MAC and Physical Coding Sublayer (PCS) components.• Timing Intellectual Property (IP).It is in charge of implementing the high accurate time protocol by means of a soft-processor IP block and embedded software.Additionally, it includes a generic TimeStamping Unit 4 VOLUME N, 2020

TABLE 1 :
Clock jitter (s RMS) for different configurations and control loop parameters.

TABLE 2 :
Selected values for control loop parameters to obtain the best performance in terms of clock jitter.

TABLE 3 :
Integrated random jitter (s RMS) for WR-Z16 and WRS

TABLE 4 :
Total resource utilization for 10 GbE system and the WR standard design for 1 GbE

TABLE 5 :
Synchronization performance under different CPU load conditions. D.