New software based readout driver for the ATLAS experiment

In order to maintain sensitivity to new physics in the coming years of LHC operations, the ATLAS experiment has been working on upgrading a portion of the front-end electronics and replacing some parts of the detector with new devices that can operate under the much harsher background conditions of the future LHC. The legacy front-end of the ATLAS detector sent data to the DAQ system via so called Read Out Drivers (ROD) - custom made VMEbus boards devoted to data processing, configuration and control. The data were then received by the Read Out System (ROS), which was responsible for buffering them during High-Level Trigger (HLT) processing. From Run 3 onward, all new trigger and detector systems will be read out using new components, replacing the combination of the ROD and the ROS. This new path will feature an application called the Software Read Out Driver (SW ROD), which will run on a commodity server receiving front-end data via the Front-End Link eXchange (FELIX) system. The SW ROD will perform event fragment building and buffering as well as serving the data on request to the HLT. The SW ROD application has been designed as a highly customizable high-performance framework providing support for detector specific event building and data processing algorithms. The implementation that will be used for the Run 3 is capable of building event fragments at a rate of 100 kHz from an input stream consisting of up to 120 MHz of individual data packets. This document will cover the design and the implementation of the SW ROD application and will present the results of performance measurements performed on the server models selected to host SW ROD applications in Run 3.


I. INTRODUCTION
As part of the preparation for LHC Run 3, which will begin at the end of 2021, the ATLAS experiment [1] has upgraded some parts of the detector with new components that can operate under the much harsher background conditions expected as the LHC reaches higher instantaneous luminosity.The new detector and trigger systems will use modern Front-End (FE) electronics that require an updated readout system.The legacy FE of the ATLAS detector sent data to the TDAQ system [2] via so-called Read Out Drivers (ROD) [3] -custom made VMEbus boards devoted to data processing, configuration and control.These data were then sent to the Read Out System (ROS) [2], that was responsible for buffering and serving them to the High-Level Trigger (HLT) [2].From Run 3 onward, all new trigger and detector systems will be read out using new components, replacing the combination of the ROD and the ROS.This new path will feature a new facility called the Software Read Out Driver (SW ROD), which will receive FE data via the FE Link eXchange (FELIX) system [4], perform event fragment building and buffering as well as serving data on request to the HLT.

II. FELIX SYSTEM OVERVIEW
FELIX is a new generic detector readout system that can receive data from detector FE electronics via (among others) the versatile radiation hard optical link architecture [5] developed at CERN.FELIX can be used to receive data either via GigaBit Transceiver (GBT) [6] or the in-house designed FULL mode protocol.FELIX uses a custom PCIe card that receives data via optical links and passes them to the memory of a commodity computer via PCIe bus.FELIX also provides a software application that can forward these data to a number of subscribers via a commercial network using Remote Direct Memory Access (RDMA) to maximize performance.RDMA is a technology that makes it possible to put data directly into the main memory of another computer without involving the processor, cache or operating system of that computer.FELIX implements a custom network communication layer called NetIO [7] on top of the RDMA over Converged Ethernet (RoCE) protocol that is supported by many modern network cards.NetIO provides a C API that can be used by a software application to receive data from the FELIX system.A FELIX card can be operated in two modes using the respective protocols: 1) GBT Mode: with the GBT protocol a physical input link can be subdivided into a number of logical sublinks (known as E-Links), which can pass information from separate pieces of FE electronics.For Run 3 the maximum number of E-Links for a single FELIX card is limited to 192, which in this case are equally spread over 24 GBT links.2) FULL Mode: this mode has no logical subdivision of links and uses an in-house designed protocol for higher bandwidth.For Run 3 this mode can be used to send data either via 12 links at full occupancy with the speed of 9.6 Gbps or via 24 links with 50% occupancy (4.8 Gbps).

III. SW ROD SYSTEM ARCHITECTURE
The SW ROD facility is envisaged to be implemented as software running on a set of commodity computers.Given that a single computer can serve only a limited amount of input data, and in order to scale to the size of the new ATLAS readout system, the software had to be designed in a way that allowed it to be distributed over an arbitrary number of computers.In the current design this is achieved by splitting the input data channels between a number of software processes, which are referred to as SW ROD applications as shown in Fig. 1.Each instance of the SW ROD application can run on a separate computer, but it originates from the same binary executable.This executable implements a highly customizable high-performance framework, rendering support for detector specific event building and data processing algorithms provided in the form of shared libraries (a.k.a.binary plugins).This way different instances of the SW ROD application diverge by using distinct configurations that define a set of plugins to be used as well as their configuration parameters.

IV. SW ROD APPLICATION FUNCTIONAL REQUIREMENTS
The Read Out Drivers being used by the legacy readout system to receive and process data from the ATLAS detector FE were developed independently by every subdetector.As such, they perform subdetector specific data processing and event building based on the signals received from the ATLAS Central Trigger Processor (CTP) [8].As the FELIX system does not perform any data processing or event aggregation, but merely provides data routing between detector FE and the DAQ system, the task of data aggregation and processing has to be fulfilled by the SW ROD application before transferring data to the HLT farm.It is also expected that the SW ROD application will be used not only for normal physics data taking but for various auxiliary subdetector specific activities, such as commissioning, calibration, monitoring, debugging etc., in which data would have to undergo specific processing and may need to be transferred to a different destination than the HLT farm.To meet such requirements the SW ROD application has been designed as a framework that supports a high degree of customization by making it possible to load subdetector specific event building and data processing algorithms at run time, which can be further configured by subdetectors with respect to their specific needs.

V. SW ROD APPLICATION HIGH-LEVEL DESIGN
The SW ROD application is split internally into a number of independent components, with each of them providing a simple interface that defines how other components can interact with it.There are three main components defined by the SW ROD application architecture that can be interacted via the respective interfaces: • DataInput interface: abstracts a source of input data to shield the other components of the SW ROD application from any changes in the network input protocol.In addition it also makes it possible to use another data source, for example internal data generators, for testing and debugging.Such a configuration defines the set of event fragments to be produced as well as a list of input links for each fragment.
• ROBFragmentConsumer interface: abstracts any kind of processing that can be applied to fully aggregated event fragments.Multiple implementations of this interface can be used simultaneously in the same SW ROD application, in which case they will be organized into a singly-linked list.Each consumer in this list will have to forward event fragments to the next one after finishing its specific processing step.For example, as shown in Fig. 2, one implementation of this interface can apply a custom subdetector specific processing procedure to the event fragments before passing them to another consumer that is used to transfer these fragments to the HLT farm.As shown in this diagram the data handling is done by the implementations of the SW ROD application interfaces, while the Application itself merely loads and instantiates the corresponding implementation classes in accordance with a given configuration and links the instantiated objects in the order defined by this configuration.

VI. SW ROD DEFAULT COMPONENT IMPLEMENTATIONS
A shared library that contains default implementations for all three interfaces is supplied along with the SW ROD application.This library contains all classes shown in Fig. 3.

A. DataInput Interface Implementations
• The NetioInput class is responsible for receiving data from the FELIX system using the NetIO Socket interface for a given set of E-Links and passing these data to the fragment builder via the ROBFragmentBuilder interface.• The InternalDataGenerator can generate FELIX-like data chunks of a given size for a configurable number of E-Links.This class is used for debugging as well as for unit test implementation.

B. ROBFragmentBuilder Interface Implementations
The library provides two implementations of the ROBFrag-mentBuilder interface, which can be used to receive data from the FELIX system in either GBT or FULL mode.The algorithms implement a specific data aggregation strategy in a generic way that is independent of the format of the incoming data chunks.As this format is detector specific this feature was implemented by allowing detectors to supply two custom procedures as parameters for these algorithms: • Trigger Information Extraction procedure -this is a function that extracts the Level 1 Trigger identifiers from a given data chunk.These identifiers are used to assign data chunks to a particular event fragment as well as to align data with the Trigger information received from the CTP.• Data Integrity Checking procedure -this function is intended to be used if there is a suspicion that input data chunks could be corrupted or a sequence of data packets for a particular input link is broken.This function is assumed to know the location of the checksum value in a given data packet format as well as the Cyclic Redundancy Check (CRC) algorithm that was used to calculate that value.In most cases detector developers have only to define these functions and reuse the data aggregation strategies provided by the carefully optimized and extensively tested default ROBFragmentBuilder interface implementations.On the other hand, if another event fragment aggregation strategy is required for a particular subdetector, a new algorithm can be implemented and plugged in to the SW ROD application as is done for the default implementations.This does not affect the existing components of the SW ROD application and is completely transparent for the application itself.

C. ROBFragmentConsumer Interface Implementations
• The FragmentProcessor class was developed to simplify implementation of the common task, required by many subdetectors, of applying custom detector-specific postprocessing to all event fragments produced by the given SW ROD application.This class provides a workbench to execute detector-specific code on every event fragment that is passed to this consumer.This code should perform the necessary modifications to the event fragment payload but should keep the structure of the fragment untouched.The code can be provided in the form of a function, which has to be implemented by the corresponding detector experts and be given to the SW ROD application in the form of a shared library that will be loaded at runtime.• The HLTRequestHandler class is responsible for buffering event fragments and serving them to the HLT farm on request.It keeps the event fragments until informed by the HLT that they are no longer needed.Event fragments are indexed by their Level 1 Trigger identifier and stored in an internal buffer until a clear request has been received from the HLT.On receipt of a clear request, all the event fragments with the identifiers provided by this request will be removed from the index and their allocated memory freed.• The FileWriter class implements a consumer that simply writes all received event fragments to a file on disk.Files created by the FileWriter will be in the standard ATLAS data file format [9] with all event fragments prepended by the ATLAS full event header, which make such files compatible with standard ATLAS event processing and analysis applications.This functionality is useful for testing, commissioning, calibration and other auxiliary activities which are performed by detectors beyond normal data taking.• The EventSampler implements event selection for online monitoring.An instance of this class can be optionally added to the list of a SW ROD application consumers to select a subset of aggregated event fragments for the purpose of online monitoring.This class passes selected events to the TDAQ Event Monitoring service [10] that transfers them to the applications responsible for data quality assessment.

VII. SW ROD APPLICATION PERFORMANCE REQUIREMENTS
In Run 3, the SW ROD has to be able to operate at an input rate of 100 kHz, matching the ATLAS Level 1 Trigger accept rate.The number of input links and the overall data rates are defined by the output produced by the FELIX system.Table I summarizes these numbers for a single FELIX card.An important goal of the SW ROD is to handle as many input data links as possible in order to reduce the total system cost by minimizing the number of computers to be used to run SW ROD applications.While in FULL mode this number is essentially limited by the input network bandwidth available for a SW ROD computer, in GBT mode the data rate produced by a single FELIX card is much lower and the number of input links that can be served by a single SW ROD computer is instead limited by the performance of the GBT data aggregation algorithm.A dedicated study has been performed to estimate the maximum number of input links that can be handled by the GBT event fragment aggregation algorithm executed on a single SW ROD computer.The results of this study will be presented in the next section.

VIII. GBT MODE EVENT FRAGMENT BUILDING
ALGORITHM OPTIMIZATION Due to power consumption and heat dissipation issues the clock frequency of a modern CPU is normally in inverse proportion to the number of cores for the given CPU.Thus the product of these parameters gives a similar value for any CPU in the same price range.This value can be used to make a rough estimate of the full computing power a particular CPU can offer.It should be noted that a modern CPU is capable of executing more than one operation per cycle, but in practice this is difficult to achieve for complex code and normally a one-to-one ratio between cycles and operations is considered satisfactory.Taking 2.5 GHz as an average CPU frequency for a CPU that has 10 cores, one can assess the total number of CPU operations per second provided by an averagely priced CPU to be on the order of 2.5 • 10 10 .
Given that the rate of data chunks from a single FELIX card in GBT mode is about 20 MHz, a simple division shows that such a CPU can provide about 1200 operations for a single data chunk, which corresponds to 0.5 microseconds.It should be taken into account that every chunk has to be aligned, by means of the Level 1 Trigger identifier that every data chunk contains, with the other ones for the purpose of event fragment aggregation.From this point of view this amount of computing power does not look large, especially if one wants to maximize the number of FELIX cards that can be handled by the same SW ROD computer, in which case this budget has to be divided further accordingly.Moreover, the computational resources provided by a modern CPU are proportional to the number of CPU cores, which means that to utilize them in an efficient way the software has to be designed to use multiple threads with a high degree of parallelism.This essentially precludes the use of high-level design patterns, like producerconsumer, to pass data between threads as this would incur too much performance overhead for thread synchronization.
The solution that was implemented for the GBT event fragment aggregation algorithm to minimize the rate of interactions between threads was to combine both data reading and event fragment aggregation into the same threads.To achieve that the total number of input E-Links is split among a configurable number of worker threads, with each thread reading data chunks from the given subset of E-Links and aggregating them into a subfragment of a given event.When all subfragments of a particular event are ready they are assembled together by a dedicated fragment building thread.This approach makes it possible to split the algorithm into two stages: 1) The processing of individual data chunks is done in parallel by multiple concurrent threads at the O(10) MHz rate.
2) The final event fragment assembly that requires synchronization between threads is done at the rate of 100 kHz only.The degree of parallelism provided by this algorithm can be estimated using a formula that is based on Amdahl's law: Equation ( 1) defines how the speedup S(n) of an algorithm executed by a given number of threads n depends on the parallel fraction of this algorithm P .Given that we know the processing rates of the parallel and non-parallel fractions of the GBT event fragment aggregation algorithm we can express P as (2).
Here CFA is the relative cost of the final subfragment assembly operation with respect to the cost to handle a single data chunk.This equation shows that if the relative cost of the final assembly operation is less than 100 then the algorithm should give some performance gain, but in order to scale well this number should be at least less than 50.Using this equation ( 1) can be transformed to: This equation defines how the speedup of the GBT algorithm depends on the relative cost of the final assembly operation.Finally, inverting (3) yields ( 4), which will be used in the next chapter to assess CFA for the current algorithm implementation using the empirical values for S(n) obtained from performance measurements:

A. Testbed Configuration
Event building algorithm performance measurements were performed on a testbed that replicates the same hardware configuration that will be used by the readout system during Run 3: • SW ROD application running on a computer with a dualsocket Supermicro motherboard with 2 Intel(R) Xeon(R) Gold 5218 CPUs and 96 GB of DDR4-2667 RAM.Each CPU has 16 physical cores with a base frequency of 2.3 GHz.• Input data for the tests generated by a FELIX card software emulation application running on another computer with an Intel Xeon E5-1660 v4 CPU with 3.2 GHz base frequency and equipped with 32 GB DDR4 2667 MHz memory.
• Both computers were equipped with Mellanox ConnectX-5 100 GbE network adapters, which were connected via a 100 Gb network switch.Data were sent to the SW ROD application via the FELIX NetIO protocol.

B. Network Throughput Test
In order to access the overhead of the RoCE protocol the network throughput was measured using a simple bandwidth test utility from the Mellanox OFED software package with a default packet size of 65K bytes.The receiving application was started on the SW ROD computer with the following command: # ib_send_bw -F -n 100000 The client (sending) application was started on the FELIX computer with the IP address of the SW ROD host: # ib_send_bw -F -n 100000 192.168.100.1 Both applications reported an average rate of 91.3 Gb/s that stayed almost constant throughout the test, with marginal variations of less than a fraction of 1 Gb/s.

C. GBT Mode Tests
The aim of these tests was to study how the GBT event fragment building algorithm scales with the number of input E-Links and the number of threads used to handle input data.To this end, three series of tests were performed with the SW ROD application using one, two and three threads respectively to receive and aggregate data chunks from every group of 192 input links, which corresponds to input from a single FELIX card.The total number of emulated FELIX cards for different test series varied from 1 to 6, which made for a total number of input channels increasing gradually from 192 to 1152.The size of the generated data chunks was set to 40 bytes.The results of these tests are shown in Fig. 4.
These results show that the GBT event fragment aggregation algorithm implementation scales well in both dimensions: with the number of worker threads aggregating data from a given number of E-Links as well as with the number of such aggregation operations running concurrently in the scope of the same SW ROD application.
The dotted line shows the maximum theoretical input rate that can be obtained with the given hardware configuration, which is limited by the available network bandwidth.This line represents (5):  where L is the input rate, F is the number of simulated FELIX cards, 40 × 8 is the size of the data chunk in bits, 12 × 8 is the size of the NetIO protocol overhead per chunk in bits as well and 91.3 • 10 9 is the maximum bandwidth that can be achieved with using the RoCE protocol in Gb/s.This line demonstrates that the last three results of the tests with two reading threads and all but the first two results for the test series with three reading threads were limited by the network bandwidth.The dashed line shows the maximum rate that could be achieved if the NetIO protocol overhead was equal to zero.It indicates that the input rate could potentially be improved by reducing the NetIO overhead.
Using the results which were not limited by network bandwidth one can calculate the speedup S(n) and parallel fraction P of the GBT event fragment assembly algorithm and then use these numbers with (4) to compute an estimate of the CFA coefficient.The results of these calculations are shown in Table II.

D. FULL Mode Tests
In FULL mode, larger sized event fragments are sent to FELIX over fewer (higher bandwidth) links, with no data aggregation required in the SW ROD.The number of links needing to be serviced at this increased bandwidth can vary from 1 to 24 in the extreme case.In FULL mode several links can be grouped together to be used to send fragments corresponding to different Level 1 trigger events in a roundrobin pattern from the same piece of detector FE electronics if more bandwidth than a single link can provide is required.
Tests have been performed to study the behavior of the SW ROD's FULL mode data handling algorithm relative to the number of input link groups, which need to be serviced independently.Each group of links was used to send an independent stream of data and inside a group event fragments were sent over the given links using round-robin pattern.For these tests the size of the generated packets was set to 5K bytes and the number of independent streams of data generated by the FELIX software simulator varied from 1 to 24.For each configuration the average input rate per data stream was measured.The results of these tests are shown in Fig. 5.The results demonstrate the excellent scalability of the FULL mode data handling algorithm with respect to the number of input links.In all test series except one input rate was limited by the network bandwidth.The only exception is the configuration with all 24 input links used for the same data stream.In this test the input rate went up to 1.76 MHz, which saturated the CPU cores used by the SW ROD application's reading threads.No further study has yet been done for this scenario as the rate that was achieved is far in excess of the readout performance requirements for Run 3.

E. Scalability Towards Run 4 Requirements
For Run 4, which is planned to start in 2027, the LHC will undergo the High Luminosity Upgrade [11] that will significantly increase instantaneous luminosity and the number of particle interactions per bunch crossing.This will bring new requirements for the ATLAS TDAQ system, which will have to receive data from trigger and detector electronics at an input rate of 1 MHz.As the readout system for Run 4 will be based on FELIX it is useful to study the limits of the current readout implementation, such that they can be addressed in the new TDAQ architecture.For this reason, a series of tests were performed with the GBT event fragment aggregation algorithm to reveal the maximum number of input E-Links and chunk size configurations at which 1 MHz input can be sustained.For these tests the SW ROD application used the default GBT event fragment aggregation algorithm that assembles data from all the given input E-Links to a single fragment.Two rounds of tests were performed.In the first one the number of input E-Links varied from 24 to 192 and the input data chunk sizes were set to 40 and 80 bytes for different test series.For the second round of tests the number of input E-Links was fixed to be 48 and the data chunk size varied from 40 to 240 bytes.The algorithm used 6 reading threads for both rounds of tests.Fig. 6a shows two data series which were obtained with the same test configurations but using different versions of the SW ROD application.The first test series revealed a bottleneck in the SW ROD application that was caused by the standard new and delete memory management operations.This was not a problem for the previous tests, where these operations were taking place at a rate of about 100 kHz, but when the input rate was increased towards 1 MHz the memory management overhead became prominent.A quick solution was put in place by replacing the new and delete operations with a custom memory pool implementation that preallocates a large number of memory blocks and keeps a list of free blocks in a tbb::concurrent_queue container [12], which made the use of this memory pool by multiple concurrent threads possible.This improved the input rate of the SW ROD application by almost 50% and made it possible to reach a rate above 1 MHz with some configurations.The same implementation was used for the second round of tests, for which the results are shown in Fig. 6b.

X. CONCLUSION
A mixture of the legacy ROD-based and the new FELIXbased readout will be used by the ATLAS TDAQ system for LHC Run 3. The SW ROD is a new component of the ATLAS DAQ system that was developed to receive data from the FELIX.The SW ROD implements a high performance customizable framework that supports custom input data formats and different event fragment aggregation strategies as required by the new ATLAS detector and trigger components.The SW ROD fully satisfies the performance and functional requirements which have been defined by ATLAS for Run 3. The default GBT event fragment aggregation algorithm makes it possible to handle data input at or above the required rates from up to 6 FELIX cards working in GBT mode, thus minimizing the overall cost of the new readout system by reducing the number of required computers for the SW ROD system.Further optimisation could be achieved by reducing the overhead of the FELIX communication protocol.A study of how the Run 4 performance requirements can be met is ongoing and has already revealed some very promising results.

Fig. 2 .
Fig. 2. Typical interactions between SW ROD application components for a normal data taking activity.The DataInput component subscribes to FELIX and passes received data to the ROBFragmentBuilder, which aggregates data into event fragments and transfers them to the first ROBFragmentConsumer in the list.

Fig. 6 .
Fig. 6.SW ROD input rate with varying data chunk size (a) and with varying number of E-Links (b).

TABLE II ESTIMATE
OF THE PARALLEL FRACTION OF GBT ALGORITHM