Performance Evaluation of Modern Time-Series Database Technologies for the ATLAS Operational Monitoring Data Archiving Service

The trigger and data acquisition (TDQA) system of a toroidal large hadron collider (LHC) apparatus (ATLAS) experiment at the LHC at CERN is composed of a large number of distributed hardware and software components, which provide the data-taking functionality of the overall system. During data-taking, huge amounts of operational data are created in order to constantly monitor the system. The persistent back end for the ATLAS information system of TDAQ (P-BEAST) is a system based on a custom-built time-series database. It archives any operational monitoring data published online, resulting in about 18 terabyte (TB) of highly compacted and compressed raw data per year. P-BEAST provides command line and programming interfaces for data insertion and retrieval, including integration with the Grafana platform. Since P-BEAST was developed, several promising database technologies for efficiently working with time-series data have been made available. A study to evaluate the possible use of these recent database technologies in the P-BEAST system was performed. First, the most promising technologies were selected. Then, their performance was evaluated. The evaluation strategy was based on both synthetic read and write tests and on realistic read patterns (e.g., providing data to a set of Grafana dashboards currently used to monitor ATLAS). All the tests were executed using a subset of ATLAS operational monitoring data, archived during the LHC Run II. The details of the testing procedure and of the testing results, including a comparison with the current P-BEAST service, are presented.


I. INTRODUCTION
T HE trigger and data acquisition (TDAQ) system of a toroidal large hadron collider (LHC) apparatus (ATLAS) [1] experiment is a complex distributed system made out of a large number of hardware and software components. That means about 3000 machines and in the order of O(10 5 ) applications working to accomplish the data gathering function of the detector.
During data-taking runs, large amounts of operational monitoring data are produced in order to monitor the functioning Manuscript  of the detector. Currently, these data are being gathered and stored using a system called the persistent back end for the ATLAS information system of TDAQ (P-BEAST) [3]. P-BEAST is, essentially, a custom time-series database used for archiving operational monitoring data and retrieving the stored data for applications that require it. It stores about 18 terabyte (TB) of highly compressed raw operational monitoring data per year. Since P-BEAST has been commissioned in 2014, more than a few new time-series database technologies have been released. Even if P-BEAST currently meets the ATLAS requirements in terms of sustained data insertion rate, a survey has been done in order to identify a short list of the most promising candidates to evaluate for the purpose of improving P-BEAST in terms of both functionality and performance. For instance, with the current architecture, P-BEAST can scale only vertically, and its performance is limited by the hardware resources of the single server it runs on. In that respect, the ability of scaling horizontally would represent a huge step toward a better performing and more robust (via redundancy) service. In addition, P-BEAST uses a plain filesystem-based database to store the recorded information; the usage of a more advanced database technology would make the data storage more reliable and querying more efficient.

II. BACKGROUND
The main requirements for any potential candidate were to support or provide emulation for all the data types currently used by the ATLAS operational monitoring (integers, floats, strings, and arrays) and the capability to sustain the data injection rate observed during real data taking runs with P-BEAST. This means an insertion rate of approximately 200k metrics/s during LHC [2] Run II and about 500k metrics/s at the beginning of Run III. A preliminary survey was done among time-series database technologies, columnar database technologies, and key-value stores. As a result of the preliminary survey, two technologies were selected.  In the first phase of this research [4], two ways of organizing the stored data have been tested: a single table data organization (see Fig. 1) and a multiple table data organization (see Fig. 2).
In the single table setup, all the data points for a given attribute are stored in a single table, which contains an object column used as an index. In the multiple table setup, all the data points belonging to an attribute.object pair are stored in their own table.
The idea behind experimenting with both these approaches to storing the data was to test which one results in a faster writing rate. In the end of the first round of testing, the results showed that both approaches have very similar performances, with a very slight advantage of the single table approach. This, combined with the fact that queries were easier to work with in a single table setup, enabled the decision to use the single table setup for this second round of testing.

A. Test Data
For the write performance testing, the tests have been run using various real ATLAS operational monitoring data archived during the ATLAS Run II operation. Out of that ATLAS operational monitoring data stored using P-BEAST, four data types have been selected as being the most representative ones: 1) arrays of 12 float64s; 2) strings of approximately 5500 characters; 3) float64s; 4) int64s. Then, for the read performance testing, the databases that have been created and populated with ATLAS operational monitoring data have been used for all read performance tests.

B. Software
The implementation of all tests has been done in the Go! programming language, which is the language used for developing InfluxDB; thus, it has native support for Go! clients. For ClickHouse, there is Go! support via third-party libraries.

C. Hardware
All the tests have been run on a dual-central processing unit (CPU) computer with the following specifications. 1) Two Intel Xeon E5-2630 v2 @2.60-GHz CPUs (each with six cores and hyperthreading, for a total of 24 threads). 2) 32 GB of RAM.

A. Write Testing
The first batch of tests was the write tests. These have been developed to fetch batches of 10 000 data points from the existing ATLAS operational monitoring database using P-BEAST, and then write those batches into the prepared InfluxDB and ClickHouse databases.
P-BEAST stores data using an in-house format based on Google Protocol Buffers. Every data file stores time-series data for many objects of a single attribute. Inside a file, the data are indexed by object name. The data files are compacted and compressed weekly. The P-BEAST measurements have been performed to serve as a baseline for the performance measurements of the other technologies.
InfluxDB: Each object is stored in a measurement (InfluxDB  table)-the timestamp and a tag (InfluxDB indexed column) containing the object name make up the primary key in each measurement.
ClickHouse: Each object is stored in a single table-the columns containing the timestamp and the object name make up the primary key in each table.
Before importing historical operational monitoring data into the InfluxDB and ClickHouse databases, it was exported from P-BEAST and stored as text files.
Then, for both technologies, a separate test was developed to implement the same functionality.
1) Initialize the database and create the necessary tables if this has not been done yet. 2) Read the intermediate store of data and fetch records until a batch of 10 000 data points have been filled. 3) Write the prepared batch of data to the database. Initially, testing several batch sizes was being planned, such as batches of 100, 1000, and 10 000 data points; 10 000 data points per batch are the recommended batch size for InfluxDB, specified in its documentation. The ClickHouse documentation makes no recommendations about batch sizes, so the InfluxDB recommendation was used as a reference. The write performance measurements were made taking into account the total number of data points being written. For example, for writing 100 000 data points, using batches of 10 000 data points would mean writing ten batches, while using batches of 100 data points would mean writing 1000 batches. The idea was to test how the writing rate is influenced by the batch size being used. However, the execution time of the tests was already very long, in the order of months, so only the 10 000 data points batch tests were kept in the testing plan.

B. Read Testing (Synthetic)
The databases created and populated with data during write testing have been used for all read tests. They contain the same raw data for both InfluxDB and ClickHouse, because the same original P-BEAST data had been written into them during write testing.
The first, and simplest, types of queries that can be used are those in which all data points are fetched from a specified time interval. However, in a production setting where Grafana [7] is going to be used to display data, such simple queries would be meaningless in many, if not most, situations. Thus, more realistic queries needed to be tested.
In order to make use of Grafana's capabilities to display complex graphs with multiple data series, a more complex query that can fetch the data needed to display multiple time series on the same graph is needed.
The complexity of the query is caused by the need to fulfill one or both of the following two requirements.
1) Separate tables into data series by a tag (in the case of P-BEAST, the tag being the object name). 2) Aggregate measurements over a given time interval. By combining these two requirements, we end up with four types of possible queries: 1) a simple query, such as those mentioned earlier, without time series separation and without aggregation; 2) a query with time series separation but without aggregation; 3) a query without time series separation but with aggregation; 4) a query with both time series separation and with aggregation. Out of these four query types, only the first and the last have been kept in order to prevent the analysis from becoming too verbose. As such, the following query types were used. A couple of issues became apparent with above queries. 1) Issue 1: Limitation of ClickHouse Grafana Datasource Plugin: The groupArray ClickHouse function returns an associative array as a list of tuples. Each tuple contains a key and a value. In these queries, they key is a timestamp, but the timestamp and the object name could be flipped around, and the object could be used as the key, if needed. The value is whatever type of data is present in the table being queried (or some aggregation of it).
The encountered problem was that the ClickHouse Grafana database plugin used for testing was not able to handle tuples of any kind. As a result, the complex queries would have been impossible to test. After investigations, it turned out that only the C++ ClickHouse client could handle all the data types that ClickHouse can output. The testing setup had been using the Go! client since the beginning, so moving over to C++ would have been a problem. So, the easier way was to implement tuple support in the Go! ClickHouse client library.
The only limitation of this feature's implementation is that it can work with all basic data types, but not with arrays. No arrays can be returned as values in the key/value pair of the tuple. This is because of the architecture of the Go! ClickHouse library, which would have required much more extensive modifications if it were to be modified to handle array tuple values as well.
2) Issue 2: Not All Queries Made Sense for All Attribute Data Types: The data types used in the tests, as mentioned earlier, are arrays, strings, floats, and integers. The complex queries, because of the time series separation and of the aggregation, offer no easy solutions for compound data types, such as arrays and strings. Complex queries can be run only on basic data types because of the following. 1) Time series separation, although conceptually possible, was not implemented for ClickHouse because of a limitation of the ClickHouse library mentioned in Issue 1. 2) Aggregation, while theoretically possible for strings and arrays, is a more complicated topic which was considered outside the scope of this work. For each test run, the server is started, checked that it started, the test is run, and then the server is stopped. This is done for two reasons as follows.
1) In order to avoid any caching artifacts that could skew the results, the server is stopped after each run. 2) It was noticed in preliminary testing that some of the queries can be intensive enough to crash the servers, both for ClickHouse and especially for InfluxDB. Regardless whether the server crashed or it was stopped cleanly, by making sure it is not running at the end of a run, each test in the test suite has the same starting point: starting the server and making sure it initialized properly before sending queries. The crashes have been happening as a consequence of limitations in the hardware resources available.

C. Read Testing (Grafana)
The standard interface of P-BEAST is based on Grafana, so any potential technology that could be used as a new  database engine for P-BEAST would need to be able to work as well as possible with Grafana. Thus, because even the complex query tests were still being used in a synthetic environment, a final round of read testing using Grafana has been set up.
The Grafana testing started from an existing P-BEAST dashboard (see Fig. 3).
The first step was to replicate the functionality of this dashboard but using first InfluxDB and then ClickHouse as a backend. A one-month long time interval was selected, and the The ATLAS operational monitoring data used in the chosen dashboard were imported both into a InfluxDB and into a ClickHouse database.
Then, once the data were imported into the databases, the original P-BEAST dashboard was recreated for InfluxDB and for ClickHouse. Care was taken to get the recreated graphs to be as close as possible to the original graphs. See Figs. 4 (from the InfluxDB dashboard) and 5 (from the ClickHouse dashboard) to get an idea of how similar to each other the two new dashboards were able to be created. As can be seen in the figures, there are still some small differences, the reason for this being the slightly different functionality of the InfluxDB and ClickHouse plugins of Grafana.
Again, such as in the case of the write performance testing, the P-BEAST version was tested as well, for the purpose of using these measurements as a reference. Time intervals starting from 3 h and up to 21 days have been selected. For each time interval, 30 measurements were taken while making sure that no caching is involved to skew the measurements.
The four setups that were tested are the following:

V. TEST RESULTS
The conclusions of this batch of testing vary across the course of the various tests.

A. Write Testing
For write testing, P-BEAST has demonstrated write performances above both InfluxDB and ClickHouse. Between the latter ones, ClickHouse has been the technology with better results across all the performed tests.
Furthermore, ClickHouse has the advantage of the fact that it has free built-in clustering support, which can be used to even further increase its writing rate. The clustering support is a commercial offering for InfluxDB. See Fig. 6.

B. Read Testing (Synthetic)
For synthetic read testing, the results between types of queries are very different among the tested technologies.
In the case of simple queries, ClickHouse always shows better read performance than InfluxDB. See Fig. 7.
However, in the case of complex queries, the performance of InfluxDB goes over that of ClickHouse from a certain query interval onward and stays above for the tested intervals as shown in Fig. 8.
It needs to be noted that the graph line for the InfluxDBstring (5500 char) dataset is missing. This is a consequence  of the fact that the queries for that dataset, for that technology, are failing in all instances.
Failing queries are also the reason for the fact that the InfluxDB graph lines do not extend all the way to the end of the 112 days interval. The missing measurements are caused by failing queries when large time intervals were being queried for those respective datasets.
Finally, it needs to be clarified that these failing queries are the reason for the server crashes that were being mentioned in Section IV, and they are caused by limitations in the hardware resources available.

C. Read Testing (Grafana)
During Grafana read tests, several tens of queries are started simultaneously by the dashboard, that is quite different from synthetic read tests, where every test was executed individually.
As suggested by the better complex query performance of InfluxDB, it was not unexpected to see that in the Grafana read testing, InfluxDB showed consistently better performance than ClickHouse, as shown in Fig. 9.
The read performance of raw P-BEAST without caching enabled is below both InfluxDB and ClickHouse, that is explained by its primitive data files format (no data indexing by time inside weekly compacted and compressed data files).
When looking at the performance of P-BEAST with caching enabled, despite starting in a place similar to ClickHouse and worse than InfluxDB, for longer queried intervals, it demonstrated better performance compared with both InfluxDB and ClickHouse. This suggests that if either InfluxDB or ClickHouse would be used as a back-end, with caching, for a potentially modified P-BEAST, the performances of the upgraded P-BEAST would be above what is currently available.

VI. CONCLUSION
The test results demonstrated much better write performance of present P-BEAST implementation over both InfluxDB or ClickHouse technologies. To sustain Run II data insertion rates, several times more hardware resources will be necessary assuming linear increase of the write speed with the number of computers in the cluster.
The read performance of both InfluxDB or ClickHouse technologies is better than present P-BEAST without caching option enabled. It is expected, implementation of a caching layer for them will increase speed for both technologies.

A. Further Research
The next natural step of this study would be to configure InfluxDB and ClickHouse in cluster mode and evaluate their capability to scale horizontally. Then, one of the two technologies could be used as a database back end for P-BEAST. The goal is to possibly have a new architecture for P-BEAST to be ready before the start of LHC Run IV.