DPTC -- an FPGA-based trace compression

Recording of flash-ADC traces is challenging from both the transmission bandwidth and storage cost perspectives. This paper presents a configuration-free lossless compression algorithm which addresses both limitations, by compressing the data on-the-fly in the controlling field-programmable gate array (FPGA). Thus the difference predicted trace compression (DPTC) can easily be used directly in front-end electronics. The method first computes the differences between consecutive samples in the traces, thereby concentrating the most probable values around zero. The values are then stored as groups of four, with only the necessary least-significant bits in a variable-length code, packed in a stream of 32-bit words. To evaluate the efficiency, the storage cost of compressed traces is modeled as a baseline cost including the ADC noise, and a cost for pulses that depends on their amplitude and width. The free parameters and the validity of the model are determined by comparing it with the results of compressing a large set of artificial traces with varying characteristics. The compression method was also applied to actual data from different types of detectors, thereby demonstrating its general applicability. The compression efficiency is found to be comparable to popular general-purpose compression methods, while available for FPGA implementation using limited resources. A typical storage cost is around 4 to 5 bits per sample. Code for the FPGA implementation in VHDL and for the CPU decompression routine in C of DPTC are available as open source software, both operating at multi-100 Msamples/s speeds.


I. INTRODUCTION
T HIS work is motivated by developments in data handling in nuclear and particle physics. However, its applicability is not limited to those fields. Experiments in nuclear and particle physics are growing, which implies an increasing amount of data that needs to be handled. This is caused by an increase in the number of detectors employed, finer segmentation and higher event rates. Of particular interest for this work is the recording of signal traces, because this is associated with a dramatic increase of data that need to be transferred, compared with a simple digitization of pulse amplitudes.
To illustrate the development of experimental setups, we consider two front-line particle physics experiments almost 30 years apart. We compare the ATLAS (A Toroidal LHC ApparatuS) experiment at LHC, CERN, which started datataking in 2009, with the UA1 (Underground Area 1) experiment at SppS, CERN, which started data-taking in 1981. Concerning data production, UA1 was designed to deliver around 3 MB/s, mainly limited by the speed in writing to magnetic tape [1]. The data acquisition of ATLAS on the other hand stores around 320 MB/s [2], with much higher internal data rates. The increase of a factor 100 in recorded data rate over a time span of 30 years is compensated by the substantial improvement of commercial development in both communication and storage.
Considering the evolution of Ethernet between 1980 and 2010, we have witnessed an increase of about a factor 20 every 10 years in bandwidth [3], [4], with the major increase in the latter half of the timespan. After 2010 however, a lower rate of growth, a factor 4 every 10 years, starts to appear.
For data storage, between 1980 and 2010, the increase was on average a factor 30 every 10 years, with a peak between 1990 and 2005 where the area density doubled and prices per byte fell by half on a yearly basis [5]. Also this pace has slowed down since around 2010, with instead a factor 4 every 10 years [6], [7].
This slowdown in industry development poses new data acquisition challenges for both transmission speed and storage. A particular case when these are in high demand is when scientists are interested in storing entire traces, i.e. raw data directly from flash-ADCs, for example during testing or debugging of detectors and data processing procedures. In this case, the amount of data is much larger, easily by a factor 20-1000 [8].
One way to cope with these challenges is to increase capital expenditure to buy newer and better performing equipment. However the need to reduce costs leads to a different approach, where we aim to reduce the size of the data to be handled. This can be achieved through data compression.
If employed as software running on a PC, only data which has already been sent from the signal acquisition unit can be reduced. This gives no reduction in the transfer rate demands. To address both limitations, an implementation of the compression directly on the FPGA, where the initial signal processing takes place, is needed. This article presents a simple yet effective lossless compression method, that can be applied to sequences of continuously sampled data. The method allows a straightforward and fast implementation in FPGAs as well as CPUs.
This paper is structured in the following way: First, already available solutions are reviewed, followed by a description of the present routine. Optimisation possibilities, both regarding compression efficiency and resource utilisation are then discussed. This is followed by descriptions of the interfaces to the FPGA compression module and the CPU decompression code. The storage cost of both noise and pulses are then modeled, and verified using synthetic trace simulations. Finally, the achieved 0000-0000/00$00.00 arXiv:1903.10984v1 [physics.ins-det] 26 Mar 2019 storage cost reduction is benchmarked using traces from actual detectors.

II. OVERVIEW OF AVAILABLE SOLUTIONS
Ideas for data compression on front-end electronics are not new. Scientists working on large detectors have already faced the problem of how to efficiently compress data, albeit with different boundary conditions than in our case. Both lossy compressions, where a part of the initial information is lost to accomplish a reduction [9], [10]; and lossless compressions, where the initial information can be fully reconstructed, can be achieved following different approaches. One is to discard parts of the signal with no or little information (zerosuppression [11]). Another approach is to use a variable length coding [12], such as Huffman coding as shown in [13] or Golomb-Rice coding, which is used in [14]. The effectiveness of such algorithms is based on the knowledge of the probability distribution of the original data values. Usually this knowledge is gained from inspecting the whole or a representative pool of the data undergoing compression. This requires to store and to analyse a representative sample of the data during setup, in order to tune the compression configuration to the signal and ADC operation parameters. As the signal characteristics have a tendency to change between calibration and production data, such approaches are not suitable for our purpose as a generic configuration-free compression method. In some cases, through a pre-processing of the incoming data, a more advantageous probability distribution can be exploited. A common approach is the calculation of differences between values [15], [16], [17], [18], [19]. These differences may be between sampled data and a model [15] or between sampled data and a reference value (base) [16], [17] or between consecutive samples [15], [17], [18]. When dealing with signal traces, which are sampled at rates high enough that consecutive samples have values close to each other, i.e. are correlated, the latter approach delivers a distribution dominated by small values.

III. OPERATING PRINCIPLE
The compression presented in this paper is based on preserving only those least significant bits which hold the information necessary to recover each value. Although this does not correspond to a real Huffman coding, the result is to encode the more common smaller values, i.e. closer to zero, with shorter sequences. This approach is quite similar to the one presented in [17], where one sample works as base value and the following three samples undergo the differencing treatment. The base value can be chosen arbitrarily. We use the first value of each trace, with all following samples subject to the difference processing. The resulting differences are organised in groups of four, and all the samples in one group are stored using the same number of bits. A small header containing information about the encoding is placed at the beginning of each group.
Our implementation is organised in two steps, as shown in Fig. 1. First, the procedure calculating the differences is applied to the input data. The original samples consist of a sequence of n-bit data words, where n is given by the bit resolution of The first stage prepares the values to be encoded as differences of the input values. The second stage concerns the bit-packaging and determines the number of bits needed for each group, and prepares and shifts group headers and encoded values, merging them into the output words. One input value can be accepted each cycle, marked by the data valid (dv) signal. This predicate follows the data pipeline (PPL), and could in the future allow the circuit to operate even if new values are not provided every cycle. Full output words are signalled by dv_out. The end of a trace is to be marked by the flush signal, which, after passing the pipeline, ensures that the last data word is generated, followed by done.
the sampling ADC. The current design allows n to have any value in the range 5-16. The second stage is responsible for packing the differences into a stream of 32-bit words.
A. Differencing procedure The first stage treats each value according to the following rules: 1) Calculate the difference to the previous value.
2) The later storage is due to the binary encoding slightly asymmetric. With a certain number of bits, it can store one more negative value than positive. With e.g. 3 bits, the eight differences −4, −3, −2, . . . , 3 can be stored. For flat (noise-like) parts of a trace, any deviation from zero will generally be followed by a difference of the opposite sign. To make negative values more common than positive, a sign-changing scheme is applied: If a stored value is negative, the next non-zero value is stored with inverted sign; while, if positive, the next is stored as is. A value of zero does not change how to store following values.
Note that in all operations, only n bits are considered, i.e. the differences are allowed to wrap (arithmetic is modulo-2 n ). This does not introduce any ambiguity. Short encoding, ∆m = −1, 0, or 1.  Depending on the difference ∆m in the number of bits that are needed in this group compared to the previous, the group header is characterised by a short (upper) or long (lower) encoding. The parameter k is fixed by the maximum number of bits that are needed to represent the maximum difference not covered by the short header max(∆m) = n − 3.

B. Group creation
The values are stored in groups of four, using the same number of bits, m, for each value in a group. This is illustrated in Figs. 2 and 3. Since the stored values are differences, both positive and negative values must be representable (in two's complement representation). Since each value may require a different number of bits to be represented, the widest representation needed by any value in a group is used. The number of bits used for values in each group is stored in a group header, placed before the actual data. Considering consecutive groups, it is worth noticing that the number of bits needed will often not change much and therefore a short and long encoding of the number of bits is employed, see Fig. 2. The short header consists of two bits: if the encoded value is 1, 2, or 3, the number of bits to use for the group is the same as for the previous group with a change of −1, 0 or +1 bits, respectively. If the value of the two-bit short header is 0, the encoding is long and contains the full difference of bits stored per value. Since some values are already covered by the short encoding, an offset of 2 is applied to the full difference. This is encoded using k bits, which is chosen such that any needed difference, at most n − 3, can be stored; i.e. 1 bit for n ≤ 5, 2 bits for n ≤ 7, 3 bits for n ≤ 11, and 4 bits for n ≤ 19.
The number of bits per stored difference is interpreted with a bias of 1, meaning that storing 0 bits per value is not supported. This is a conscious choice: supporting 0-bit values (i.e. minimal encoding of groups with all value differences 0) would make the code for CPU decompression (and compression) more complicated. Since ADCs usually are operated with noise in the least significant bit, it is also expected to have limited practical use.
The data values are then stored with the necessary number of bits for the group. Each data value is stored with a bias relative to the most negative value that can be stored with the given number of bits. This simplifies decoding, as the stored value only has to be unmasked, and the bias subtracted. This  avoids a cumbersome sign extension operation by the CPU decoder.
As an exception to the above rules, the first data value is stored alone and fully, using n bits. This avoids storing the entire first group of data with many bits.

C. Output word formation
The resulting stream of bits is then packed in 32-bit words, being filled from the least significant bits. When a value to store cannot fit, the completed output word is emitted and the remaining bits are stored in the next 32-bit output word.
Information about the number of original data values, number of data words produced by the compression and n is needed by the decompression procedure. These values are not recorded by our routine, therefore it is the responsibility of the user to retain this information.

IV. OPTIMISATION
The algorithm described in the previous section can be optimised in different ways. However, the improvements obtained by applying additional procedures depend on many aspects, such as noise level, signal shape, and the distribution of signal amplitudes. Note that while improving for some characteristic, an optimisation will undermine other aspects. We present a few ideas together with a short analysis of each one, discussing advantages and disadvantages.  Table III. The results have been averaged (geometrically) within the different sample groups, and then again to provide a total average. The minimum locations are indicated. The flat test traces are not included in the total average. A. Compression factor optimisation 1) Linear predictor: With this additional pre-processing, the linear component of long sloped parts of a trace are removed by a second differencing of the data. This aims at a distribution of values more narrow around zero. However, for flat parts of a trace, which mainly contain noise, such a double difference leads to a wider distribution. Thus, in order to give an overall improvement, this procedure must only be applied for sufficiently long, sloped sequences. This is controlled by a heuristic using the observation that consecutive samples in unfavorable regions change sign often, or have 0 difference, and thus can be detected by a three-most-recent rule. The second differencing is switched off when at least one sign change or a zero has occurred for the previous three values.
While at first appearing to be promising for synthetic traces, from tests on actual traces, this optimisation does however not bring any improvement. This is connected to the fact that usually most of the pulses in the digitised traces only have small amplitudes, therefore the few improvements by this predictor are neutralised due to it activating spuriously in flat parts. The optimisation is implemented in the code, but deactivated by default.
2) Number of values in a group: The group size can also be varied to optimise the compression efficiency, see Fig. 4. Using smaller groups require more storage space due to the more frequent headers, while larger groups will encode unnecessary bits for more samples. The figure shows an optimum around five samples per group, with gradual losses at larger values, or steeply below three.
We have chosen to code four values in each group. The loss is about 0.6 % compared to five. Fixing the number as a power of two might be useful for a future parallelized unpack code.
B. Circuit optimisation 1) Additional pipeline stages: The achievable minimum clock cycle period in a digital circuit depends on the propaga-  tion delay of the longest combinational logic chain between register latches. In our case the circuit is described in VHDL, where the model and grade of the FPGA that is targeted will affect which logic expression becomes the longest. Adding pipeline stages to split the longest paths helps to lower the minimum clock cycle. At the same time however, introducing a pipeline stage causes more LUTs 1 to be used, as well as flip-flops; leading to a trade-off between resource-usage and speed. In order to allow flexibility when using the code, a few generic parameters control a number of optional pipeline stages.
Since the synthesized code uses more LUTs than flipflops, compared to the usually available ratio on FPGAs, we concentrate on the LUT usage for the circuit optimization comparisons.
By performing VHDL synthesis for all combinations of the optional pipeline stages, and directing the respective FPGA development toolchain to optimise for speed, the achievable performance as function of resource usage can be determined. It is shown in Fig. 5 and summarized in Table I. The VHDL code allows the minimum period of the clock to be below 10 ns (i.e. 100 MHz) even on 10-year old FPGAs, and easily be configured to reach below 5 ns with additional pipeline stages. On more modern FPGAs, going below 3 ns seems rather easy.
2) Barrel shifter vs. multiplier units: The single most expensive component of the circuit is the barrel shifter, which aligns the encoded data at the next position in the output word. For n = 16, the shifter input is 22 bits wide, with the additional 6 bits coming from the potentially long encoded header. The shift amount is in the range 0 to 37, inclusive. 0 to 31 depends on how many bits already are used in the output word. The additional 0, 4, or 6 positions depend on the header (long, short, or none). This gives a 60 bit output.
A barrel shifter on FPGAs is normally realised as one multiplexer for each output bit (sharing some parts of the first stage selectors of each multiplexer). Since it also can be expressed as a multiplication of the input value with 2 i , where i is the shift value, it can also be implemented using multiplier units in FPGAs. For the second factor, the input value is generated as 2 i , i.e. a one-hot encoding of the shift amount.
One could imagine this to be beneficial when generic LUT resources are scarce, however for the cases tested, it is not. The generation of the 2 i input value is rather expensive, as it requires 2 i individual selectors. Also the combination of the output values from the several multiplier units, often 9 × 9 or 18 × 18 wide, are rather expensive.
The resource usage for 22-bit, 38-position left-shifters implemented in the two ways are also compared in Fig. 5.

V. VHDL MODULE INTERFACE
The interface to the VHDL compression module is a single entity, with input and output signals as seen at the top and bottom of Fig. 1. Optional pipeline stages are configured using a generic map.
The circuit inputs are: • clk: clock signal; • reset: reset signal, given for at least as many cycles as the pipeline has stages; • input: n-bit data value to compress; • dv_in: data valid signal: set to '1' every clock cycle an input value is provided; • flush: flush signal: set to '1', after the last input value has been given. Held until done reported back. This forces the last output word to be emitted, especially when it is not fully occupied.
The output signals are: • output: 32-bit output data word; • dv_out: data valid signal: '1' every time the output word is filled, signaling the presence of a completed data word to be stored;  Table II shows the single-threaded decompression times per sample for the actual traces presented later in Table III. With 16-bit data samples, and a decompression time on modern hardware smaller than 2.5 ns/sample, the decompression rate is larger than 800 MB/s, i.e. well comparable to solid state drives (SSDs).
• done: informs that the last input value has been processed and the final output word was produced (possibly in the current cycle).

VI. DECOMPRESSION
The decompression is performed by one C function with the following parameters: • compr: pointer to the 32-bit words of the compressed input buffer; • ncompr: number of elements in the input buffer; • output: pointer to a buffer of 16-bit items for the decompressed values; • ndata: number of original/decompressed values; • bits: number of bits of each value that was stored (n).
This must be the same as the number configured during compression. On success, 0 is returned, otherwise a non-zero value.
The routine will report decompression failure on malformed compressed data, e.g. if there are non-zero bits left in the input buffer, or when entire words have not been used. The decompression routine will not read items beyond the end of the source buffer even if it runs out of data, e.g. due to a corrupted data stream. Table II shows the typical performance, which only has a small dependence on the actual data values.

VII. COMPRESSION EFFICIENCY-STORAGE COST
The contributions to the compressed data size can be divided in two parts: 1) The cost of storing traces with no pulses, i.e. only containing the digitization noise. This is described as a cost per sample.
2) The cost of storing a pulse, described as an additional cost for the entire pulse. There is a natural interplay between the two, as the noise affects the additional cost to store a pulse. This effect is also addressed below.
In the following, we use the variables c for cost and b for bits. To specify these, subscripts are used: N for noise, T for trace, S for sample, P for pulse, and B for a small pulse (bump). Gaussian noise is described by its amplitude σ N . The amplitude and width (std. dev.) of Gaussian-shaped pulses are given by A P and w P . Cost, Gaussian noise Uniform noise Model (uniform) Linear relation Figure 6. Average cost per sample depending on the number of noisy bits. For uniform noise this number represents the span of the random value distribution. For Gaussian noise instead the standard deviation of the distribution expresses the number of noisy bits. The minimum cost per sample is 1.5 bits, given by one bit per sample and a short header of two bits every group of four samples. For uniform noise, the range of differences is twice as large as the value distribution (due to also encoding negative entries), explaining the use of, on average, one more bit per sample in addition to the short header and b N . This is modeled by (3) (solid line). For Gaussian noise, the distribution of differences between consequtive samples is wider by a factor ∼ √ 2 than the original distribition, and any large value in a group of four leads to longer encodings. The further small fractional costs per sample are in both cases likely given by the occasional use of long group headers.

A. Bare trace cost
The cost of storing a trace without pulses has two parts: the size of the headers and the size of the encoded values, i.e. the differences.
The cost of storing the differences depends on the noise content, most easily expressed as the number of bits of noise b N = log 2 σ N .
Ignoring the pecularities of the first group, which may require a long header encoding, the estimated cost for a trace c T will be proportional to its length n T : The first sample has a fixed cost n. The constant 15.5 accounts for the average number of unused bits in the last output word at the end of a trace. A first approximation, denoted by the tilde, for the average cost per noise sample is The first half bit comes from the short group header, using two bits every four samples. The additional one comes from differences encoding both positive and negative entries, i.e. effectively a sign bit. The term o is an overhead, since the grouping of values causes some more bits than necessary to be used. To model the transition from very small noise levels to the proportional regime where the cost is 1.5 bits/sample, This is illustrated in Fig. 6.

B. Pulse cost
The cost of a pulse is best described as the total cost of the pulse, and not a cost in bits per sample. For the following discussion, pulses are assumed to have a Gaussian shape. Detector pulses can be considered as composed of two parts with different time constants (i.e. widths) for the rising and falling parts. Even if this may be a rather rough approximation of real pulses, especially for the trailing part, it is practical.
Since it is differences that are stored, the important parameter is not the amplitude A P of a pulse, but its steepest slope, which scales as A P w P . As a first approximation, denoted by the tilde, the cost is proportional to the number of bits needed to store these differences, as well as the width of the pulse, The scale is given by the proportionality constant a. It turns out that this formula works rather well, if modified to account for the facts that even for small pulses, costs are not negative (by adding 1 inside the logarithm and control parameter b), and that very narrow pulses still will affect the storage size of at least one entire group (d within the square root): The modification is thus adjusted by the control parameters b and d.

C. Pulse-noise interaction
The above description (5) works in the limit where the pulse is large compared to the background noise. When this is not the case, the additional cost of storing the pulse will be smaller, since the pulse-associated part of the differences to some extent will be covered by the noise storage cost. This can be modeled by The correction is the cost of storing the noise for a stretch of samples proportional to the pulse width: q is a proportionality constant.

D. Storage cost verification-synthetic traces
The storage cost described above and culminating in (6) has been verified by simulating a large number of traces with Gaussian pulses, where the parameters A P , w P , and σ N were varied. A global fit suggests the following values for the control parameters: a = 5.6, b = 1.3, q = 33, d = 2.6 and f = 7.7, with a parameter uncertainty of up to 10 %. Fig. 7 shows the σ N = 2.0 case.
Simulations were performed by building, for each set of parameters, a set of 15 × 10 6 traces, each made of 500 samples, with Gaussian noise σ N . In each, a Gaussian pulse (A P , w P ) was added to the trace. To average over discretisation effects, both the (noise) baseline and the center of the pulse were randomised, trace by trace, with fractional offsets. Although Fig. 7 shows a good agreement between Equation (6) and the data, larger differences emerge for small values of A P and w P . These correspond to the limits handled by the modifications between (4) and (5), which are thus seen to only partly address these edge effects. Table III shows the compression efficiencies for some different collections of actual data. They are compared to the common gzip [21] and xz [22] generic compression routines (at their normal setting). For the generic routines, all data of each file was stored in a binary file with 16-bit values. For a fair comparison, the overhead size of storing an empty compressed file was subtracted. In general, the DPTC results are quite similar to the LZMA results, and well below the gzip results. The main exception are the LaBr 3 collections marked a , where the data is very flat (virtually no noise) except for the pulses. Here the DPTC routine still uses its minimum of at least 1.5 bits/sample. This effect is also seen for the three synthetic traces marked b , which have constant values.

E. Storage cost verification-actual traces
Note how close the costs per sample are to the expectations for only storing the respective noise content, showing that the storage cost contributions from pulses are negligible.

F. Caveat emptor-how to ignore ADC noise
In case the original data contains an excessive number of least-significant bits with noise that shall not be stored, they must be shifted out of the original data before the values are given to the compression routines. Just masking them out will not improve the compression efficiency, as the compression routine is looking for the most significant bit of the differences that need to be stored. On the other hand, using a compressor with n larger than necessary causes little extra cost. Few, if any, extra bits will be used; since mainly k will potentially be affected, see Fig. 2.

VIII. CONCLUSION
A lossless compression routine which addresses both the transmission bandwidth and storage cost challenges associated with recording flash-ADC traces has been presented. The routine can be directly integrated in front-end electronics and handles data streams on-the-fly at rates above 300 Msamples/s in the controlling FPGA. Calculation of the differences between consecutive trace samples concentrated the most frequently occuring values around zero. The compression was concluded by storing the values in groups of four, yielding a simple yet effective variable-length code, by only storing the necessary least-significant bits, in a stream of 32-bit words.
A model for the storage cost was developed, by first considering the influence of the group headers as well as the retained ADC noise. The additional cost of storing a pulse was expressed in terms of its amplitude and width. By compressing a large set of artificial traces with varying characteristics, both the free parameters and the validity of the model were determined.
The method was then applied to actual data from different kinds of detectors. The compression efficiency was found to be comparable to popular general-purpose compression methods (gzip and xz). It was shown that the dominating cost of storing actual traces is generally given by the retained ADC noise, and not the pulses. It is therefore important for users to carefully assess how many least-significant bits shall be kept, in case they are noisy. Except for that, there are no parameters that need to be adapted, which is of particular interest for experiments employing hundreds or thousands of detector channels.
Computer code for the FPGA implementation in VHDL and for the CPU decompression routine in C are available for download [23] as open source software. The VHDL code can be synthesized to do one-the-fly compression above 300 Msamples/s and the decompression routine processes more than 400 Msamples/s on modern CPU cores. In Table III, the noise content of each trace collection is characterised by σ N , which is calculated from the distribution of the differences between each sample, and the average of its four closest neighbours on each side. The pulse amplitudes A P are represented as a geometric average of the difference between the largest and smallest sample value in each trace. This slightly overestimates the pulse amplitudes, due to the noise broadening, but since A P σ N , it is still clear that the traces contain pulses. The expected costs for storing the noise, as suggested by Fig. 6, are calculated as c S = log 2 σ N + 2.95. The LaBr 3 collections marked a are very flat (virtually no noise) except for the pulses, causing the DPTC costs to be dominated by its minimum of at least 1.5 bits/sample. This also applies to the synthetic constant-value traces marked b .