A Rolling Hash Algorithm and the Implementation to LZ4 Data Compression

LZ77 is a dictionary compression algorithm by replacing the repeating sequence with the addresses of the previous referenced data in the stream. To find out these repetition, the LZ77 encoder maintains a hashing table, which have to frequently calculate hash values during the encoding process. In this paper, we present a class of rolling hash functions, that can calculate multiple hash values via a carry-less multiplication instruction. Then the proposed hash function is implemented in LZ4, which is a derivative of LZ77. The simulation shows that the encoding throughput of LZ4 has 15.7% improvement in average, and the compression ratio is ±1% in most cases.


I. INTRODUCTION
Data compression is a process of reducing data storage space, which is currently used in various aspects of software engineering. There are two major categories of compression algorithms, termed lossy and lossless [1]. The lossy compression algorithm reduces the size of a multimedia file, such as video, voice, and image, by removing small details that require a large amount of space [2]. Thus, it is impossible to restore the original file due to the removal of essential data. In contrast, the lossless compression is used in cases when the information must be completely restored [3]. The lossless data compression is used in text files, executable files, and source codes.
LZ77(Lempel- Ziv-1977) [4] is a class of lossless compression algorithms. LZ77 is a very simple adaptive dictionary-based technique, which does not require prior statistical characteristics of source [5]. Currently, there are many variants of LZ77 are proposed, such as LZ-Markov chain algorithm (LZMA) [6], LZ4 [7], LZB [8], LZP [9], and LZSS [10]. Although the implementations among them are slightly different, the objective of these algorithms is to find out the repeating sequences, which is usually achieved by hash functions and hash tables.
Hash function is a function mapping a block of data to a fixed-size code, called hash value. Hash functions can be used The associate editor coordinating the review of this manuscript and approving it for publication was Jun Wang .
to detect repeating records in a large file. Nowadays, the hashing functions are widely used in many applications, such as secure encryptions, data deduplications, Bloom filters, and load balancing.
A good hash function satisfies two fundamental properties, termed simple calculation and uniform distribution. In particular, the simple calculation means that the computing time of the hash function should less than the time of other search and keyword comparison algorithms. The uniform distribution means that hash values are evenly distributed and the collisions are few.
When the encoder contains the components of the hash function, then these two properties of a hash function will affect compression respectively. The former property will affect the compression speed. And the latter one will affect the compression ratio, which we will be discussed in detail later.
A variety of fast hash functions have been proposed for requirements and applications. A rolling hash is used to prepare the calculation of piece hashes, It works by sampling the hash values of all substrings of a fixed length in a normalized string representation of its input [11]. An obvious application of rolling hash is used in Rabin-Karp string search algorithm [12], which is a substring searching algorithm. In data compression, LZ77 family uses the rolling hash to find out the repeating sequences in the data stream. The Bentley-McIlroy algorithm [13] uses a rolling hash to detect long repetitions that may occur far apart in the input text. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ In this paper, a rolling hash function for LZ4 is presented. However, the proposed hash function can also be applied to other variants of LZ77. In the proposed hash function, the input is treated as a binary polynomial s(x), which is multiplied by a constant polynomial p(x). Then the hash value is defined as the product with removing the high and low degree parts. An important property is that, this hash function can obtain multiple hash values by reading longer input sequence. In contrast, with other hash functions, the hash values should be calculate repeatedly. The contributions of this paper are enumerated as follows.
1) A hash function for rolling hash is proposed. In particular, the proposed hash function can calculate multiple hash values with using a carry-less multiplication instruction.
2) The proposed hash function is implemented in LZ4 library. The simulation shows that the encoding speed has 15.7% improvement in average, while the compression rate is basically identical. The rest of this paper is organized as follows. In Section II, we review the rolling hash algorithms, the compression algorithms of LZ77 family, the hash functions used in LZ4 and the SIMD instruction set. Section III presents the proposed hash function. In Section IV, we show the details of implementing the proposed hash functions to LZ4 library. Section V gives the simulation of the proposed LZ4 algorithm. Section VI discusses a number of issues related to the proposed hash function. Section VII concludes this work.

A. ROLLING HASH
The rolling hash is to sequentially calculate the hash values, which depends only on the substring in the sliding window. A number of rolling hash functions are proposed, and these algorithms maintains a state and each byte is added to the state as it is processed and removed from the state after a set number of other bytes have been processed [14].
The rolling hash function used in the Rabin-Karp string search algorithm is defined as where a is a prime number, and the input integers c i , . . . , c k are the characters in the sliding window. Characters can be interpreted as integers with the coding system (e.g. ASCII, Unicode). The next hash value can be calculated via where H old = H (c 1 , c 2 , . . . , c k ). Thus, the next hash value can be calculated with O(1) operations by utilizing the previous hash value. This is the major difference from the conventional hash function, that requires O(k) operations to calculate each hash value independently.
B. LZ77 FAMILY LZ77 algorithms achieve compression by replacing the repeating sequence with the addresses of the previous referenced data in the stream. The algorithm searches the longest repetition of the current processing sequence in the sliding window. When a repetition is detected, it will be encoded as a pair of integers o, l , where o is the offset, and l is the length of the repetition. If no repetition, the data is encoded as literals. Based on the specification of LZ77, it has better compression ratio for sequential data with many repetitions in context. The process of finding repetitions requires sequence comparisons. The brute force way of comparing sequences is to compare the letters of two sequences, which has a time complexity O(min(n 1 , n 2 )), where n 1 and n 2 are the lengths of the two sequences. To accelerate the performance, the LZ77 implementation maintains a hash table to find out the repetitions. Precisely, the index of each sequence is saved in the hash table, and the sequence comparison requires O(1) operation to calculate the hash function and query the index in the table.
In the hash table, different keywords may map to the same hash address. In this case, the LZ77 implementation is to replace the old entry with the new entry directly. This causes that the encoder may cannot find out longest repetitions. In addition, if the uniformity of a hash function is poor, most entries are concentrated in few buckets, and the collisions occur easily. This causes a lot of undetected repetitions, and the compression ratio tends to be unsatisfactory.

C. HASH FUNCTION IN LZ4
LZ4 is a byte-oriented compression scheme belonging to the LZ77 family focusing on compression and decompression speed [15], [16]. The hash function used in LZ4 is the multiply-shift hash [17], which is defined as where x is the input m-bit integer, a is a uniformly random odd m-bit integer. As illustrated above, the hash function (3) converts an m-bit integer to a n-bit integer. The standard LZ4 implementation chooses (m, n) = (32, 13), and a = 2654435761 is a golden ratio prime. Figure 1 gives a graphical representation. When the value ax is encoded as a binary representation, then the hash value H (x) is a segment of ax between m − n to m − 1. When (3) is implemented in C with a 32-bit integer variable H , the code is given by That is, the overflow discard in a 32-bit integer H is the same with the modulo 2 32 , and the division with 2 m−n can be replaced with a right shift operation. A standard hash function is the multiply-mod-prime scheme [18], which is defined as The article [18] reports that the hash function (3) is many times faster than the standard method (5).
The polynomial ring where d 1 is a non-negative integer and each a i ∈ {0, 1}. The ring F 2 [x] also defines two arithmetic operations, termed addition and multiplication, shown as follows.
Given two polynomials a( , the addition is defined as where ⊕ is the exclusive OR (XOR) operation. In addition, the multiplication is defined as

Each symbol is given by
where is the AND operation, and is the summation modulo 2.
In implementations, we usually prefer to use binary representations to identify the polynomials in can be represented as an integer a = (a d 1 a d 1 −1 . . . a 0 ) 2 = d 1 i=0 a i 2 i . In this case, the addition is written as a ⊕ b, where ⊕ is the bitwise XOR operation. The multiplication is written as a ⊗ b, where ⊗ is the carry-less multiplication.

III. PROPOSED HASH FUNCTION A. DEFINITION
The proposed hash function can be seen as the binary polynomial version of (3) in F 2 [x]. Precisely, given an input polynomial (6), each coefficient of r(x) is given by The hash value is a subset of the coefficients of r(x). A good hash function has an important property that when even a bit of the input is altered, the hash value will change accordingly. Thus, we prefer that the chosen coefficients are related to all coefficients of s(x). From (7), we have for d 1 ≤ i ≤ d 2 . Thus, the hash value is defined as From above, the proposed hash function is defined as follows.
Definition 1: Given an input polynomial s(x) ∈ F 2 [x] with degree deg(s(x)) < m, the proposed hash function is defined as where a(x) ∈ F 2 [x] is a constant polynomial of degree m − n.
In (9), the operation modx m is to remove the terms r i x i , for i ≥ m. In addition, the operation /x m−n is to remove the terms r i x i , for i < m − n. That is, the hash value is the coefficients between x m−n and x m−1 .
Equivalently, if the input and output are treated as integers, the hash function (9) converts an m-bit integer to a n-bit integer. That is, (9) can be written as where the integer s < 2 m is the binary representation of s(x), and the integer a < 2 m−n is the binary representation of a(x). For example, when (m, n) = (5, 2), If the example is expressed as integers, then s = (11001) 2 = 25, a = (1001) 2 = 9, and r = a ⊗ s = (11010001) 2 = 321. The hash value is H (s) = (10) 2 = 2.

B. MULTIPLE-HASH COMPUTATION
This subsection gives the scheme to calculate multiple hashes in a sliding window, when the hash function is Definition 1. Upon presenting the approach, we give a simple example as follows.
We consider (m, n) = (3, 2), and a(x) = x + 1. The input sequence is denoted as (s 3 , s 2 , s 1 , s 0 ). The size of the sliding window is three. To begin with, we take three symbols in the sliding window, and these symbols form a polynomial s 0 (x) = s 3 x 2 + s 2 x + s 1 . The hash value is given by Next, we move the sliding window a step, and obtain the polynomial s 1 (x) = s 2 x 2 + s 1 x + s 0 . The hash value is given by On the other hand, we construct a polynomial s(x) = s 3 x 3 + s 2 x 2 + s 1 x + s 0 by using all four symbols. Then we calculate From (11), (12) and (13), we have the following observations. First, there is a overlapping s 2 ⊕ s 1 between H (s 0 (x)) and VOLUME 8, 2020 H (s 1 (x)). Second, the degrees 2 and 3 of r(x) are the coefficients of H (s 0 (x)), and the degrees 1 and 2 of r(x) are the coefficients of H (s 1 (x)). Therefore, instead of calculating H (s 0 (x)) and H (s 1 (x)) individually, we can calculate r(x), and take a portion of r(x) to obtain H (s 0 (x)) (and H (s 1 (x)), respectively). The following gives a formal theorem.
Theorem 1: Given a input polynomial with degree deg(s(x)) < , a hash function is defined as where 0 ≤ k ≤ − m. Let Thus, we can get Thus, L k (s(x)) = H (s k (x)). Assume that the sliding window moves a byte forward after calculating a hash value. The new product will contain k hash values according to previous rules. Algorithm 1 gives the detail steps, where k = ( − m)/8 + 1 and MASK = 2 n − 1. Figure 2 gives an example for (m, n) = (32, 13) and = 64. It shows there is a overlap between any two adjacent hash values.

IV. IMPLEMENTATION TO LZ4
This section gives the details of implementing the proposed hash function to LZ4. First, the naive implementation is proposed. Then the algorithm is given to calculate multiple hash values in batch.

A. NAIVE IMPLEMENTATION
In this subsection, we present the naive implementation of the hash function in Definition 1. From the specification of LZ4, we use (m, n) = (32, 13). That is, the input is an integer of m = 32 bits, and the constant a is chosen as an integer of m−n+1 = 20 bits. In this way, the product a⊗s has 2m−n = 51 bits, and the middle n = 13 bits form the hash value. When the produce a ⊗ s is stored in a 32-bit integer variable, we do not need to perform the operation modx m in (9), because the coefficients higher than degree 32 are overflow. The operation /x m−n can be implemented by the right shift operation . Algorithm 2 gives the detail steps.
Algorithm 2 needs to perform the carry-less multiplication. Fortunately, the carry-less multiplication is implemented in certain SIMD instruction sets. For example, the instruction in ARMv8 is vmull_p64(), and the instruction in x86_64 is _mm_clmulepi64_si128(). In addition, when a = 2 m−n +1, the carry-less product a ⊗ s can be implemented with a left shift operation and a bitwise XOR operation. That is, Line 2 of Algorithm 2 can be replaced with which is usually faster than a carry-less multiplication on modern processors. Though the chosen a = 2 m−n + 1 may increase the probability of hash collisions, the simulation shows that the reduction of the compression ratio is limited. Algorithm 2 Naive implementation to LZ4 on 32-bit CPU Input: A 32-bit integer x Output: A 13-bit hash value 1: ret ← a ⊗ x 2: ret ← ret 19 3: return ret B. BATCH PROCESSING Based on the approach in Section III-B, this subsection presents the implementation to calculate five hashes in LZ4. The implementation uses (m, n) = (32, 13). That is, the encoder reads an integer of = 64 bits, that is then multiplied with the constant a of m − n + 1 = 20 bits. Therefore, the product a ⊗ s has + m − n = 83 bits, and we sequentially take five hash values from the product a ⊗ s. coding performance. Algorithm 3 gives the detail steps, and Figure 2 gives the graphical representation.

V. EXPERIMENTS
As the proposed hash function can only be applied to encoders, we only test the performance of the encoding in the experiment. The LZ4 v1.9.1 in its default mode (level 1) is chosen as the test program. We implement Algorithm 2 and Algorithm 3 in C, and the hash functions used in LZ4 are replaced with the proposed functions. All programs are compiled by GCC v7.4.0 with the optimization level -O3. All experiments are performed by a single thread on the platforms with ARMv8 architecture processors and x86_64 architecture processors, respectively. Table 1 tabulates the configurations of the platforms.
The data sets used in the experiments are chosen from the Calgary corpus [19] and the Canterbury corpus [20]. The first experiment shows the compression ratios of LZ4 with various hash functions, where the column Con. is the compression ratio of the conventional LZ4. Two constants are tested, namely a 0 = 2 19 + 2 6 + 2 2 + 2 1 + 1 and a 1 = 2 19 + 1. Table 2 lists the compression ratios for 13 input files. As shown in Table 2, the compression ratios of the proposed hash functions are similar to that of the conventional LZ4. For the hash function with a 0 , the compression ratio of the proposed approach is about 0.0023% worse than the conventional hash function in average. For the hash function with a 1 , the compression ratio of the proposed approach is about 0.554% worse than the conventional hash function in average. This shows that the uniformity of the proposed hash function is close to that of the hash function in LZ4.
In the second experiment, we consider the throughput of the encoders. The throughput is defined as the amount of data read per second (MB/sec). Table 3 lists the throughput of the programs on the platforms with ARMv8 processors and x86_64 processors, respectively. In addition, we list the ratio between batch method and conventional LZ4, and that is defined as Though the carry-less multiplication is supported by the SIMD instruction set, the throughput of our naive implementation is around 77.65% (and 76.36%) of that of the conventional LZ4 on x86_64 processors (and on ARMv8 processors). This is because the number of cycles for the carry-less multiplication instruction is greater than the number of cycles for the integer multiplication on modern CPUs.
For our batch implementations, the proposed algorithm usually has higher throughput than the conventional LZ4 on x86_64 processors. It shows that the propose algorithm has about Ratio avg x86_64 = 12.44% improvements on x86_64 processors in average. Further, the proposed algorithm always has higher throughput than the conventional LZ4 on ARMv8 processors. It shows that the propose algorithm has about Ratio avg ARMv8 = 18.89% improvements on ARMv8 processors in average. Table 3 shows that the proposed batch implementation can improve the encoding throughput. In Algorithm 3, the encoder calculates five hash values via (15), that requires a left shift operation and a bitwise XOR operation. Then in the next four hashing rounds, the encoder does not need to call the hash function again, and the hash value is in the set of prior obtained values. In contrast, the conventional hash function (4) in LZ4 requires an integer multiplication, that takes more cycles than the bitwise operations used in the proposed implementation. Thus, the proposed hash function is faster than the the conventional hash function (4) in LZ4.

A. PERFORMANCE OF THE PROPOSED HASH FUNCTION
As shown in Table 3, the improvement Ratio is variant. A reason is if the LZ4 encoder finds out a repetition in the sequence, the encoder will almost skip the hash computations for the repeating sequence. Thus, when the input sequence has many repeating sequences, most hash values calculated by Algorithm 3 may be discarded (except for the first one). This causes that the throughput of the proposed implementation is close to that of the conventional LZ4.

B. DRAWBACKS OF THE CONVENTIONAL ROLLING HASH
In this paper, we do not adopt the conventional rolling hash method (see Section II-A). The major reason is that, calculating (2) requires two multiplications and two additions. In contrast, the hash function (4) used in LZ4 only requires a multiplication and a right shift operation. Thus, we conclude that calculating (2) is slower than calculating (4) on modern processors.

VII. CONCLUSION
In this paper, we present a new hash function, that is suitable for the application that need to calculate hash values in a sliding window. In particular, the proposed hash function can calculate multiple hash values with a carry-less multiplication instruction. The proposed hash function is implemented on the LZ4 compression algorithm. The experiments show that the proposed hash function has similar compression ratio to the conventional hash function in LZ4. Thus, it meets the requirements in actual applications. Moreover, the proposed algorithm generally improves the encoding throughput on x86_64 and ARMv8 processors.