High-Throughput Variable-to-Fixed Entropy Codec Using Selective, Stochastic Code Forests

Efficient high-throughput (HT) compression algorithms are paramount to meet the stringent constraints of present and upcoming data storage, processing, and transmission systems. In particular, latency, bandwidth and energy requirements are critical for those systems. Most HT codecs are designed to maximize compression speed, and secondarily to minimize compressed lengths. On the other hand, decompression speed is often equally or more critical than compression speed, especially in scenarios where decompression is performed multiple times and/or at critical parts of a system. In this work, an algorithm to design variable-to-fixed (VF) codes is proposed that prioritizes decompression speed. Stationary Markov analysis is employed to generate multiple, jointly optimized codes (denoted code forests). Their average compression efficiency is on par with the state of the art in VF codes, e.g., within 1% of Yamamoto et al.’s algorithm. The proposed code forest structure enables the implementation of highly efficient codecs, with decompression speeds 3.8 times faster than other state-of-the-art HT entropy codecs with equal or better compression ratios for natural data sources. Compared to these HT codecs, the proposed forests yields similar compression efficiency and speeds.


I. INTRODUCTION
High throughput (HT) data compression is widely employed to improve performance in many systems with strong time constraints. In communications applications, compression improves effective channel capacity [1], [2], and HT is essential to avoid undesired transmission delays. Large-scale data processing applications benefit from HT compression, e.g., to reduce network, disk and memory usage in computer clusters and distributed (including cloud) storage systems [3]- [10].
HT codecs produce compact representations of input data, prioritizing the rate at which samples are processed over the achieved compression ratios -i.e., the quotient between the original and compressed data sizes. Even though HT can be attained via massive parallelization [11]- [14], low-complexity compression pipelines are often employed The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. for this purpose [15]- [25]. Lower complexity typically entails lower power consumption, and therefore HT codecs are also suitable in scenarios with energy consumption and delay constraints. These include remote sensing and earth observation onboard satellites [26], [27], real-time transmission of point clouds for search and rescue [28], cooperative robot coordination [29], and Internet of Things (IoT) scenarios [30]- [32]. Many other applications can benefit from time-efficient and energy-efficient data compression, in particular when involving interactive communication and mobile devices. The use of compression in hypertext transfer protocol data [33], [34] and in wireless sensor networks [35] are two notable examples thereof.
In many of the applications described above, decompression performance (in terms of time and energy consumption) is equally as or more important than compression performance. Decompression time is critical to minimize delay in high-performance network contexts [1], [2], [36], VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in particular when headers need to be obtained for routing and other purposes in intermediate nodes [1], [32], [35], and to allow fast access to data in cloud and distributed file systems [5], [37]- [39]. In the context of high-performance computing, it is paramount to quickly retrieve compressed data [3], [4], [9], especially in data-driven applications [6], [36], [40] using distributed databases and compressed columns. Furthermore, minimizing energy spent at the decoder can extend the availability of battery-powered devices, e.g., for mobile networks and IoT [32], [32], [35], robot coordination in rescue situations [28] and mobile web browsing [34], to name a few. Power efficiency in the decoding process also has a positive effect on the carbon footprint of data centers where it is intensively applied [1]- [5], [9], [37]- [39], notably when the same data are typically decoded at different points in time [6], [40]. Variable-to-fixed (VF) coding produces fixed-length words, each representing multiple input symbols [41]- [53]. In this work, we present an HT compressor-decompressor pair based on VF codes, designed to provide significantly higher decoding speed while yielding compression ratios and coding speeds comparable to state-of-the-art HT entropy codecs. A method is first proposed to create dictionaries (equivalent to code trees) stochastically optimized for any given symbol probability distribution. An extension using code forests allows adding a parametrizable number of coding contexts, enabling configurable time-compression ratio trade-offs. Further improvements are proposed based on selectively dedicating more effort to more compressible parts of the input data. This enables the design of more efficient codes and minimizes the impact of low-probability symbols and of incompressible noise. The proposed codec extends upon our previous conference works [21], [54] with theoretical contributions, exhaustive experimental results to assess their performance, and a standalone code container released with digital object identifier (DOI) 10.24433/CO.2752092.v2 1 as supplementary materials to enable full experimental reproducibility.
The remainder of this paper is structured as follows: Section II discusses the closest related works in the state of the art. Section III provides definitions and a concise review of the VF coding literature. The proposed HT codec is described in Section IV, and exhaustive experimental evaluation is presented in Section V. Section VI concludes this work.

II. RELATED WORK
Some of the best-performing HT algorithms in the state of the art are based on a Lempel-Ziv (LZ) [55] decomposition, and/or a fast entropy coding stage. LZ decomposition removes data redundancy by expressing the input as a series of position-length references to a dynamically built dictionary [55]. These references can be directly output -e.g., as in LZ4 [15] and Google's Snappy [16]-, compressed with fast bit mangling techniques [17], encoded with a fast 1 Also available at https://codeocean.com/capsule/9466901/. version of Huffman's entropy coder -e.g., as in LZO [18] and Gipfeli [19]-, or coded with a combination of fast Huffman and asymmetrical numeral system (ANS) [56] algorithms, for example as in Zstandard [20]. Random data sourcese.g., with Laplacian distributions-appear naturally when dealing with image, audio and other sensor signals when using differential pulse-code modulation (DPCM) and other common prediction and/or transform-based methods. For these and other random sources, LZ decomposition does not yield good results, and especially low efficiency has been reported for medium and low source entropy rates [21]. Competitive codecs have been proposed that obviate any LZ decomposition and rely on a fast entropy codec. A modified version of Huffman is used in Huff0 [22]. Finite State Entropy (FSE) [23] provides a fast implementation of ANS; and Rice codes are used in fast JPEG-LS implementations such as CharLS [24]. FAPEC [25], a recent compression algorithm designed for space missions, includes an HT entropy codec that is adaptively applied to blocks of prediction errors.
In all cases above, 2 HT compression is attained by repeatedly coding one [22]- [25] or several [18]- [20], [56] input symbols into variable-length words. Even though these compression schemes enable fast encoding and reasonable compression ratios, one-pass decoders cannot know a priori the boundaries between coded words [15]- [25]. Therefore, they must continually execute conditional branches that hinder efficient pipelining of the decoding process, thus limiting maximum decoding speed and increasing energy consumption.
By construction, VF codes can be decoded very efficiently without requiring conditional branches at the most critical parts of the decoding loop. Tunstall proposed an algorithm that produces codes optimal within a subclass of all possible VF codes called prefix free [41]. Since then, more general classes of VF codes have been described that improve upon Tunstall. Savari introduced the concept of plurally parsable dictionaries, and proposed a generation algorithm for the binary case [44]. Yamamoto and Yokoo described the class of plurally parsable codes called almost-instantaneous variableto-fixed (AIVF) codes, which can also be employed in the non-binary case [47]. Some improvements have been proposed to [47], reporting small gains in terms of compressed sizes [49], [51] and memory efficiency [52]. This work contributes with a new method for constructing VF codes that provide an advantageous trade-off between compression efficiency and computational complexity, specifically for the decoder. It should be highlighted that other variable-to-fixed codecs not discussed in this work have been proposed in the literature, e.g., [42], [43], [45], [46], [48], [52]. However, these are based on arithmetic coding as opposed to dictionary-based coding, which enables higher (better) compression ratios but at a prohibitive complexity cost in the scope of HT codecs.

III. BACKGROUND A. DEFINITIONS AND NOTATION 1) SOURCES, SEQUENCES AND CODECS
Let ℵ be a discrete source that produces symbols from a finite alphabet A = a 1 , . . . , a |A| . When the probability of emission of the symbols is known, A also denotes a sequence sorted in decreasing order of probability. Let A + be the set of all finite, nonempty sequences of elements in A. Elements W ∈ A + are referred to as words and denoted as W = w 1 w 2 · · · w |W| , where w i ∈ A denotes the i-th symbol in the word. Let B + be the set of all finite, nonempty binary sequences. A lossless entropy codec can be defined as a pair of coder and decoder functions E : Hereafter, lossless entropy codecs as per this definition are considered exclusively.

2) VARIABLE-TO-FIXED DICTIONARIES
Variable-to-fixed coders split A ∈ A + into a finite sequence of words whose concatenation yields A. Each word is then represented by a binary sequence of fixed length l > 0, and E(A) ∈ B + is defined as the concatenation of all binary sequences. Each l-bit sequence can be considered an index to a dictionary W that contains at most 2 l words and unambiguously describes its associated variable-to-fixed code.
VF decoders can be implemented very efficiently by iteratively reading l bits from E(A) and outputting the corresponding word's symbols. Dictionaries associated with VF codes can be defined a priori, or dynamically constructed as input samples are processed, e.g., with the LZ algorithm [55]. A dictionary W fully defines a codec assuming the criterion of using the longest word possible. This criterion is general in the literature and is embraced as well in this work. In what follows, only complete dictionaries are considered, i.e., for any valid input sequence A ∈ A + , there exists at least one concatenation of words in W such that A is a prefix of that concatenation.
The statistical properties of ℵ, in particular the probability of emission of each a ∈ A, can be exploited to optimize dictionary performance. Sources are hereafter assumed to be ergodic -e.g., stationary-with symbol probability distribution P. As discussed in Section I, probability-driven design enables more efficient compression than LZ-based methods for sources well modeled by random variables, e.g., following a Laplacian distribution [21]. Therefore, this approach to dictionary design is considered exclusively hereinafter.

3) CODE TREES
VF dictionaries can always be expressed as a code tree, in which each edge is assigned a symbol a ∈ A. Each node is associated with a word W formed by the symbols associated with the edges of its path from the root node, and a present flag determining whether that word is Included or Excluded from W. The code tree T equivalent to a dictionary W is defined as the one that satisfies these conditions: Algorithm 1 VF Coding of an Input Sequence A Using a Code Tree T 1: function TreeCoding(T , A) 2: n ← T .root n: current node 3: for each a ∈ A a: current input symbol 4: if n.HasEdge(a) then // Follow edge associated with a 5: n ← n.Child(a) 6: else // Word defined by the path from T .root 7: EmitWord(n.W) 8: n ← T .root.Child(a) 9: end if 10: end for 11: if n = T .root then 12: FlushLastSymbols(n) 13: end if 14: end function 1) For each word W in the dictionary, there exists exactly one node with associated word W and present flag set to Included. 2) All Included nodes have words included in W.
3) The present flag is Included for all leaf nodes. A node is a leaf when it has no children. 4) No two nodes can have the same associated word. The encoding process using a code tree is described in Algorithm 1. It starts at the root node, and each input symbol is processed sequentially. If the current node has an edge with associated symbol equal to the input, and the edge connects a child node, the algorithm advances to that child node. Else, the algorithm emits the current node's word W, i.e., it appends that word's index binary representation to the current output. Any one-to-one mapping between the words and the available binary indices can be used, as long as it is known by both the coder and the decoder. A routine called EmitWord is assumed to be available in this and subsequent algorithms to invoke this functionality. After emission of a word, the algorithm returns to W's root and the process is repeated. If the dictionary completeness assumption described above holds, nodes associated to emitted words are guaranteed to be Included in W. After the main coding loop, one or more symbols might remain uncoded. By construction, these symbols are prefix of some Included words in W. The FlushLastSymbols function emits any of those words and sends as side information the number of symbols coded this way.
A common decompression algorithm exists for all VF codes. Each emitted fixed-length index identifies a single Included word, that can be looked up in a table that contains that word's symbols. In case the FlushLastSymbols function emitted a word during compression, the decoder truncates that word to match the decompressed symbol sequence with the original one. Side information needed to perform this truncation is negligible for long enough sequences and not considered hereafter. A dictionary W is prefix-free (or proper) if it satisfies ∀W, W ∈ W, W = w 1 w 2 · · · w |W| , i.e., no word is the beginning of another. Under this constraint, no valid input A ∈ A + can be expressed by more than one concatenation of words in W. Tunstall [41] proposed a method to design optimal prefix-free dictionaries assuming a memoryless source ℵ, i.e., one that produces symbols independently. Tunstall's method is listed in Algorithm 2. First, the dictionary is initialized with |A| words, each one representing a different input symbol (line 2). In successive iterations, each word W ∈ W is assigned a probability where w iterates over W's symbols. The most probable word W based on P word is then substituted by |A| new words, each resulting from concatenating W and a different symbol from A (lines 4 to 10). The algorithm stops when no more words can be expanded without exceeding the specified maximum number of words. Using code tree notation, each iteration of Tunstall's algorithm can be regarded as adding |A| children to the leaf node with the highest estimated probability. To satisfy the prefix-free constraint, only words represented by leaf nodes are included in the output dictionary W. Fig. 1 shows an example of an iteration of Tunstall's algorithm using code tree notation for A = {a, b, c, d} and associated probabilities 0.8, 0.1, 0.07, and 0.03. In this example, the words included in W are {aa, ab, ac, ad, b, c, d}, and the node to be expanded in the next iteration (not shown in the figure) is that with W = aa and associated probability 0.64.

C. PLURALLY PARSABLE DICTIONARIES
The optimality of Tunstall codes is restricted to the prefixfree case. More general, plurally parsable (or non-proper) dictionaries that remove the prefix-free constraint were introduced in [44]. Using tree notation, words associated with non-leaf nodes can also be included in W when removing the prefix-free constraint. This has enabled more flexible and efficient designs [47], [49]- [51], [53].
In [44], Savari designs dictionaries that outperform Tunstall's for the case of predictable, memoryless binary sources, i.e., ℵ producing symbols from A = {0, 1} independently with large P (0) or large P (1). She considers the encoding process as a Markov chain and analyzes its steady state to appraise dictionary performance, although her analysis applies only to the binary case and is described as cumbersome [44] and difficult to apply to the non-binary case [47]. Rababa et al. improve upon Savari's method by removing the constraint of always emitting the longest possible word [51]. Input can then be expressed using different word sequences with the same length as that obtained with the aforementioned constraint, and the choice made in the coding process is used as a side channel between the coder and the decoder.
In [47], a class of plurally parsable dictionaries called almost instantaneous variable to fixed (AIVF) codes is proposed. A complete dictionary W is AIVF if and only if its associated code tree satisfies the following conditions: 1) All leaf nodes in the code tree represent words included in W. 2) Non-leaf nodes that have |A| children (complete nodes) represent words not included in W.

3) Words represented by non-leaf nodes with fewer than
|A| are included in W. The algorithm in [47] differs from Tunstall's in the fact that nodes can be partially expanded, leading to more flexible designs. This is illustrated with an example in Fig. 2 for A = {a, b, c, d, e} and associated probabilities {1/3, 1/4, 1/6, 3/20, 1/10}. In [47], multiple plurally parsable dictionaries are also proposed to exploit some dependencies between consecutively emitted words. Further analysis and improvements to [47] based on dynamic programming are proposed in [49], [53]. The code tree by [53] for the same example is shown in Fig. 2b.

IV. PROPOSED HT CODEC
A novel HT codec based on plurally parsable dictionaries is proposed in this section. An algorithm for creating individually optimized code trees is proposed in Section IV-A, FIGURE 2. Example of plurally parsable code trees from [53]: (a) created using [47]; (b) created using [53].
based on stochastic optimization. In Section IV-B, multiple code trees are optimized jointly to generate code forests that improve upon individual code trees. Further improvements to the code forest generation algorithm are proposed in Section IV-C to prioritize resource allocation to the most compressible parts of the input.

A. STOCHASTICALLY OPTIMIZED CODE TREES
The description of the proposed algorithm for generating stochastically optimized code trees is presented in three parts. Section IV-A.1 defines the generation of basic trees. Section IV-A.2 models code trees as stochastic processes, enabling the creation of better adapted trees. Section IV-A.3 describes the iterative process used in the proposed codec to generate stochastically optimized code trees.

1) BASIC TREE GENERATION
Given a node n with associated word W, the probability of the encoder emitting W is lower if that node has a nonempty set of children, hereafter denoted children(n). This is because some occurrences of W in the sequence of input symbols will be coded by words associated with nodes contained in children(n). Since longer words are preferred to maximize compression efficiency, a child word will be used whenever the next input symbol a can be matched, i.e., when an edge associated with a is connected to n in the code tree. Based on this, for the first coded word, the probability of a node being emitted can be calculated as where symbol(m) denotes the symbol of the edge connecting m and its parent n. By definition, when children(n) is empty, P node (n) = P word (W). As discussed in IV-A.2, P node is not always accurate for the second and subsequent coded words. For the sake of simplicity, P node is assumed to be exact for basic tree generation. T ← T W Tree with the root and |A| children 4: while |T .IncludedNodes()| ≤ max_size do 5: n ← arg max // A's symbols sorted by decreasing probability. 6: n.AddChild(a |children(n)|+1 ) 7: if |children(n)| = |A| − 1 then // Last possible symbol 8: n.AddChild(a |A| ) 9: n.present ← Excluded 10: end if 11: end while 12: return T 13: end function The proposed method is given in Algorithm 3. It follows a generalization of Tunstall's iterative algorithm, with the following three main differences: 1) In each iteration, the selected node is the one with the highest P node (n) instead of the highest P word (W) (line 5). This is hereafter referred to as the node being expanded. 2) Only one node m is created and connected to the expanded node n (line 6). The expanded node can be selected again in subsequent iterations. This is as opposed to Tunstall's algorithm, which connects all possible children to the node being expanded in a single iteration. Note that nodes added with AddChild are flagged as Included by default. 3) If an expanded node n is connected to |A| − 1 children, a second node m is connected to n associated to the last symbol in A, a |A| (line 8). Then n is flagged as Excluded, removing its associated word from the dictionary (line 9). These changes can be safely applied after 2 by construction, because the emission of n's word W can only take place if symbol a |A| is produced after W by the source, or the end of the input sequence is reached after W. Otherwise, by construction, another word would have been emitted. The changes benefit compression performance by increasing the average number of symbols per word included in the dictionary. Algorithm 3 can be regarded as fast approximation to the average-sense optimal almost instantaneous VF code tree generation algorithm presented in [47]. In [47], sometimes all children of a node n are added instead of creating a single child in the node with highest associated probability. This occurs when the average word length gain due to the automatic expansion of n's last symbol outweighs creating otherwise suboptimal children. Empirically, we observed that this has little impact in most cases. VOLUME 8, 2020 2) CODE TREES AND STOCHASTIC PROCESSES Plurally parsable dictionaries introduce temporal dependencies between words. For instance, consider the code tree depicted in Fig. 2 for A = {a, b, c, d, e}. If word W = a is emitted, since longer words are always preferred for coding, the next emitted word must start with symbols d or e. Otherwise, words aa, ab or ac would have been emitted instead of W. Therefore, a tree obtained by direct application of Algorithm 3 will generally not be optimally adapted for the next symbol. This is because P (a), P (b) and P (c) were not assumed to be 0 in its creation. In general, suboptimal probability estimation will occur after emitting any non-leaf's word. Algorithm 4 is proposed below based on stochastic analysis to address this issue and generate an improved tree, which maximizes expected emitted word length for a long-enough sequence of symbols. This is as opposed to considering only the expected word length of a dictionary, which is accurate to measure the dictionary's efficiency only in the prefix-free case due to the lack of temporal dependencies between words discussed above.
The coding procedure using a tree T is modeled as a stochastic process X T with |A| − 1 states, X T = {σ 1 , . . . , σ |A|−1 }, which are transitioned whenever a word is emitted. When T is obvious from the context, X is used instead of X T . Hereafter, X is defined to be in state σ i when T 's structure and the last emitted word determine that the next coded word cannot start with any of the i − 1 most probable symbols, i.e., the first i − 1 symbols in A. After emitting a leaf node's word, the process will transition to state σ 1 because there are no restrictions on the next input symbol. By design, Algorithm 3 adds children associated with symbols in decreasing order of probability, i.e., in the order given by A. Therefore, after emitting the word associated with a node n, X will be in state σ |children(n)|+1 . Using this definition, the probability of a node's word being emitted can now be expressed as where π (σ ) is the probability of X being in state σ before emitting a word, and P X (n | σ ) is that node's conditional probability, given that the process is in state σ before emission. 3 The new measure P X can directly replace P node in line 5 of Algorithm 3. The resulting function for producing trees given a finite discrete process X is hereafter denoted GetProcessTree(A, P, X , π , max_size), where X is X 's state set and π is a function satisfying σ ∈ X π (σ ) = 1.

3) STOCHASTIC OPTIMIZATION
To obtain P X (n), both the conditional probabilities and the process state probabilities must be computed. A node's 3 Note that the Greek letter π is hereinafter employed to denote state probability measures, as opposed to the Latin P used for symbols and words. conditional probability in (3) can be calculated as where NodesFrom(T , σ ) is the set of all nodes whose word can be emitted while in a state σ , where δ is a function defined as and where A.FirstAt(σ ) is the set of symbols valid in the first position of the next emitted word. Assuming symbols are sorted in descending order of emission probability, A.FirstAt(σ i ) = {a j ∈ A : j ≥ i} is the set of all but the first i − 1 symbols in A. Note that equation (5) holds by construction, because all nodes in NodesFrom(T , σ i ) are associated with words that start with a ∈ A.FirstAt(σ i ). The P node sum of all nodes in any branch remains constant with node expansions as defined in Algorithm 3, hence symbol probabilities are unaffected by X 's state in the second and subsequent positions of a word. Continuing with P X 's calculation, X 's state probabilities can be accurately computed assuming ergodicity of the source ℵ. Stationary Markov chain analysis is applied by considering the state transition probability matrix. The transition probability to σ j given state σ i can be expressed as where NodesTo(T , σ j ) = {n ∈ T : |children(n)| = j − 1} (8) is the set of nodes in T whose associated word's emission entails a transition to state σ j . It is trivial to verify that T satisfies all necessary conditions for X to be a stationary Markov chain. In the proposed codec, π is defined as X 's stationary state probability measure, which can be computed as shown in Algorithm 5: state probabilities are iteratively updated by multiplication with T until probability convergence of π . A small tolerance ε is allowed, e.g., ε ≈ 10 −10 , to avoid numerical instability. It was empirically found that Algorithm 5 converges within a few iterations. The proposed method for generating individual, stochastically optimized code trees is given in Algorithm 4. Function MarkovOptimizedTree takes an alphabet A, symbol probability P and a maximum number of nodes. A basic tree as described in Section IV-A.1 is generated for initialization purposes (line 3). The stationary state probabilities are computed using Algorithm 5 (line 6). These probabilities are then used to update the current tree with a new one using the routine defined in Section IV-A.2. If the newly generated code tree is not identical to the previous one, probabilities and trees are updated again. Otherwise, convergence is achieved, π ← TreeStationaryProbability(T ) 6: X ← X T

B. OPTIMIZED CODE FORESTS
In Section IV-A, optimization is performed, assuming that a single tree is used to code all input symbols. In this section, multiple trees are jointly optimized assuming that different words can be emitted by different trees. Each emitted word determines the next tree to be used for coding. A code forest F is defined as a finite set of code trees, {T 1 , . . . , T |F | }, alongside the rules that dictate what tree within the set to use after emitting each word. Hereafter, |F| denotes the number of trees in the code forest. The general coding procedure for any code forest is provided in Algorithm 6.
In [47]'s algorithm, a code forest with exactly |A| − 1 trees is produced. T i 's root has children for all but the i − 1 most probable symbols. Therefore, T i is designed specifically for state σ i as defined in Section IV-A.2, where none of the T ← F 0 F 0 : Arbitrary initial tree 3: n ← T .root n: current node 4: for each a ∈ A a: current input symbol 5: if n.HasEdge(a) then 6: n ← n.Child(a) 7: else 8: EmitWord(n.W) n.W: n's word 9: T ← n.T next n.T next : defined transition 10: n ← T .root.Child(a) 11: end if 12: end for 13: if n = T .root then 14: FlushLastSymbols(n) 15: end if 16: end function i − 1 most probable symbols may appear at the beginning of the next word. In [47], after emitting a node n's word, the next tree is the one designed for state σ |children(n)|+1 . In this section, we propose a more general forest design algorithm that produces a parametrizable number of trees. In contrast to [47], all tree roots of the proposed method have exactly |A| children so that any input sequence can be coded with any of these trees. Code trees in the forest T ∈ F are optimized based on the transition rules of F, as described below in Section IV-B.1. The number of trees determines the number of coding contexts that are effectively used for coding. In Section IV-B.2 a method is proposed to arrange word indices that allows highly competitive decompression throughput.

1) STOCHASTIC FOREST OPTIMIZATION
In the proposed Algorithm 7, 2 trees are produced, F = {T 1 , . . . , T 2 }. Each tree contains exactly |T i | = 2 K Included nodes. Parameters and K are integers that must satisfy K ≥ ≥ 0, and K ≥ log 2 |A| . For each tree in the forest, 2 K − of its words are defined so that T 1 is used for coding after emitting them. The same number of words is devoted to the remaining trees in F. For example, for K = 8 and = 2, 4 trees are created. Each tree contains 256 Included words, 64 of which define a transition to the first tree (T 1 ), 64 to the second tree (T 2 ), and so forth. As detailed later in Section IV-B.2, this design structure is imposed so that implementations can attain competitive HT performance.
A stochastic process Y is used to model the proposed forest's coding procedure. For each tree T in the forest, |A| − 1 states are defined, {σ T 1 , . . . , σ T |A|−1 }. The process Y is in state σ T i when tree T is selected to emit the next word W, and none of the i − 1 most probable symbols may appear at the beginning of W. Therefore, 2 · (|A| − 1) states are defined in total for Y , Y = {σ T i } T ∈F , 1≤i≤|A|−1 . The probability VOLUME 8, 2020 π ← ForestStationaryProbability(F) 9: for each ω ∈ {1, . . . , 2 } 10: T ω ←GetProcessTree(A, P, , π, 2 K ) 12: end for 13: DefineTransitions(F ) 15: if F = F then 16: return F end forever 20: end function of a node n's word being emitted can then be expressed as a weighted sum of conditional probabilities, where π σ T i is the probability of emitting a word while in state σ T i . The conditional probability of a node n given σ T i is given by As in Section IV-A, NodesFrom(T , σ T i ) denotes the set of nodes whose word can be emitted while in state σ T i , A.FirstAt(σ T i ) is the set of all but the i − 1 most probable symbols in A, and δ(σ T i , n) is given by The state transition probability matrix of the forest's process Y can be expressed as Here, NodesTo(T , σ T j ) contains the nodes in T after which the process transitions to σ T j and T is used for the next word. Equation (11) holds by construction. Similar to Section IV-A, T defines a stationary Markov chain. Stationary state probabilities can be computed using the same iterative approach as in Algorithm 5. The new routine is hereinafter denoted ForestStationaryProbability(F). In this case, transitions between any two states are considered, including transitions between different trees. Therefore, the obtained probabilities well describe the long-run behavior of the coding process given a code forest.
Algorithm 7 is proposed to generate optimized code forest for an alphabet A, symbol probability P, and parameters K and . During initialization, 2 trees are created, each with 2 K Included nodes and tree transitions as defined in DefineTransitions (lines 3 to 5). This routine sets those transitions with the restriction that 2 K − words from each tree in the forest must transition to any given T ∈ F. Since there are 2 K + total Included words in F, (K + )-bit words are emitted. Therefore, any one-to-one mapping between the Included words and 0, . . . , 2 K + − 1 is valid. The proposed mapping used in DefineTransitions, which is optimized for highly efficient decompression, is detailed in Section IV-B.2. The remainder of Algorithm 7 alternates between stationary state probability updates (line 8), and the generation of 2 trees and their transition rules (lines 9 to 14). Empirically, it was found that this iterative algorithm converges after a few minutes on a commercial desktop CPU for all tested parameters.

2) CONTEXT PARAMETRIZATION AND OVERLAPPED WORD INDICES
Each of the 2 trees produced in an Algorithm 7's forest F can be considered a coding context, as they are selected based on the input symbols. As described in Section IV-B.1, trees are optimized based on their transitions as well as on the Included words' emission probability. DefineTransitions assigns which 2 K − words from each tree cause transition to each other T ∈ F. Finding optimal word-to-tree transitions is not trivial, so a heuristic is proposed as follows: 1) Nodes of each T ∈ F are assigned to one of |A| − 1 groups, denoted G i . A node with |children(n)| children is assigned to group G |children(n)|+1 . A list of Included nodes is produced. 2) Within each group G i , Included nodes are sorted by P node . 3) The sorted list of Included nodes is split into 2 consecutive blocks of size 2 K − . Words of the i-th block are defined to transition to T i .
This heuristic is designed so that trees T i ∈ F with lower indices are adapted to receiving transitions from more probable words. Moreover, the number of trees optimized based on the words in G i depends on the number of words that transition to G i . This results in more specialized trees available for the most probable cases, while less probable scenarios are still represented proportionally to an estimation of their likelihood. The chosen transition structure enables a very efficient decoder implementation that removes the necessity for a memory access per emitted word to find the next tree to be used. More specifically, we propose to employ the -bit suffix of each index's binary representation to identify which of the 2 possible trees to use next. In addition, words are proposed to have (K + )-bit binary indices, of which the first bits identify the tree to which a word's node belong. Therefore, with a single read operation of K + bits, both the coded word and the next tree to be used are acquired. The next read starts K bits after the previous one. As a result, the last bits of each word (identifying the next tree to use) are the first bits of the next read of K + bits. An example of the proposed overlapped word scheme is shown in Fig. 3 for K = 4 and = 2. It should be highlighted that the K + bits read for each coded word contain information to reconstruct the intended list of symbols, as well as to signal the transition next tree to be used. Note that such an optimization is not possible with methods such as Yamamoto et al.'s, which require at least an additional memory access to determine the next tree to be used.

C. SELECTIVE INFORMATION CODING
The general motif of sections IV-A and IV-B is based on investing more resources -e.g., number of words within a tree's branch-to more probable events, in order to increase average word length and enhance compression ratio in those cases, to the detriment of less probable events. This motif can be generalized to further skew codec resources towards design aspects that can improve overall performance. The following improvements are proposed:

1) MODIFIED GOLOMB-RICE CODES
By considering the index x ∈ {0, . . . , |A| − 1} of each input symbol in A, r and q can be defined based on an integer parameter S ≥ 0 as Similar to Golomb-Rice codes, q and r are processed separately, and entropy coding is applied only to q. The binary expression of r corresponds to the S least significant bits of the input, which are output without further coding.
Hereafter, S is referred to as the shift parameter. The quotients obtained in (15) are then compressed using a code forest, created as described in Section IV-B. The main advantage is that the effective alphabet size is divided by 2 S , and therefore it is easier to obtain longer average word lengths for a fixed dictionary size. For sources with r values distributed uniformly enough, gains due to longer word lengths are greater than the loss due to outputting r values without coding them.

2) UNREPRESENTED SYMBOLS
Lossless codecs must be able to represent any valid input, including those that contain highly improbable symbols. This is a source of inefficiency especially for large alphabets, because a non-negligible fraction of the words may be devoted to considering unlikely symbol events. Complementary to the modified Golomb-Rice method, another improvement to Algorithm 7 is proposed to address this issue. A symbol probability threshold is selected, and the largest subset of symbols with probability sum below are discarded. Algorithm 7 is used to produce an optimized code forest for the remaining symbols. The coding process is identical, except for the fact that when an unrepresented symbol is found, it is skipped. Instead, an auxiliary binary sequence is defined during initialization, and updated every time an unrepresented symbol is found, appending the unrepresented symbol's index in the input sequence, and its index in the original A. For sufficiently low values of , the number of such updates is small, hence it suffices to output those indices using fixed length binary representations. Empirically, we found in the order of 10 −6 to yield performance improvements for low entropy sources for which enough symbols can be discarded for that .

V. PERFORMANCE EVALUATION
Given a codec (E, D) and an input A ∈ A + , the compressed data rate is defined as where |E(A)| is the length of the coded sequence in bits, |A| is the number of coded symbols, and R (E, A) is expressed in bits per sample (bps). In this work, codec performance is evaluated based on three measures: compressed data rate, coding time, and decoding time. Section V-A discusses compressed data rate efficiency for synthetic memoryless sources and provides comparison with the state of the art in VF codecs. Compression rate for natural sources is evaluated jointly with coding and decoding time for the proposed codec and the best-performing publicly available HT codecs in Section V-B. In order to attain full reproducibility of the performed experiments, except for the obtained time measurements, a code container is provided as supplementary materials for this manuscript. The interested reader is referred to this container for implementations of the tested codecs as well as of VOLUME 8, 2020 the benchmark employed for generating the figures displayed in this document.

A. PERFORMANCE ON SYNTHETIC MEMORYLESS SOURCES
For a memoryless ergodic source ℵ and long enough input sequences, Shannon proved that the lower bound for a codec's compressed data rate is the source's entropy [57], In this work, the compression efficiency of a codec (E, D) for a given input A ∈ A + produced by ℵ is defined as the ratio between that lower bound and the attained compressed data rate: Synthetic pseudorandom sources are defined with |A| = 16 and symbol probabilities based on a Laplacian distribution, where γ is a constant defined so that a∈A P (a) = 1, and b is a parameter that determines the resulting source entropy. For each source ℵ, an input sequence A ℵ of length |A ℵ | = 10 6 symbols is generated so that symbol a appears max(1, [P (a) · |A ℵ |]) times. Symbols are then reordered in 16 pseudorandom ways to simulate the ergodicity and memoryless properties, The efficiency of a codec (E, D) for a source ℵ's generated input is defined as and average results for all trials are reported. Note average differences below 0.3% are observed between the trials.

1) CODE TREE EFFICIENCY
The stochastic optimization approach proposed in Algorithm 4 is compared to Algorithm 3 and to the single tree generation algorithm described by Yamamoto and Yokoo (YY) in [47]. Code trees are created for each algorithm with exactly |T | = 2 8 nodes, and their compression efficiency is measured for ℵ as described above. Efficiency results for these algorithms are shown in Fig. 4. Significant efficiency gains are observed for Algorithm 4, when compared to both Algorithm 3 and YY's code tree. Efficiency improvements are more noticeable for low entropy sources. This can be explained by the fact that the most probable symbol is rarely the first in a word for this type of sources and code trees. The proposed stochastic optimization correctly predicts this behavior and adapts code tree designs accordingly. Average compression efficiency gains due to stochastic optimization range from 0.08 to 0.11, depending on the parameter choice. The effect of K on the proposed code trees is also studied. Fig. 5 shows efficiency results for the proposed method for  K between 6 and 14, e.g., between 64 and 16384 Included words. As can be observed, better compression is observed for larger values of K . Notwithstanding, improvements due to increasing K become small when K is already large. For |A| = 16, no significant efficiency gains are observed beyond K = 14. In practical applications, K should be ideally selected so that the code forest structure can be fit inside the L1 cache while providing competitive compression efficiency. Therefore, the optimal value of K depends on the alphabet size and the machine's cache size.

2) CODE FOREST EFFICIENCY
The code forest generation method proposed in Algorithm 7 is evaluated for different number of trees, |F| ∈ {2 0 , . . . , 2 4 }, each with exactly |T | = 2 8 Included nodes. Note that for |F| = 1, Algorithm 7 is equivalent to Algorithm 4. Compression efficiency for these forests, as well as for the ones obtained with YY's code tree and code forest generation algorithms for the same number of nodes per tree, is plotted in Fig. 6. Compression efficiency is consistently improved by using more than one tree in the code forest. In the proposed method, performance is increased with , although increments beyond a certain point yield no significant gains. For = 4, the obtained compression efficiency curve is on average 0.95% worse than YY's code forest. It should be highlighted that YY's code forest structure does not admit the overlapped word structure described above, nor an equally efficient HT implementation. Note that the oscillatory nature of all performance curves is related to the intrinsic ability of variable-to-fixed codes to efficiently represent some symbol probability distributions. The interested reader is referred to [58], [59] for a more complete explanation to this behavior.

3) MODIFIED GOLOMB-RICE SHIFT
To study the impact of the modified Golomb-Rice shift parameter S in the proposed coded forest, compression efficiency is evaluated for S ∈ {0, 1, 3} and for Yamamoto et al.'s forest generation algorithm. Results are shown in Fig. 7. Parameter values S > 0 reduce compression efficiency for medium and low entropies. This is as expected, since low entropy sources are more easily compressible, and therefore it is preferable to apply entropy coding to all bitplanes. On the other hand, for sources with high enough entropy, S enhances performance because the least significant bitplanes, which are  hardly compressible, are output uncompressed. Therefore, more nodes can be devoted to encode more compressible parts of the input. Also as expected, the interception point between the unshifted (S = 0) and shifted (S > 0) plot lines is translated towards higher frequencies as S is increased. For the same reasons as before, performance efficiency for the maximum entropy depends on whether the total number of nodes, 2 K , is a power of symbols in the source alphabet after modified Golomb-Rice, i.e., |A|/2 S .

B. PERFORMANCE ON NATURAL SOURCES
The proposed code forest generation algorithm is also tested on natural sources using data produced by real sensors of different types including image, scientific and textual. For these data, neither ergodicity nor the memoryless properties can be assumed, hence efficiencies higher than 1 are both theoretically possible and observed in practice. The following datasets containing visual contents have been included in the comparison: the RGB encoded ISO 12640-2 [60] standard color image set, the Kodak PhotoCD set [61], and the Rawzor [62] set. Data produced by other types of sources have also been included in the comparison. Namely, floating point data from chemical sensors [63], semi-structured text data describing medical product reviews and scores [64] and structured data from several physical dimensions taken at the Intel Berkeley Research lab [65] (grouped as Mixed in in Tables 1 and 2) are used in the comparison. This choice TABLE 2. Average compression efficiency (η ℵ ), encoding speed (E s , 10 9 ·samples/s) and decoding speed (D s , 10 9 ·samples/s).
is intended to provide a representative, albeit non-exhaustive, selection of binary and text sources relevant to HT entropy codecs and exhibiting diverse properties. For compression, all files in these datasets are interpreted as one-dimensional sequences of 8-bit samples, i.e., |A| = 256. In particular, image data is presented in band-sequential mode, components in raster order. The number of samples in each dataset as well as their average zero-order entropy is provided in Table 1. All input and output files are stored in an in-memory file system to decouple performance results from the type of long-term storage used.
Code forests produced by Algorithm 7 for parameters K ∈ {8, 10}, ∈ {0, 2} are used to compress the aforementioned input sequences. Each sequence is divided in blocks of 4096 samples, and the zero-order entropy of each block determines which of 10 precomputed code forests to use for a given choice of K and . This additional stage is included in the reported execution times, although it would not be needed when the sources' entropy is known a priori or otherwise calculated. Note that the optimal shift parameter S for each code forest is determined as a part of the precomputation. The proposed algorithm is compared with the best performing, publicly available HT entropy codecs. These include (in alphabetic order): FAPEC [25], FSE [23], Gipfeli [19], Huff0 [22], LZ4 [15], LZO [18], Snappy [16] and Zstd [20]. Rice and RLE were also implemented for this work for the sake of a more complete comparison. Note that compression algorithms specific for only one data type -e.g., images-, or with significantly higher computational complexity (e.g., JPEG-LS [66], JPEG 2000 [67], CALIC [68]), are out of the scope of this work and are not analyzed here. 4 To the best of the authors' knowledge, the tested algorithms are representative of the state of the art in HT coding/decoding. All compression and decompression routines in the tested algorithms are invoked with default parameters, except for FAPEC. For this method, the -mt 1 -dtype 8 -np us parameters are specified to guarantee that only one thread is employed, and that no decorrelation transformation is applied to the data before compression. This is to provide a fair comparison, since such transforms are out of the scope of this work and are not present in any of the other tested algorithms.
For each input sequence, entropy is obtained as described in (17), and used to calculate each codec's efficiency η ℵ as defined in Eq. (18). Compression and decompression speeds for each codec-sequence combination are obtained as the number of samples in that sequence, divided respectively by the encoding and decoding times, defined as the sum of measured CPU and System times on an Intel Xeon Platinum 8175M CPU at 2.50 GHz. To guarantee stability in the measurements, each codec-sequence combination is repeated until the accumulated computation time reaches a minimum of 1 s, and the average time is returned. All codecs are configured to use exactly one thread. Average efficiency results for each dataset are provided in Table 2. Average results for all input sequences are also included in Table 2, and plotted in Fig. 8.

1) COMPRESSION EFFICIENCY (η ℵ )
The proposed code forests yield compression efficiency results comparable to the best codec for all tested datasets, i.e., Huff0, FSE and Zstd. On average, and depending on the choice of K and , the proposed code forests yield compressed sizes 4.27% and 5.98% larger than Huff0. Consistent with previous discussion, using more than one tree in the forest ( > 0) typically yields higher compression performance, with gains of up to 0.03. On the other hand, increasing the number of included nodes per tree, i.e., increasing the value of K beyond 8, enhances compression only if = 0. It should be highlighted that the observed efficiency differences between the proposed and the most efficient codecs is as expected, due to the additional design constraints imposed by VF codes as opposed to variable-to-variable codes such as Huff0, FSE and Zstd. Also note that for the ISO, Kodak and Rawzor image datasets, most codecs attain efficiencies higher than 1. This is due to the fact that zero-order entropy (used in efficiency calculations) cannot detect any redundancy present among neighboring samples, whereas many of the tested codecs exploit it to obtain higher efficiency. In particular for the proposed algorithm, higher-order redundancy is exploited mainly via the shift parameter S, the use of several coding trees in the forest when > 0, and the per-block dictionary selection described above. The proposed code forests exhibit compression speeds comparable to the algorithms with the highest compression efficiency. On average, our implementation of Algorithm 7 is able to encode 2.0 · 10 8 samples per second for (K , ) = (8, 0), i.e., 17.6% faster than Zstd and 16.7% slower than Huff0 and FSE. Golomb-rice codes yield good compression efficiency and speed, although their decoding speed is significantly lower than those of the proposed code forests.
Other tested algorithms achieve a significantly higher sample throughput, most notably FAPEC, although their compression efficiency is not as high as that of the most efficient algorithms.
In terms of decompression speed, the proposed code forests yield a very competitive throughput, higher than those of the methods that yield the highest compression efficiency, i.e., Huff0, FSE and Zstd. This is one of the main goals of the proposed algorithm, based on variable-to-fixed codes. On average, the slowest of the tested code forests is 3.77 times faster than FSE, which in turn exhibits faster decompression than Huff0 and Zstd. Entropy codecs with lower compression efficiency (i.e., η ℵ < 1.0 for the test sensor data) are able to decode faster than the proposed code forests. FAPEC dominates decompression speeds overall, in part due to the use of a heavily optimized, commercial implementation, which employs hardware-specific tuning such as single-instruction multiple-data (SIMD) assembly instructions. 5 In contrast, the proposed codec implementation is hardware agnostic. Note that the development of an optimized platform-specific version of the proposed codec is possible, but out of the scope of this paper.
For both compression and decompression, increasing K reduces attained speed. Even though the best value of K based on compression efficiency and execution speed depends on ℵ, empirically we found K = 8 to yield a favorable trade-off on the tested datasets. Increasing the shift parameter S also increases execution time, although very similar compression performance and execution times are observed for K ∈ {8, 10} and S > 0.

VI. CONCLUSION
The design of HT entropy codecs with competitive performance is paramount to meet the ever increasing demands in terms of latency, bandwidth requirements, storage space usage and energy consumption. In many scenarios such as high-performance communication and computation systems, special importance is placed on decompression, since data are often decompressed multiple times or at critical points in the information pipeline. Notwithstanding, in most HT entropy codecs available, design decisions favor compression speed and efficiency, in detriment of decompression speed. The code forests proposed in this work are based on variable-to-fixed codes, which prioritize decompression speed while maintaining competitive compression efficiency. These forests are optimized using stationary Markov chain analysis, and achieve compression efficiency results within 1% of the state of the art in variable-to-fixed codes when tested on synthetic sources. In addition, the structure of the proposed codes enables highly efficient implementations that are not possible with other VF codes in the state of the art. When tested on natural sources against several paradigms of HT entropy codec, the proposed code forests attain compression efficiency competitive with the state of the art, with average decompression speeds 3.8 times faster than tested codecs with similar compression efficiency.