Lower Bounds on the Redundancy of Huffman Codes With Known and Unknown Probabilities

In this paper, we provide a method to obtain tight lower bounds on the minimum redundancy achievable by a Huffman code when the probability distribution underlying an alphabet is only partially known. In particular, we address the case where the occurrence probabilities are unknown for some of the symbols in an alphabet. Bounds can be obtained for alphabets of a given size, for alphabets of up to a given size, and alphabets of arbitrary size. The method operates on a computer algebra system, yielding closed-form numbers for all results. Finally, we show the potential of the proposed method to shed some light on the structure of the minimum redundancy achievable by the Huffman code.


I. INTRODUCTION
L ET A be a discrete memoryless source of n ≥ 2 symbols, having an alphabet {a 1 , . . . , a n } with associated probability distribution {p(a 1 ), . . . , p(a n )}. Let be the redundancy achieved by a code C, where is the average length of C when applied to source A, l C (a i ) is the length of the codeword for a i , and is the entropy of A. Without loss of generality, we assume that p(a i ) > 0 for all i and that all logarithms are base 2.
The Huffman Algorithm [1] can be employed to produce a Huffman code, H, for a source A. Huffman codes are prefixfree and are optimal in the sense that no other code can achieve a lower redundancy than R H (A) when coding one symbol at a time.
Producing bounds on the redundancy of a Huffman code for which the underlying probability distribution is only partially known is a recurring theme in the literature. Multiple authors This work was supported in part by the Centre National d'Etudes Spatiales, and by the Spanish Ministry of Economy and Competitiveness (MINECO), by the European Regional Development Fund (FEDER) and by the European Union under grant TIN2015-71126-R, and by the Catalan Government under grant 2014SGR-691. have described successively more accurate bounds on the redundancy of a Huffman code when only the probability of the most-likely symbol is known. In [2], an upper bound is provided for this case. An improved upper bound and a lower bound are described in [3]. In particular, a tight lower bound is provided when the (known) probability of the most-likely symbol is greater than 0.4. Bounds are further refined in [4], and [5] provides a lower bound that is tight for all values of the probability of the most likely symbol. Improvements on the upper bound are provided in [6]. In [7], the lower bound provided in [5] is improved for a known alphabet length. The concept of local redundancy, a useful tool to prove previous results through different means, is introduced in [8]. Further bound improvements are provided in [9], [10], [11], [12], [13]. Tight upper and lower bounds are provided for one known symbol (not necessarily the most probable symbol) in [14]. The lower bound is obtained through convex optimization over a small set of prefix-free codes sharing a singular structure. Interestingly, the tight lower bound for one known symbol is the same as the one obtained in [5] for the known most-likely symbol case.
In this paper, we provide a method to obtain a tight bound on the minimum redundancy achievable by a Huffman code for a source in which the occurrence probability is only known for the symbols in a given (arbitrary) subset of A. We start by providing a method for alphabets of a given size n. We then extend it to any alphabet size less than or equal to a given value n, and then we extend it to the case of arbitrary alphabet sizes. Finally, we employ the proposed method to present novel redundancy maps, which provide a visual representation of minimum redundancy and insight into which code covers each minimum redundancy region.
The framework provided herein has potential for further interesting extensions, such as the imposition of more general constraints on probability distributions. For example, in addition to the equality constraints discussed in this work, the method can be extended to include inequality constraints, e.g., at least one probability is less than 0.1 and at least one probability is greater than 0.5.
We also hope that these bounds can help to understand the structure of Huffman code redundancy, and in doing so shed some light on the still challenging problem of finding optimal variable-to-variable codes (also referred to as Khodak's codes) [15], [16], [17], [18]. In the mean time, until stronger results are found, the work presented herein can be of use in pruning the search space whenever these codes are generated by brute force.

A. The Huffman Algorithm
For the purpose of establishing a common notation and being able to draw parallels, a description of the Huffman Algorithm follows for the (usual) case were all symbol probabilities of a source are known. For simplicity, and with some abuse of notation, we write A = {a 1 , . . . , a n } to describe a source A with alphabet {a 1 , . . . , a n }, and associated occurrence probabilities {p(a 1 ), . . . , p(a n )}. The number of symbols in the alphabet is denoted by |A|. The Huffman Algorithm is described here in terms of a state machine. In what follows, the current state Θ (i) is successively updated by a state transition function h(·). At this point in the development, the state takes the form of a discrete source. Specifically, given a source A = {a 1 , . . . , a n }, state transition function h(·) results in a new source by merging symbols a j and a k into a new symbol, denoted by [a j , a k ] with associated probability p ([a j , a k ]) = p (a j ) + p (a k ), as follows: Algorithm 1 (The Huffman Algorithm): Let A = {a 1 , . . . , a n } be a discrete source of n ≥ 2 symbols. Let 2) Find indices j and k, with 1 ≤ j < k ≤ n − i, of two elements in Θ (i) such that no other element in Θ (i) has a smaller occurrence probability. I.e., 3) Merge the two elements to obtain Θ (i+1) = h(Θ (i) , j, k).

4) Set
The result of this algorithm is a source Θ (n−1) containing a single symbol θ (n−1) 1 composed by the repeated merging of the original symbols. This symbol is equivalent to the wellknown representation of a Huffman code in tree form. The codeword lengths l(a i ) for the original symbols a i can be obtained from θ (n−1) 1 by counting the number of " ] " minus the number of " [ " to the right of a i . For example, for the case of A = {a 1 , a 2 , a 3 , a 4 , a 5 } with p(a 1 ) = 0.10, p(a 2 ) = 0.21, p(a 3 ) = 0.15, p(a 4 ) = 0.30, and p(a 5 ) = 0.24, the Huffman Algorithm yields (within a permutation) The resulting codeword lengths are l(a 2 ) = l(a 5 ) = 4−2 = 2, l(a 4 ) = 3 − 1 = 2, and l(a 1 ) = l(a 3 ) = 3 − 0 = 3. By taking ", " into account, a similar scan can be applied to obtain codewords, yielding 110, 00, 111, 10, and 01 for a 1 to a 5 respectively.
It is worth noting that the condition j < k in step 2 avoids generating certain codes that are equivalent within a permutation to the codes produced by the algorithm. This facilitates complexity reductions as discussed in future sections.
To apply Algorithm 1 to a source A, the underlying probability distribution must be fully specified, which is contrary to the objective of this work -operating with only partially-known probability distributions. In Section II, we address the issue of partially-known probability distributions for alphabets of given size n. In Section III, we consider the case of alphabets of size up to n, and provide a general bound for arbitrary alphabet sizes. In Section IV we provide an efficient method to calculate the general bound, and in Section V some examples are shown. Finally, conclusions are drawn in Section VI.

II. A BOUND ON HUFFMAN CODE REDUNDANCY
Suppose now that the underlying probability distribution of a discrete memoryless source A of n ≥ 2 symbols has m symbols with known probabilities, 0 ≤ m ≤ n, whereas the probability is unknown for the remaining n − m symbols, 0 ≤ n−m ≤ n. 1 Notation and terminology for these partiallyknown sources are established in the following definitions.
Definition 1 (Sub-source): Let A be a source with alphabet {a 1 , . . . , a n } and associated probabilities p(a i ). We define a sub-source of A as a subset of the symbols of the alphabet of A together with their associated probabilities p(a i ). If X = {x 1 , . . . , x m }, 0 ≤ m ≤ n, is a sub-source of A, we write X A. The cardinality m of the sub-source is denoted by |X|. We note that the probabilities of a sub-source do not necessarily add up to 1.
Definition 2 (Complementary Sub-source): Given a sub-source X of a source A, the complementary sub-source Y of X with respect to A is the sub-source containing all the symbols of A not in X together with their associated probabilities.
A source A for which some symbol probabilities are known can be thought of as two complementary sub-sources: a sub-source X holding all symbols for which probabilities are known, and a complementary sub-source Y holding all symbols for which probabilities are unknown.
In the remainder of this section we describe how to obtain a bound on the lowest possible redundancy obtainable by a Huffman code for a source A of n ≥ 2 symbols, with m of the probabilities being known, 0 ≤ m ≤ n. We do this by considering a sub-source X of A, with |X| = m. We then find the lowest possible redundancy obtainable by a Huffman code over all sources B such that X B and |B| = |A| = n. We formalize this bound in the definition below.
Definition 3 (Redundancy Bound for Sources of n Symbols): Let X be a sub-source of m symbols for a source A of n symbols, with 0 ≤ m ≤ n and n ≥ 2. Then Clearly, R (n) min (X) is a lower bound to the redundancy obtainable by a Huffman code for A. In what follows we describe a method to compute R where Φ (n) is the set of all possible Huffman codes that can be generated by Algorithm 1 for sources of n symbols.
Proof: As Algorithm 1 always yields an optimal code [1], a Huffman code is included in Φ (n) for every B, |B| = n.
Otherwise, there would be a source B for which Algorithm 1 does not yield the optimal code. Hence, which substituted into (6) yields The minimization operations can be permuted, which yields (7).
While its proof is simple, the implication of Theorem 1 is significant. In particular, it allows the problem of calculating R for a given C ∈ Φ (n) .
We address problem (a) in Subsection II-A, where we show how to obtain Φ (n) through exhaustive enumeration. As for problem (b), we show in Subsection II-B that it can be posed as a constrained convex optimization problem and solved accordingly. In subsequent sections we show how to obtain a dramatically smaller, yet still sufficient, subset of Φ (n) .
Corollary 1: The bound provided by Theorem 1 is tight.
Proof: For the code C ∈ Φ (n) that minimizes the outer minimization operation in (7), solving the convex optimization problem in (b) yields both (10) and the values of p(b i ). Thus there is an example of a source B = {b 1 , . . . , b n } for which R C (B) lies on the bound.

A. All Huffman Codes of n Codewords
The following algorithm generates a set of prefix-free codes of n codewords which includes all possible Huffman codes that can be produced by Algorithm 1. The algorithm explores every possible state trajectory Θ (0) , . . . , Θ (n−1) . It does so by iterating over a set Φ (i) containing all possible states Θ (i) at iteration i.
. . , a n } be an alphabet of n ≥ 2 symbols, and let Φ (i) be a set of states.

4)
If i < n, go to step 2.

5) Stop.
The result of this algorithm is Φ (n) , which is a set of states. Each such state is an alphabet containing a single symbol θ composed by the repeated merging of the original symbols. As described previously with respect to Algorithm 1, each such θ is equivalent to a prefix-free code in the form of a tree.
For example, for the case of A = {a 1 , a 2 , a 3 }, and Examination of Φ (3) yields three different prefix-free codes. Each code has two codewords of length 2 and one of length 1.
With some abuse of notation, we employ Φ (n) to denote the set of associated prefix-free codes employed in the outer minimization of (7) in Theorem 1. In the following subsection, we tackle the inner minimization.

B. Convex Optimization
Given a sub-source X and a code C ∈ Φ (n) , we denote the inner minimization of (7) by F(X, C). That is, Simply put, for a given prefix-free code C having known codeword lengths 0 < l C (b i ) < ∞, and a sub-source X having known (strictly positive) probabilities, we seek a source B for which C achieves minimum redundancy, under the constraint that X is a sub-source of B.
Let X = {x 1 , . . . , x m } and let Y = {y 1 , . . . , y m−n } be the complementary sub-source of X with respect to B. For now, we assume that there is at least one unknown probability, i.e., m < n and address the case when m = n later.
Since the codeword lengths l C (b i ) are known, and the probabilities p(b i ) are known for the symbols that lie in X B, can be rewritten as where β 0 and β i are constants, given by and Defining the additional constant the minimization problem can be formally posed as the following inequality constrained convex minimization problem: It is worth noting that β T represents the total probability of the symbols with unknown probabilities. Under our current assumption that m < n, β T is strictly positive.
To solve the constrained minimization problem, we initially ignore the inequality constraints and proceed via the method of Lagrangian multipliers. Setting and solving for the Lagrange multiplier, we get λ = β i + log p(y i ) + log e, which can be rearranged into Substituting this last expression into the equality constraint we obtain n−m i=1 2 λ−βi /e = β T , or we can see that 0 < p(y i ) ≤ 1. Hence, we have obtained a solution to the problem posed in (20).
Substituting (24) into (16) yields a closed-form expression for the achieved minimum: It is easy to see that the algorithm discussed above is still correct for the case of m = n that was left untackled above. In this case, (16) and (14) reveal that F(X, C) = β 0 . Noting that (19) gives β T = 0 and defining 0 log 0 to be 0 (as is often done in the literature), the convex optimization result of (25) also yields F(X, C) = β 0 .
At this point, we have a method for computing a tight lower bound to the redundancy of a Huffman code for a source having n ≥ 2 symbols, where the occurrence probability is known for only m of the symbols, 0 ≤ m ≤ n. This method consists of applying the convex optimization described above to each code in Φ (n) , then taking the minimum over all such codes, as prescribed by (7).
In subsequent sections, we discuss some practical issues, including the reduction of complexity that may arise when the cardinality of Φ (n) is large. Before proceeding to these issues however, we generalize the method to the case when the size of the source is unknown.

III. GENERAL BOUND
In this section, we formalize a method to obtain a bound on the minimum redundancy achievable by a Huffman code for a source of arbitrary size. That is, the bound holds for all n ≥ 2 such that n ≥ m. As before, m is the number of symbols with known occurrence probabilities.
We begin with a simpler bound, R min (X), on the lowest possible redundancy obtainable by a Huffman code for any source A having from 2 to n symbols, where X A is the sub-source of A containing the m symbols with known probabilities.
Definition 4 (Redundancy Bound for Sources of up to n Symbols): Let X be a sub-source of size m for a source A, where 2 ≤ |A| ≤ n. Then It is straightforward to see that R (≤n) min (X) can be obtained as and that it is in fact tight. This simple bound paves the ground to obtain the desired bound, R * min (X), which is a tight bound on the lowest possible redundancy obtainable by a Huffman code regardless of the cardinality of A.
Definition 5 (General Redundancy Bound for Any Source): Let X be a sub-source for a source A. We define The calculation of R * min (X) is, in fact, the main objective of this manuscript. In the following theorem, we show that min (X) for all values of n greater than or equal to a certain threshold T (X). where with the particular case of R * min (X) = 0 when |X| = 0. Before proceeding to the proof of Theorem 2, some necessary lemmas are presented.
Lemma 1: Given a sub-source X and its complement Y , with respect to source A, with n = |A| > T (X), m = |X| ≥ 1, and n−m = |Y | ≥ 1, there are at least two elements in Y with probabilities strictly smaller than min{p(x 1 ), . . . , p(x m )}.
Without loss of generality, we assume p( and so From (31), Substituting (33) and (30) into (34) yields which is impossible. Hence, p(x 1 ) > p(y j ) for at least some index j.
Suppose now that p(x 1 ) > p(y j ) but that p( and But substituting (37) and (30) into (34) yields and which is impossible because the left-hand side is always 0 or less, while the right-hand side is strictly positive. Hence, p(x 1 ) > p(y i ) for at least two elements in Y .
Lemma 2: Given a source A of n ≥ 3 symbols for which symbols a j , a k ∈ A are merged in step 3 of Algorithm 1, the source B = (A \ {a j , a k })∪{b q }, with p(b q ) = p(a j )+p(a k ), has R H (B) ≤ R H (A).
B be the state of Algorithm 1 at iteration i when it is executed, respectively, for sources A and B. Since p(a j ) < p(b q ) and p(a k ) < p(b q ), Algorithm 1 performs the same merging steps for both A and B before the algorithm merges a j and a k from A into [a j , a k ] at some iteration l. I.e., Once a j and a k have been merged by the algorithm for A, states Θ B have an equal number of symbols with identical occurrence probabilities. Thus, both algorithms continue to produce identical outcomes except that "[a j , a k ]" is substituted by "b q " in the Huffman code for B. This implies that l H (a j ) = l H (a k ) = l H (b q ) + 1, which can be employed to prove that R H (B) ≤ R H (A), or p(a i )· l H (a i )+log p(a i ) .
(43) In what follows, a series of operations are carried out, each of which holds only if (43) holds.
Canceling equal terms on both sides in (43) and using (42) yields Dividing by p(b q ) on both sides and rearranging results in The left hand side of (46) is the entropy H(D) of a binary source D = {d 1 , d 2 }, with p(d 1 ) = p(a j )/p(b q ) and p(d 2 ) = p(a k )/p(b q ), which can be no greater than 1.
We now proceed to prove Theorem 2.
Proof: Consider first the case of |X| = 0, when all probabilities are unknown. It is trivial to see that R * min (X) = 0, as redundancy must be at least 0 by definition, and B = {b 1 , b 2 } with p(b 1 ) = p(b 2 ) = 0.5 is an example where R H (B) = 0.
Suppose now that |X| > 0. We will show that for any source B such that X B with |B| > T (X), there is a source C such that X C, with |C| = T (X) having R H (C) ≤ R H (B). Thus, it is not necessary to consider any source with |B| > T (X) in (28).
Let Y be the complementary sub-source of X with respect to B, and let p(x 1 ) = min{p(x 1 ), . . . , p(x m )}. By assumption, |B| > T (X) and so from (30), |X| < |B|. So from Lemma 1, there exist i and j such that p(y i ) < p(x 1 ) and p(y j ) < p(x 1 ). That is, there exist two symbols in Y with probabilities smaller than any symbol in X. Thus, the first two symbols merged in step 2 of the Huffman Algorithm, as applied to B, must come from Y . Without loss of generality we choose y i and y j to be these two symbols.
Since B contains y i and y j , it follows that |B| ≥ |X| + 2 ≥ 3, and so from Lemma 2, we can conclude that the source C = B \ {y i , y j } ∪ {c} with p(c) = p(y i ) + p(y j ) has R H (C) ≤ R H (B), with X C and |C| = |B| − 1.
Simply put, the Huffman code for source C is less redundant than the Huffman code for source B. Thus, it is not necessary to consider source B in the minimization of (28).
This procedure can be repeated until |C| = T (X).

IV. EFFICIENT ENUMERATION OF PREFIX-FREE CODES
The previously presented methods to obtain bounds on Huffman code redundancy rely on Algorithm 2 for the exhaustive enumeration of all possible Huffman codes Φ (n) for an alphabet of size n. This enumeration rapidly becomes untenable, since Φ (n) = n! · (n − 1)!/2 n−1 , with factorial functions dominating the growth speed. This is even more problematic for the R * min bound, as all codes need to be enumerated for all alphabet sizes n up to T (X). While T (X) may be small when the elements in X have large probabilities, (30) implies that T (X) − |X| is inversely proportional to the smallest probability in X, and thus T (X) can be large for small probabilities. For example, a case as simple as X = {x 1 } with p(x 1 ) = 0.01 yields T (X) = 100 and Φ (n) 10 284 . Clearly, Φ (n) cannot be calculated by exhaustive means.
In this section we show how to reduce the complexity of Algorithm 2 by eliminating codes that are provably unnecessary for the purpose of obtaining R * min (X). This is accomplished by effectively managing the internal states of the algorithm via the following two strategies.
(i) We disregard states resulting from the merging of two symbols in the sub-source complementary to X, as it can be seen through Lemma 2 that there is a complementary sub-source to X with one less symbol having lower or equal redundancy. For this reason, we stop considering any state trajectory as soon as there is a merging of symbols which does not involve either a symbol resulting from a previous merge or one of the original symbols in X.
(ii) We keep track of constraints on probabilities that arise during Algorithm 2, which allows us to prune state trajectories that can not yield any viable code.
For example, given sub-source X = {a 1 , a 2 } with p(a 1 ) = p(a 2 ) = 0.4, Algorithm 2 produces (among others) Φ (3) over alphabet A = {a 1 , a 2 , a 3 }. In this case, code [[a 1 , a 2 ], a 3 ] is in Φ (3) . For this code, a 1 and a 2 are merged first, which implies that p(a 1 ) ≤ p(a 3 ) and p(a 2 ) ≤ p(a 3 ). These inequalities together with 0 ≤ p(a i ) ≤ 1, p(a i ) = 1, p(a 1 ) = 0.4, and p(a 2 ) = 0.4 yield an inconsistent system of equations, i.e., p(a 3 ) must be 0.2 which is incompatible with p(a 1 ) ≤ p(a 3 ). Thus, the code cannot be a Huffman code for any source that has X as sub-source, and it can be ignored in an efficient enumeration of such codes.
In the remainder of this section we describe a modified version of Algorithm 2 that applies these two strategies to dramatically reduce implementation complexity.

A. Extended State
First, we define two concepts that enable the application of the aforementioned strategies. The first definition is a partition of each state employed in Algorithm 2 into known symbols and unknown symbols.
Definition 6 (State partition): Let Θ be a state (alphabet) and let X be a sub-source. Based on X, we define a partition of Θ into two complementary subsets: -a set of known symbols K, containing all symbols in Θ that are either in X or that are the result of one or more merging operations, with at least one of the symbols involved being a symbol in X; and -a set of unknown symbols U = Θ \ K.
Let κ i denote an element in K and u i denote an element in U .
The second definition is that of an extended state.
is an extended state, where K is an alphabet, s is an integer, and Z is a set of linear inequalities. Let functions K(·), S(·), and Z(·) denote each of the elements of the triplet. That is Extended states replace states (alphabets) in Algorithm 2, by providing an alternative representation of the alphabet, and augmenting it with constraints. Specifically, an extended state ... Θ contains the set of known symbols K of an equivalent nonextended state Θ together with an integer s which is used to indicate the number of symbols that have been drawn from U at any point in the merging process. In this formulation, symbols in U can be thought of as being created when needed. In addition, an extended state also contains a set, Z, containing all inequalities that have arisen through previous merging of symbols, so that an extended state with an inconsistent Z can be discarded according to strategy (ii).
Following with the idea that symbols in U are created when needed in a merging operation, we can consider three cases for merging symbols θ i , θ j ∈ Θ in step 2 of Algorithm 2: Per strategy (i), it is not necessary to ever consider case (c).
One new symbol in U is created each time case (b) is applied, and thus, integer s is equivalent to how many case-(b) merges have been performed. It is clear that even though the probabilities of the created symbols u i are unknown, they must be non-decreasing (p(u i ) ≤ p(u i+1 )) starting with the first symbol drawn from U , which we denote as u 0 .

B. State Transition Functions for Extended States
The state transition function h(·) is now modified to operate over extended states resulting in three different functions, each covering one of the three cases of symbol merging operations described in the previous subsection.
For case (a), the state transition function h a ... Θ, i, j merges elements κ i and κ j in K ... Θ as follows with and Sets Ω 1 to Ω 4 contain inequalities stating that no other symbol has smaller occurrence probability than p(κ i ) and p(κ j ). Elements in K( ... Θ) are covered in Ω 1 and Ω 2 , while elements (implicitly) in U are covered in Ω 3 and Ω 4 . As elements in U are created in non-decreasing order of probability, the inequality involving u S( with and It is not necessary to give the state transition function for case (c), as per strategy (i), it only yields extended states not worth considering.

C. Algorithm
Following the aforementioned strategies and definitions, Algorithm 2 is modified as follows.
Once a given extended state ... Θ is reached with |K( ... Θ)| = 1, K( ... Θ) contains a single symbol, which is equivalent to a prefixfree code of |X| + S( ... Θ) codewords. The redundancy for this code is lower bounded as described in Subsection II-B. State transition function h a cannot be further applied to ... Θ, but state transition function h b may still be applied, producing a new state ... Θ representing a prefix-free code of |X| + S( ... Θ) + 1 codewords. Thus, finding an extended state equivalent to a prefix-free code, does not terminate a state trajectory.
State trajectories are terminated when |X|+S( ... Θ) > T (X) (as per Theorem 2); thus the algorithm is guaranteed to stop. In addition, as per strategy (ii), state trajectories are terminated when Z( ... Θ) becomes inconsistent, significantly limiting the number of extended states to examine.
These modifications yield a new algorithm capable of generating a set of codes sufficient to calculate R * min (·) without having to fully enumerate Φ (n) .

2) Let
Once Algorithm 3 stops, Ψ is a set of extended states containing sufficient codes to yield the least possible redundancy for every source B having sub-source X. R * min is then computed by applying the convex optimization of Subsection II-B to each code in Ψ, and taking the minimum of all such minimization results. This is facilitated by the fact that Ψ is only a small subset of which is what would need to be examined in the exhaustive case.
The reduction in the number of elements in Ψ as compared to those in (58) is substantial, even if not easily given in closed form. Following the example at the beginning of this section, for the case of X = {x 1 } with p(x 1 ) = 0.01, we obtain |Ψ| = 11, which is clearly smaller than Φ (100) 10 284 and, by extension, 100 i=2 Φ (i) .

D. Consistency Verification
The final consideration in this section is that of evaluating whether a system of inequality constraints is consistent or not. We have intentionally overlooked the issue up to this point, as it is a self contained problem. In this subsection, we formalize the problem, and then describe how to solve it efficiently.
Given an extended state ... Θ we want to know whether the following system of inequalities is consistent: The system is composed by all known probabilities given by X, here denoted by P i , by all constraints due to merging operations in Z( ... Θ), and by two more additional constraints. The first additional constraint, p(x i )+ p(u i ) ≤ 1, ensures that probabilities for a prefix-free code deriving from ... Θ do not grow over 1. It is not a strict equality, as additional symbols from U may be added to ... Θ by h b . The second additional constraint, p(u S( ...
It is well known that the problem of determining the feasibility of the previous inequality system can be posed as a linear programming problem as follows.
is an arbitrary function of x, and Ax − b 0 are inequalities equivalent to those in (59), with denoting element-wise comparison. We employ bold symbols to denote vector and matrices, and to distinguish them from those of previous sections.
Finding any solution, x * , for (60) yields a point satisfying all inequalities. Conversely, if there is no solution to (60), then the system in (59) is inconsistent. In our case, solving (60) is not straightforward. Up to this point, a Computer Algebra System (CAS) can perform all operations described in this manuscript, thus ensuring that closed-form solutions can be found for all redundancy bounds. Unfortunately, efficient linear program solvers are of a numerical nature, and as such the optimization can incorrectly fail to reach a solution due to numerical problems, which could incorrectly discard valid extended states in Algorithm 3, thus resulting in an incorrect calculation of R * min . Note that the opposite could also occur, but does not represent a serious problem. That is, an invalid solution could be reached, which could incorrectly preserve an inconsistent extended state, but this would only increase the size of Ψ and not alter the validity of its use to find R * min .
To address the issue of numerical solvers, we can reformulate the linear program, through duality [19], into a problem in where finding any solution (regardless of optimality) proves that no solution is possible for (60). First, we reformulate (60) into the following so-called Phase I problem: If there exists a solution (γ * , x * ) with γ * > 0, then there is no γ less than γ * that satisfies the constraints and thus Ax − b 0 cannot be valid.
For (61), a numerical solver would yield an approximate solution ( γ * , x * ) from which it would not be possible to infer whether γ * > 0 is true or not. To address this, we employ the dual problem of (61): subject to A T λ = 0 with g(λ) = −b T λ (see [19, p. 225]).
Again, a numerical solver yields only an approximate solution λ * for (62). However, this is sufficient for our purposes. The solution given by the numerical solver is a vector of rational numbers which can be safely manipulated with a CAS. Hence, if a solution yields g( λ * ) > 0 we can conclude that (59) is inconsistent and terminate the corresponding state trajectory. Should a solution fail to satisfy g( λ * ) > 0 or the numerical method fail to reach a solution, we cannot make claims regarding the consistency of (59). In this case we cannot terminate the state trajectory. The bound R * min is plotted in Fig. 1 for a few cases of one, two, or three known probabilities. For the case of one known probability, the best bounds available in the literature are given in [5, Fig. 1] and [14, Fig. 4]. The proposed method yields, as expected, the same results.
It can be seen that as the sum of the known probabilities approaches 1, redundancy quickly rises. Redundancy falls again, once the sum reaches 1. The rise is caused by the prefixfree code having to necessarily account for one additional symbol with very low associated probability.
For the cases of p(x 2 ) = 7/10 and p(x 2 ) = 1/2, the respective curves are identical to that of |X| = 1, except for constant shift along the y-axis and a difference in scale. It can easily be seen that this is the result of x 2 being merged by the last step of the Huffman Algorithm in both cases. In fact, this can bee seen to be true whenever the known probabilities that can be arranged in a sequence where each successive probability is larger than the sum of the remaining ones. I.e., p(x i ) > i<j p(x j ). For example, this holds for the case shown when p(x 2 ) = 7/10 and p(x 3 ) = 1/5, where 7 10 > 1 − 7 10 = 3 10 and 1 5 > 1 − 7 10 − 1 5 = 1 10 . However, this does not hold for the case when p(x 2 ) = 1/5, as evidenced by lack of smoothness at p(x 1 ) 0.4581 caused by the transition between optimal prefix-free codes.
We can use the examples in Fig. 1 to disprove that This implies that considering known probabilities independently, one at a time, does not yield a good estimator of minimum redundancy. For example, given X = {x 1 , x 2 } with p(x 1 ) = 0.49 and p(x 1 ) = 0.5, we have In Fig. 2, four two-dimensional contour plots are presented, in which two known probabilities vary along the axes of the plot. These plots are colored to denote regions sharing the same optimal prefix-free code. As more than one prefixfree code may be optimal at some given coordinates, codes covering larger regions take precedence over smaller ones in the figure. The plot follows a "map coloring" scheme, where colors may repeat but adjacent codes are guaranteed to be of different color. Consistent with the idea of Shannon coding, it can be seen that in Fig. 2(a) local minimums are arranged at coordinates where p(x 1 ) and p(x 2 ) are negative powers of two (i.e., p(x 1 ) = 2 −i , p(x 2 ) = 2 −j with i, j ∈ N). In this case, a single prefix-free code covers the local neighborhood around a local minimum. Contours for Fig. 2(b) and (d) are similar in character to those in (a), except for scale. Fig. 2(c) exhibits more substantial differences in the regions covered by each prefix code, with changes both in shape and quantity. In Fig. 2(a), there is an inappreciable diagonal line where p(x 1 )+p(x 2 ) = 1 and p(x 3 ) = 0, for which the optimal code is different from those in the interior of the plot where p(x 3 ) = 0. This discontinuity is akin to those in Fig. 1. Similar discontinuities appear in parts (b), (c) and (d) of Fig. 2.
One additional remark regarding Fig. 2 is that the optimal regions for prefix-free codes seem to be of polygonal shape, and possibly convex. These shapes are given in part, but not exclusively, by the necessary conditions that must be met in Step 2 of Algorithm 3. The shape of these regions is an interesting topic for future research.

VI. CONCLUSIONS
In this paper, we present a tight bound to the minimum redundancy achievable by the Huffman code when source probabilities may be only partially known.
First, we show how to calculate a bound for alphabets of a given size, by generating all prefix-free codes that could become a Huffman code, and then, given the known probabilities, calculating their redundancy through convex optimization. This process yields a closed-form expression of the minimum redundancy. The previous bound is further extended to alphabets of up to a given size, and then generalized to alphabets of any size. This is accomplished by showing that the last two cases are equivalent under certain conditions. Moreover, all bounds are tight by construction, as examples lying on each bound are found for each case.
To enable the calculation of the general bound, which may otherwise be infeasible to calculate, an efficient method is provided to enumerate all prefix-free codes that could become a Huffman code, while discarding cases not worth considering. This is achieved through early pruning of prefix-free codes which are either proven more redundant than other codes considered, or which are proven by a linear program to never yield a Huffman code for the given known probabilities. All obtained results are closed-form expressions obtainable by means of a CAS, thus preventing any numerical issue related to floating-point operations. A numerical solver for linear programs is employed using a strategy that guarantees that numerical issue cannot alter the final result.
In addition, we present examples where we show the potential of the general bound to aid in the visualization of the structure of the minimum redundancy achievable by the Huffman code.
Finally, we would like to remark that the work described here lays the foundation to more complex bounds which could incorporate additional restrictions on the probability distributions, such as relations between probabilities of multiple symbols, or constraints on their magnitude.