By Topic

IEEE Quick Preview
  • Abstract



THE classical setting of the universal lossless compression problem [6], [9], [10] assumes that a sequence Formula$x^{n}$ of length Formula$n$ that was generated by a source Formula$\mmb{\theta}$ is to be compressed without knowledge of the particular Formula$\mmb{\theta}$ that generated Formula$x^{n}$ but with knowledge of the class Formula$\Lambda$ of all possible sources Formula$\mmb{\theta}$. The average performance of any given code, that assigns a length function Formula$L(\cdot)$, is judged on the basis of the redundancy function Formula$R_{n}\left(L,\mmb{\theta}\right)$, which is defined as the difference between the expected code length of Formula$L\left(\cdot\right)$ with respect to (w.r.t.) the given source probability mass function Formula$P_{\theta}$ and the Formula$n$ th-order entropy of Formula$P_{\theta}$ normalized by the length Formula$n$ of the uncoded sequence. A class of sources is said to be universally compressible in some worst sense if the redundancy function diminishes for this worst setting. Another approach to universal coding [37] considers the individual sequence redundancy Formula${\hat{R}}_{n}\left(L, x^{n}\right)$, defined as the normalized difference between the code length obtained by Formula$L(\cdot)$ for Formula$x^{n}$ and the negative logarithm of the maximum likelihood (ML) probability of the sequence Formula$x^{n}$, where the ML probability is within the class Formula$\Lambda$. We thereafter refer to this negative logarithm as the ML description length of Formula$x^{n}$. The individual sequence redundancy is defined for each sequence that can be generated by a source Formula$\mmb{\theta}$ in the given class Formula$\Lambda$.

Classical literature on universal compression [6], [9], [10], [25], [37] considered compression of sequences generated by sources over finite alphabets. In fact, it was shown by Kieffer [17] (see also [15]) that there are no universal codes (in the sense of diminishing redundancy) for sources over infinite alphabets. Later work (see, e.g., [23], [30], [38]), however, bounded the achievable redundancies for i.i.d. sequences generated by sources over large and infinite alphabets. Specifically, while it was shown that the redundancy does not decay if the alphabet size is of the same order of magnitude as the sequence length Formula$n$ or greater, it was also shown that the redundancy does decay for alphabets of size Formula$o(n)$.1

While there is no universal code for infinite alphabets, recent work [22] demonstrated that if one considers the pattern of a sequence instead of the sequence itself, universal codes do exist in the sense of diminishing redundancy. A pattern of a sequence, first considered, to the best of our knowledge, in [1], is a sequence of indices, where the index Formula$\psi_{i}$ at time Formula$i$ represents the order of first occurrence of letter Formula$x_{i}$ in the sequence Formula$x^{n}$. Further study of universal compression of patterns [13], [22], [23], [31], [35] (and subsequently to the work in this paper in [2]) provided various lower and upper bounds to various forms of redundancy in universal compression of patterns. Another related study is that of compression of data, where the order of the occurring data symbols is not important, but their types and empirical counts are [39], [40].

This paper considers universal compression of data sequences generated by distributions that are known a priori to be monotonic. The order of probabilities of the source symbols is known in advance to both encoder and decoder and can be utilized as side information. Monotonic distributions, such as the Zipf (see, e.g., [42], [43]) and the geometric distribution over the integers, are common in applications such as language modeling, and image compression where residual signals are compressed (see, e.g., [20], [21]). One can also consider compression of the list of last or first names in a given city of a given population. Usually, there exists some monotonicity for such a distribution in the given population, which both encoder and decoder may be aware of a priori. For example, the last name “Smith” can be expected to be much more common than the last name “Shannon.” Another example is the compression of a sequence of observations of different species, where one has prior knowledge which species are more common, and which are rare. Finally, one can consider compressing data for which side information given to the decoder through a different channel gives the monotonicity order.

Monotonic distributions were studied by Elias [8], Rissanen [24], and Ryabko [26]. In [8] and [24], the study focused on relative redundancy, computing the ratio between average assigned code length and the source entropy. Ryabko in [26] studied codes for monotonic distributions and used the connection between redundancy and channel capacity (i.e., the redundancy-capacity theorem) to lower bound minimax redundancy. Much newer work by Foster et al. showed in [11] that (unlike the compression of patterns) there are no universal block codes in the standard sense for the complete class of monotonic distributions. The main reason is that there exist such distributions, for which much of the statistical weight lies in the long tail of the distribution in symbols that have very low probability, and most of which will not occur in a given sequence. Thus, in practice, even though one has the prior knowledge of the monotonicity of the distribution, this monotonicity is not necessarily retained in an observed sequence. Actual coding is, therefore, very similar to compressing with infinite alphabets, and the additional prior knowledge of the monotonicity is not very helpful in reducing redundancy. Despite that, Foster et al. demonstrated codes that obtained universal per-symbol redundancy of Formula$o(1)$ as long as the source entropy is fixed (i.e., neither increasing with Formula$n$ nor infinite).

The work in [11] studied coding sequences (or blocks) generated by i.i.d. monotonic distributions, and designed codes for which the relative block redundancy could be (upper) bounded. Unlike that work, the focus in [8], [24], and [26] was on designing codes that minimize the redundancy or relative redundancy for a single symbol generated by a monotonic distribution. Specifically, in [24], minimax codes, which minimize the relative redundancy for the worst possible monotonic distribution over a given alphabet size, were derived. In [26], it was shown that redundancy of Formula$O(\log \log k)$, where Formula$k$ is the alphabet size, can be obtained with minimax per-symbol codes. Very recent work [18] considered per-symbol codes that minimize an average redundancy over the class of monotonic distributions for a given alphabet size. Unlike [11], all these papers study per-symbol codes. Therefore, the codes designed always pay nondiminishing per-symbol redundancy.

A different line of work on monotonic distributions considered optimizing codes for a known monotonic distribution but with unknown parameters (see [20], [21] for design of codes for two-sided geometric distributions). In this line of work, the class of sources is very limited and consists of only the unknown parameters of a known distribution.

In this paper, we consider a general class of monotonic distributions that is not restricted to a specific type or a single parameter. We study standard block redundancy for coding sequences generated by i.i.d. monotonic distributions, i.e., a setting similar to the work in [11]. We do, however, restrict ourselves to smaller subsets of the complete class of monotonic distributions. First, we consider monotonic distributions over alphabets of size Formula$k$, where Formula$k$ is either small w.r.t. Formula$n$, or of Formula$O(n)$. Then, we extend the analysis to show that under minimal restrictions of the monotonic distribution class, there exist universal codes in the standard sense, i.e., with diminishing per-symbol redundancy. In fact, not only do universal codes exist, but under mild restrictions, they achieve the same redundancy as obtained for alphabets of size Formula$O(n)$. The restrictions on this subclass imply that some types of fast decaying monotonic distributions are included in it, and therefore, sequences generated by these distributions (without prior knowledge of either the distribution or of its parameters) can still be compressed universally in the class of monotonic distributions.

The main contributions of this paper are the development of codes and derivation of their upper bounds on the redundancies for coding i.i.d. sequences generated by monotonic distributions. Specifically, this paper gives complete characterization of the redundancy in coding with monotonic distributions over “small” alphabets Formula$(k=o(n^{1/3}))$ and “large” alphabets Formula$(k=O(n))$. Then, it shows that these redundancy bounds carry over (in first order) to fast decaying distributions. Next, a code that achieves good redundancy rates for even slower decaying monotonic distributions is derived, and is used to study achievable redundancy rates for such distributions. Finally, even tighter upper bounds relative to the ML description length are obtained for individual sequences for which the monotonic order of the probabilities is known. The codes derived are two part codes, based on a description of any sequence using a quantized distribution describing the ML distribution of a given sequence. The redundancy consists of the distribution description cost and quantization penalty.

Lower bounds are also presented (in both average and individual sequence cases) to complete the characterization, and are shown to meet the upper bounds in the first three cases (small alphabets, large alphabets, and fast decaying distributions). The lower bounds turn out to relate to those obtained for coding patterns. The relationship to patterns is demonstrated in the proofs of the lower bounds. The main components of the average case proofs are, in fact, identical to those in [31], and the reader is referred to more details in [31]. The main steps of the proofs are still presented in appendixes here for the sake of completeness.

The universal compression problem over monotonic distributions is very related to that of patterns. For small and large alphabets, the redundancy rates attained appear to be the same. This is because in both problems the richness of the class (yielding the universal coding redundancy) is decreased by the same factor from that of the original i.i.d. class, although for different reasons. In the pattern case, sequences which are label permutations of the others are governed by the same pattern ML distribution. Here, such sequences are constrained to a distribution whose probabilities are ordered by the monotonicity constraint. However, a monotonic ML distribution requires given labels to appear in the required order, and may not equal the actual i.i.d. ML distribution. This restriction is not imposed when coding patterns, and makes this part of the analysis more difficult for monotonic distributions. Overall, in both cases, we observe a cost of essentially Formula$0.5\log (n/k^{3})$ bits per each unknown parameter for smaller alphabets and a cost of essentially Formula$O(n^{1/3})$ bits overall for larger alphabets. The technique that is used to prove the upper bounds of the main theorems in this paper follows the original work in [35] for upper bounding the redundancy for coding patterns. Tight upper bounds on the redundancy for coding patterns were not attained when the work presented in this paper, published originally in [36] and [34], was done. Several years subsequently to the work presented here, the general construction in [35] was followed in [2] to show an Formula$O(n^{1/3+\varepsilon})$ upper bound for coding patterns. An upper bound for small alphabets of Formula$(1+\varepsilon) 0.5\log (n/k^{3})$ bits per parameter is yet to have been derived for patterns, to the best of our knowledge. The constructions used in this paper can be applied to the pattern problem. The description costs of these constructions apply to patterns, but the computation of quantization costs is much more difficult for patterns. Specifically, the construction used in the individual sequence case for monotonic distributions can be applied to patterns.

The outline of this paper is as follows. Section II describes the notation and basic definitions. Then, in Section III, lower bounds on the redundancy for monotonic distributions are derived. Next, in Section IV, we propose codes and upper bound their redundancy for coding monotonic distributions over small and large alphabets. These upper bounds match the rates of the lower bounds. They are then extended to fast decaying monotonic distributions in Section V, which also demonstrates the use of the bounds on some standard monotonic distributions. Finally, in Section VI, we consider individual sequences.



Let Formula$x^{n}\triangleq\left(x_{1}, x_{2},\ldots, x_{n}\right)$ denote a sequence of Formula$n$ symbols over the alphabet Formula$\Sigma$ of size Formula$k$, where Formula$k$ can go to infinity. Without loss of generality, we assume that Formula$\Sigma=\left\{1, 2,\ldots, k\right\}$, i.e., it is the set of positive integers from 1 to Formula$k$. The sequence Formula$x^{n}$ is generated by an i.i.d. distribution of some source, determined by the parameter vector Formula$\mmb{\theta}\triangleq\left(\theta_{1},\theta_{2},\ldots,\theta_{k}\right)$, where Formula$\theta_{i}$ is the probability of Formula$X$ taking value Formula$i$. The components of Formula$\mmb{\theta}$ are nonnegative and sum to 1. The distributions we consider in this paper are monotonic. Therefore, Formula$\theta_{1}\geq\theta_{2}\geq\ldots\geq\theta_{k}$. The class of all monotonic distributions will be denoted by Formula${\cal M}$. The class of monotonic distributions over an alphabet of size Formula$k$ is denoted by Formula${\cal M}_{k}$. It is assumed that prior to coding Formula$x^{n}$ both encoder and decoder know that Formula$\mmb{\theta}\in{\cal M}$ or Formula$\mmb{\theta}\in{\cal M}_{k}$, and also know the order of the probabilities in Formula$\mmb{\theta}$. In the more restrictive setting, Formula$k$ is known in advance and it is known that Formula$\mmb{\theta}\in{\cal M}_{k}$. We do not restrict ourselves to this setting. In general, boldface letters will denote vectors, whose components will be denoted by their indices in the vector. Capital letters will denote random variables. We will denote an estimator by the hat sign. In particular, Formula${\hat{\mmb{\theta}}}$ will denote the ML estimator of Formula$\mmb{\theta}$ which is obtained from Formula$x^{n}$.

The probability of Formula$x^{n}$ generated by Formula$\mmb{\theta}$ is given by Formula$P_{\theta}\left(x^{n}\right)\triangleq\Pr\left(x^{n}\vert{\mmb{\Theta}}=\mmb{\theta}\right)$. The average per-symbol2 Formula$n$ th-order redundancy obtained by a code that assigns length function Formula$L (\cdot)$ for Formula$\mmb{\theta}$ is Formula TeX Source $$R_{n}\left(L,\mmb{\theta}\right)\triangleq{{1}\over{n}}E_{\theta}L\left[X^{n}\right]-H_{\theta}\left[X\right]\eqno{\hbox{(1)}}$$ where Formula$E_{\theta}$ denotes expectation w.r.t. Formula$\mmb{\theta}$, and Formula$H_{\theta}\left[X\right]$ is the (per-symbol) entropy (rate) of the source (Formula$H_{\theta}\left[X^{n}\right]$ is the Formula$n$ th-order sequence entropy of Formula$\mmb{\theta}$, and for i.i.d. sources, Formula$H_{\theta}\left[X^{n}\right]=n H_{\theta}\left[X\right]$). With entropy coding techniques, assigning a universal probability Formula$Q\left(x^{n}\right)$ is identical to designing a universal code for coding Formula$x^{n}$ where, up to negligible integer length constraints that will be ignored, the negative logarithm to the base of 2 of the assigned probability is considered as the code length.

The individual sequence redundancy (see, e.g., [37]) of a code with length function Formula$L\left(\cdot\right)$ per sequence Formula$x^{n}$ over class Formula$\Lambda$ is Formula TeX Source $${\hat{R}}_{n}\left(L, x^{n}\right)\triangleq{{1}\over{n}}\left\{L\left(x^{n}\right)+\log P_{ML}\left(x^{n}\right)\right\}\eqno{\hbox{(2)}}$$ where the logarithm function is taken to the base of 2, here and elsewhere, and Formula$P_{ML}\left(x^{n}\right)$ is the probability of Formula$x^{n}$ given by the ML estimator Formula${\hat{\mmb{\theta}}}_{\Lambda}\in\Lambda$ of the governing parameter vector Formula${\mmb{\Theta}}$. The negative logarithm of this probability is, up to integer length constraints, the shortest possible code length assigned to Formula$x^{n}$ in Formula$\Lambda$. It will be referred to as the ML description length of Formula$x^{n}$ in Formula$\Lambda$. In the general case, one considers the i.i.d. ML. However, since we only consider Formula$\mmb{\theta}\in{\cal M}$, i.e., restrict the sequence to one governed by a monotonic distribution, we define Formula${\hat{\mmb{\theta}}}_{\cal M}\in{\cal M}$ as the monotonic ML estimator. Its associated shortest code length will be referred to as the monotonic ML description length. The estimator Formula${\hat{\mmb{\theta}}}_{\cal M}$ may differ from the i.i.d. ML Formula${\hat{\mmb{\theta}}}$, in particular, if the empirical distribution of Formula$x^{n}$ is not monotonic. The individual sequence redundancy in Formula${\cal M}$ is thus defined w.r.t. the monotonic ML description length, which is the negative logarithm of Formula$P_{ML}\left(x^{n}\right)\triangleq P_{{\hat{\theta}}_{\cal M}}\left(x^{n}\right)\triangleq\Pr\left(x^{n}\;\vert\;{\mmb{\Theta}}={\hat{\mmb{\theta}}}_{\cal M}\in{\cal M}\right)$.

The average minimax redundancy of some class Formula$\Lambda$ is defined as Formula TeX Source $$R_{n}^{+}\left(\Lambda\right)\triangleq\min_{L}\sup_{\mmb{\theta}\in\Lambda}R_{n}\left(L,\mmb{\theta}\right).\eqno{\hbox{(3)}}$$ Similarly, the individual minimax redundancy is that of the best code Formula$L\left(\cdot\right)$ for the worst sequence Formula$x^{n}$, Formula TeX Source $${\hat{R}}_{n}^{+}\left(\Lambda\right)\triangleq\min_{L}\sup_{\mmb{\theta}\in\Lambda}\max_{x^{n}}{{1}\over{n}}\left\{L\left(x^{n}\right)+\log P_{\theta}\left(x^{n}\right)\right\}.\eqno{\hbox{(4)}}$$ The maximin redundancy of Formula$\Lambda$ is Formula TeX Source $$R_{n}^{-}\left(\Lambda\right)\triangleq\sup_{w}\min_{L}\int_{\Lambda}w\left(d\mmb{\theta}\right) R_{n}\left(L,\mmb{\theta}\right)\eqno{\hbox{(5)}}$$ where Formula$w(\cdot)$ is a prior on Formula$\Lambda$. In [6], it was shown by Davisson that Formula$R_{n}^{+}\left(\Lambda\right)\geq R_{n}^{-}\left(\Lambda\right)$. Davisson also tied the maximin redundancy to the capacity of the channel induced by the conditional probability Formula$P_{\theta}$. It was then shown independently by Gallager [12] and Ryabko [26] first, and then by Davisson and Leon-Garcia [7], that the minimax and maximin redundancies are essentially equal, hence, making the connection between the minimax redundancy and the capacity of the channel induced by Formula$P_{\theta}$. Finally, Merhav and Feder [19] tied between the capacity of this channel and redundancy for almost all sources in a class proving a strong version of the theorem. The redundancy-capacity theorem is used to prove lower bounds in the minimax (maximin) and “almost all sources” senses for the monotonic distribution class.



Lower bounds on various forms of the redundancy for the class of monotonic distributions can be obtained with slight modifications of the proofs for the lower bounds on the redundancy of coding patterns in [16], [22], [23], and [31]. The bounds are presented in the following three theorems. For the sake of completeness, the main steps of the proofs of the first two theorems are presented in appendixes, and the proof of the third theorem is presented below. The reader is referred to [16], [22], [23], [30] and [31] for more details.

Theorem 1

Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Then, the Formula$n$ th-order average maximin and minimax universal coding redundancies for i.i.d. sequences generated by a monotonic distribution with alphabet size Formula$k$ are lower bounded by Formula TeX Source $$\eqalignno{&\displaystyle R^{-}_{n}\left({\cal M}_{k}\right)\geq&{\hbox{(6)}}\cr&\qquad\cases{{{k-1}\over{2n}}\log {{n^{1-\varepsilon}}\over{k^{3}}}+{{k-1}\over{2n}}\log {{\pi e^{3}}\over{2}}-O\left({{\log k}\over{n}}\right), &$k\leq{\cal T}^{-}_{\varepsilon, n}$\cr\left({{\pi}\over{2}}\right)^{1/3}(1.5\log e){{n^{(1-\varepsilon)/3}}\over{n}}-O\left({{\log n}\over{n}}\right), &$k>{\cal T}^{-}_{\varepsilon,n}$}}$$ where Formula TeX Source $${\cal T}^{-}_{\varepsilon, n}\triangleq\left({{\pi n^{1-\varepsilon}}\over{2}}\right)^{1/3}.\eqno{\hbox{(7)}}$$

Theorem 2

Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Let Formula$\mu_{n}(\cdot)$ be the uniform prior over points in Formula${\cal M}_{k}$. Then, the Formula$n$ th-order average universal coding redundancy for coding i.i.d. sequences generated by monotonic distributions with alphabet size Formula$k$ is lower bounded by Formula TeX Source $$\eqalignno{&\displaystyle R_{n}\left(L,\mmb{\theta}\right)\geq\cr&\qquad\cases{{{k-1}\over{2n}}\log {{n^{1-\varepsilon}}\over{k^{3}}}-{{k-1}\over{2n}}\log {{8\pi}\over{e^{3}}}-O\left({{\log k}\over{n}}\right), &$k\leq{\cal T}_{\varepsilon, n}$\cr{{1.5\log e}\over{2\pi^{1/3}}}\cdot{{n^{(1-\varepsilon)/3}}\over{n}}-O\left({{\log n}\over{n}}\right),&$k>{\cal T}_{\varepsilon, n}$}\cr&&{\hbox{(8)}}}$$ where Formula TeX Source $${\cal T}_{\varepsilon, n}\triangleq{{1}\over{2}}\cdot\left({{n^{1-\varepsilon}}\over{\pi}}\right)^{1/3}\eqno{\hbox{(9)}}$$ for every code Formula$L(\cdot)$ and almost every i.i.d. source Formula$\mmb{\theta}\in{\cal M}_{k}$, except for a set of sources Formula$A_{\varepsilon}\left(n\right)$ whose relative volume Formula$\mu_{n}\left(A_{\varepsilon}(n)\right)$ in Formula${\cal M}_{k}$ goes to 0 as Formula$n\rightarrow\infty$.

Theorems 1 and 2 give lower bounds on redundancies of coding over monotonic distributions for the class Formula${\cal M}_{k}$. However, the bounds are more general, and the second region applies to the whole class of monotonic distributions Formula${\cal M}$. By plugging the boundary values of Formula$k$ into the first regions of both Theorems, the bounds of the second regions are obtained, demonstrating the threshold phenomenon of the transition between the regions. Subsequent work in [2] to the work presented in this paper slightly tightened the second region of the bound of Theorem 1 for patterns. This was done by applying a general technique that uses bounds on error correcting codes, as that described in earlier work in [27], [28], [29], to patterns on top of the bounding methods used in [31]. The tighter bound for that region can also be applied to monotonic distributions. As in the case of patterns [22], [31], the bounds in (6)(8) show that each parameter costs at least Formula$0.5\log (n/k^{3})$ bits for small alphabets, and the total universality cost is at least Formula$\Theta (n^{1/3-\varepsilon})$ bits overall for larger alphabets. We show in Section IV that for Formula$k=O(n)$ these bounds are asymptotically achievable for monotonic distributions. The bounds in (6)(8) focus on large values of Formula$k$ that can increase with Formula$n$. For small fixed Formula$k$, the second-order terms of existing bounds for coding unconstrained i.i.d. sources are tighter. However, as Formula$k$ increases, the bounds above become tighter through their first dominant term, and second-order terms become negligible. The proofs of Theorems 1 and 2 are presented in Appendixes A and B, respectively.

Theorem 3

3 Let Formula$n\rightarrow\infty$. Then, the Formula$n$ th-order individual minimax redundancy for i.i.d. sequences with maximal letter Formula$k$ w.r.t. the monotonic ML description length with alphabet size Formula$k$ is lower bounded by Formula TeX Source $${\hat{R}}^{+}_{n}\left({\cal M}_{k}\right)\geq\cases{{{k}\over{2n}}\log {{n e^{3}}\over{k^{3}}}-{{\log n}\over{2n}}+O\left({{k^{3/2}}\over{n^{3/2}}}\right), &$k\leq n^{1/3}$\cr{{3}\over{2}}(\log e)\cdot{{n^{1/3}}\over{n}}-{{\log n} \over{2n}}+O\left({{1}\over{n}}\right), &$k>n^{1/3}$.}\eqno{\hbox{(10)}}$$

Theorem 3 lower bounds the individual minimax redundancy for coding a sequence believed to have an empirical monotonic distribution. The alphabet size is determined by the maximal letter that occurs in the sequence, i.e., Formula$k=\max\left\{x_{1}, x_{2},\ldots, x_{n}\right\}$. (If Formula$k$ is unknown, one can use Elias' code for the integers [8] using Formula$O(\log k)$ bits to describe Formula$k$. However, this is not reflected in the lower bound.) The ML probability estimate is taken over the class of monotonic distributions. Namely, the empirical probability (standard ML) estimate Formula${\hat{\mmb{\theta}}}$ is not Formula${\hat{\mmb{\theta}}}_{\cal M}$ in case Formula${\hat{\mmb{\theta}}}$ does not satisfy the monotonicity that defines the class Formula${\cal M}$. While the average case maximin and minimax bounds of Theorem 1 also apply to Formula${\hat{R}}^{+}_{n}\left({\cal M}_{k}\right)$, the bounds of Theorem 3 are tighter for the individual redundancy and are obtained using individual sequence redundancy techniques.

Proof [Theorem 3]

Using Shtarkov's normalized maximum likelihood (NML) approach [37], one can assign probability Formula TeX Source $$Q\left(x^{n}\right)\triangleq{{P_{{\hat{\theta}}_{\cal M}}\left(x^{n}\right)} \over{\sum\nolimits_{y^{n}}P_{{\hat{\theta}}^{\prime}_{\cal M}}\left(y^{n}\right)} }\triangleq{{\max_{\theta^{\prime}\in{\cal M}}P_{\theta^{\prime}}\left(x^{n}\right) }\over{\sum\nolimits_{y^{n}}\max_{\theta^{\prime\prime}\in{\cal M}} P_{\theta^{\prime\prime}}\left(y^{n}\right)}}\eqno{\hbox{(11)}}$$ to sequence Formula$x^{n}$. This approach minimizes the individual minimax redundancy, giving individual redundancy of Formula TeX Source $$\eqalignno{{\hat{R}}_{n}\left(Q, x^{n}\right)=&\,{{1}\over{n}}\log {{\max_{\theta^{\prime}\in {\cal M}}P_{\theta^{\prime}}\left(x^{n}\right)}\over{Q\left(x^{n}\right)}}\cr =&\,{{1}\over{n}}\log \left\{\sum_{y^{n}}\max_{\theta^{\prime}\in{\cal M}} P_{\theta^{\prime}}\left(y^{n}\right)\right\}&{\hbox{(12)}}}$$ to every Formula$x^{n}$, specifically achieving the individual minimax redundancy.

It is now left to bound the logarithm of the sum in (12). We follow the approach used in [23, Th. 2] for bounding the redundancy for standard compression of i.i.d. sequences over large alphabets and use the results in [38] (as well as the approximation in [1]) for a precise expression of this component. We then adjust the result to monotonic distributions. Let Formula${\bf n}_{x}^{\ell}\triangleq\left(n_{x}(1), n_{x}(2),\ldots, n_{x}(\ell)\right)$ denote the occurrence counts of the first Formula$\ell$ letters of the alphabet Formula$\Sigma$ in Formula$x^{n}$. Assuming Formula$k$ is the largest letter in Formula$x^{n}$, Formula$\sum_{i=1}^{k}n_{x}(i)=n$. Now, following (12), Formula TeX Source $$\eqalignno{& n{\hat{R}}^{+}_{n}\left({\cal M}_{k}\right)\cr &\enspace\buildrel{(a)}\over{\geq}\log \left\{\sum\limits_{y^{n}:{\hat{\theta}}(y^{n})\in{\cal M}}P_{\hat{\theta}}\left(y^{n}\right)\right\}\cr &\enspace\buildrel{(b)}\over{\geq}\log \left\{\sum\limits_{\ell=1}^{k}\sum \limits_{{\bf n}_{y}^{\ell}}{{1}\over{\ell!}}\left(\buildrel{n}\over{n_{y}(1),\ldots, n_{y}(\ell)}\right)\!\prod_{i=1}^{\ell}\!\left({{n_{y}(i)} \over{n}}\right)^{n_{y}(i)}\!\right\}\cr &\enspace\buildrel{(c)}\over{\geq}\log \left\{\sum\limits_{{\bf n}_{y}^{k}}{{1} \over{k !}}\left(\buildrel{n}\over{n_{y}(1),\ldots,n_{y}(k)}\right)\prod_{i=1}^{k} \left({{n_{y}(i)}\over{n}}\right)^{n_{y}(i)}\right\}\cr &\enspace\buildrel{(d)}\over{=}{{k-1}\over{2}}\log {{n}\over{k}}+{{k}\over{2}} \log e+O\left({{k^{3/2}}\over{n^{1/2}}}\right)-\log \left(k!\right)\cr &\enspace\buildrel{(e)}\over{\geq}{{k}\over{2}}\log {{n e^{3}}\over{k^{3}}}-{{1} \over{2}}\log n+O\left({{k^{3/2}}\over{n^{1/2}}}\right)&{\hbox{(13)}}}$$ where Formula$(a)$ follows from including only sequences Formula$y^{n}$ that have a monotonic empirical (i.i.d. ML) distribution in Shtarkov's sum. Inequality Formula$(b)$ follows from partitioning the sequences Formula$y^{n}$ into types as done in [23], first by the number of occurring symbols Formula$\ell$, and then by the empirical distribution. Unlike standard i.i.d. distributions though, monotonicity implies that only the first Formula$\ell$ symbols in Formula$\Sigma$ occur, and thus the choice of Formula$\ell$ out of Formula$k$ in the proof in [23] is replaced by 1. Like in coding patterns, we also divide by Formula$\ell !$ because each type with Formula$\ell$ occurring symbols can be ordered in at most Formula$\ell !$ ways, where only some retain the monotonicity. (Note that this step is the reason that step Formula$(b)$ produces an inequality, because more than one of the orderings may be monotonic if equal occurrence counts occur.) Retaining only the term Formula$\ell=k$ yields Formula$(c)$. Then, Formula$(d)$ follows from applying (15) in [38] (see also the approximation of (13) in [1]). Finally, Formula$(e)$ follows from Stirling's approximation Formula TeX Source $$\sqrt{2\pi m}\cdot\left({{m}\over{e}}\right)^{m}\leq m!\leq\sqrt{2\pi m}\cdot\left({{m}\over{e}}\right)^{m}\cdot\exp\left\{{{1}\over{12m}}\right\}.\eqno{\hbox{(14)}}$$ The first region in (10) results directly from (13). The value Formula$\ell^{\ast}=n^{1/3}$ that maximizes the summand can be retained in step Formula$(c)$ instead of Formula$k$, for every Formula$k\geq\ell^{\ast}$, yielding the second region of the bound. This concludes the proof of Theorem 3. Formula$\hfill \blacksquare$



In this section, we demonstrate codes that asymptotically achieve the lower bounds for Formula$\mmb{\theta}\in{\cal M}_{k}$ and Formula$k=O(n)$. We begin with a theorem and a corollary that show the achievable redundancies. The theorem shows a simpler bound, and the corollary (that follows the proof of the theorem) shows a tighter, more complex bound. The remainder of the section is devoted to proving both theorem and corollary, by describing codes, for which the redundancy bounds provide general bounds on the redundancy, and bounding their redundancies. The theorem is stated assuming no initial knowledge of Formula$k$. The proof first considers the setting where Formula$k$ is known, and then shows how the same bounds are achieved even when Formula$k$ is unknown in advance, but as long as it satisfies the conditions.

Theorem 4

Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Then, there exists a code with length function Formula$L^{\ast}\left(\cdot\right)$ that achieves redundancy Formula TeX Source $$\eqalignno{&\displaystyle R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq&{\hbox{(15)}}\cr &\qquad\cases{\left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n}\over{k^{3}}}, &$k=o\left(n^{1/3}\right)$,\cr \left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}, &$k\leq n^{1/3},$\cr \left(1+\varepsilon\right)\left(\log n\right)\left(\log {{k}\over{n^{1/3-\varepsilon}}}\right){{n^{1/3}}\over{n}}, &$n^{1/3}<k=o(n),$\cr \left(1+\varepsilon\right){{2}\over{3}}\left(\log n\right)^{2}{{n^{1/3}}\over{n}}, &$n^{1/3}<k=O(n)$}}$$ for i.i.d. sequences generated by any source Formula$\mmb{\theta}\in{\cal M}_{k}$.

The bounds presented are asymptotic. Second-order terms are absorbed in Formula$\varepsilon$. The second region contains the first, and the last contains the third. The first and third regions, however, have tighter bounds for the smaller values of Formula$k$. The code designed to code a sequence Formula$x^{n}$ is a two part code [25]. First, a distribution is described, and then it is used to code Formula$x^{n}$. The redundancy consists of the cost of describing the distribution and a quantization cost. Quantization is performed to reduce description cost, but yields the quantization cost. To achieve the lower bound, the larger the probability parameter is, the coarser its quantization. This approach was used in [30] and [31] to obtain upper bounds on the redundancy for coding over large alphabets and for coding patterns, respectively. The method in [30] and [31], however, is insufficient here, because it still results in too many quantization points due to the polynomial growth in quantization spacing. Here, we use an exponential growth as the parameters increase. This general idea was used in [35] to improve an upper bound on the redundancy of coding patterns. Since both encoder and decoder know the order of the probabilities a priori, this order need not be coded. It is thus sufficient to encode the quantized probabilities of the monotonic distribution, and the decoder can identify which probability is associated with which symbol using the monotonicity of the distribution. This point, in fact, complicates the proof, because the actual ML distribution Formula${\hat{\mmb{\theta}}}$ of a given sequence may not be monotonic even if the sequence was generated by a monotonic distribution. Since the labels are not coded, we must quantize Formula${\hat{\mmb{\theta}}}_{\cal M}$ instead. There is no such complication when coding patterns or sequences that obey distribution monotonicity side information as in Section VI.

Branching several steps from the proof of Theorem 4 below leads to the following tighter bounds on the upper regions, which are proved following the proof of Theorem 4.

Corollary 1

Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Then, for Formula$k>n^{1/3}$, there exists a code with length function Formula$L^{\ast}\left(\cdot\right)$ that achieves redundancy Formula TeX Source $$\eqalignno{&\displaystyle R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq&{\hbox{(16)}}\cr &\qquad\cases{\left(2.3\log {{k}\over{n^{1/3-\varepsilon}}}+0.8\log n\right){{n^{1/3}\left(\log n\right)^{1/3}}\over{n}},&$k=o(n),$\cr {{2.3 n^{1/3}\left(\log n\right)^{4/3}}\over{n}}, &$k=O(n)$}}$$ for i.i.d. sequences generated by any source Formula$\mmb{\theta}\in{\cal M}_{k}$.

Proof [Theorem 4]

The proof treats the regions Formula$k\leq n^{1/3}$ and Formula$k>n^{1/3}$ separately. For each region, we construct a grid of points to which a two part code can quantize the probability parameters. The main idea is that spacing between adjacent grid points is “semi”-exponentially increasing. To achieve that, the probability space is partitioned into intervals, whose length increases exponentially, and within each interval a fixed number of equally separated grid points are generated. Next, the ML probability vector of each sequence is quantized into the points of the grid. In the lower Formula$k$ region, a differential code is used to describe the number of points in the grid between two adjacent probability parameters, starting with the smallest one. In the upper Formula$k$ region, the number of probability parameters quantized to that grid point is described for every grid point. Then, the description cost, and the quantization cost are upper bounded. The sum of these two costs constitutes the description length. The redundancy is computed by subtracting the description length with the true probability parameters from the description length used. The quantized version of the true probability vector is used as an auxiliary vector to aid in upper bounding this difference.

We start with Formula$k\leq n^{1/3}$ assuming Formula$k$ is known. Let Formula$\beta=1/(\log n)$ be a parameter (we can also choose other values). Partition the probability space into Formula$J_{1}=\left\lceil 1/\beta\right\rceil$ intervals Formula TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n}},{{n^{j\beta}}\over{n}}\right),\;1\leq j\leq J_{1}.\eqno{\hbox{(17)}}$$ Note that Formula$I_{1}=[1/n, 2/n),\;I_{2}=[2/n, 4/n),\ldots,\;I_{j}=[2^{j-1}/n, 2^{j}/n)$. Let Formula$k_{j}=\vert\theta_{i}\in I_{j}\vert$ denote the number of probabilities in Formula$\mmb{\theta}$ that are in interval Formula$I_{j}$. In interval Formula$j$, take a grid of points with spacing Formula TeX Source $$\Delta_{j}^{(1)}={{\sqrt{k}n^{j\beta}}\over{n^{1.5}}}.\eqno{\hbox{(18)}}$$ Note that to complete all points in an interval, the spacing between two points at the boundary of an interval may be smaller. There are Formula$\left\lceil\log n\right\rceil$ intervals. Ignoring negligible integer length constraints (here and elsewhere), in each interval, the number of points is bounded by Formula TeX Source $$\left\vert I_{j}\right\vert\leq{{1}\over{2}}\cdot\sqrt{{n}\over{k}},\;\forall j: j=1, 2,\ldots,J_{1},\eqno{\hbox{(19)}}$$ where Formula$\vert\cdot\vert$ denotes the cardinality of a set. Let the grid Formula TeX Source $$\eqalignno{{\mmb{\tau}}=&{\left(\tau_{1},\tau_{2},\ldots\right)=\left({{1}\over{n}},{{1}\over{n}}+{{2\sqrt{k}}\over{n^{1.5}}},\ldots,{{2}\over{n}},{{2}\over{n}}+{{4\sqrt{k}}\over{n^{1.5}}},\ldots\right)}&\cr && {\hbox{(20)}}}$$ be a vector that takes all the points from all intervals, with cardinality Formula TeX Source $$B_{1}\triangleq\vert\mmb{\tau}\vert\leq{{1}\over{2}}\cdot\sqrt{{n}\over{k}}\left\lceil\log n\right\rceil.\eqno{\hbox{(21)}}$$

Now, let Formula$\mmb{\varphi}=\left(\varphi_{1},\varphi_{2},\ldots,\varphi_{k}\right)$ be a monotonic probability vector, such that Formula$\sum\varphi_{i}=1$, Formula$\varphi_{1}\geq\varphi_{2}\geq\cdots\geq\varphi_{k}\geq 0$, and also the smaller Formula$k-1$ components of Formula$\mmb{\varphi}$ are either 0 or from Formula$\mmb{\tau}$, i.e., Formula$\varphi_{i}\in (\mmb{\tau}\cup\left\{0\right\})$, Formula$i=2,3,\ldots, k$. One can code Formula$x^{n}$ using a two part code, assuming the distribution governing Formula$x^{n}$ is given by the parameter Formula$\mmb{\varphi}$. The code length required (up to integer length constraints) is Formula TeX Source $$L\left(x^{n}\vert\mmb{\varphi}\right)=\log k+L_{R}(\mmb{\varphi})-\log P_{\varphi}\left(x^{n}\right),\eqno{\hbox{(22)}}$$ where Formula$\log k$ bits are needed to describe how many letter probabilities are greater than 0 in Formula$\mmb{\varphi}$, Formula$L_{R}(\mmb{\varphi})$ is the number of bits required to describe the quantized points of Formula$\mmb{\varphi}$, and the last term is needed to encode Formula$x^{n}$ assuming it is governed by Formula$\mmb{\varphi}$.

The vector Formula$\mmb{\varphi}$ can be described by a code as follows. Let Formula${\hat{k}}_{\varphi}$ be the number of nonzero letter probabilities hypothesized by Formula$\mmb{\varphi}$. Let Formula$b_{i}$ denote the index of Formula$\varphi_{i}$ in Formula$\mmb{\tau}$, i.e., Formula$\varphi_{i}=\tau_{b_{i}}$. Then, we will use the following differential code. For Formula$\varphi_{{\hat{k}}_{\varphi}}$ we need at most Formula$1+\log b_{{\hat{k}}_{\varphi}}+2\log (1+\log b_{{\hat{k}}_{\varphi}})$ bits to code its index in Formula$\mmb{\tau}$ using Elias' coding for the integers [8]. For Formula$\varphi_{i-1}$, we need at most Formula$1+\log (b_{i-1}-b_{i}+1)+2\log [1+\log (b_{i-1}-b_{i}+1)]$ bits to code the index displacement from the index of the previous parameter, where an additional 1 is added to the difference in case the two parameters share the same index. Summing up all components of Formula$\mmb{\varphi}$, and taking Formula$b_{{\hat{k}}_{\varphi}+1}=0$, Formula TeX Source $$\eqalignno{L_{R}(\mmb{\varphi})\leq &\,{\hat{k}}_{\varphi}-1+\sum\limits_{i=2}^{{\hat{k}}_{\varphi}}\log \left(b_{i}-b_{i+1}+1\right)+\cr & 2\sum\limits_{i=2}^{{\hat{k}}_{\varphi}}\log \left[1+\log \left(b_{i}-b_{i+1}+1\right)\right]\cr \buildrel{(a)}\over{\leq}&\, (k-1)+(k-1)\log {{B_{1}+k-1}\over{k}}+\cr &2(k-1)\log \log {{B_{1}+k-1}\over{k}}+o(k)\cr \buildrel{(b)}\over{=}&\, (1+\varepsilon){{k-1}\over{2}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}.&{\hbox{(23)}}}$$ Inequality Formula$(a)$ is obtained by applying Jensen's inequality once on the first sum, twice on the second sum utilizing the monotonicity of the logarithm function, and by bounding Formula${\hat{k}}_{\varphi}$ by Formula$k$, and absorbing second-order terms in the resulting Formula$o(k)$ term. Then, second-order terms are absorbed in Formula$\varepsilon$, and (21) is used to obtain Formula$(b)$.

To code Formula$x^{n}$, we choose Formula$\mmb{\varphi}$ which minimizes the expression in (22) over all Formula$\mmb{\varphi}$, i.e., Formula TeX Source $$L^{\ast}\left(x^{n}\right)=\min_{\mmb{\varphi}:\varphi_{i}\in (\mmb{\tau}\cup\left\{0\right\}),\;i=2,3,\ldots, k}L\left(x^{n}\vert\mmb{\varphi}\right)\triangleq L\left(x^{n}\vert{\hat{\mmb{\varphi}}}\right).\eqno{\hbox{(24)}}$$ The pointwise redundancy for Formula$x^{n}$ is given by Formula TeX Source $$\eqalignno{nR_{n}\left(L^{\ast}, x^{n}\right)=&\, L^{\ast}\left(x^{n}\right)+\log P_{\theta}\left(x^{n}\right)\cr =&\,\log k+L^{\ast}_{R}\left({\hat{\mmb{\varphi}}}\right)+\log {{P_{\theta}\left(x^{n}\right)}\over{P_{\hat{\varphi}}\left(x^{n}\right)}}.&{\hbox{(25)}}}$$ Note that the pointwise redundancy differs from the individual one, since it is defined w.r.t. the true probability of Formula$x^{n}$. Thus, for a given Formula$x^{n}$ it may also be negative.

To bound the third term of (25), let Formula$\mmb{\theta}^{\prime}$ be a monotonic version of Formula$\mmb{\theta}$ quantized onto Formula$\mmb{\tau}$, i.e., Formula$\theta^{\prime}_{i}\in (\mmb{\tau}\cup\left\{0\right\})$, Formula$i=2,3,\ldots, k$, where if Formula$\theta_{i}>0\Leftrightarrow\theta^{\prime}_{i}>0$ as well. This implies that all positive Formula$\theta_{i}<1/n$ are quantized to Formula$\theta^{\prime}_{i}=1/n$. Define the quantization error, Formula TeX Source $$\delta_{i}=\theta_{i}-\theta^{\prime}_{i}.\eqno{\hbox{(26)}}$$ The quantization is performed from the smallest parameter Formula$\theta_{k}$ to the largest, where monotonicity is maintained, as well as minimal absolute cumulative quantization error. Thus, unless there is cumulative error formed by many parameters Formula$\theta_{i}<1/n$, Formula$\theta_{i}$ will be quantized to one of the two nearest grid points (one smaller and one greater than it). It also guarantees that Formula$\vert\delta_{1}\vert\leq\Delta_{j_{2}}^{(1)}\leq\Delta_{j_{1}}^{(1)}$, where Formula$j_{1}$ and Formula$j_{2}$ are the indices of the intervals in which Formula$\theta_{1}$ and Formula$\theta_{2}$ are contained, respectively, i.e., Formula$\theta_{1}\in I_{j_{1}}$ and Formula$\theta_{2}\in I_{j_{2}}$. However, if there exists a cumulative error Formula$\Delta_{offset}$ due to quantization of parameters Formula$\theta_{i}: 0<\theta_{i}<1/n$ to Formula$\theta^{\prime}_{i}=1/n$, this error is offset by decreasing every Formula$\theta^{\prime}_{i}$ for Formula$\theta_{i}>1/n$ by Formula$\alpha_{i}\cdot\Delta_{offset}\cdot\theta^{\prime}_{i}$, where Formula$\alpha_{i}>0$ is some constant, and quantizing the value to the nearest grid point maintaining monotonicity and minimal cumulative error. By construction, Formula$\Delta_{offset}\leq k/n$, and thus Formula TeX Source $$\left\vert\delta_{i}\right\vert\leq{{\sqrt{k}n^{j\beta}}\over{n^{1.5}}}+{{k}\over{n}}\alpha^{\prime}_{i}\theta^{\prime}_{i}\eqno{\hbox{(27)}}$$ where Formula$\alpha^{\prime}_{i}>0$ is a constant derived from Formula$\alpha_{i}$.

Now, since Formula$\mmb{\theta}^{\prime}$ is included in the minimization of (24), we have, for every Formula$x^{n}$, Formula TeX Source $$L^{\ast}\left(x^{n}\right)\leq L\left(x^{n}\vert\mmb{\theta}^{\prime}\right),\eqno{\hbox{(28)}}$$ and also Formula TeX Source $$nR_{n}\left(L^{\ast}, x^{n}\right)\leq\log k+L_{R}\left(\mmb{\theta}^{\prime}\right)+\log {{P_{\theta}\left(x^{n}\right)}\over{P_{\theta^{\prime}}\left(x^{n}\right)}}.\eqno{\hbox{(29)}}$$ Averaging over all possible Formula$x^{n}$, the average redundancy is bounded by Formula TeX Source $$\eqalignno{&\displaystyle nR_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\;\;=\log k+E_{\theta}L^{\ast}_{R}\left({\hat{\mmb{\varphi}}}\right)+E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\hat{\varphi}}\left(X^{n}\right)}}\cr &\;\;\leq\log k+E_{\theta}L_{R}\left(\mmb{\theta}^{\prime}\right)+E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\theta^{\prime}}\left(X^{n}\right)}}.&{\hbox{(30)}}}$$ The second term of (30) is bounded with the bound of (23), and we proceed with the third term Formula TeX Source $$\eqalignno{& E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\theta^{\prime}}\left(X^{n}\right)}}\cr &\quad\buildrel{(a)}\over{=}n\sum\limits_{i=1}^{k}\theta_{i}\log {{\theta_{i}} \over{\theta^{\prime}_{i}}}\;\buildrel{(b)}\over{=}\;n\sum\limits_{i=1}^{k}\left (\theta^{\prime}_{i}+\delta_{i}\right)\log \left(1+{{\delta_{i}}\over{\theta^{\prime}_{i}}}\right)\cr &\quad\buildrel{(c)}\over{\leq}n (\log e)\sum\limits_{i=1}^{k}\left (\theta^{\prime}_{i}+\delta_{i}\right){{\delta_{i}}\over{\theta^{\prime}_{i}}}\; \buildrel{(d)}\over{=}\;n(\log e)\sum\limits_{i=1}^{k}{{\delta_{i}^{2}} \over{\theta^{\prime}_{i}}}\cr &\quad\buildrel{(e)}\over{\leq}\left(1+o(1)\right) k\log e+\left(1+o(1)\right) {{2(\log e)k}\over{n}}\sum\limits_{j=1}^{J_{1}}k_{j}\cdot n^{j\beta}\cr &\quad\buildrel{(f)}\over{\leq}5\left(1+o(1)\right)(\log e) k.&{\hbox{(31)}}}$$ Equality Formula$(a)$ is since the expectation is performed on the number of occurrences of letter Formula$i$ for each letter. Representing Formula$\theta_{i}=\theta^{\prime}_{i}+\delta_{i}$ yields equation Formula$(b)$. We use Formula$\ln (1+x)\leq x$ to obtain Formula$(c)$. Equality Formula$(d)$ is obtained since all the quantization displacements must sum to 0. The first term of inequality Formula$(e)$ in (31) is obtained under a worst case assumption that Formula$\theta_{i}\ll 1/n$ for Formula$i\geq 2$. Thus, it is quantized to Formula$\theta^{\prime}_{i}=1/n$, and the bound Formula$\vert\delta_{i}\vert\leq 1/n$ is used. In a different worst case scenario, we have from (27) and since in interval Formula$j$, Formula$\theta^{\prime}_{i}\geq n^{(j-1)\beta}/n$, Formula TeX Source $${{\delta_{i}^{2}}\over{\theta^{\prime}_{i}}}\leq{{2k n^{j\beta}}\over{n^{2}}}+{{2 k^{1.5}n^{j\beta}}\over{n^{2.5}}}+{{k^{2}\alpha^{\prime^{2}}_{i}\theta^{\prime}_{i}}\over{n^{2}}}\eqno{\hbox{(32)}}$$ where Formula$n^{\beta}=2$ is used to derive the equation. Since Formula$k=o(n)$, the second term above is absorbed in the first, leading to the second term of inequality Formula$(e)$ of (31) after aggregating elements of the sum into intervals. The sum over Formula$i$ of the last term of (32) is Formula$o(1)$ since Formula$k=o(n)$. This sum is absorbed into the first term of inequality Formula$(e)$ of (31). Inequality Formula$(f)$ of (31) is obtained since Formula TeX Source $$\sum_{j=1}^{J_{1}}k_{j}n^{j\beta}=\sum_{j=1}^{J_{1}}k_{j}2^{j}\leq 2n.\eqno{\hbox{(33)}}$$ Inequality (33) follows since Formula$k_{1}\leq n$, Formula$k_{2}\leq (n-k_{1})/2$, Formula$k_{3}\leq (n-k_{1})/4-k_{2}/2$, and so on, until Formula TeX Source $$k_{J_{1}}\leq{{n}\over{2^{J_{1}-1}}}-\sum_{\ell=1}^{J_{1}-1}{{k_{\ell}}\over{2^{J_{1}-\ell}}}\Rightarrow\sum_{j=1}^{J_{1}}k_{j}2^{j}\leq2n.\eqno{\hbox{(34)}}$$ The reason for these relations are the lower limits of the Formula$J_{1}$ intervals that restrict the number of parameters inside the interval. The restriction is done in order of intervals, so that the used probabilities are subtracted, leading to the series of equations.

Plugging the bounds of (23) and (31) into (30), we obtain Formula TeX Source $$\eqalignno{&\displaystyle nR_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\!\!\quad\leq\log k+\left(1+\varepsilon\right){{k-1}\over{2}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}+5 (\log e)k\cr &\!\!\quad\leq\left(1+\varepsilon^{\prime}\right){{k-1}\over{2}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}&{\hbox{(35)}}}$$ where we absorb second-order terms in Formula$\varepsilon^{\prime}$. Replacing Formula$\varepsilon^{\prime}$ by Formula$\varepsilon$ normalizing the redundancy per symbol by Formula$n$, the bound of the second region of (15) is proved. Since Formula$\log \log n$ can also be absorbed in Formula$\varepsilon$, the first region is also proved. The code proposed, however, will lead to redundancy whose second order is larger than obtained for standard i.i.d. optimal codes that do not exploit the distribution monotonicity for fixed Formula$k$. This is because the grid used here is too dense for the fixed Formula$k$ case. One can use the standard i.i.d. codes for tighter second-order bounds for fixed Formula$k$.

We now consider the larger values of Formula$k$, i.e., Formula$n^{1/3}<k=O(n)$. The idea of the proof is the same. However, we need to partition the probability space to different intervals, the spacing within an interval must be optimized, and the parameters' description cost must be bounded differently, because now there are more parameters quantized than points in the quantization grid. Define the Formula$j$th interval as Formula TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n^{2}}},{{n^{j\beta}}\over{n^{2}}}\right),\;1\leq j\leq J_{2},\eqno{\hbox{(36)}}$$ where Formula$J_{2}=\left\lceil 2/\beta\right\rceil=\left\lceil 2\log n\right\rceil$. Again, let Formula$k_{j}=\vert\theta_{i}\in I_{j}\vert$ denote the number of probabilities in Formula$\mmb{\theta}$ that are in interval Formula$I_{j}$. It could be possible to use the intervals as defined in (17), but this would not guarantee bounded redundancy in the rate we require if there are very small probabilities Formula$\theta_{i}\ll 1/n$. Note that the smallest nonzero component of Formula${\hat{\mmb{\theta}}}$ is Formula$1/n$. However, this is not necessarily the case for Formula${\hat{\mmb{\theta}}}_{\cal M}$. The latter may consist of smaller nonzero probabilities for sequences that do not obey the monotonicity of the distribution. Therefore, the interval definition in (17) can be used for larger alphabets only if the probabilities of the symbols are known to be bounded. Define the spacing in interval Formula$j$ as Formula TeX Source $$\Delta_{j}^{(2)}={{n^{j\beta}}\over{n^{2+\alpha}}},\eqno{\hbox{(37)}}$$ where Formula$\alpha>0$ is a parameter to be optimized. Similarly to (19), the interval cardinality here is Formula TeX Source $$\left\vert I_{j}\right\vert\leq 0.5\cdot n^{\alpha}\;\forall j: j=1, 2,\ldots, J_{2}.\eqno{\hbox{(38)}}$$ In a similar manner to the definition of Formula$\mmb{\tau}$ in (20), we define Formula TeX Source $$\eqalignno{\mmb{\eta}=&\,\left(\eta_{1},\eta_{2},\ldots\right)\cr =&\,\left({{1}\over{n^{2}}},{{1}\over{n^{2}}}+{{2}\over{n^{2+\alpha}}},\ldots,{{2}\over{n^{2}}},{{2}\over{n^{2}}}+{{4}\over{n^{2+\alpha}}},\ldots\right).&{\hbox{(39)}}}$$ The cardinality of Formula$\mmb{\eta}$ is Formula TeX Source $$B_{2}\triangleq\vert\mmb{\eta}\vert\leq 0.5\cdot n^{\alpha}\left\lceil 2\log n\right\rceil\leq n^{\alpha}\left\lceil\log n\right\rceil.\eqno{\hbox{(40)}}$$

We now perform the encoding similarly to the small Formula$k$ case, where we allow quantization to nonzero values to the components of Formula$\mmb{\varphi}$ up to Formula$i=n^{2}$. (This is more than needed but is possible since Formula$\eta_{1}=1/n^{2}$.) Encoding is performed similarly to the small Formula$k$ case. Thus, similarly to (30), we have Formula TeX Source $$nR_{n}\left(L^{\ast},\mmb{\theta}\right)\leq 2\log n+E_{\theta}L_{R}\left(\mmb{\theta}^{\prime}\right)+E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\theta^{\prime}}\left(X^{n}\right)}},\eqno{\hbox{(41)}}$$ where the first term is due to allowing up to Formula${\hat{k}}=n^{2}$. Since usually in this region Formula$k\geq B_{2}$ (except the low end), the description of vectors Formula$\mmb{\varphi}$ and Formula$\mmb{\theta}^{\prime}$ is done by coding the cardinality of Formula$\vert\varphi_{i}=\eta_{j}\vert$ and Formula$\vert\theta^{\prime}_{i}=\eta_{j}\vert$, respectively, i.e., for each grid point the code describes how many letters have probability quantized to this point. This idea resembles coding profiles of patterns, as done in [22]. However, unlike the method in [22], here, many probability parameters of symbols with different occurrences are mapped to the same grid point by quantization. The number of parameters mapped to a grid point of Formula$\mmb{\eta}$ is coded using Elias' representation of the integers. Hence, in a similar manner to (23), Formula TeX Source $$\eqalignno{& L_{R}(\mmb{\theta}^{\prime})&{\hbox{(42)}}\cr &\!\!\quad\buildrel{(a)}\over{\leq}\sum\limits_{j=1}^{B_{2}}\left\{1+\log (\vert\theta^{\prime}_{i}=\eta_{j}\vert+1)+\right.\cr &\qquad\left.2\log [1+\log (\vert\theta^{\prime}_{i}=\eta_{j}\vert+1)]\right\}\cr &\!\!\quad\buildrel{(b)}\over{\leq}B_{2}+B_{2}\log {{k+B_{2}}\over{B_{2}}}+2B_{2}\log \log {{k+B_{2}}\over{B_{2}}}+o(B_{2})\cr &\!\!\quad\buildrel{(c)}\over{\leq}\cases{(1+\varepsilon) (\log n)(\log {{k}\over{n^{\alpha-\varepsilon}}}) n^{\alpha}, &$n^{\alpha}<k=o(n),$\cr (1+\varepsilon) (1-\alpha)(\log n)^{2}n^{\alpha}, &$n^{\alpha}<k=O(n)$.}}$$ The additional 1 term in the logarithm in Formula$(a)$ is for 0 occurrences, Formula$(b)$ is obtained similarly to step Formula$(a)$ of (23), absorbing all second-order terms in the last term. To obtain Formula$(c)$, we first assume, for the first region, that Formula$k n^{\varepsilon}\gg B_{2}$ (an assumption that must be later validated with the choice of Formula$\alpha$). Then, second-order terms are absorbed in Formula$\varepsilon$. The extra Formula$n^{\varepsilon}$ factor is unnecessary if Formula$k\gg B_{2}$. The second region is obtained by upper bounding Formula$k$ without this factor. It is possible to separate the first region into two regions, eliminate this factor in the lower region, and obtain a more complicated, yet tighter, expression in the upper region, where Formula$k\sim\Theta (n^{1/3})$.

Now, similarly to (31), we obtain Formula TeX Source $$\eqalignno{E_{\theta}\log {{P_{\theta}(X^{n})}\over{P_{\theta^{\prime}}(X^{n})}}\leq &\, n (\log e)\sum\limits_{i=1}^{k}{{\delta_{i}^{2}}\over{\theta^{\prime}_{i}}}\cr \buildrel{(a)}\over{\leq}&\, O(1)+{{2\log e}\over{n^{1+2\alpha}}}\sum\limits_{j=1}^{J_{2}}k_{j}n^{j\beta}\cr \buildrel{(b)}\over{\leq}&\, 4(\log e) n^{1-2\alpha}+O(1).&{\hbox{(43)}}}$$ The first term of inequality Formula$(a)$ is obtained under the assumption that Formula$k=O(n)$, Formula$\theta^{\prime}_{i}\geq 1/n^{2}$, and Formula$\vert\delta_{i}\vert\leq 1/n^{2}$. Similarly to the last two terms of (32), we obtain an additional Formula$O(1)$ term for extra offset costs of the larger probability symbols due to many small probability symbols if they exist. For the second term Formula$\vert\delta_{i}\vert\leq n^{j\beta}/n^{2+\alpha}$, and Formula$\theta^{\prime}_{i}\geq n^{(j-1)\beta}/n^{2}$. Inequality Formula$(b)$ is obtained in a similar manner to inequality Formula$(f)$ of (31), where the sum is shown similarly to be Formula$2n^{2}$.

Summing up the contributions of (42) and (43) in (41), Formula$\alpha=1/3$ is shown to minimize the total cost (to first order). This choice of Formula$\alpha$ also satisfies the assumption of step Formula$(c)$ in (42). Using Formula$\alpha=1/3$, absorbing all second-order terms in Formula$\varepsilon$ and normalizing by Formula$n$, we obtain the remaining two regions of the bound in (15). It should be noted that the proof here would give a bound of Formula$O(n^{1/3+\varepsilon})$ up to Formula$k=O(n^{4/3})$. If the intervals in (17) were used for bounded distributions, the coefficients of the last two regions will be reduced by a factor of 2.

The proof up to this point assumes that Formula$k$ is known in advance. This is important for the code resulting in the bounds for the first two regions because the quantization grid depends on Formula$k$. Specifically, if in building the grid, Formula$k$ is underestimated, the description cost of Formula$\mmb{\varphi}$ increases. If Formula$k$ is overestimated, the quantization cost will increase. Also, if the code of larger Formula$k$'s is used for a smaller Formula$k$, a larger bound than necessary results. To solve this, the optimization that chooses Formula$L^{\ast}(x^{n})$ is done over all possible values of Formula$k$ (greater than or equal to the maximal symbol occurring in Formula$x^{n}$), i.e., every greater Formula$k$ in the first construction, and the construction of the code for the top regions. For fixed Formula$k$, a standard optimal code for nonmonotonic distributions can also be constructed. For every small Formula$k$, a different construction is done, using the appropriate Formula$k$ to determine the spacing in each interval. The value of Formula$k$ yielding the shortest code word is then used. Elias' coding for the integers can be used to designate Formula$k$ with Formula$O (\log k)$ prefix bits. The analysis continues as before. This does not change the redundancy to first order, giving all four regions of the bound in (15), even if Formula$k$ is unknown in advance. This concludes the proof of Theorem 4. Formula$\hfill \blacksquare$

Proof [Corollary 1]

The proof branches off the proof of Theorem 4 by improving on several steps, and mainly on the choice of Formula$\alpha$. First, like the partitioning of the probability space into three intervals in [35], we can partition the probability space into two intervals here, Formula$(0, 1/n^{\alpha}]$ and Formula$(1/n^{\alpha}, 1)$. (Since we can have probabilities smaller than Formula$1/n$, we cannot use a bottom interval of Formula$(0, n^{\alpha}/n]$ here.) In the top interval, we need at most Formula$(1+\varepsilon) n^{\alpha}\log n$ bits to describe the monotonic ML probabilities of at most Formula$n^{\alpha}$ symbols whose probabilities are in this interval. Quantizing all these probabilities with Formula$1/n$ resolution yields Formula$o(1)$ additional quantization cost. (This can be shown following similar steps as (43) with a choice of Formula$\alpha=(1+o(1))/3$.) Using Formula$1/n^{\alpha}$ as the upper limit on the total number of intervals in (36) instead of 1 now yields Formula$J^{\prime}_{2}=(2-\alpha)\log n$. It then follows, similarly to (40), that Formula$B^{\prime}_{2}\leq 0.5 (2-\alpha) n^{\alpha}\log n$. Next, the description costs in (42) reduce by the factor Formula$0.5 (2-\alpha)$. Combining the costs in (41), using the new description cost, the quantization cost of (43), and absorbing the cost of the probability parameter top interval and other second-order terms in Formula$\varepsilon$ yields Formula TeX Source $$\eqalignno{nR_{n}&\left(L^{\ast},\mmb{\theta}\right)&{\hbox{(44)}}\cr \leq&\, (1+\varepsilon) 0.5(2-\alpha) (1-\alpha) n^{\alpha}(\log n)^{2}+4 (\log e) n^{1-2\alpha}}$$ for Formula$k=O(n)$. (Similarly, a more complex expression can be written for Formula$k=o(n)$.) A choice of Formula TeX Source $$\alpha={{1}\over{3}}-{{2}\over{3}}{{\log \log n}\over{\log n}}+{{\log \nu}\over{\log n}}\eqno{\hbox{(45)}}$$ for some parameter Formula$\nu$ minimizes (44) yielding Formula TeX Source $$nR_{n}\left(L^{\ast},\mmb{\theta}\right)\leq(1+\varepsilon)\left({{5\nu}\over{9}}+{{4\log e}\over{\nu^{2}}}\right) n^{1/3}(\log n)^{4/3}.\eqno{\hbox{(46)}}$$ Taking Formula$\nu=(72 (\log e)/5)^{1/3}\approx 2.75$ minimizes (46), yielding a coefficient of less than 2.3 in (46). Letting Formula$n\rightarrow\infty$, absorbing all second-order terms in the gap of the coefficient to 2.3 proves the second region of (16). Using the same value of Formula$\nu$ for the resulting terms for the first region in a similar manner, proves the first region. A slightly tighter bound can be obtained for the first region if the value of Formula$\nu$ is optimized for the specific value of Formula$k$. Formula$\hfill \blacksquare$



This section shows that with some mild conditions on the source distribution, the same redundancy upper bounds achieved for finite monotonic distributions can be achieved even if the monotonic distribution is over an infinite alphabet. The key observation that leads to this result is that a distribution that decays fast enough will result in only a small number of occurrences of letters from its tail in Formula$x^{n}$. Occurrences of these letters will likely not retain the monotonicity. Since there are few such occurrences, they can be handled without increasing the asymptotic behavior of the coding cost. More precisely, fast decaying monotonic distributions can be viewed as if they have some effective bounded alphabet size. Occurrences of symbols outside this limited alphabet are rare. We present two theorems and a corollary that upper bound the redundancy when coding with such unknown monotonic distributions. The first theorem also provides a slightly stronger bound (with smaller coefficient) for Formula$k=O(n)$. For slower decays with more occurring symbols from the distribution tail, the redundancy order does increase due to the penalty of identifying these symbols in a sequence. However, we show, consistently with the results in [11], that as long as the entropy of the source is finite, a universal code, in the sense of diminishing redundancy per symbol, still exists. We begin with stating the two theorems and the corollary, then the proofs are presented. The section is concluded with three examples of typical monotonic distributions over the integers, that demonstrate both cases of fast and slow decays.

A. Upper Bounds

We begin with some notation. Fix an arbitrary small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Define Formula$m\triangleq m_{\rho}\triangleq n^{\rho}$ as the effective alphabet size, where Formula$\rho>\varepsilon$. (Note that Formula$\rho=(\log m)/(\log n)$.) Let Formula TeX Source $$\eqalignno{&{\cal R}_{n}(m)&{\hbox{(47)}}\cr &\!\!\quad\triangleq\cases{{{m-1}\over{2}}\log {{n}\over{m^{3}}}, &$m=o(n^{1/3}),$\cr {{1}\over{2}}\left(\rho+{{1}\over{3}}\right)\left(\rho+\varepsilon-{{1}\over{3}}\right)(\log n)^{2}n^{1/3}, & o.w.}}$$

Theorem 5

  1. Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Let Formula$x^{n}$ be generated by an i.i.d. monotonic distribution Formula$\mmb{\theta}\in{\cal M}$. If there exists Formula$m^{\ast}$, such that Formula TeX Source $$\sum_{i>m^{\ast}}n\theta_{i}\log i=o\left[{\cal R}_{n}\left(m^{\ast}\right)\right],\eqno{\hbox{(48)}}$$ then, there exists a code with length function Formula$L^{\ast}(\cdot)$, such that Formula TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq{{\left(1+\varepsilon\right)}\over{n}}{\cal R}_{n}\left(m^{\ast}\right)\eqno{\hbox{(49)}}$$ for the monotonic distribution Formula$\mmb{\theta}$.
  2. If there exists Formula$m^{\ast}$ for which Formula$\rho^{\ast}=o\left(n^{1/3}/(\log n)\right)$, such that Formula TeX Source $$\sum_{i>m^{\ast}}\theta_{i}\log i=o(1),\eqno{\hbox{(50)}}$$ then, there exists a universal code with length function Formula$L^{\ast}(\cdot)$, such that Formula TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)=o(1).\eqno{\hbox{(51)}}$$

Theorem 5 shows that redundancy bounds of the same order as those obtained for finite alphabets are achievable for monotonic distributions that decay fast enough (with effective alphabet that does not exceed Formula$O(n^{\rho})$ symbols for a fixed Formula$\rho$). Specifically, very fast decaying distributions, although over infinite alphabets, may even behave like monotonic distributions with Formula$o\left(n^{1/3}\right)$ symbols. The condition in (48) merely means that the cost that a code would incur in order to code very rare symbols, that are larger than the effective alphabet size, is negligible w.r.t. the total cost obtained from other, more likely, symbols. Note that for Formula$m=n$, the bound is tighter than that of the last region of Theorem 4, and a constant of 4/9 replaces 2/3. The second part of the theorem states that if the decay is slow, but the cost of coding rare symbols is still diminishing per symbol, a universal code still exists for such distributions. However, in this case the redundancy will be dominated by coding the rare (out of order) symbols.

Applying the additional steps used to prove Corollary 1 to the proof of the first part of Theorem 5 yields a tighter expression for the second region of Formula${\cal R}_{n}(m)$ in (47), which for fixed Formula$\rho$ is Formula$\Theta\left(n^{1/3}(\log n)^{4/3}\right)$. While Theorem 5 bounds the redundancy decay rate for two extremes, a more general theorem can provide the redundancy rates for coding an unknown monotonic distribution whose decay rate is between these extremes. As the examples at the end of this section show, the next theorem is very useful for slower decaying distributions. It also encapsulate the derivation of a tighter bound as that in Corollary 1 for the more general case.

Theorem 6

Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Let Formula$x^{n}$ be generated by an i.i.d. monotonic distribution Formula$\mmb{\theta}\in{\cal M}$. Then, there exists a code with length function Formula$L^{\ast}(\cdot)$, that achieves redundancy Formula TeX Source $$\eqalignno{&\displaystyle n R_{n}(L^{\ast},\mmb{\theta})\leq(1+\varepsilon)\cdot\cr &\quad\min_{\alpha>0,\rho:\rho\geq\alpha+\varepsilon}\left\{{{(\rho+\alpha)(\rho-\alpha)(\log n)^{2}n^{\alpha}}\over{2}}+{{5 n^{1-2\alpha}}\over{\ln 2}}+\right.\cr &\qquad\qquad\qquad\quad\left.\left(1+{{1}\over{\rho}}\right)n\sum_{i>n^{\rho}}\theta_{i}\log i\right\}&\hbox{(52)}}$$ for coding sequences generated by the source Formula$\mmb{\theta}$.

The theorems above lead to the following corollary.

Corollary 2

As Formula$n\rightarrow\infty$, sequences generated by monotonic distributions with Formula$H_{\theta}(X)=O(1)$ are universally compressible in the average sense.

Corollary 2 shows that sequences generated by finite entropy monotonic distributions can be compressed in the average with diminishing per symbol redundancy. This result is consistent with the results shown in [11].

We continue with proving the two theorems and the corollary.


The proof of both theorems is constructive in a similar manner to the proof of Theorem 4. This time, however, the main idea is first separating the more likely symbols from the unlikely ones. The code first determines the point of this separation Formula$m=n^{\rho}$. (Note that Formula$\rho$ can be greater than 1.) All symbols Formula$i\leq m$ are considered likely and are quantized and described in a similar manner as in the codes for smaller alphabets. Unlike bounded alphabets, though, a more robust grid is used here to allow larger values of Formula$m$. The unlikely symbols are coded hierarchically. They are first merged into a single innovation symbol. Then, they are encoded within this symbol by coding their actual values. As long as the decay is fast enough, the average cost of conveying these symbols becomes negligible w.r.t. the cost of coding the likely symbols. If the decay is slower, but still fast enough, as the case described in condition (50), the coding cost of the rare symbols dominates the redundancy, which is still diminishing. The description length of likely symbols is bounded as in the proof of Theorem 4, consisting of description of the probability grid points and the quantization cost. In order to determine the best value of Formula$m$ for a given sequence, all values are tried and the one yielding the shortest description is used for coding a specific Formula$x^{n}$. The steps described prove both Theorems 5 and 6.

Let Formula$m\geq 2$ determine the number of likely symbols in the alphabet. For a given Formula$m$, define Formula TeX Source $$S_{m}\triangleq\sum\limits_{i>m}\theta_{i},\eqno{\hbox{(53)}}$$ as the total probability of the remaining symbols. Given Formula$\mmb{\theta}$, Formula$m$ and Formula$S_{m}$, a probability Formula TeX Source $$\eqalignno{& P\left(x^{n}\vert m, S_{m},\mmb{\theta}\right)&{\hbox{(54)}}\cr &\quad\triangleq\left[\prod_{i=1}^{m}\theta_{i}^{n_{x}(i)}\right]\cdot S_{m}^{n_{x}(x>m)}\cdot\prod_{i>m}\left({{n_{x}(i)}\over{n_{x}(x>m)}}\right)^{n_{x}(i)}}$$ can be computed for Formula$x^{n}$, where Formula$n_{x}(i)$ counts occurrences of symbol Formula$i$ in Formula$x^{n}$, and Formula$n_{x}(x>m)$ is the count of all symbols greater than Formula$m$ in Formula$x^{n}$. This probability mass function clusters all symbols greater than Formula$m$ into one innovation symbol. Then, it uses the ML estimate of each to distinguish among them in the clustered symbol.

For every Formula$m$, we can define a quantization grid Formula$\mmb{\xi}_{m}$, in a similar manner to the proof of Theorem 4, for the first Formula$m$ components of Formula$\mmb{\theta}$. If Formula$m=o(n^{1/3})$, we use Formula$\mmb{\xi}_{m}=\mmb{\tau}_{m}$, where Formula$\mmb{\tau}_{m}$ is the grid defined in (20) with Formula$m$ replacing Formula$k$. Otherwise, we can use the definition of Formula$\mmb{\eta}$ in (39). However, to obtain tighter bounds for large Formula$m$, we define a different grid for the larger values of Formula$m$ following similar steps to those in (36)(40). First, define the Formula$j$th interval as Formula TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n^{\rho+2\alpha}}},{{n^{j\beta}}\over{n^{\rho+2\alpha}}}\right),\;1\leq j\leq J_{\rho},\eqno{\hbox{(55)}}$$ where Formula$\rho=(\log m)/(\log n)$ as defined above, Formula$\alpha>0$ is a parameter, and Formula$\beta=1/(\log n)$ as before. Within the Formula$j$th interval, we define the spacing in the grid by Formula TeX Source $$\Delta_{j}^{(\rho)}={{n^{j\beta}}\over{n^{\rho+3\alpha}}}.\eqno{\hbox{(56)}}$$ As in (38), Formula TeX Source $$\left\vert I_{j}\right\vert\leq 0.5\cdot n^{\alpha}\;\forall j: j=1, 2,\ldots, J_{\rho},\eqno{\hbox{(57)}}$$ and the total number of intervals to describe probabilities less up to Formula$1/n^{\alpha}$ is Formula TeX Source $$J_{\rho}=\left\lceil (\rho+\alpha)\log n\right\rceil.\eqno{\hbox{(58)}}$$ As in the proof of Corollary 1, we use Formula$O\left(n^{\alpha}\log n\right)$ bits to describe and quantize probabilities greater than Formula$1/n^{\alpha}$. Similarly to (39), Formula$\mmb{\xi}_{m}$ is defined as Formula TeX Source $$\eqalignno{\mmb{\xi}_{m}=&\,\left(\xi_{1},\xi_{2},\ldots\right)\cr =&\,\left({{1}\over{n^{\rho+2\alpha}}},{{1}\over{n^{\rho+2\alpha}}}+{{2}\over{n^{\rho+3\alpha}}},\ldots,{{2}\over{n^{\rho+2\alpha}}},\right.\cr &\quad\left.{{2}\over{n^{\rho+2\alpha}}}+{{4}\over{n^{\rho+3\alpha}}},\ldots\right).&{\hbox{(59)}}}$$ The cardinality of Formula$\mmb{\xi}_{m}$ is thus Formula TeX Source $$B_{\rho}\triangleq\vert\mmb{\xi}_{m}\vert\leq0.5\cdot n^{\alpha}\left\lceil (\rho+\alpha)\log n\right\rceil.\eqno{\hbox{(60)}}$$

An Formula$m$th order quantized version Formula$\mmb{\theta}^{\prime}_{m}$ of Formula$\mmb{\theta}$ is obtained by quantizing Formula$\theta_{i}\leq 1/n^{\alpha}$, Formula$i=2,3,\ldots,m$ onto Formula$\mmb{\xi}_{m}$, such that Formula$\theta^{\prime}_{i}\in\mmb{\xi}_{m}$ for these values of Formula$i$. Then, the remaining cluster probability Formula$S_{m}$ is quantized into Formula$S^{\prime}_{m}\in\left[1/n, 2/n,\ldots, 1\right]$. The parameter Formula$\theta^{\prime}_{1}$ is constrained by the quantization of the other parameters. Quantization is performed again in a manner that minimizes the cumulative error but retains monotonicity, and probabilities smaller than Formula$\xi_{1}$ are offset by larger symbols as before.

Now, for any Formula$m\geq 2$, let Formula$\mmb{\varphi}_{m}$ be any monotonic probability vector of cardinality Formula$m$ whose last Formula$m-1$ components are quantized into Formula$\mmb{\xi}_{m}$ (or coded separately in the upper interval Formula$(1/n^{\alpha}, 1)$ if such values exist), and let Formula$\sigma_{m}\in\left[1/n, 2/n,\ldots, 1\right]$ be a quantized value of the innovation symbol, such that Formula$\sum_{i=1}^{m}\varphi_{i,m}+\sigma_{m}=1$, where Formula$\varphi_{i,m}$ is the Formula$i$th component of Formula$\mmb{\varphi}_{m}$. If Formula$m$, Formula$\sigma_{m}$, and Formula$\mmb{\varphi}_{m}$ are known, a given Formula$x^{n}$ can be coded using Formula$P\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)$ as defined in (54), with Formula$\sigma_{m}$ replacing Formula$S_{m}$, and the Formula$m$ components of Formula$\mmb{\varphi}_{m}$ replacing the first Formula$m$ components of Formula$\mmb{\theta}$. However, in the universal setting, none of these parameters are known in advance. Furthermore, neither the symbols greater than Formula$m$ nor their conditional ML probabilities are known in advance. Therefore, the total cost of coding Formula$x^{n}$ using these parameters requires universality costs for describing them. The additional universality cost of coding Formula$x^{n}$ with probability Formula$P\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)$ thus consists of the following five components: 1) Formula$m$ should be described using Elias' representation with at most Formula$1+\rho\log n+2\log (1+\rho\log n)$ bits. 2) The value of Formula$\sigma_{m}$ in its quantization grid should be coded using Formula$\log n$ bits. 3) The Formula$m$ components of Formula$\mmb{\varphi}_{m}$ require Formula$L_{R}\left(\mmb{\varphi}_{m}\right)$ bits. 4) The number Formula$c_{x}(x>m)$ of distinct letters in Formula$x^{n}$ greater than Formula$m$ is coded using Formula$\log n$ bits. 5) Each letter Formula$i>m$ in Formula$x^{n}$ is coded. Elias' coding for the integers using Formula$1+\log i+2\log (1+\log i)$ bits can be used, but to simplify the derivation we can also use the code, also presented in [8], that uses no more than Formula$1+2\log i$ bits to describe Formula$i$. In addition, at most Formula$\log n$ bits are required for describing Formula$n_{x}(i)$ in Formula$x^{n}$. For Formula$n\rightarrow\infty$, Formula$m\gg 1$, and Formula$\varepsilon>0$ arbitrarily small, this yields a total cost of Formula TeX Source $$\eqalignno{&\displaystyle L\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)\cr &\!\!\quad\leq-\log P\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)+L_{R}\left(\mmb{\varphi}_{m}\right)+\cr &\qquad [(1+\varepsilon)\rho+c_{x}(x>m)+2]\log n\cr &\qquad+c_{x}(x>m)+2\sum_{i>m, i\in x^{n}}\log i&{\hbox{(61)}}}$$ where we assume Formula$m$ is large enough to bound the cost of describing Formula$m$ by Formula$(1+\varepsilon)\rho\log n$.

The description cost of Formula$\mmb{\varphi}_{m}$ for Formula$m=o(n^{1/3})$ is bounded by Formula TeX Source $$L_{R}\left(\mmb{\varphi}_{m}\right)\leq\left(1+\varepsilon\right){{m-1}\over{2}}\log {{n}\over{m^{3}}}\eqno{\hbox{(62)}}$$ using (23) with Formula$m$ replacing Formula$k$. The Formula$(\log n)^{2}$ factor in (23) can be absorbed in Formula$\varepsilon$ since we limit Formula$m$ to Formula$o(n^{1/3})$, unlike the derivation in (23). For larger values of Formula$m$, we describe symbol probabilities of Formula$\mmb{\varphi}_{m}$ in the grid Formula$\mmb{\xi}_{m}$ in a similar manner to the description of Formula$O(n)$ symbol probabilities in the grid Formula$\mmb{\eta}$ in the proof of Corollary 1. Similarly to (42), we have Formula TeX Source $$\eqalignno{& L_{R}(\mmb{\varphi}_{m})\cr &\!\!\quad\leq B_{\rho}+B_{\rho}\log {{n^{\rho}+B_{\rho}}\over{B_{\rho}}}+2B_{\rho}\log \log {{n^{\rho}+B_{\rho}}\over{B_{\rho}}}+O\left(B_{\rho}\right)\cr &\!\!\quad\buildrel{(a)}\over{\leq}{{\left(1+\varepsilon\right)}\over{2}}\left(\rho+\alpha\right)\left(\rho+\varepsilon-\alpha\right)(\log n)^{2}n^{\alpha}&{\hbox{(63)}}}$$ where the term Formula$O (B_{\rho})$ absorbs the cost of probabilities larger than Formula$1/n^{\alpha}$. To obtain inequality Formula$(a)$, we first multiply Formula$n^{\rho}$ by Formula$n^{\varepsilon}$ in the numerator of the argument of the logarithm. This is only necessary for Formula$\rho\rightarrow\alpha$ to guarantee that Formula$n^{\rho+\varepsilon}\gg B_{\rho}$. Substituting the bound on Formula$B_{\rho}$ from (60), absorbing second-order terms in the leading Formula$\varepsilon$ yields the bound.

A sequence Formula$x^{n}$ can now be coded using the universal parameters that minimize the sequence description length, i.e., Formula TeX Source $$\eqalignno{& L^{\ast}\left(x^{n}\right)\cr &\!\!\quad\triangleq\min_{m^{\prime}\geq 2}\min_{\sigma_{m^{\prime}}\in \left[{{1}\over{n}},{{2}\over{n}},\ldots, 1\right]}\;\min_{\mmb{\varphi}_{m^{\prime}} :\varphi_{i}\in\mmb{\xi}_{m^{\prime}}, i\geq2}L\left(x^{n}\vert m^{\prime}, \sigma_{m^{\prime}},\mmb{\varphi}_{m^{\prime}}\right)\cr &\!\!\quad\leq L\left(x^{n}\vert m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)&{\hbox{(64)}}}$$ where the minimization over Formula$\varphi_{i}$ also includes values larger than Formula$1/n^{\alpha}$, using their designated description. The values Formula$\mmb{\theta}^{\prime}_{m}$ and Formula$S^{\prime}_{m}$ are the true source parameters quantized in the manner described above, and the inequality holds for every Formula$m$. The minimization on Formula$m^{\prime}$ should be performed only up to the maximal symbol that occurs in Formula$x^{n}$.

Following (61)(64), up to negligible integer length constraints, the average redundancy using Formula$L^{\ast}(\cdot)$ is bounded, for every Formula$m\geq 2$, by Formula TeX Source $$\eqalignno{& n R_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\quad=E_{\theta}\left[L^{\ast}\left(X^{n}\right)+\log P_{\theta}\left(X^{n}\right)\right]\cr &\quad\buildrel{(a)}\over{\leq}E_{\theta}\left[L\left(X^{n}\;\vert\;m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)+\log P_{\theta}\left(X^{n}\right)\right]\cr &\quad\buildrel{(b)}\over{\leq}E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P\left(X^{n}\;\vert\;m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)}}+L_{R}\left(\mmb{\theta}^{\prime}_{m}\right)+\cr &\;\qquad2\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\log i+\cr &\;\qquad\left(1+\varepsilon\right) [E_{\theta}C_{x}\left(X>m\right)+\rho+2]\log n&\hbox{(65)}}$$ where Formula$(a)$ follows from (64), and Formula$(b)$ follows from averaging on (61) with Formula$\sigma_{m}=S^{\prime}_{m}$, and Formula$\mmb{\varphi}_{m}=\mmb{\theta}^{\prime}_{m}$ with the average on Formula$c_{x}(x>m)$ absorbed in the leading Formula$\varepsilon$.

Expressing Formula$P_{\theta}\left(x^{n}\right)$ as Formula TeX Source $$P_{\theta}\left(x^{n}\right)=\left[\prod_{i\leq m}\theta_{i}^{n_{x}(i)}\right]\cdot S_{m}^{n_{x}(x>m)}\cdot\prod_{i>m}\left({{\theta_{i}}\over{S_{m}}}\right)^{n_{x}(i)}\eqno{\hbox{(66)}}$$ and defining Formula$\delta_{S}\triangleq S_{m}-S^{\prime}_{m}$, the first term of (65) is bounded, for the upper region of Formula$m$, by Formula TeX Source $$\eqalignno{& E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P\left(X^{n}\;\vert\;m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)}}\cr &\quad\leq E_{\theta}\left[\sum\limits_{i=1}^{m}N_{x}(i)\log {{\theta_{i}}\over{\theta^{\prime}_{i,m}}}+N_{x}\left(X>m\right)\log {{S_{m}}\over{S^{\prime}_{m}}}+\right.\cr &\qquad\left.\sum\limits_{i>m}N_{x}(i)\log {{\theta_{i}/S_{m}}\over{N_{x}(i)/N_{x}(X>m)}}\right]\cr &\quad\buildrel{(a)}\over{\leq}n\cdot\sum\limits_{i=1}^{m}\theta_{i}\log {{\theta_{i}}\over{\theta^{\prime}_{i,m}}}+n S_{m}\log {{S_{m}}\over{S^{\prime}_{m}}}\cr &\quad\buildrel{(b)}\over{\leq}n (\log e)\left[\left(\sum\limits_{i=1}^{m}{{\delta_{i}^{2}}\over{\theta^{\prime}_{i,m}}}\right)+{{\delta_{S}^{2}}\over{S^{\prime}_{m}}}\right]\cr &\quad\buildrel{(c)}\over{\leq}(\log e)\cdot{{n\cdot n^{\rho}}\over{n^{\rho+2\alpha}}}+2 (\log e) n^{1-\rho-4\alpha}\cdot\sum\limits_{j=1}^{J_{\rho}}k_{j}n^{j\beta}+\log e\cr &\quad\buildrel{(d)}\over{\leq}5(\log e) n^{1-2\alpha}+\log e&{\hbox{(67)}}}$$ where Formula$(a)$ is since for the third term, the conditional ML probability used for coding is greater than the actual conditional probability assigned to all letters greater than Formula$m$ for every Formula$x^{n}$. Hence, the third term is bounded by 0. Expectation is performed for the other terms. Inequality Formula$(b)$ is obtained similarly to (31) where quantization includes the first Formula$m$ components of Formula$\mmb{\theta}$ and the parameter Formula$S_{m}$. Then, inequality Formula$(c)$ follows the same reasoning as step Formula$(a)$ of (43). The first term bounds the worst case in which all Formula$n^{\rho}$ symbols are quantized to Formula$1/n^{\rho+2\alpha}$ with Formula$\vert\delta_{i}\vert\leq 1/n^{\rho+2\alpha}$. The second term is obtained where Formula$\theta^{\prime}_{i,m}\geq n^{(j-1)\beta}/n^{\rho+2\alpha}$ and Formula$\vert\delta_{i}\vert\leq n^{j\beta}/n^{\rho+3\alpha}$ for Formula$\theta_{i}\in I_{j}$, and Formula$k_{j}=\vert\theta_{i}\in I_{j}\vert$ as before. Offsetting of probabilities smaller than Formula$\xi_{1}$, if required, results, similarly to (27), in Formula$\vert\delta_{i}\vert\leq n^{j\beta}/n^{\rho+3\alpha}+\gamma^{\prime}_{i}\theta^{\prime}_{i}/n^{2\alpha}$ where Formula$\gamma^{\prime}_{i}>0$ is some constant, and adds negligibly to both terms. The last term of Formula$(c)$ is since Formula$S^{\prime}_{m}\geq 1/n$ and Formula$\vert\delta_{S}\vert\leq 1/n$. Finally, Formula$(d)$ is obtained similarly to step Formula$(b)$ of (43), where as in (33), Formula$\sum k_{j}n^{j\beta}\leq 2n^{\rho+2\alpha}$. For Formula$m=o(n^{1/3})$, the same initial steps up to step Formula$(b)$ in (67) are applied. The remaining steps in (31) are then applied with Formula$m$ replacing Formula$k$, yielding a total quantization cost of Formula$5(1+o(1))(\log e)m+\log e$.

To bound the third and fourth terms of (65), Formula TeX Source $$P_{\theta}\left(i\in X^{n}\right)=1-\left(1-\theta_{i}\right)^{n}\leq n\theta_{i}.\eqno{\hbox{(68)}}$$ Similarly, Formula TeX Source $$E_{\theta}C_{x}(X>m)=\sum_{i>m}P_{\theta}\left(i\in X^{n}\right)\leq n S_{m}.\eqno{\hbox{(69)}}$$ Combining the dominant terms of the third and fourth terms of (65), we have Formula TeX Source $$\eqalignno{& 2\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\log i+(1+\varepsilon)E_{\theta}C_{x}(X>m)\log n\cr &\!\!\quad\buildrel{(a)}\over{=}\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\left[2\log i+(1+\varepsilon)\log n\right]\cr &\!\!\quad\buildrel{(b)}\over{\leq}\left(2+{{1+\varepsilon}\over{\rho}}\right)\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\log i\cr &\!\!\quad\buildrel{(c)}\over{\leq}\left(2+{{1+\varepsilon}\over{\rho}}\right)n\sum\limits_{i>m}\theta_{i}\log i&{\hbox{(70)}}}$$ where Formula$(a)$ is because Formula$E_{\theta}C_{x}(X>m)=\sum_{i>m}P_{\theta}\left(i\in X^{n}\right)$, Formula$(b)$ is because for Formula$i>m=n^{\rho}$, Formula$\log i>\rho\log n$, and Formula$(c)$ follows from (68). Given Formula$\rho>\varepsilon$ for an arbitrary fixed Formula$\varepsilon>0$, the resulting coefficient above is upper bounded by some constant Formula$\kappa$.

Summing up the contributions of the terms of (65) from (31), (62), and (70), absorbing second-order terms in a leading Formula$\varepsilon^{\prime}$, we obtain that for Formula$m=o(n^{1/3})$, Formula TeX Source $$n R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq\left(1+\varepsilon^{\prime}\right){{m-1}\over{2}}\log {{n}\over{m^{3}}}+\kappa n\sum_{i>m}\theta_{i}\log i.\eqno{\hbox{(71)}}$$ For the second region, substituting Formula$\alpha=1/3$, and summing up the contributions of (67), (63), and (70) to (65), absorbing second-order terms in Formula$\varepsilon^{\prime}$, we obtain Formula TeX Source $$\eqalignno{&\displaystyle n R_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\!\!\quad\leq (1+\varepsilon^{\prime}){{1}\over{2}}\left(\rho+{{1}\over{3}}\right)\left(\rho+\varepsilon^{\prime}-{{1}\over{3}}\right)\left(\log n\right)^{2}n^{1/3}+\cr &\qquad\kappa n\sum_{i>m}\theta_{i}\log i.&{\hbox{(72)}}}$$ Using the value of Formula$\alpha$ in (45) instead would yield a tighter expression of Formula$\Theta\left(n^{1/3}(\log n)^{4/3}\right)$ for the first term, and then the value of Formula$\nu$ can be optimized to minimize the leading coefficient. Since (71) and (72) hold for every Formula$m>n^{\varepsilon}$, there exists Formula$m^{\ast}$ for which the minimal bound is obtained. To bound the redundancy, we choose this Formula$m^{\ast}$. Now, if the condition in (48) holds, then the second term in (71) and (72) is negligible w.r.t. the first term. Absorbing it in a leading Formula$\varepsilon$, normalizing by Formula$n$, yields the upper bound of (49), and concludes the proof of the Part I of Theorem 5.

For Part II of Theorem 5, we consider the bound of the second region in (72). If there exists Formula$\rho^{\ast}=o\left(n^{1/3}/(\log n)\right)$ for which the condition in (50) holds, then both terms of (72) are of Formula$o(n)$, yielding a total redundancy per symbol of Formula$o(1)$. The proof of Theorem 5 is concluded. Formula$\hfill \square$

Now, consider the upper region in (65) with parameters Formula$\alpha$ and Formula$\rho$ taking any valid value. (The code leading to the bound of the upper region can be applied even if the actual effective alphabet size is in the lower region.) We can sum up the contributions of (67), (63), and (70) to (65), absorbing second-order terms in Formula$\varepsilon$. Equation (63) is valid without the middle Formula$\varepsilon$ term as long as Formula$\rho\geq\alpha+\varepsilon$. Since, in the upper region of Formula$m$, Formula$i\geq m$ is large enough, Elias' code for the integers can be used costing Formula$(1+\varepsilon)\log i$ to code Formula$i$, with Formula$\varepsilon>0$ which can be made arbitrarily small. Hence, the leading coefficient of the bound in (70) can be replaced by Formula$(1+\varepsilon)(1+1/\rho)$. This yields the expression bounding the redundancy in (52). This expression applies to every valid choice of Formula$\alpha$ and Formula$\rho$, including the choice that minimizes the expression. Thus, the proof of Theorem 6 is concluded. Formula$\hfill \square$

To prove Corollary 2, we use Wyner's inequality [41], which implies that for a finite entropy monotonic distribution Formula TeX Source $$\sum_{i\geq 1}\theta_{i}\log i=E_{\theta}\left[\log X\right]\leq H_{\theta}\left[X\right].\eqno{\hbox{(73)}}$$ Fix an arbitrarily small Formula$\varepsilon>0$. Since the sum on the left-hand side of (73) is finite if Formula$H_{\theta}[X]$ is finite, there must exist some Formula$n_{0}$ such that Formula$\sum_{i>n_{0}}\theta_{i}\log i<\varepsilon$. Let Formula$n>n_{0}$, then for Formula$m^{\ast}=n$ and Formula$\rho^{\ast}=1$, using Theorem 6 with any Formula$\alpha\in (0, 1)$, we obtain Formula$R_{n}\left(L^{\ast},\mmb{\theta}\right)<\kappa\varepsilon$ for some fixed constant Formula$\kappa>0$. The proof of Corollary 2 is concluded. Formula$\hfill \blacksquare$

B. Examples

We demonstrate the use of the bounds of Theorems 5 and 6 with three typical distributions over the integers. We specifically show that the redundancy rate of Formula$O\left(n^{1/3+\varepsilon}\right)$ bits overall is achievable when coding sequences generated by many of the typical monotonic distributions, and, in fact, for many distributions faster convergence rates are achievable with the codes proposed. The examples render the assumption reflected in conditions (48) and (50), that very few large symbols appear in Formula$x^{n}$, very practical. Specifically, in the phone book example, there may be many rare names, but only very few of them may occur in a certain city. The more common names can constitute most of a possible phone book sequence.

1) Zipf Distribution

Consider the monotonic distributions over the integers [42], [43] of the form Formula TeX Source $$\theta_{i}={{a}\over{i^{1+\gamma}}},\;i=1,2,\ldots,\eqno{\hbox{(74)}}$$ where Formula$\gamma>0$, and Formula$a$ is a normalization coefficient that guarantees that the probabilities over all integers sum to 1. Approximating summation by integration, we can show that Formula TeX Source $$\eqalignno{S_{m}\leq &\,{{a}\over{\gamma m^{\gamma}}}&{\hbox{(75)}}\cr \sum_{i>m}\theta_{i}\log i\leq &\,{{a}\over{\ln 2}}\left[{{\ln m}\over{\gamma m^{\gamma}}}+{{1}\over{\gamma^{2}m^{\gamma}}}\right]\cr =&\,\left(1+\varepsilon\right){{a\log m}\over{\gamma m^{\gamma}}}&{\hbox{(76)}}}$$ where the last equality holds for Formula$m\rightarrow\infty$ with some fixed Formula$\varepsilon>0$. For Formula$m=n^{\rho}$ and fixed Formula$\rho$, the sum in (48) is thus Formula$O\left(n^{1-\rho\gamma}\log n\right)$, which is Formula$o\left(n^{1/3}(\log n)^{2}\right)$ (and even Formula$o\left(n^{1/3}(\log n)^{4/3}\right)$ if the tighter form of the bound is considered) for every Formula$\rho\geq 2/(3\gamma)$, thus satisfying the negligibility condition (48) at least relative to the second region of (47). As long as Formula$\gamma\leq 2$ (slow decay), the minimal value of Formula$\rho$ required to guarantee negligibility of the sum in (48) is greater than 1/3. Using Theorem 5, this implies that for Formula$\gamma\leq 2$, the second (upper) region of the upper bound in (49) holds with the minimal choice of Formula$\rho^{\ast}=2/(3\gamma)$. Plugging in this value in the second region of (47) [i.e., in (49)] yields the upper bound shown below for this region. For Formula$\gamma>2$, Formula$2/(3\gamma)<1/3$. Hence, (48) holds for Formula$m^{\ast}=o\left(n^{1/3}\right)$. This means that for the distribution in (74) with Formula$\gamma>2$, the effective alphabet size is Formula$o\left(n^{1/3}\right)$, and thus the achievable redundancy is in the first region of the bound of (49). Thus, even though the distribution is over an infinite alphabet, its compressibility behavior is similar to a distribution over a relatively small alphabet. To find the exact redundancy rate, we balance between the contributions of (62) and (70) in (65). As long as Formula$1-\rho\gamma<\rho$, condition (48) holds, and the contribution of rare letters in (70) is negligible w.r.t. the other terms of the redundancy. Equality, implying Formula$\rho^{\ast}=1/(1+\gamma)$, achieves the minimal redundancy rate. Thus, for Formula$\gamma>2$, Formula TeX Source $$\eqalignno{& n R_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\enspace\buildrel{(a)}\over{\leq}\!\left(1+\varepsilon\right)\left[{{a (2\rho^{\ast}\!+1)}\over{\gamma}}n^{1-\rho^{\ast}\gamma}\log n\!+{{n^{\rho^{\ast}}}\over{2}}\left(1-3\rho^{\ast}\right)\log n\right]\cr &\enspace\buildrel{(b)}\over{=}\left(1+\varepsilon\right)\left({{a{{3+\gamma}\over{1+\gamma}}}\over{\gamma}}+{{1-{{3}\over{1+\gamma}}}\over{2}}\right)n^{{1}\over{1+\gamma}}\log n&{\hbox{(77)}}}$$ where the first term in Formula$(a)$ follows from the bounds in (70) and (76), with Formula$m=n^{\rho^{\ast}}$, and the second term from that in (62), and Formula$(b)$ follows from Formula$\rho^{\ast}=1/(1+\gamma)$. Note that for a fixed Formula$\rho^{\ast}$, the factor 3 in the first term can be reduced to 2 with Elias' coding for the integers. The results described are summarized in the following corollary.

Corollary 3

Let Formula$\mmb{\theta}\in{\cal M}$ be defined in (74). Then, there exists a universal code with length function Formula$L^{\ast}(\cdot)$ that has only prior knowledge that Formula$\mmb{\theta}\in{\cal M}$, that can achieve universal coding redundancy Formula TeX Source $$\eqalignno{&\displaystyle R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq&{\hbox{(78)}}\cr &\quad\cases{\left(1+\varepsilon\right){{1}\over{18}}\left(1+{{2}\over{\gamma}}\right)\left({{2}\over{\gamma}}+\varepsilon-1\right){{n^{1/3}(\log n)^{2}}\over{n}},&$\gamma\leq 2,$\cr \left(1+\varepsilon\right)\left({{a{{3+\gamma}\over{1+\gamma}}}\over{\gamma}}+{{1-{{3}\over{1+\gamma}}}\over{2}}\right){{n^{{1}\over{1+\gamma}}\log n}\over{n}}, &$\gamma>2.$}}$$ Corollary 3 gives the redundancy rates for all distributions defined in (74). With a tighter form of the bound (choosing Formula$\alpha$ as in (45) and applying to Theorem 6), a tighter bound of Formula$\Theta\left(n^{1/3}(\log n)^{4/3}/n\right)$ can be obtained for the first region. Using the looser bound of Corollary 3, if, for example, Formula$\gamma=1$, the redundancy is Formula$O\left(n^{1/3}(\log n)^{2}\right)$ bits overall with coefficient 1/6. For Formula$\gamma=3$, Formula$O(n^{1/4}\log n)$ bits are required. For faster decays (greater Formula$\gamma$) even smaller redundancy rates are achievable.

3) Geometric Distributions

Geometric distributions given by Formula TeX Source $$\theta_{i}=p\left(1-p\right)^{i-1};\;i=1,2,\ldots,\eqno{\hbox{(79)}}$$ where Formula$0<p<1$, decay even faster than the Zipf distribution in (74). Thus, their effective alphabet sizes are even smaller. This implies that a universal code can have even smaller redundancy than that presented in Corollary 3, when coding sequences generated by a geometric distribution (even if this is unknown in advance, and the only prior knowledge is that Formula$\mmb{\theta}\in{\cal M}$). Choosing Formula$m=\ell\cdot\log n$, the contribution of low probability symbols in (70) to (65) can be upper bounded by Formula TeX Source $$\eqalignno{& 2n\sum\limits_{i>m}\theta_{i}\left(\log i+\log n\right)&{\hbox{(80)}}\cr &\quad\buildrel{(a)}\over{\leq}2n (1-p)^{m}\log n+O\left(n (1-p)^{m}\log m\right)\cr &\quad\buildrel{(b)}\over{=}2 n^{1+\ell\log (1-p)}(\log n)+O\left(n^{1+\ell\log (1-p)}\log \log n\right)}$$ where Formula$(a)$ follows from computing Formula$S_{m}$ using geometric series, and bounding the second term, and Formula$(b)$ follows from substituting Formula$m=\ell\log n$ and representing Formula$(1-p)^{\ell\log n}$ as Formula$n^{\ell\log (1-p)}$. As long as Formula$\ell\geq 1/(-\log (1-p))$, the expression in (80) is Formula$O(\log n)$, thus negligible w.r.t. the redundancy upper bound of (49) with Formula$m^{\ast}=\ell^{\ast}\log n=(\log n)/(-\log (1-p))$. Substituting this Formula$m^{\ast}$ in (49), we obtain the following corollary.

Corollary 4

Let Formula$\mmb{\theta}\in{\cal M}$ be a geometric distribution defined in (79). Then, there exists a universal code with length function Formula$L^{\ast}(\cdot)$ that has only prior knowledge that Formula$\mmb{\theta}\in{\cal M}$, that can achieve universal coding redundancy Formula TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq{{1+\varepsilon}\over{-2\log (1-p)}}\cdot{{(\log n)^{2}}\over{n}}.\eqno{\hbox{(81)}}$$ Corollary 4 shows that if Formula$\mmb{\theta}$ parameterizes a geometric distribution, sequences governed by Formula$\mmb{\theta}$ can be coded with average universal coding redundancy of Formula$O\left((\log n)^{2}\right)$ bits. Their effective alphabet size is Formula$O(\log n)$, implying that larger symbols are very unlikely to occur. For example, for Formula$p=0.5$, the effective alphabet size is Formula$\log n$, and Formula$0.5 (\log n)^{2}$ bits are required for a universal code. For Formula$p=0.75$, the effective alphabet size is Formula$(\log n)/2$, and Formula$(\log n)^{2}/4$ bits are required by a universal code.

5) Slow Decaying Distributions Over the Integers

Up to now, we considered fast decaying distributions, which all achieved the Formula$O(n^{1/3+\varepsilon}/n)$ redundancy rate. We now consider a slowly decaying monotonic distribution over the integers, given by Formula TeX Source $$\theta_{i}={{a}\over{i\left(\log i\right)^{2+\gamma}}},\;i=2,3,\ldots\eqno{\hbox{(82)}}$$ where Formula$\gamma>0$ and Formula$a$ is a normalizing factor (see, e.g., [14], [32], [33]). This distribution has finite entropy only if Formula$\gamma>0$ (but is a valid infinite entropy distribution for Formula$\gamma>-1$). Unlike the previous distributions, we need to use Theorem 6 to bound the redundancy for coding sequences generated by this distribution. Approximating the sum with an integral, the order of the third term of (52) is Formula TeX Source $$n\sum_{i>m}\theta_{i}\log i=O\left({{n}\over{(\log m)^{\gamma}}}\right).\eqno{\hbox{(83)}}$$ In order to minimize the redundancy bound of (52), we define Formula$\rho=n^{\ell}$. For the minimum rate, all terms of (52) must be balanced. To achieve that, we must have Formula TeX Source $$\alpha+2\ell=1-2\alpha=1-\gamma\ell.\eqno{\hbox{(84)}}$$ The solution is Formula$\alpha=\gamma/(4+3\gamma)$ and Formula$\ell=2/(4+3\gamma)$. Substituting these values in the expression of (52), with Formula$\rho=n^{\ell}$, results in the first term in (52) dominating, and yields the following corollary.

Corollary 5

Let Formula$\mmb{\theta}\in{\cal M}$ be defined in (82) with Formula$\gamma>0$. Then, there exists a universal code with length function Formula$L^{\ast}(\cdot)$ that has only prior knowledge that Formula$\mmb{\theta}\in{\cal M}$, that can achieve universal coding redundancy Formula TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq\left(1+\varepsilon\right){{n^{{\gamma+4}\over{3\gamma+4}}(\log n)^{2}}\over{2n}}.\eqno{\hbox{(85)}}$$ In a similar manner to the Zipf distribution, the tighter form of the general upper bound can be used, reducing the Formula$(\log n)^{2}$ term to Formula$\Theta\left((\log n)^{4/3}\right)$ (with a different leading coefficient). Due to the slow decay rate of the distribution in (82), the effective alphabet size is much greater here. For Formula$\gamma=1$, for example, it is Formula$n^{n^{2/7}}$. This implies that very large symbols are likely to appear in Formula$x^{n}$. As Formula$\gamma$ increases though, the effective alphabet size decreases, and as Formula$\gamma\rightarrow\infty$, Formula$m\rightarrow n$. The redundancy rate increases due to the slow decay. For Formula$\gamma\geq 1$, it is Formula$O\left(n^{5/7}(\log n)^{2}/n\right)$. As Formula$\gamma\rightarrow\infty$, since the distribution tends to decay faster, the redundancy rate tends to the finite alphabet rate of Formula$O\left(n^{1/3}(\log n)^{2}/n\right)$. However, as the decay rate is slower Formula$\gamma\rightarrow 0$, a nondiminishing redundancy rate is approached. Note that the proof of Theorem 6 does not limit the distribution to a finite entropy one. Therefore, the bound of (85) applies, in fact, also to Formula${-}{1}<\gamma\leq 0$. However, for Formula$\gamma\leq 0$, the per-symbol redundancy is no longer diminishing.



In this section, we show that if we have side information of the monotonicity of the distribution governing an individual sequence (i.e., its ML distribution), we can universally compress the individual sequence as well as (and even better than) the average case. We next show that in this case the lower bound of Theorem 3 is asymptotically achieved. Moreover, the upper bound derived here for the upper region is tighter than the bounds obtained in Theorem 4 and Corollary 1 for the average case. The reason is that, with the additional side information that Formula${\hat{\mmb{\theta}}}\in{\cal M}$, we restrict the smallest nonzero symbol probability to Formula$1/n$. This is not the case in the average case, where symbols from a long tail of the distribution can have unordered occurrences in a given sequence. For a specific sequence, we can have Formula${\hat{\mmb{\theta}}}\not\in{\cal M}$, but we still need to describe Formula${\hat{\mmb{\theta}}}_{\cal M}\in{\cal M}$ for that sequence. The distributions Formula${\hat{\mmb{\theta}}}_{\cal M}$ may have probability parameters smaller than Formula$1/n$ for symbols Formula$i\not\in x^{n}$ and Formula$j\in x^{n}$, where Formula$j>i$ (recalling the assumption that for Formula$\mmb{\theta}\in{\cal M}$, we must have Formula$\theta_{i}\geq\theta_{j}$).

The side information assumed restricts the set of allowable sequences to those which obey the monotonicity, omitting all sequences for which Formula${\hat{\mmb{\theta}}}\ne{\hat{\mmb{\theta}}}_{\cal M}$ from the set considered. This means that the class considered is smaller than the class considered for the lower bound in Theorem 3. However, in proving Theorem 3, all sequences that do not obey the monotonicity are excluded from the Shtarkov sum [step Formula$(b)$ of (13)], essentially rendering the bound also as a bound on the class containing only sequences that obey the monotonicity requirement.

If one assumes some monotonicity on the symbol probabilities, but the observed sequence diverges from this assumption, the code proposed can still be used to describe the probabilities of the symbols that obey the monotonicity. An additional description is added as a prefix to the code to describe the number of symbols that do not obey the monotonicity, and then Formula$O(\log n)$ bits are used for each such symbol to describe its occurrence count. If the maximal symbol in Formula$x^{n}$ is Formula${\hat{k}}$, as long as Formula$o({\hat{k}})$ symbols are out of order for Formula${\hat{k}}\leq n^{1/3}$ or Formula$o (n^{1/3})$ symbols are out of order for greater Formula${\hat{k}}$, the additional cost of coding the symbols violating the monotonicity is negligible. The method described below can thus still be used. Moreover, it can be shown (see, e.g., [34]) that as long as the largest symbol is polynomial in Formula$n$ and there are not too many symbols larger than Formula$n$, diminishing redundancy w.r.t. the monotonic ML probability Formula${\hat{\mmb{\theta}}}_{\cal M}$ can be achieved coding any such Formula$x^{n}$. However, this result does not imply cheaper total description length than the one using the true ML Formula${\hat{\mmb{\theta}}}$ of Formula$x^{n}$, as the loss in using Formula${\hat{\mmb{\theta}}}_{\cal M}$ instead of Formula${\hat{\mmb{\theta}}}$ may dominate over the redundancy savings.

Finally, the class of the distributions of all sequences with Formula${\hat{k}}$ symbols that obey the monotonicity is identical to the class of the distributions of all patterns with Formula${\hat{k}}$ indices for a given Formula${\hat{k}}$. The Shtarkov sum on the ML sequence probabilities is not equal in these cases because the pattern ML sequence probability is the sum over the probabilities of all permutations of these sequences. However, the method used for describing the ML i.i.d. distribution within this class can be used to derive tight bounds for coding patterns. (Bounding the quantization cost for patterns, on the other hand, is more complicated.) This was not the case when addressing the average case, as the description cost in the average case for monotonic distributions is more complicated due to the use of the monotonic ML over all sequences, including those not obeying the monotonicity. This section is concluded with the theorem that upper bounds the individual sequence redundancy and its proof.

Theorem 7

Fix an arbitrarily small Formula$\varepsilon>0$, and let Formula$n\rightarrow\infty$. Let Formula$x^{n}$ be a sequence for which Formula${\hat{\mmb{\theta}}}\in{\cal M}$, i.e., Formula${\hat{\theta}}_{1}\geq{\hat{\theta}}_{2}\geq\ldots$. Let Formula$k={\hat{k}}$ be the number of letters occurring in Formula$x^{n}$. Then, there exists a code Formula$L^{\ast}\left(\cdot\right)$ that achieves individual sequence redundancy w.r.t. Formula${\hat{\mmb{\theta}}}_{\cal M}={\hat{\mmb{\theta}}}$ for Formula$x^{n}$ which is upper bounded by Formula TeX Source $$\eqalignno{&\displaystyle{\hat{R}}_{n}\left(L^{\ast}, x^{n}\right)\leq&{\hbox{(86)}}\cr &\quad\cases{\left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n}\over{k^{3}}}, &$k=o\left(n^{1/3}\right)$\cr \left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}, &$k\leq n^{1/3}$\cr {{\left(0.79\log {{k}\over{n^{1/3-\varepsilon}}}+0.14\log n\right) (n\log n)^{1/3}}\over{n}}, &$n^{1/3}<k=o(n)$\cr {{0.4\left(\log n\right)^{4/3}n^{1/3}}\over{n}}, &$k=O(n)$.}}$$ Note that by the monotonicity constraint, the number of symbols Formula${\hat{k}}$ occurring in Formula$x^{n}$ also equals to the maximal symbol in Formula$x^{n}$. Since, in the individual sequence case, this maximal symbol defines the class considered and also to be consistent with Theorem 3, we use Formula$k$ to characterize the alphabet size of a given sequence. Since Formula${\hat{\mmb{\theta}}}$ is monotonic, Formula${\hat{\mmb{\theta}}}_{\cal M}={\hat{\mmb{\theta}}}$.

Proof [Theorem 7]

The proof enhances on that of Theorem 4 and Corollary 1. Both regions of the proof apply here, where instead of quantizing Formula$\mmb{\theta}$ to Formula$\mmb{\theta}^{\prime}$, we quantize Formula${\hat{\mmb{\theta}}}$ to Formula${\hat{\mmb{\theta}}}^{\prime}$ in a similar manner, and do not need to average over all sequences. Instead of using any general Formula${\hat{\mmb{\varphi}}}$ to code Formula$x^{n}$, we can use Formula${\hat{\mmb{\theta}}}^{\prime}$ without any additional optimizations, where Formula$\log n$ bits describe Formula$k$. The first two regions of (86) are then proved similarly to these regions in Theorem 4.

To prove the bounds of the upper regions, which are tighter than those of Corollary 1, we make several modifications based on now using three major intervals (as in [35]) instead of two. To describe Formula${\hat{\mmb{\theta}}}^{\prime}$, using parameter Formula$\alpha$, describe the components of Formula${\hat{\mmb{\theta}}}^{\prime}$ separately for three intervals Formula$(1/n, n^{\alpha}/n]$, Formula$(n^{\alpha}/n, 1/n^{\alpha}]$, and Formula$(1/n^{\alpha}, 1]$. For the bottom interval, use Formula$n^{\alpha}\log n$ bits to describe all probability parameters in this interval. For each of the Formula$n^{\alpha}$ points in this interval use at most Formula$\log n$ bits to describe the multiplicity of these values in Formula${\hat{\mmb{\theta}}}$. The top interval consists of at most Formula$n^{\alpha}$ probability parameters. Use at most Formula$\log n$ bits to describe the value of each. For both intervals, no quantization is necessary, and the components of Formula${\hat{\mmb{\theta}}}^{\prime}$ are identical to those of Formula${\hat{\mmb{\theta}}}$.

As in [35], the middle interval is the one in which the parameters need to be quantized. Partition this interval into Formula$J^{\prime}_{2}\triangleq J^{+}_{2}-J^{-}_{2}$ smaller intervals, in a similar manner to (17) Formula TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n}},{{n^{j\beta}}\over{n}}\right),\;J^{-}_{2}\leq j\leq J^{+}_{2}\eqno{\hbox{(87)}}$$ where Formula$J^{-}_{2}$ and Formula$J^{+}_{2}$ coincide with the end points of the large middle interval. This results in Formula$J^{\prime}_{2}\leq (1-2\alpha)\log n$. Partition each interval into grid points with the spacing Formula TeX Source $$\Delta^{\prime^{(2)}}_{j}={{n^{j\beta}}\over{n^{1+\alpha}}}.\eqno{\hbox{(88)}}$$ Similarly to (40), this yields Formula TeX Source $$B^{\prime}_{2}\leq 0.5 (1-2\alpha) n^{\alpha}\log n\eqno{\hbox{(89)}}$$ grid points. Following a similar derivation to that in (42), the description cost of Formula${\hat{\mmb{\theta}}}^{\prime}$ is bounded by Formula TeX Source $$\eqalignno{&\displaystyle L_{R}({\hat{\mmb{\theta}}}^{\prime})\leq&{\hbox{(90)}}\cr &\;\quad\cases{{{1+\varepsilon}\over{2}}(1-2\alpha) (\log n)\left(\log {{k}\over{n^{\alpha-\varepsilon}}}\right) n^{\alpha},&$n^{\alpha}<k\leq n^{1-\alpha}$\cr {{1+\varepsilon}\over{2}}(1-2\alpha)^{2}\left(\log n\right)^{2}n^{\alpha}, &$k>n^{1-\alpha}$}}$$ where the description cost of the upper and lower large intervals is absorbed in second-order terms, and the second Formula$1-2\alpha$ factor in the upper region results from Formula$k_{mid}\leq n^{1-\alpha}$ due to the lower limit Formula$n^{\alpha-1}$ of the middle large interval.

The number of symbols with parameters in small interval Formula$J^{-}_{2}$ is upper bounded by Formula$k_{J^{-}_{2}}\leq n^{1-\alpha}$, then, Formula$k_{J^{-}_{2}+1}\leq (n^{1-\alpha}-k_{J^{-}_{2}})/2$, and so on. Similarly to (33), we have Formula TeX Source $$\sum_{j=J^{-}_{2}}^{J^{+}_{2}}k_{j}2^{j}\leq n^{1-\alpha}\cdot 2^{J^{-}_{2}-1}=n.\eqno{\hbox{(91)}}$$ Thus, following (43) and using (87) and (88), the quantization cost can be upper bounded by Formula TeX Source $$n (\log e)\sum_{i=1}^{k}{{\delta_{i}^{2}}\over{{\hat{\theta}}^{\prime}_{i}}}\leq{{2\log e}\over{n^{2\alpha}}}\sum_{j=J^{-}_{2}}^{J^{+}_{2}}k_{j}2^{j}=2 (\log e)n^{1-2\alpha}.\eqno{\hbox{(92)}}$$ There is thus a factor of 2 reduction over (43) because of the increased lower limit of the first point of quantized parameters.

Combining (90) and (92) for the second region of (90) Formula TeX Source $$n{\hat{R}}_{n}(L^{\ast}, x^{n})\!\leq\!{{1\!+\!\varepsilon}\over{2}}(1\!-\!2\alpha)^{2}n^{\alpha}(\log n)^{2}\!+\!2 (\log e) n^{1-2\alpha}\!.\eqno{\hbox{(93)}}$$ Choosing Formula$\alpha$ from (45) yields Formula TeX Source $$n{\hat{R}}_{n}\left(L^{\ast}, x^{n}\right)\leq\left(1+\varepsilon\right)\left({{\nu}\over{18}}+{{2\log e}\over{\nu^{2}}}\right) n^{1/3}(\log n)^{4/3}\eqno{\hbox{(94)}}$$ for this region. Taking Formula$\nu=(72\log e)^{1/3}\approx 4.7$ minimizes (94), resulting in coefficient of less than 0.4 in (94). Letting Formula$n\rightarrow\infty$, absorbing all second-order terms in the gap of the coefficient to 0.4 proves the last region of (86). Using the same value of Formula$\nu$ for the resulting terms for the third region in a similar manner, proves the third region. A slightly tighter bound can be obtained for the third region if the value of Formula$\nu$ is optimized for the specific value of Formula$k$. Formula$\hfill \blacksquare$



Universal compression of sequences generated by monotonic distributions was studied. We showed that for finite alphabets, if one has the prior knowledge of the monotonicity of a distribution, one can reduce the cost of universality. For alphabets of Formula$o(n^{1/3})$ letters, this cost reduces from Formula$0.5\log (n/k)$ bits per each unknown probability parameter to Formula$0.5\log (n/k^{3})$ bits per each unknown probability parameter. Otherwise, for alphabets of Formula$O(n)$ letters, one can compress such sources with overall redundancy of Formula$O(n^{1/3+\varepsilon})$ bits. This is a significant decrease in redundancy from Formula$O(k\log n)$ or Formula$O(n)$ bits overall that can be achieved if no side information is available about the source distribution. Redundancy of Formula$O(n^{1/3+\varepsilon})$ bits overall can also be achieved for much larger alphabets including infinite alphabets for fast decaying monotonic distributions. Sequences generated by slower decaying distributions can also be compressed with diminishing per-symbol redundancy costs under some mild conditions and specifically if they have finite entropy rates. Examples for well-known monotonic distributions demonstrated how the diminishing redundancy decay rates can be computed by applying the bounds that were derived. The general results were shown to also apply to individual sequences whose empirical distributions obey the monotonicity. The techniques used for individual sequences can also be applied to bounding redundancy coding patterns.


A. Proof of Theorem 1

The proof follows the same steps used in [30] and [31] to lower bound the maximin redundancies for large alphabets and patterns, respectively, using the weak version of the redundancy-capacity theorem [6]. This version ties between the maximin universal coding redundancy and the capacity of a channel defined by the conditional probability Formula$P_{\theta}\left(x^{n}\right)$. We define a set Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ of points Formula$\mmb{\theta}\in{\cal M}_{k}$. Then, show that these points are distinguishable by observing Formula$X^{n}$, i.e., the probability that Formula$X^{n}$ generated by Formula$\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ appears to have been generated by another point Formula$\mmb{\theta}^{\prime}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ diminishes with Formula$n$. Then, using Fano's inequality [4], the number of such distinguishable points is a lower bound on Formula$R_{n}^{-}\left({\cal M}_{k}\right)$. Since Formula$R_{n}^{+}\left({\cal M}_{k}\right)\geq R_{n}^{-}\left({\cal M}_{k}\right)$, it is also a lower bound on the average minimax redundancy. The two regions in (6) result from a threshold phenomenon, where there exists a value Formula$k_{m}$ of Formula$k$ that maximizes the lower bound, and can be applied to all Formula${\cal M}_{k}$ for Formula$k\geq k_{m}$.

We begin with defining Formula${\mmb{\Omega}}_{{\cal M}_{k}}$. Let Formula$\mmb{\omega}$ be a vector of grid components, such that the last Formula$k-1$ components Formula$\theta_{i}$, Formula$i=2,\ldots, k$, of Formula$\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ must satisfy Formula$\theta_{i}\in\mmb{\omega}$. Let Formula$\omega_{b}$ be the Formula$b$th point in Formula$\mmb{\omega}$, and define Formula$\omega_{0}=0$ and Formula TeX Source $$\omega_{b}\triangleq\sum\limits_{j=1}^{b}{{2 (j-{{1}\over{2}})}\over{n^{1-\varepsilon}}}={{b^{2}}\over{n^{1-\varepsilon}}},\;b=1, 2,\ldots.\eqno{\hbox{(A.1)}}$$ Then, for the Formula$b$th point in Formula$\mmb{\omega}$, Formula TeX Source $$b=\sqrt{\omega_{b}}\cdot\sqrt{n}^{1-\varepsilon}.\eqno{\hbox{(A.2)}}$$

To count the number of points in Formula${\mmb{\Omega}}_{{\cal M}_{k}}$, let us first consider the standard i.i.d. case, where there is no monotonicity requirement, and count the number of points in Formula${\mmb{\Omega}}$, which is defined similarly, but without the monotonicity requirement (i.e., Formula${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$). Let Formula$b_{i}$ be the index of Formula$\theta_{i}$ in Formula$\mmb{\omega}$, i.e., Formula$\theta_{i}=\omega_{b_{i}}$. Then, from (A.1) and (A.2) and since the components of Formula$\mmb{\theta}$ are probabilities Formula TeX Source $$\sum_{i=2}^{k}{{b_{i}^{2}}\over{n^{1-\varepsilon}}}=\sum_{i=2}^{k}\omega_{b_{i}}=\sum_{i=2}^{k}\theta_{i}\leq1.\eqno{\hbox{(A.3)}}$$ It follows that for Formula$\mmb{\theta}\in{\mmb{\Omega}}$, Formula TeX Source $$\sum_{i=2}^{k}b_{i}^{2}\leq n^{1-\varepsilon}.\eqno{\hbox{(A.4)}}$$ Hence, since the components Formula$b_{i}$ are nonnegative integers Formula TeX Source $$\eqalignno{M&\triangleq\left\vert{\mmb{\Omega}}\right\vert\cr &\geq\sum\limits_{b_{2}=0}^{\left\lfloor\sqrt{n^{1-\varepsilon}}\right\rfloor}\sum\limits_{b_{3}=0}^{\left\lfloor\sqrt{n^{1-\varepsilon}-b_{2}^{2}}\right\rfloor}\cdots\sum\limits_{b_{k}=0}^{\left\lfloor\sqrt{n^{1-\varepsilon}-\sum\nolimits_{i=2}^{k-1}b_{i}^{2}}\right\rfloor}1\cr &\buildrel{(a)}\over{\geq}\!\int_{0}^{\sqrt{n^{1-\varepsilon}}}\!\int_{0}^{\sqrt{n^{1-\varepsilon}-x_{2}^{2}}}\cdots\!\int_{0}^{\sqrt{n^{1-\varepsilon}-\sum\nolimits_{i=2}^{k-1}x_{i}^{2}}}dx_{k}\cdots dx_{2}\cr &\buildrel{(b)}\over\triangleq{{V_{k-1}\left(\sqrt{n}^{1-\varepsilon}\right)}\over{2^{k-1}}}&{\hbox{(A.5)}}}$$ where Formula$V_{k-1}\left(\sqrt{n}^{1-\varepsilon}\right)$ is the volume of a Formula$k-1$ dimensional sphere with radius Formula$\sqrt{n}^{1-\varepsilon}$, Formula$(a)$ follows from monotonic decrease of the function in the integrand for all integration arguments, and Formula$(b)$ follows since its left-hand side computes the volume of the positive quadrant of this sphere. Note that this is a different proof from that used in [30] and [31] for this step. Applying the monotonicity constraint, all permutations of Formula$\mmb{\theta}$ that are not monotonic must be taken out of the grid. Hence, Formula TeX Source $$M_{{\cal M}_{k}}\triangleq\left\vert{\mmb{\Omega}}_{{\cal M}_{k}}\right\vert\geq{{V_{k-1}\left(\sqrt{n}^{1-\varepsilon}\right)}\over{k!\cdot2^{k-1}}},\eqno{\hbox{(A.6)}}$$ where dividing by Formula$k!$ is a worst-case assumption, yielding a lower bound and not an equality. This leads to a lower bound equal to that obtained for patterns in [31] on the number of points in Formula${\mmb{\Omega}}_{{\cal M}_{k}}$. Specifically, the bound achieves a maximal value for Formula$k_{m}=\left(\pi n^{1-\varepsilon}/2\right)^{1/3}$ and then decreases to eventually become smaller than 1. However, for Formula$k>k_{m}$, one can consider a monotonic distribution for which all components Formula$\theta_{i}$; Formula$i>k_{m}$, of Formula$\mmb{\theta}$ are zero, and use the bound for Formula$k_{m}$.

Distinguishability of Formula$\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ is a direct result of distinguishability of Formula$\mmb{\theta}\in{\mmb{\Omega}}$, which is shown in Lemma 3.1 in [30]. The lemma states the following: there exits an estimator Formula${\hat{{\mmb{\Theta}}}}_{g}(X^{n})\in{\mmb{\Omega}}$ for which the estimate Formula${\hat{\mmb{\theta}}}_{g}$ satisfies Formula$\lim_{n\rightarrow\infty}P_{\theta}\left({\hat{\mmb{\theta}}}_{g}\ne\mmb{\theta}\right)=0$ for all Formula$\mmb{\theta}\in{\mmb{\Omega}}$. Since this is true for all points in Formula${\mmb{\Omega}}$, it is also true for all points in Formula${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$, where now, Formula${\hat{\mmb{\theta}}}_{g}\in{\mmb{\Omega}}_{{\cal M}_{k}}$. Assuming all points in Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ are equally probable to generate Formula$X^{n}$, we can define an average error probability Formula$P_{e}\triangleq\Pr\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=\sum_{\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}}P_{\theta}\left({\hat{\mmb{\theta}}}_{g}\ne\mmb{\theta}\right)/M_{{\cal M}_{k}}$. Using the redundancy-capacity theorem, Formula TeX Source $$\eqalignno{nR^{-}_{n}\left[{\cal M}_{k}\right]\geq &\, C\left[{\cal M}_{k}\rightarrow X^{n}\right]\cr \buildrel{(a)}\over{\geq}&\, I [{\mmb{\Theta}}; X^{n}]=H\left[{\mmb{\Theta}}\right]-H\left[{\mmb{\Theta}}\vert X^{n}\right]\cr \buildrel{(b)}\over{=}&\,\log M_{{\cal M}_{k}}-H\left[{\mmb{\Theta}}\vert X^{n}\right]\cr \buildrel{(c)}\over{\geq}&\,\left(1-P_{e}\right)\left(\log M_{{\cal M}_{k}}\right)-1\cr \buildrel{(d)}\over{\geq}&\, (1-o(1))\log M_{{\cal M}_{k}}&{\hbox{(A.7)}}}$$ where Formula$C\left[{\cal M}_{k}\rightarrow X^{n}\right]$ denotes the capacity of the channel between Formula${\cal M}_{k}$ and the observation Formula$X^{n}$, and Formula$I [{\mmb{\Theta}}; X^{n}]$ is the mutual information induced by the joint distribution Formula$\Pr\left(\Theta=\theta\right)\cdot P_{\theta}\left(X^{n}\right)$. Inequality Formula$(a)$ follows from the definition of capacity, equality Formula$(b)$ from the uniform distribution of Formula${\mmb{\Theta}}$ in Formula${\mmb{\Omega}}_{{\cal M}_{k}}$, inequality Formula$(c)$ from Fano's inequality, and Formula$(d)$ follows since Formula$P_{e}\rightarrow 0$. Lower bounding the expression in (A.6) for the two regions (obtaining the same bounds as in [31]), then using (A.7), normalizing by Formula$n$, and absorbing second-order terms in Formula$\varepsilon$, yields the two regions of the bound in (6). The proof of Theorem 1 is concluded. Formula$\hfill \square$

B) Proof of Theorem 2

To prove Theorem 2, we use the random-coding strong version of the redundancy-capacity theorem [19]. The idea is similar to the weak version used in Appendix A. We assume that grids Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ of points are uniformly distributed over Formula${\cal M}_{k}$, and one grid is selected randomly. Then, a point in the selected grid is randomly selected under a uniform prior to generate Formula$X^{n}$. The random choice of a grid and then of a source in the grid must uniformly cover the whole space Formula${\cal M}_{k}$. Showing distinguishability within a selected grid, for every possible random choice of Formula${\mmb{\Omega}}_{{\cal M}_{k}}$, implies that a lower bound on the cardinality of Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ for every possible choice is essentially a lower bound on the overall sequence redundancy for most sources in Formula${\cal M}_{k}$.

The construction of Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ is identical to that used in [31] to construct a grid of sources that generate patterns. We pack spheres of radius Formula$n^{-0.5(1-\varepsilon)}$ in the parameter space defining Formula${\cal M}_{k}$. The set Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ consists of the center points of the spheres. To cover the space Formula${\cal M}_{k}$, we randomly select a random shift of the whole lattice under a uniform distribution. The cardinality of Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ is lower bounded by the relation between the volume of Formula${\cal M}_{k}$, which equals (as shown in [31]) Formula$1/[(k-1)! k!]$, and the volume of a single sphere, with factoring also of a packing density (see, e.g., [3]). This yields (55) in [31] Formula TeX Source $$M_{{\cal M}_{k}}\geq{{1}\over{(k-1)!\cdot k!\cdot V_{k-1}\left(n^{-0.5(1-\varepsilon)}\right)\cdot2^{k-1}}}\eqno{\hbox{(B.1)}}$$ where Formula$V_{k-1}\left(n^{-0.5(1-\varepsilon)}\right)$ is the volume of a Formula$k-1$ dimensional sphere with radius Formula$n^{-0.5(1-\varepsilon)}$ (see, e.g., [3] for computation of this volume).

For distinguishability, it is sufficient to show that there exists an estimator Formula${\hat{{\mmb{\Theta}}}}_{g}(X^{n})\in{\mmb{\Omega}}_{{\cal M}_{k}}$ such that Formula$\lim_{n\rightarrow\infty}P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=0$ for every choice of Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ and for every choice of Formula${\mmb{\Theta}}\in{\mmb{\Omega}}_{{\cal M}_{k}}$. This is already shown in [30, Lemma 4.1] for a larger grid Formula${\mmb{\Omega}}$ of i.i.d. sources, which is constructed identically to Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ over the complete Formula$k-1$ dimensional probability simplex. The lemma states the following: let Formula${\mmb{\Theta}}\in{\mmb{\Omega}}$ be a randomly selected point in a grid Formula${\mmb{\Omega}}$. Let a random sequence Formula$X^{n}$ be governed by Formula$P_{\Theta}\left(X^{n}\right)$. Then, there exists a decision rule that chooses a point Formula${\hat{\Theta}}_{g}\left(X^{n}\right)\in{\mmb{\Omega}}$, such that Formula$\lim_{n\rightarrow\infty}P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=0$. By the monotonicity requirement, for every Formula${\mmb{\Omega}}_{{\cal M}_{k}}$, there exists an i.i.d. Formula${\mmb{\Omega}}$, such that Formula${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$. Since [30, Lemma 4.1] holds for Formula${\mmb{\Omega}}$, it then must also hold for the smaller grid Formula${\mmb{\Omega}}_{{\cal M}_{k}}$. Now, since all the conditions of the strong random-coding version of the redundancy-capacity theorem hold, taking the logarithm of the bound in (B.1), absorbing second-order terms in Formula$\varepsilon$, and normalizing by Formula$n$, leads to the first region of the bound in (8). By [19, Th. 3], since for any fixed arbitrarily small Formula$\varepsilon>0$ we have Formula$P_{e}\triangleq P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]\rightarrow 0$, then, Formula$\mu_{n}\left(A_{\varepsilon}(n)\right)\rightarrow 0$, thus completing the proof for the first region of the bound.

The second region of the bound is handled in a manner related to the second region of the bound of Theorem 1. However, here, we cannot simply set the probability of all symbols Formula$i>k_{m}$ to zero, because all possible valid sources must be included in one of the grids Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ to achieve a complete covering of Formula${\cal M}_{k}$. As was done in [31], we include sources with Formula$\theta_{i}>0$ for Formula$i>k_{m}$ in the grids Formula${\mmb{\Omega}}_{{\cal M}_{k}}$, but do not include them in the lower bound on the number of grid points. Instead, for Formula$k>k_{m}$, we bound the number of points in a Formula$k_{m}$-dimensional cut of Formula${\cal M}_{k}$ for which the remaining Formula$k-k_{m}$ components of Formula$\mmb{\theta}$ are very small (and insignificant). This analysis is valid also for Formula$k>n$. In proving distinguishability, however, we must take into account the effect of the additional sources in the grid, and make sure that the existence of these sources in Formula${\mmb{\Omega}}_{{\cal M}_{k}}$ does not lead to nondiminishing error probability. Lemma 6.1 in [31] shows that Formula$\lim_{n\rightarrow\infty}P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=0$ for Formula$k>k_{m}$ for i.i.d. nonmonotonically restricted grid of sources Formula${\mmb{\Omega}}$. The proof is given in [31, Appendix D]. As before, it carries over to monotonic distributions, since as before, for each Formula${\mmb{\Omega}}_{{\cal M}_{k}}$, there exists an unrestricted corresponding Formula${\mmb{\Omega}}$, such that Formula${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$. The choice of Formula$k_{m}=0.5 (n^{1-\varepsilon}/\pi)^{1/3}$ gives the maximal bound w.r.t. Formula$k$. Since, again, all conditions of the strong version of the redundancy-capacity theorem are satisfied, the second region of the bound is obtained. This concludes the proof of Theorem 2. Formula$\hfill \square$


The author would like to express gratitude to the associate editor, W. Szpankowski, for handling this paper and for very valuable comments that helped improving this paper, and also to an anonymous reviewer for providing valuable feedback.


This work was supported by the NSF under Grant CCF-0347969. This paper was presented at the 2007 IEEE International Symposium on Information Theory.

The author is with Google, Inc., Pittsburgh, PA 15206 USA (e-mail:

Communicated by W. Szpankowski, Associate Editor for Source Coding.

1For two functions Formula$f(n)$ and Formula$g(n)$, Formula$f(n)=o(g(n))$ if Formula$\forall c$, Formula$\exists n_{0}$, such that Formula$\forall n>n_{0}$, Formula$f(n)<cg(n)$; Formula$f(n)=O(g(n))$ if Formula$\exists c$, Formula$n_{0}$, such that Formula$\forall n>n_{0}$, Formula$0\leq f(n)\leq cg(n)$; Formula$f(n)=\Theta (g(n))$ if Formula$\exists c_{1}$, Formula$c_{2}$, Formula$n_{0}$, such that Formula$\forall n>n_{0}$, Formula$c_{1}g(n)\leq f(n)\leq c_{2}g(n)$.

2In this paper, redundancy is defined per-symbol (normalized by the sequence length Formula$n$). However, when we refer to redundancy in overall bits, we address the block redundancy cost for a sequence.

3The original submission of this paper derived a looser bound for the first region of (10). A tighter bound was obtained using results that appeared subsequently to the submission of this paper in [38].


No Data Available


No Photo Available

Gil I. Shamir

Gil I. Shamir received the B.Sc. (Cum Laude), and M.Sc. degrees from the Technion, Israel Institute of Technology, Haifa, Israel in 1990 and 1997, respectively, and the Ph.D. degree from the University of Notre Dame, Notre Dame, IN, USA, in 2000, all in electrical engineering.

From 1990 to 1995 he participated in research and development of signal processing and communication systems. From 1995 to 1997 he was with the Electrical Engineering Department at the Technion—Israel Institute of Technology, as a graduate student and teaching assistant. From September 1997 to May 2000 he was a Ph.D. student and a research assistant in the Electrical Engineering Department at the University of Notre Dame, and then a post-doctoral fellow until July 2001. During his tenure at Notre Dame he was a fellow of the Center for Applied Mathematics of the university. Between 2001 and 2008 he was with the Electrical and Computer Engineering Department at the University of Utah, and between 2008 and 2009 he was with Seagate Research. He is currently with Google Inc. His main research interests include information theory, machine learning, coding, and communication theory. Dr. Shamir received an NSF CAREER award in 2003.

Cited By

No Data Available





No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
INSPEC Accession Number:
Digital Object Identifier:
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size