By Topic

• Abstract

SECTION I

## INTRODUCTION

THE classical setting of the universal lossless compression problem [6], [9], [10] assumes that a sequence $x^{n}$ of length $n$ that was generated by a source $\mmb{\theta}$ is to be compressed without knowledge of the particular $\mmb{\theta}$ that generated $x^{n}$ but with knowledge of the class $\Lambda$ of all possible sources $\mmb{\theta}$. The average performance of any given code, that assigns a length function $L(\cdot)$, is judged on the basis of the redundancy function $R_{n}\left(L,\mmb{\theta}\right)$, which is defined as the difference between the expected code length of $L\left(\cdot\right)$ with respect to (w.r.t.) the given source probability mass function $P_{\theta}$ and the $n$ th-order entropy of $P_{\theta}$ normalized by the length $n$ of the uncoded sequence. A class of sources is said to be universally compressible in some worst sense if the redundancy function diminishes for this worst setting. Another approach to universal coding [37] considers the individual sequence redundancy ${\hat{R}}_{n}\left(L, x^{n}\right)$, defined as the normalized difference between the code length obtained by $L(\cdot)$ for $x^{n}$ and the negative logarithm of the maximum likelihood (ML) probability of the sequence $x^{n}$, where the ML probability is within the class $\Lambda$. We thereafter refer to this negative logarithm as the ML description length of $x^{n}$. The individual sequence redundancy is defined for each sequence that can be generated by a source $\mmb{\theta}$ in the given class $\Lambda$.

Classical literature on universal compression [6], [9], [10], [25], [37] considered compression of sequences generated by sources over finite alphabets. In fact, it was shown by Kieffer [17] (see also [15]) that there are no universal codes (in the sense of diminishing redundancy) for sources over infinite alphabets. Later work (see, e.g., [23], [30], [38]), however, bounded the achievable redundancies for i.i.d. sequences generated by sources over large and infinite alphabets. Specifically, while it was shown that the redundancy does not decay if the alphabet size is of the same order of magnitude as the sequence length $n$ or greater, it was also shown that the redundancy does decay for alphabets of size $o(n)$.1

While there is no universal code for infinite alphabets, recent work [22] demonstrated that if one considers the pattern of a sequence instead of the sequence itself, universal codes do exist in the sense of diminishing redundancy. A pattern of a sequence, first considered, to the best of our knowledge, in [1], is a sequence of indices, where the index $\psi_{i}$ at time $i$ represents the order of first occurrence of letter $x_{i}$ in the sequence $x^{n}$. Further study of universal compression of patterns [13], [22], [23], [31], [35] (and subsequently to the work in this paper in [2]) provided various lower and upper bounds to various forms of redundancy in universal compression of patterns. Another related study is that of compression of data, where the order of the occurring data symbols is not important, but their types and empirical counts are [39], [40].

This paper considers universal compression of data sequences generated by distributions that are known a priori to be monotonic. The order of probabilities of the source symbols is known in advance to both encoder and decoder and can be utilized as side information. Monotonic distributions, such as the Zipf (see, e.g., [42], [43]) and the geometric distribution over the integers, are common in applications such as language modeling, and image compression where residual signals are compressed (see, e.g., [20], [21]). One can also consider compression of the list of last or first names in a given city of a given population. Usually, there exists some monotonicity for such a distribution in the given population, which both encoder and decoder may be aware of a priori. For example, the last name “Smith” can be expected to be much more common than the last name “Shannon.” Another example is the compression of a sequence of observations of different species, where one has prior knowledge which species are more common, and which are rare. Finally, one can consider compressing data for which side information given to the decoder through a different channel gives the monotonicity order.

Monotonic distributions were studied by Elias [8], Rissanen [24], and Ryabko [26]. In [8] and [24], the study focused on relative redundancy, computing the ratio between average assigned code length and the source entropy. Ryabko in [26] studied codes for monotonic distributions and used the connection between redundancy and channel capacity (i.e., the redundancy-capacity theorem) to lower bound minimax redundancy. Much newer work by Foster et al. showed in [11] that (unlike the compression of patterns) there are no universal block codes in the standard sense for the complete class of monotonic distributions. The main reason is that there exist such distributions, for which much of the statistical weight lies in the long tail of the distribution in symbols that have very low probability, and most of which will not occur in a given sequence. Thus, in practice, even though one has the prior knowledge of the monotonicity of the distribution, this monotonicity is not necessarily retained in an observed sequence. Actual coding is, therefore, very similar to compressing with infinite alphabets, and the additional prior knowledge of the monotonicity is not very helpful in reducing redundancy. Despite that, Foster et al. demonstrated codes that obtained universal per-symbol redundancy of $o(1)$ as long as the source entropy is fixed (i.e., neither increasing with $n$ nor infinite).

The work in [11] studied coding sequences (or blocks) generated by i.i.d. monotonic distributions, and designed codes for which the relative block redundancy could be (upper) bounded. Unlike that work, the focus in [8], [24], and [26] was on designing codes that minimize the redundancy or relative redundancy for a single symbol generated by a monotonic distribution. Specifically, in [24], minimax codes, which minimize the relative redundancy for the worst possible monotonic distribution over a given alphabet size, were derived. In [26], it was shown that redundancy of $O(\log \log k)$, where $k$ is the alphabet size, can be obtained with minimax per-symbol codes. Very recent work [18] considered per-symbol codes that minimize an average redundancy over the class of monotonic distributions for a given alphabet size. Unlike [11], all these papers study per-symbol codes. Therefore, the codes designed always pay nondiminishing per-symbol redundancy.

A different line of work on monotonic distributions considered optimizing codes for a known monotonic distribution but with unknown parameters (see [20], [21] for design of codes for two-sided geometric distributions). In this line of work, the class of sources is very limited and consists of only the unknown parameters of a known distribution.

In this paper, we consider a general class of monotonic distributions that is not restricted to a specific type or a single parameter. We study standard block redundancy for coding sequences generated by i.i.d. monotonic distributions, i.e., a setting similar to the work in [11]. We do, however, restrict ourselves to smaller subsets of the complete class of monotonic distributions. First, we consider monotonic distributions over alphabets of size $k$, where $k$ is either small w.r.t. $n$, or of $O(n)$. Then, we extend the analysis to show that under minimal restrictions of the monotonic distribution class, there exist universal codes in the standard sense, i.e., with diminishing per-symbol redundancy. In fact, not only do universal codes exist, but under mild restrictions, they achieve the same redundancy as obtained for alphabets of size $O(n)$. The restrictions on this subclass imply that some types of fast decaying monotonic distributions are included in it, and therefore, sequences generated by these distributions (without prior knowledge of either the distribution or of its parameters) can still be compressed universally in the class of monotonic distributions.

The main contributions of this paper are the development of codes and derivation of their upper bounds on the redundancies for coding i.i.d. sequences generated by monotonic distributions. Specifically, this paper gives complete characterization of the redundancy in coding with monotonic distributions over “small” alphabets $(k=o(n^{1/3}))$ and “large” alphabets $(k=O(n))$. Then, it shows that these redundancy bounds carry over (in first order) to fast decaying distributions. Next, a code that achieves good redundancy rates for even slower decaying monotonic distributions is derived, and is used to study achievable redundancy rates for such distributions. Finally, even tighter upper bounds relative to the ML description length are obtained for individual sequences for which the monotonic order of the probabilities is known. The codes derived are two part codes, based on a description of any sequence using a quantized distribution describing the ML distribution of a given sequence. The redundancy consists of the distribution description cost and quantization penalty.

Lower bounds are also presented (in both average and individual sequence cases) to complete the characterization, and are shown to meet the upper bounds in the first three cases (small alphabets, large alphabets, and fast decaying distributions). The lower bounds turn out to relate to those obtained for coding patterns. The relationship to patterns is demonstrated in the proofs of the lower bounds. The main components of the average case proofs are, in fact, identical to those in [31], and the reader is referred to more details in [31]. The main steps of the proofs are still presented in appendixes here for the sake of completeness.

The universal compression problem over monotonic distributions is very related to that of patterns. For small and large alphabets, the redundancy rates attained appear to be the same. This is because in both problems the richness of the class (yielding the universal coding redundancy) is decreased by the same factor from that of the original i.i.d. class, although for different reasons. In the pattern case, sequences which are label permutations of the others are governed by the same pattern ML distribution. Here, such sequences are constrained to a distribution whose probabilities are ordered by the monotonicity constraint. However, a monotonic ML distribution requires given labels to appear in the required order, and may not equal the actual i.i.d. ML distribution. This restriction is not imposed when coding patterns, and makes this part of the analysis more difficult for monotonic distributions. Overall, in both cases, we observe a cost of essentially $0.5\log (n/k^{3})$ bits per each unknown parameter for smaller alphabets and a cost of essentially $O(n^{1/3})$ bits overall for larger alphabets. The technique that is used to prove the upper bounds of the main theorems in this paper follows the original work in [35] for upper bounding the redundancy for coding patterns. Tight upper bounds on the redundancy for coding patterns were not attained when the work presented in this paper, published originally in [36] and [34], was done. Several years subsequently to the work presented here, the general construction in [35] was followed in [2] to show an $O(n^{1/3+\varepsilon})$ upper bound for coding patterns. An upper bound for small alphabets of $(1+\varepsilon) 0.5\log (n/k^{3})$ bits per parameter is yet to have been derived for patterns, to the best of our knowledge. The constructions used in this paper can be applied to the pattern problem. The description costs of these constructions apply to patterns, but the computation of quantization costs is much more difficult for patterns. Specifically, the construction used in the individual sequence case for monotonic distributions can be applied to patterns.

The outline of this paper is as follows. Section II describes the notation and basic definitions. Then, in Section III, lower bounds on the redundancy for monotonic distributions are derived. Next, in Section IV, we propose codes and upper bound their redundancy for coding monotonic distributions over small and large alphabets. These upper bounds match the rates of the lower bounds. They are then extended to fast decaying monotonic distributions in Section V, which also demonstrates the use of the bounds on some standard monotonic distributions. Finally, in Section VI, we consider individual sequences.

SECTION II

## NOTATION AND DEFINITIONS

Let $x^{n}\triangleq\left(x_{1}, x_{2},\ldots, x_{n}\right)$ denote a sequence of $n$ symbols over the alphabet $\Sigma$ of size $k$, where $k$ can go to infinity. Without loss of generality, we assume that $\Sigma=\left\{1, 2,\ldots, k\right\}$, i.e., it is the set of positive integers from 1 to $k$. The sequence $x^{n}$ is generated by an i.i.d. distribution of some source, determined by the parameter vector $\mmb{\theta}\triangleq\left(\theta_{1},\theta_{2},\ldots,\theta_{k}\right)$, where $\theta_{i}$ is the probability of $X$ taking value $i$. The components of $\mmb{\theta}$ are nonnegative and sum to 1. The distributions we consider in this paper are monotonic. Therefore, $\theta_{1}\geq\theta_{2}\geq\ldots\geq\theta_{k}$. The class of all monotonic distributions will be denoted by ${\cal M}$. The class of monotonic distributions over an alphabet of size $k$ is denoted by ${\cal M}_{k}$. It is assumed that prior to coding $x^{n}$ both encoder and decoder know that $\mmb{\theta}\in{\cal M}$ or $\mmb{\theta}\in{\cal M}_{k}$, and also know the order of the probabilities in $\mmb{\theta}$. In the more restrictive setting, $k$ is known in advance and it is known that $\mmb{\theta}\in{\cal M}_{k}$. We do not restrict ourselves to this setting. In general, boldface letters will denote vectors, whose components will be denoted by their indices in the vector. Capital letters will denote random variables. We will denote an estimator by the hat sign. In particular, ${\hat{\mmb{\theta}}}$ will denote the ML estimator of $\mmb{\theta}$ which is obtained from $x^{n}$.

The probability of $x^{n}$ generated by $\mmb{\theta}$ is given by $P_{\theta}\left(x^{n}\right)\triangleq\Pr\left(x^{n}\vert{\mmb{\Theta}}=\mmb{\theta}\right)$. The average per-symbol2 $n$ th-order redundancy obtained by a code that assigns length function $L (\cdot)$ for $\mmb{\theta}$ is TeX Source $$R_{n}\left(L,\mmb{\theta}\right)\triangleq{{1}\over{n}}E_{\theta}L\left[X^{n}\right]-H_{\theta}\left[X\right]\eqno{\hbox{(1)}}$$ where $E_{\theta}$ denotes expectation w.r.t. $\mmb{\theta}$, and $H_{\theta}\left[X\right]$ is the (per-symbol) entropy (rate) of the source ($H_{\theta}\left[X^{n}\right]$ is the $n$ th-order sequence entropy of $\mmb{\theta}$, and for i.i.d. sources, $H_{\theta}\left[X^{n}\right]=n H_{\theta}\left[X\right]$). With entropy coding techniques, assigning a universal probability $Q\left(x^{n}\right)$ is identical to designing a universal code for coding $x^{n}$ where, up to negligible integer length constraints that will be ignored, the negative logarithm to the base of 2 of the assigned probability is considered as the code length.

The individual sequence redundancy (see, e.g., [37]) of a code with length function $L\left(\cdot\right)$ per sequence $x^{n}$ over class $\Lambda$ is TeX Source $${\hat{R}}_{n}\left(L, x^{n}\right)\triangleq{{1}\over{n}}\left\{L\left(x^{n}\right)+\log P_{ML}\left(x^{n}\right)\right\}\eqno{\hbox{(2)}}$$ where the logarithm function is taken to the base of 2, here and elsewhere, and $P_{ML}\left(x^{n}\right)$ is the probability of $x^{n}$ given by the ML estimator ${\hat{\mmb{\theta}}}_{\Lambda}\in\Lambda$ of the governing parameter vector ${\mmb{\Theta}}$. The negative logarithm of this probability is, up to integer length constraints, the shortest possible code length assigned to $x^{n}$ in $\Lambda$. It will be referred to as the ML description length of $x^{n}$ in $\Lambda$. In the general case, one considers the i.i.d. ML. However, since we only consider $\mmb{\theta}\in{\cal M}$, i.e., restrict the sequence to one governed by a monotonic distribution, we define ${\hat{\mmb{\theta}}}_{\cal M}\in{\cal M}$ as the monotonic ML estimator. Its associated shortest code length will be referred to as the monotonic ML description length. The estimator ${\hat{\mmb{\theta}}}_{\cal M}$ may differ from the i.i.d. ML ${\hat{\mmb{\theta}}}$, in particular, if the empirical distribution of $x^{n}$ is not monotonic. The individual sequence redundancy in ${\cal M}$ is thus defined w.r.t. the monotonic ML description length, which is the negative logarithm of $P_{ML}\left(x^{n}\right)\triangleq P_{{\hat{\theta}}_{\cal M}}\left(x^{n}\right)\triangleq\Pr\left(x^{n}\;\vert\;{\mmb{\Theta}}={\hat{\mmb{\theta}}}_{\cal M}\in{\cal M}\right)$.

The average minimax redundancy of some class $\Lambda$ is defined as TeX Source $$R_{n}^{+}\left(\Lambda\right)\triangleq\min_{L}\sup_{\mmb{\theta}\in\Lambda}R_{n}\left(L,\mmb{\theta}\right).\eqno{\hbox{(3)}}$$ Similarly, the individual minimax redundancy is that of the best code $L\left(\cdot\right)$ for the worst sequence $x^{n}$, TeX Source $${\hat{R}}_{n}^{+}\left(\Lambda\right)\triangleq\min_{L}\sup_{\mmb{\theta}\in\Lambda}\max_{x^{n}}{{1}\over{n}}\left\{L\left(x^{n}\right)+\log P_{\theta}\left(x^{n}\right)\right\}.\eqno{\hbox{(4)}}$$ The maximin redundancy of $\Lambda$ is TeX Source $$R_{n}^{-}\left(\Lambda\right)\triangleq\sup_{w}\min_{L}\int_{\Lambda}w\left(d\mmb{\theta}\right) R_{n}\left(L,\mmb{\theta}\right)\eqno{\hbox{(5)}}$$ where $w(\cdot)$ is a prior on $\Lambda$. In [6], it was shown by Davisson that $R_{n}^{+}\left(\Lambda\right)\geq R_{n}^{-}\left(\Lambda\right)$. Davisson also tied the maximin redundancy to the capacity of the channel induced by the conditional probability $P_{\theta}$. It was then shown independently by Gallager [12] and Ryabko [26] first, and then by Davisson and Leon-Garcia [7], that the minimax and maximin redundancies are essentially equal, hence, making the connection between the minimax redundancy and the capacity of the channel induced by $P_{\theta}$. Finally, Merhav and Feder [19] tied between the capacity of this channel and redundancy for almost all sources in a class proving a strong version of the theorem. The redundancy-capacity theorem is used to prove lower bounds in the minimax (maximin) and “almost all sources” senses for the monotonic distribution class.

SECTION III

## LOWER BOUNDS

Lower bounds on various forms of the redundancy for the class of monotonic distributions can be obtained with slight modifications of the proofs for the lower bounds on the redundancy of coding patterns in [16], [22], [23], and [31]. The bounds are presented in the following three theorems. For the sake of completeness, the main steps of the proofs of the first two theorems are presented in appendixes, and the proof of the third theorem is presented below. The reader is referred to [16], [22], [23], [30] and [31] for more details.

#### Theorem 1

Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Then, the $n$ th-order average maximin and minimax universal coding redundancies for i.i.d. sequences generated by a monotonic distribution with alphabet size $k$ are lower bounded by TeX Source \eqalignno{&\displaystyle R^{-}_{n}\left({\cal M}_{k}\right)\geq&{\hbox{(6)}}\cr&\qquad\cases{{{k-1}\over{2n}}\log {{n^{1-\varepsilon}}\over{k^{3}}}+{{k-1}\over{2n}}\log {{\pi e^{3}}\over{2}}-O\left({{\log k}\over{n}}\right), &k\leq{\cal T}^{-}_{\varepsilon, n}\cr\left({{\pi}\over{2}}\right)^{1/3}(1.5\log e){{n^{(1-\varepsilon)/3}}\over{n}}-O\left({{\log n}\over{n}}\right), &k>{\cal T}^{-}_{\varepsilon,n}}} where TeX Source $${\cal T}^{-}_{\varepsilon, n}\triangleq\left({{\pi n^{1-\varepsilon}}\over{2}}\right)^{1/3}.\eqno{\hbox{(7)}}$$

#### Theorem 2

Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Let $\mu_{n}(\cdot)$ be the uniform prior over points in ${\cal M}_{k}$. Then, the $n$ th-order average universal coding redundancy for coding i.i.d. sequences generated by monotonic distributions with alphabet size $k$ is lower bounded by TeX Source \eqalignno{&\displaystyle R_{n}\left(L,\mmb{\theta}\right)\geq\cr&\qquad\cases{{{k-1}\over{2n}}\log {{n^{1-\varepsilon}}\over{k^{3}}}-{{k-1}\over{2n}}\log {{8\pi}\over{e^{3}}}-O\left({{\log k}\over{n}}\right), &k\leq{\cal T}_{\varepsilon, n}\cr{{1.5\log e}\over{2\pi^{1/3}}}\cdot{{n^{(1-\varepsilon)/3}}\over{n}}-O\left({{\log n}\over{n}}\right),&k>{\cal T}_{\varepsilon, n}}\cr&&{\hbox{(8)}}} where TeX Source $${\cal T}_{\varepsilon, n}\triangleq{{1}\over{2}}\cdot\left({{n^{1-\varepsilon}}\over{\pi}}\right)^{1/3}\eqno{\hbox{(9)}}$$ for every code $L(\cdot)$ and almost every i.i.d. source $\mmb{\theta}\in{\cal M}_{k}$, except for a set of sources $A_{\varepsilon}\left(n\right)$ whose relative volume $\mu_{n}\left(A_{\varepsilon}(n)\right)$ in ${\cal M}_{k}$ goes to 0 as $n\rightarrow\infty$.

Theorems 1 and 2 give lower bounds on redundancies of coding over monotonic distributions for the class ${\cal M}_{k}$. However, the bounds are more general, and the second region applies to the whole class of monotonic distributions ${\cal M}$. By plugging the boundary values of $k$ into the first regions of both Theorems, the bounds of the second regions are obtained, demonstrating the threshold phenomenon of the transition between the regions. Subsequent work in [2] to the work presented in this paper slightly tightened the second region of the bound of Theorem 1 for patterns. This was done by applying a general technique that uses bounds on error correcting codes, as that described in earlier work in [27], [28], [29], to patterns on top of the bounding methods used in [31]. The tighter bound for that region can also be applied to monotonic distributions. As in the case of patterns [22], [31], the bounds in (6)(8) show that each parameter costs at least $0.5\log (n/k^{3})$ bits for small alphabets, and the total universality cost is at least $\Theta (n^{1/3-\varepsilon})$ bits overall for larger alphabets. We show in Section IV that for $k=O(n)$ these bounds are asymptotically achievable for monotonic distributions. The bounds in (6)(8) focus on large values of $k$ that can increase with $n$. For small fixed $k$, the second-order terms of existing bounds for coding unconstrained i.i.d. sources are tighter. However, as $k$ increases, the bounds above become tighter through their first dominant term, and second-order terms become negligible. The proofs of Theorems 1 and 2 are presented in Appendixes A and B, respectively.

#### Theorem 3

3 Let $n\rightarrow\infty$. Then, the $n$ th-order individual minimax redundancy for i.i.d. sequences with maximal letter $k$ w.r.t. the monotonic ML description length with alphabet size $k$ is lower bounded by TeX Source $${\hat{R}}^{+}_{n}\left({\cal M}_{k}\right)\geq\cases{{{k}\over{2n}}\log {{n e^{3}}\over{k^{3}}}-{{\log n}\over{2n}}+O\left({{k^{3/2}}\over{n^{3/2}}}\right), &k\leq n^{1/3}\cr{{3}\over{2}}(\log e)\cdot{{n^{1/3}}\over{n}}-{{\log n} \over{2n}}+O\left({{1}\over{n}}\right), &k>n^{1/3}.}\eqno{\hbox{(10)}}$$

Theorem 3 lower bounds the individual minimax redundancy for coding a sequence believed to have an empirical monotonic distribution. The alphabet size is determined by the maximal letter that occurs in the sequence, i.e., $k=\max\left\{x_{1}, x_{2},\ldots, x_{n}\right\}$. (If $k$ is unknown, one can use Elias' code for the integers [8] using $O(\log k)$ bits to describe $k$. However, this is not reflected in the lower bound.) The ML probability estimate is taken over the class of monotonic distributions. Namely, the empirical probability (standard ML) estimate ${\hat{\mmb{\theta}}}$ is not ${\hat{\mmb{\theta}}}_{\cal M}$ in case ${\hat{\mmb{\theta}}}$ does not satisfy the monotonicity that defines the class ${\cal M}$. While the average case maximin and minimax bounds of Theorem 1 also apply to ${\hat{R}}^{+}_{n}\left({\cal M}_{k}\right)$, the bounds of Theorem 3 are tighter for the individual redundancy and are obtained using individual sequence redundancy techniques.

##### Proof [Theorem 3]

Using Shtarkov's normalized maximum likelihood (NML) approach [37], one can assign probability TeX Source $$Q\left(x^{n}\right)\triangleq{{P_{{\hat{\theta}}_{\cal M}}\left(x^{n}\right)} \over{\sum\nolimits_{y^{n}}P_{{\hat{\theta}}^{\prime}_{\cal M}}\left(y^{n}\right)} }\triangleq{{\max_{\theta^{\prime}\in{\cal M}}P_{\theta^{\prime}}\left(x^{n}\right) }\over{\sum\nolimits_{y^{n}}\max_{\theta^{\prime\prime}\in{\cal M}} P_{\theta^{\prime\prime}}\left(y^{n}\right)}}\eqno{\hbox{(11)}}$$ to sequence $x^{n}$. This approach minimizes the individual minimax redundancy, giving individual redundancy of TeX Source \eqalignno{{\hat{R}}_{n}\left(Q, x^{n}\right)=&\,{{1}\over{n}}\log {{\max_{\theta^{\prime}\in {\cal M}}P_{\theta^{\prime}}\left(x^{n}\right)}\over{Q\left(x^{n}\right)}}\cr =&\,{{1}\over{n}}\log \left\{\sum_{y^{n}}\max_{\theta^{\prime}\in{\cal M}} P_{\theta^{\prime}}\left(y^{n}\right)\right\}&{\hbox{(12)}}} to every $x^{n}$, specifically achieving the individual minimax redundancy.

It is now left to bound the logarithm of the sum in (12). We follow the approach used in [23, Th. 2] for bounding the redundancy for standard compression of i.i.d. sequences over large alphabets and use the results in [38] (as well as the approximation in [1]) for a precise expression of this component. We then adjust the result to monotonic distributions. Let ${\bf n}_{x}^{\ell}\triangleq\left(n_{x}(1), n_{x}(2),\ldots, n_{x}(\ell)\right)$ denote the occurrence counts of the first $\ell$ letters of the alphabet $\Sigma$ in $x^{n}$. Assuming $k$ is the largest letter in $x^{n}$, $\sum_{i=1}^{k}n_{x}(i)=n$. Now, following (12), TeX Source \eqalignno{& n{\hat{R}}^{+}_{n}\left({\cal M}_{k}\right)\cr &\enspace\buildrel{(a)}\over{\geq}\log \left\{\sum\limits_{y^{n}:{\hat{\theta}}(y^{n})\in{\cal M}}P_{\hat{\theta}}\left(y^{n}\right)\right\}\cr &\enspace\buildrel{(b)}\over{\geq}\log \left\{\sum\limits_{\ell=1}^{k}\sum \limits_{{\bf n}_{y}^{\ell}}{{1}\over{\ell!}}\left(\buildrel{n}\over{n_{y}(1),\ldots, n_{y}(\ell)}\right)\!\prod_{i=1}^{\ell}\!\left({{n_{y}(i)} \over{n}}\right)^{n_{y}(i)}\!\right\}\cr &\enspace\buildrel{(c)}\over{\geq}\log \left\{\sum\limits_{{\bf n}_{y}^{k}}{{1} \over{k !}}\left(\buildrel{n}\over{n_{y}(1),\ldots,n_{y}(k)}\right)\prod_{i=1}^{k} \left({{n_{y}(i)}\over{n}}\right)^{n_{y}(i)}\right\}\cr &\enspace\buildrel{(d)}\over{=}{{k-1}\over{2}}\log {{n}\over{k}}+{{k}\over{2}} \log e+O\left({{k^{3/2}}\over{n^{1/2}}}\right)-\log \left(k!\right)\cr &\enspace\buildrel{(e)}\over{\geq}{{k}\over{2}}\log {{n e^{3}}\over{k^{3}}}-{{1} \over{2}}\log n+O\left({{k^{3/2}}\over{n^{1/2}}}\right)&{\hbox{(13)}}} where $(a)$ follows from including only sequences $y^{n}$ that have a monotonic empirical (i.i.d. ML) distribution in Shtarkov's sum. Inequality $(b)$ follows from partitioning the sequences $y^{n}$ into types as done in [23], first by the number of occurring symbols $\ell$, and then by the empirical distribution. Unlike standard i.i.d. distributions though, monotonicity implies that only the first $\ell$ symbols in $\Sigma$ occur, and thus the choice of $\ell$ out of $k$ in the proof in [23] is replaced by 1. Like in coding patterns, we also divide by $\ell !$ because each type with $\ell$ occurring symbols can be ordered in at most $\ell !$ ways, where only some retain the monotonicity. (Note that this step is the reason that step $(b)$ produces an inequality, because more than one of the orderings may be monotonic if equal occurrence counts occur.) Retaining only the term $\ell=k$ yields $(c)$. Then, $(d)$ follows from applying (15) in [38] (see also the approximation of (13) in [1]). Finally, $(e)$ follows from Stirling's approximation TeX Source $$\sqrt{2\pi m}\cdot\left({{m}\over{e}}\right)^{m}\leq m!\leq\sqrt{2\pi m}\cdot\left({{m}\over{e}}\right)^{m}\cdot\exp\left\{{{1}\over{12m}}\right\}.\eqno{\hbox{(14)}}$$ The first region in (10) results directly from (13). The value $\ell^{\ast}=n^{1/3}$ that maximizes the summand can be retained in step $(c)$ instead of $k$, for every $k\geq\ell^{\ast}$, yielding the second region of the bound. This concludes the proof of Theorem 3. $\hfill \blacksquare$

SECTION IV

## UPPER BOUNDS FOR SMALL AND LARGE ALPHABETS

In this section, we demonstrate codes that asymptotically achieve the lower bounds for $\mmb{\theta}\in{\cal M}_{k}$ and $k=O(n)$. We begin with a theorem and a corollary that show the achievable redundancies. The theorem shows a simpler bound, and the corollary (that follows the proof of the theorem) shows a tighter, more complex bound. The remainder of the section is devoted to proving both theorem and corollary, by describing codes, for which the redundancy bounds provide general bounds on the redundancy, and bounding their redundancies. The theorem is stated assuming no initial knowledge of $k$. The proof first considers the setting where $k$ is known, and then shows how the same bounds are achieved even when $k$ is unknown in advance, but as long as it satisfies the conditions.

#### Theorem 4

Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Then, there exists a code with length function $L^{\ast}\left(\cdot\right)$ that achieves redundancy TeX Source \eqalignno{&\displaystyle R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq&{\hbox{(15)}}\cr &\qquad\cases{\left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n}\over{k^{3}}}, &k=o\left(n^{1/3}\right),\cr \left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}, &k\leq n^{1/3},\cr \left(1+\varepsilon\right)\left(\log n\right)\left(\log {{k}\over{n^{1/3-\varepsilon}}}\right){{n^{1/3}}\over{n}}, &n^{1/3}<k=o(n),\cr \left(1+\varepsilon\right){{2}\over{3}}\left(\log n\right)^{2}{{n^{1/3}}\over{n}}, &n^{1/3}<k=O(n)}} for i.i.d. sequences generated by any source $\mmb{\theta}\in{\cal M}_{k}$.

The bounds presented are asymptotic. Second-order terms are absorbed in $\varepsilon$. The second region contains the first, and the last contains the third. The first and third regions, however, have tighter bounds for the smaller values of $k$. The code designed to code a sequence $x^{n}$ is a two part code [25]. First, a distribution is described, and then it is used to code $x^{n}$. The redundancy consists of the cost of describing the distribution and a quantization cost. Quantization is performed to reduce description cost, but yields the quantization cost. To achieve the lower bound, the larger the probability parameter is, the coarser its quantization. This approach was used in [30] and [31] to obtain upper bounds on the redundancy for coding over large alphabets and for coding patterns, respectively. The method in [30] and [31], however, is insufficient here, because it still results in too many quantization points due to the polynomial growth in quantization spacing. Here, we use an exponential growth as the parameters increase. This general idea was used in [35] to improve an upper bound on the redundancy of coding patterns. Since both encoder and decoder know the order of the probabilities a priori, this order need not be coded. It is thus sufficient to encode the quantized probabilities of the monotonic distribution, and the decoder can identify which probability is associated with which symbol using the monotonicity of the distribution. This point, in fact, complicates the proof, because the actual ML distribution ${\hat{\mmb{\theta}}}$ of a given sequence may not be monotonic even if the sequence was generated by a monotonic distribution. Since the labels are not coded, we must quantize ${\hat{\mmb{\theta}}}_{\cal M}$ instead. There is no such complication when coding patterns or sequences that obey distribution monotonicity side information as in Section VI.

Branching several steps from the proof of Theorem 4 below leads to the following tighter bounds on the upper regions, which are proved following the proof of Theorem 4.

#### Corollary 1

Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Then, for $k>n^{1/3}$, there exists a code with length function $L^{\ast}\left(\cdot\right)$ that achieves redundancy TeX Source \eqalignno{&\displaystyle R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq&{\hbox{(16)}}\cr &\qquad\cases{\left(2.3\log {{k}\over{n^{1/3-\varepsilon}}}+0.8\log n\right){{n^{1/3}\left(\log n\right)^{1/3}}\over{n}},&k=o(n),\cr {{2.3 n^{1/3}\left(\log n\right)^{4/3}}\over{n}}, &k=O(n)}} for i.i.d. sequences generated by any source $\mmb{\theta}\in{\cal M}_{k}$.

##### Proof [Theorem 4]

The proof treats the regions $k\leq n^{1/3}$ and $k>n^{1/3}$ separately. For each region, we construct a grid of points to which a two part code can quantize the probability parameters. The main idea is that spacing between adjacent grid points is “semi”-exponentially increasing. To achieve that, the probability space is partitioned into intervals, whose length increases exponentially, and within each interval a fixed number of equally separated grid points are generated. Next, the ML probability vector of each sequence is quantized into the points of the grid. In the lower $k$ region, a differential code is used to describe the number of points in the grid between two adjacent probability parameters, starting with the smallest one. In the upper $k$ region, the number of probability parameters quantized to that grid point is described for every grid point. Then, the description cost, and the quantization cost are upper bounded. The sum of these two costs constitutes the description length. The redundancy is computed by subtracting the description length with the true probability parameters from the description length used. The quantized version of the true probability vector is used as an auxiliary vector to aid in upper bounding this difference.

We start with $k\leq n^{1/3}$ assuming $k$ is known. Let $\beta=1/(\log n)$ be a parameter (we can also choose other values). Partition the probability space into $J_{1}=\left\lceil 1/\beta\right\rceil$ intervals TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n}},{{n^{j\beta}}\over{n}}\right),\;1\leq j\leq J_{1}.\eqno{\hbox{(17)}}$$ Note that $I_{1}=[1/n, 2/n),\;I_{2}=[2/n, 4/n),\ldots,\;I_{j}=[2^{j-1}/n, 2^{j}/n)$. Let $k_{j}=\vert\theta_{i}\in I_{j}\vert$ denote the number of probabilities in $\mmb{\theta}$ that are in interval $I_{j}$. In interval $j$, take a grid of points with spacing TeX Source $$\Delta_{j}^{(1)}={{\sqrt{k}n^{j\beta}}\over{n^{1.5}}}.\eqno{\hbox{(18)}}$$ Note that to complete all points in an interval, the spacing between two points at the boundary of an interval may be smaller. There are $\left\lceil\log n\right\rceil$ intervals. Ignoring negligible integer length constraints (here and elsewhere), in each interval, the number of points is bounded by TeX Source $$\left\vert I_{j}\right\vert\leq{{1}\over{2}}\cdot\sqrt{{n}\over{k}},\;\forall j: j=1, 2,\ldots,J_{1},\eqno{\hbox{(19)}}$$ where $\vert\cdot\vert$ denotes the cardinality of a set. Let the grid TeX Source \eqalignno{{\mmb{\tau}}=&{\left(\tau_{1},\tau_{2},\ldots\right)=\left({{1}\over{n}},{{1}\over{n}}+{{2\sqrt{k}}\over{n^{1.5}}},\ldots,{{2}\over{n}},{{2}\over{n}}+{{4\sqrt{k}}\over{n^{1.5}}},\ldots\right)}&\cr && {\hbox{(20)}}} be a vector that takes all the points from all intervals, with cardinality TeX Source $$B_{1}\triangleq\vert\mmb{\tau}\vert\leq{{1}\over{2}}\cdot\sqrt{{n}\over{k}}\left\lceil\log n\right\rceil.\eqno{\hbox{(21)}}$$

Now, let $\mmb{\varphi}=\left(\varphi_{1},\varphi_{2},\ldots,\varphi_{k}\right)$ be a monotonic probability vector, such that $\sum\varphi_{i}=1$, $\varphi_{1}\geq\varphi_{2}\geq\cdots\geq\varphi_{k}\geq 0$, and also the smaller $k-1$ components of $\mmb{\varphi}$ are either 0 or from $\mmb{\tau}$, i.e., $\varphi_{i}\in (\mmb{\tau}\cup\left\{0\right\})$, $i=2,3,\ldots, k$. One can code $x^{n}$ using a two part code, assuming the distribution governing $x^{n}$ is given by the parameter $\mmb{\varphi}$. The code length required (up to integer length constraints) is TeX Source $$L\left(x^{n}\vert\mmb{\varphi}\right)=\log k+L_{R}(\mmb{\varphi})-\log P_{\varphi}\left(x^{n}\right),\eqno{\hbox{(22)}}$$ where $\log k$ bits are needed to describe how many letter probabilities are greater than 0 in $\mmb{\varphi}$, $L_{R}(\mmb{\varphi})$ is the number of bits required to describe the quantized points of $\mmb{\varphi}$, and the last term is needed to encode $x^{n}$ assuming it is governed by $\mmb{\varphi}$.

The vector $\mmb{\varphi}$ can be described by a code as follows. Let ${\hat{k}}_{\varphi}$ be the number of nonzero letter probabilities hypothesized by $\mmb{\varphi}$. Let $b_{i}$ denote the index of $\varphi_{i}$ in $\mmb{\tau}$, i.e., $\varphi_{i}=\tau_{b_{i}}$. Then, we will use the following differential code. For $\varphi_{{\hat{k}}_{\varphi}}$ we need at most $1+\log b_{{\hat{k}}_{\varphi}}+2\log (1+\log b_{{\hat{k}}_{\varphi}})$ bits to code its index in $\mmb{\tau}$ using Elias' coding for the integers [8]. For $\varphi_{i-1}$, we need at most $1+\log (b_{i-1}-b_{i}+1)+2\log [1+\log (b_{i-1}-b_{i}+1)]$ bits to code the index displacement from the index of the previous parameter, where an additional 1 is added to the difference in case the two parameters share the same index. Summing up all components of $\mmb{\varphi}$, and taking $b_{{\hat{k}}_{\varphi}+1}=0$, TeX Source \eqalignno{L_{R}(\mmb{\varphi})\leq &\,{\hat{k}}_{\varphi}-1+\sum\limits_{i=2}^{{\hat{k}}_{\varphi}}\log \left(b_{i}-b_{i+1}+1\right)+\cr & 2\sum\limits_{i=2}^{{\hat{k}}_{\varphi}}\log \left[1+\log \left(b_{i}-b_{i+1}+1\right)\right]\cr \buildrel{(a)}\over{\leq}&\, (k-1)+(k-1)\log {{B_{1}+k-1}\over{k}}+\cr &2(k-1)\log \log {{B_{1}+k-1}\over{k}}+o(k)\cr \buildrel{(b)}\over{=}&\, (1+\varepsilon){{k-1}\over{2}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}.&{\hbox{(23)}}} Inequality $(a)$ is obtained by applying Jensen's inequality once on the first sum, twice on the second sum utilizing the monotonicity of the logarithm function, and by bounding ${\hat{k}}_{\varphi}$ by $k$, and absorbing second-order terms in the resulting $o(k)$ term. Then, second-order terms are absorbed in $\varepsilon$, and (21) is used to obtain $(b)$.

To code $x^{n}$, we choose $\mmb{\varphi}$ which minimizes the expression in (22) over all $\mmb{\varphi}$, i.e., TeX Source $$L^{\ast}\left(x^{n}\right)=\min_{\mmb{\varphi}:\varphi_{i}\in (\mmb{\tau}\cup\left\{0\right\}),\;i=2,3,\ldots, k}L\left(x^{n}\vert\mmb{\varphi}\right)\triangleq L\left(x^{n}\vert{\hat{\mmb{\varphi}}}\right).\eqno{\hbox{(24)}}$$ The pointwise redundancy for $x^{n}$ is given by TeX Source \eqalignno{nR_{n}\left(L^{\ast}, x^{n}\right)=&\, L^{\ast}\left(x^{n}\right)+\log P_{\theta}\left(x^{n}\right)\cr =&\,\log k+L^{\ast}_{R}\left({\hat{\mmb{\varphi}}}\right)+\log {{P_{\theta}\left(x^{n}\right)}\over{P_{\hat{\varphi}}\left(x^{n}\right)}}.&{\hbox{(25)}}} Note that the pointwise redundancy differs from the individual one, since it is defined w.r.t. the true probability of $x^{n}$. Thus, for a given $x^{n}$ it may also be negative.

To bound the third term of (25), let $\mmb{\theta}^{\prime}$ be a monotonic version of $\mmb{\theta}$ quantized onto $\mmb{\tau}$, i.e., $\theta^{\prime}_{i}\in (\mmb{\tau}\cup\left\{0\right\})$, $i=2,3,\ldots, k$, where if $\theta_{i}>0\Leftrightarrow\theta^{\prime}_{i}>0$ as well. This implies that all positive $\theta_{i}<1/n$ are quantized to $\theta^{\prime}_{i}=1/n$. Define the quantization error, TeX Source $$\delta_{i}=\theta_{i}-\theta^{\prime}_{i}.\eqno{\hbox{(26)}}$$ The quantization is performed from the smallest parameter $\theta_{k}$ to the largest, where monotonicity is maintained, as well as minimal absolute cumulative quantization error. Thus, unless there is cumulative error formed by many parameters $\theta_{i}<1/n$, $\theta_{i}$ will be quantized to one of the two nearest grid points (one smaller and one greater than it). It also guarantees that $\vert\delta_{1}\vert\leq\Delta_{j_{2}}^{(1)}\leq\Delta_{j_{1}}^{(1)}$, where $j_{1}$ and $j_{2}$ are the indices of the intervals in which $\theta_{1}$ and $\theta_{2}$ are contained, respectively, i.e., $\theta_{1}\in I_{j_{1}}$ and $\theta_{2}\in I_{j_{2}}$. However, if there exists a cumulative error $\Delta_{offset}$ due to quantization of parameters $\theta_{i}: 0<\theta_{i}<1/n$ to $\theta^{\prime}_{i}=1/n$, this error is offset by decreasing every $\theta^{\prime}_{i}$ for $\theta_{i}>1/n$ by $\alpha_{i}\cdot\Delta_{offset}\cdot\theta^{\prime}_{i}$, where $\alpha_{i}>0$ is some constant, and quantizing the value to the nearest grid point maintaining monotonicity and minimal cumulative error. By construction, $\Delta_{offset}\leq k/n$, and thus TeX Source $$\left\vert\delta_{i}\right\vert\leq{{\sqrt{k}n^{j\beta}}\over{n^{1.5}}}+{{k}\over{n}}\alpha^{\prime}_{i}\theta^{\prime}_{i}\eqno{\hbox{(27)}}$$ where $\alpha^{\prime}_{i}>0$ is a constant derived from $\alpha_{i}$.

Now, since $\mmb{\theta}^{\prime}$ is included in the minimization of (24), we have, for every $x^{n}$, TeX Source $$L^{\ast}\left(x^{n}\right)\leq L\left(x^{n}\vert\mmb{\theta}^{\prime}\right),\eqno{\hbox{(28)}}$$ and also TeX Source $$nR_{n}\left(L^{\ast}, x^{n}\right)\leq\log k+L_{R}\left(\mmb{\theta}^{\prime}\right)+\log {{P_{\theta}\left(x^{n}\right)}\over{P_{\theta^{\prime}}\left(x^{n}\right)}}.\eqno{\hbox{(29)}}$$ Averaging over all possible $x^{n}$, the average redundancy is bounded by TeX Source \eqalignno{&\displaystyle nR_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\;\;=\log k+E_{\theta}L^{\ast}_{R}\left({\hat{\mmb{\varphi}}}\right)+E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\hat{\varphi}}\left(X^{n}\right)}}\cr &\;\;\leq\log k+E_{\theta}L_{R}\left(\mmb{\theta}^{\prime}\right)+E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\theta^{\prime}}\left(X^{n}\right)}}.&{\hbox{(30)}}} The second term of (30) is bounded with the bound of (23), and we proceed with the third term TeX Source \eqalignno{& E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\theta^{\prime}}\left(X^{n}\right)}}\cr &\quad\buildrel{(a)}\over{=}n\sum\limits_{i=1}^{k}\theta_{i}\log {{\theta_{i}} \over{\theta^{\prime}_{i}}}\;\buildrel{(b)}\over{=}\;n\sum\limits_{i=1}^{k}\left (\theta^{\prime}_{i}+\delta_{i}\right)\log \left(1+{{\delta_{i}}\over{\theta^{\prime}_{i}}}\right)\cr &\quad\buildrel{(c)}\over{\leq}n (\log e)\sum\limits_{i=1}^{k}\left (\theta^{\prime}_{i}+\delta_{i}\right){{\delta_{i}}\over{\theta^{\prime}_{i}}}\; \buildrel{(d)}\over{=}\;n(\log e)\sum\limits_{i=1}^{k}{{\delta_{i}^{2}} \over{\theta^{\prime}_{i}}}\cr &\quad\buildrel{(e)}\over{\leq}\left(1+o(1)\right) k\log e+\left(1+o(1)\right) {{2(\log e)k}\over{n}}\sum\limits_{j=1}^{J_{1}}k_{j}\cdot n^{j\beta}\cr &\quad\buildrel{(f)}\over{\leq}5\left(1+o(1)\right)(\log e) k.&{\hbox{(31)}}} Equality $(a)$ is since the expectation is performed on the number of occurrences of letter $i$ for each letter. Representing $\theta_{i}=\theta^{\prime}_{i}+\delta_{i}$ yields equation $(b)$. We use $\ln (1+x)\leq x$ to obtain $(c)$. Equality $(d)$ is obtained since all the quantization displacements must sum to 0. The first term of inequality $(e)$ in (31) is obtained under a worst case assumption that $\theta_{i}\ll 1/n$ for $i\geq 2$. Thus, it is quantized to $\theta^{\prime}_{i}=1/n$, and the bound $\vert\delta_{i}\vert\leq 1/n$ is used. In a different worst case scenario, we have from (27) and since in interval $j$, $\theta^{\prime}_{i}\geq n^{(j-1)\beta}/n$, TeX Source $${{\delta_{i}^{2}}\over{\theta^{\prime}_{i}}}\leq{{2k n^{j\beta}}\over{n^{2}}}+{{2 k^{1.5}n^{j\beta}}\over{n^{2.5}}}+{{k^{2}\alpha^{\prime^{2}}_{i}\theta^{\prime}_{i}}\over{n^{2}}}\eqno{\hbox{(32)}}$$ where $n^{\beta}=2$ is used to derive the equation. Since $k=o(n)$, the second term above is absorbed in the first, leading to the second term of inequality $(e)$ of (31) after aggregating elements of the sum into intervals. The sum over $i$ of the last term of (32) is $o(1)$ since $k=o(n)$. This sum is absorbed into the first term of inequality $(e)$ of (31). Inequality $(f)$ of (31) is obtained since TeX Source $$\sum_{j=1}^{J_{1}}k_{j}n^{j\beta}=\sum_{j=1}^{J_{1}}k_{j}2^{j}\leq 2n.\eqno{\hbox{(33)}}$$ Inequality (33) follows since $k_{1}\leq n$, $k_{2}\leq (n-k_{1})/2$, $k_{3}\leq (n-k_{1})/4-k_{2}/2$, and so on, until TeX Source $$k_{J_{1}}\leq{{n}\over{2^{J_{1}-1}}}-\sum_{\ell=1}^{J_{1}-1}{{k_{\ell}}\over{2^{J_{1}-\ell}}}\Rightarrow\sum_{j=1}^{J_{1}}k_{j}2^{j}\leq2n.\eqno{\hbox{(34)}}$$ The reason for these relations are the lower limits of the $J_{1}$ intervals that restrict the number of parameters inside the interval. The restriction is done in order of intervals, so that the used probabilities are subtracted, leading to the series of equations.

Plugging the bounds of (23) and (31) into (30), we obtain TeX Source \eqalignno{&\displaystyle nR_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\!\!\quad\leq\log k+\left(1+\varepsilon\right){{k-1}\over{2}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}+5 (\log e)k\cr &\!\!\quad\leq\left(1+\varepsilon^{\prime}\right){{k-1}\over{2}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}&{\hbox{(35)}}} where we absorb second-order terms in $\varepsilon^{\prime}$. Replacing $\varepsilon^{\prime}$ by $\varepsilon$ normalizing the redundancy per symbol by $n$, the bound of the second region of (15) is proved. Since $\log \log n$ can also be absorbed in $\varepsilon$, the first region is also proved. The code proposed, however, will lead to redundancy whose second order is larger than obtained for standard i.i.d. optimal codes that do not exploit the distribution monotonicity for fixed $k$. This is because the grid used here is too dense for the fixed $k$ case. One can use the standard i.i.d. codes for tighter second-order bounds for fixed $k$.

We now consider the larger values of $k$, i.e., $n^{1/3}<k=O(n)$. The idea of the proof is the same. However, we need to partition the probability space to different intervals, the spacing within an interval must be optimized, and the parameters' description cost must be bounded differently, because now there are more parameters quantized than points in the quantization grid. Define the $j$th interval as TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n^{2}}},{{n^{j\beta}}\over{n^{2}}}\right),\;1\leq j\leq J_{2},\eqno{\hbox{(36)}}$$ where $J_{2}=\left\lceil 2/\beta\right\rceil=\left\lceil 2\log n\right\rceil$. Again, let $k_{j}=\vert\theta_{i}\in I_{j}\vert$ denote the number of probabilities in $\mmb{\theta}$ that are in interval $I_{j}$. It could be possible to use the intervals as defined in (17), but this would not guarantee bounded redundancy in the rate we require if there are very small probabilities $\theta_{i}\ll 1/n$. Note that the smallest nonzero component of ${\hat{\mmb{\theta}}}$ is $1/n$. However, this is not necessarily the case for ${\hat{\mmb{\theta}}}_{\cal M}$. The latter may consist of smaller nonzero probabilities for sequences that do not obey the monotonicity of the distribution. Therefore, the interval definition in (17) can be used for larger alphabets only if the probabilities of the symbols are known to be bounded. Define the spacing in interval $j$ as TeX Source $$\Delta_{j}^{(2)}={{n^{j\beta}}\over{n^{2+\alpha}}},\eqno{\hbox{(37)}}$$ where $\alpha>0$ is a parameter to be optimized. Similarly to (19), the interval cardinality here is TeX Source $$\left\vert I_{j}\right\vert\leq 0.5\cdot n^{\alpha}\;\forall j: j=1, 2,\ldots, J_{2}.\eqno{\hbox{(38)}}$$ In a similar manner to the definition of $\mmb{\tau}$ in (20), we define TeX Source \eqalignno{\mmb{\eta}=&\,\left(\eta_{1},\eta_{2},\ldots\right)\cr =&\,\left({{1}\over{n^{2}}},{{1}\over{n^{2}}}+{{2}\over{n^{2+\alpha}}},\ldots,{{2}\over{n^{2}}},{{2}\over{n^{2}}}+{{4}\over{n^{2+\alpha}}},\ldots\right).&{\hbox{(39)}}} The cardinality of $\mmb{\eta}$ is TeX Source $$B_{2}\triangleq\vert\mmb{\eta}\vert\leq 0.5\cdot n^{\alpha}\left\lceil 2\log n\right\rceil\leq n^{\alpha}\left\lceil\log n\right\rceil.\eqno{\hbox{(40)}}$$

We now perform the encoding similarly to the small $k$ case, where we allow quantization to nonzero values to the components of $\mmb{\varphi}$ up to $i=n^{2}$. (This is more than needed but is possible since $\eta_{1}=1/n^{2}$.) Encoding is performed similarly to the small $k$ case. Thus, similarly to (30), we have TeX Source $$nR_{n}\left(L^{\ast},\mmb{\theta}\right)\leq 2\log n+E_{\theta}L_{R}\left(\mmb{\theta}^{\prime}\right)+E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P_{\theta^{\prime}}\left(X^{n}\right)}},\eqno{\hbox{(41)}}$$ where the first term is due to allowing up to ${\hat{k}}=n^{2}$. Since usually in this region $k\geq B_{2}$ (except the low end), the description of vectors $\mmb{\varphi}$ and $\mmb{\theta}^{\prime}$ is done by coding the cardinality of $\vert\varphi_{i}=\eta_{j}\vert$ and $\vert\theta^{\prime}_{i}=\eta_{j}\vert$, respectively, i.e., for each grid point the code describes how many letters have probability quantized to this point. This idea resembles coding profiles of patterns, as done in [22]. However, unlike the method in [22], here, many probability parameters of symbols with different occurrences are mapped to the same grid point by quantization. The number of parameters mapped to a grid point of $\mmb{\eta}$ is coded using Elias' representation of the integers. Hence, in a similar manner to (23), TeX Source \eqalignno{& L_{R}(\mmb{\theta}^{\prime})&{\hbox{(42)}}\cr &\!\!\quad\buildrel{(a)}\over{\leq}\sum\limits_{j=1}^{B_{2}}\left\{1+\log (\vert\theta^{\prime}_{i}=\eta_{j}\vert+1)+\right.\cr &\qquad\left.2\log [1+\log (\vert\theta^{\prime}_{i}=\eta_{j}\vert+1)]\right\}\cr &\!\!\quad\buildrel{(b)}\over{\leq}B_{2}+B_{2}\log {{k+B_{2}}\over{B_{2}}}+2B_{2}\log \log {{k+B_{2}}\over{B_{2}}}+o(B_{2})\cr &\!\!\quad\buildrel{(c)}\over{\leq}\cases{(1+\varepsilon) (\log n)(\log {{k}\over{n^{\alpha-\varepsilon}}}) n^{\alpha}, &n^{\alpha}<k=o(n),\cr (1+\varepsilon) (1-\alpha)(\log n)^{2}n^{\alpha}, &n^{\alpha}<k=O(n).}} The additional 1 term in the logarithm in $(a)$ is for 0 occurrences, $(b)$ is obtained similarly to step $(a)$ of (23), absorbing all second-order terms in the last term. To obtain $(c)$, we first assume, for the first region, that $k n^{\varepsilon}\gg B_{2}$ (an assumption that must be later validated with the choice of $\alpha$). Then, second-order terms are absorbed in $\varepsilon$. The extra $n^{\varepsilon}$ factor is unnecessary if $k\gg B_{2}$. The second region is obtained by upper bounding $k$ without this factor. It is possible to separate the first region into two regions, eliminate this factor in the lower region, and obtain a more complicated, yet tighter, expression in the upper region, where $k\sim\Theta (n^{1/3})$.

Now, similarly to (31), we obtain TeX Source \eqalignno{E_{\theta}\log {{P_{\theta}(X^{n})}\over{P_{\theta^{\prime}}(X^{n})}}\leq &\, n (\log e)\sum\limits_{i=1}^{k}{{\delta_{i}^{2}}\over{\theta^{\prime}_{i}}}\cr \buildrel{(a)}\over{\leq}&\, O(1)+{{2\log e}\over{n^{1+2\alpha}}}\sum\limits_{j=1}^{J_{2}}k_{j}n^{j\beta}\cr \buildrel{(b)}\over{\leq}&\, 4(\log e) n^{1-2\alpha}+O(1).&{\hbox{(43)}}} The first term of inequality $(a)$ is obtained under the assumption that $k=O(n)$, $\theta^{\prime}_{i}\geq 1/n^{2}$, and $\vert\delta_{i}\vert\leq 1/n^{2}$. Similarly to the last two terms of (32), we obtain an additional $O(1)$ term for extra offset costs of the larger probability symbols due to many small probability symbols if they exist. For the second term $\vert\delta_{i}\vert\leq n^{j\beta}/n^{2+\alpha}$, and $\theta^{\prime}_{i}\geq n^{(j-1)\beta}/n^{2}$. Inequality $(b)$ is obtained in a similar manner to inequality $(f)$ of (31), where the sum is shown similarly to be $2n^{2}$.

Summing up the contributions of (42) and (43) in (41), $\alpha=1/3$ is shown to minimize the total cost (to first order). This choice of $\alpha$ also satisfies the assumption of step $(c)$ in (42). Using $\alpha=1/3$, absorbing all second-order terms in $\varepsilon$ and normalizing by $n$, we obtain the remaining two regions of the bound in (15). It should be noted that the proof here would give a bound of $O(n^{1/3+\varepsilon})$ up to $k=O(n^{4/3})$. If the intervals in (17) were used for bounded distributions, the coefficients of the last two regions will be reduced by a factor of 2.

The proof up to this point assumes that $k$ is known in advance. This is important for the code resulting in the bounds for the first two regions because the quantization grid depends on $k$. Specifically, if in building the grid, $k$ is underestimated, the description cost of $\mmb{\varphi}$ increases. If $k$ is overestimated, the quantization cost will increase. Also, if the code of larger $k$'s is used for a smaller $k$, a larger bound than necessary results. To solve this, the optimization that chooses $L^{\ast}(x^{n})$ is done over all possible values of $k$ (greater than or equal to the maximal symbol occurring in $x^{n}$), i.e., every greater $k$ in the first construction, and the construction of the code for the top regions. For fixed $k$, a standard optimal code for nonmonotonic distributions can also be constructed. For every small $k$, a different construction is done, using the appropriate $k$ to determine the spacing in each interval. The value of $k$ yielding the shortest code word is then used. Elias' coding for the integers can be used to designate $k$ with $O (\log k)$ prefix bits. The analysis continues as before. This does not change the redundancy to first order, giving all four regions of the bound in (15), even if $k$ is unknown in advance. This concludes the proof of Theorem 4. $\hfill \blacksquare$

##### Proof [Corollary 1]

The proof branches off the proof of Theorem 4 by improving on several steps, and mainly on the choice of $\alpha$. First, like the partitioning of the probability space into three intervals in [35], we can partition the probability space into two intervals here, $(0, 1/n^{\alpha}]$ and $(1/n^{\alpha}, 1)$. (Since we can have probabilities smaller than $1/n$, we cannot use a bottom interval of $(0, n^{\alpha}/n]$ here.) In the top interval, we need at most $(1+\varepsilon) n^{\alpha}\log n$ bits to describe the monotonic ML probabilities of at most $n^{\alpha}$ symbols whose probabilities are in this interval. Quantizing all these probabilities with $1/n$ resolution yields $o(1)$ additional quantization cost. (This can be shown following similar steps as (43) with a choice of $\alpha=(1+o(1))/3$.) Using $1/n^{\alpha}$ as the upper limit on the total number of intervals in (36) instead of 1 now yields $J^{\prime}_{2}=(2-\alpha)\log n$. It then follows, similarly to (40), that $B^{\prime}_{2}\leq 0.5 (2-\alpha) n^{\alpha}\log n$. Next, the description costs in (42) reduce by the factor $0.5 (2-\alpha)$. Combining the costs in (41), using the new description cost, the quantization cost of (43), and absorbing the cost of the probability parameter top interval and other second-order terms in $\varepsilon$ yields TeX Source \eqalignno{nR_{n}&\left(L^{\ast},\mmb{\theta}\right)&{\hbox{(44)}}\cr \leq&\, (1+\varepsilon) 0.5(2-\alpha) (1-\alpha) n^{\alpha}(\log n)^{2}+4 (\log e) n^{1-2\alpha}} for $k=O(n)$. (Similarly, a more complex expression can be written for $k=o(n)$.) A choice of TeX Source $$\alpha={{1}\over{3}}-{{2}\over{3}}{{\log \log n}\over{\log n}}+{{\log \nu}\over{\log n}}\eqno{\hbox{(45)}}$$ for some parameter $\nu$ minimizes (44) yielding TeX Source $$nR_{n}\left(L^{\ast},\mmb{\theta}\right)\leq(1+\varepsilon)\left({{5\nu}\over{9}}+{{4\log e}\over{\nu^{2}}}\right) n^{1/3}(\log n)^{4/3}.\eqno{\hbox{(46)}}$$ Taking $\nu=(72 (\log e)/5)^{1/3}\approx 2.75$ minimizes (46), yielding a coefficient of less than 2.3 in (46). Letting $n\rightarrow\infty$, absorbing all second-order terms in the gap of the coefficient to 2.3 proves the second region of (16). Using the same value of $\nu$ for the resulting terms for the first region in a similar manner, proves the first region. A slightly tighter bound can be obtained for the first region if the value of $\nu$ is optimized for the specific value of $k$. $\hfill \blacksquare$

SECTION V

## UPPER BOUNDS FOR FAST DECAYING DISTRIBUTIONS

This section shows that with some mild conditions on the source distribution, the same redundancy upper bounds achieved for finite monotonic distributions can be achieved even if the monotonic distribution is over an infinite alphabet. The key observation that leads to this result is that a distribution that decays fast enough will result in only a small number of occurrences of letters from its tail in $x^{n}$. Occurrences of these letters will likely not retain the monotonicity. Since there are few such occurrences, they can be handled without increasing the asymptotic behavior of the coding cost. More precisely, fast decaying monotonic distributions can be viewed as if they have some effective bounded alphabet size. Occurrences of symbols outside this limited alphabet are rare. We present two theorems and a corollary that upper bound the redundancy when coding with such unknown monotonic distributions. The first theorem also provides a slightly stronger bound (with smaller coefficient) for $k=O(n)$. For slower decays with more occurring symbols from the distribution tail, the redundancy order does increase due to the penalty of identifying these symbols in a sequence. However, we show, consistently with the results in [11], that as long as the entropy of the source is finite, a universal code, in the sense of diminishing redundancy per symbol, still exists. We begin with stating the two theorems and the corollary, then the proofs are presented. The section is concluded with three examples of typical monotonic distributions over the integers, that demonstrate both cases of fast and slow decays.

### A. Upper Bounds

We begin with some notation. Fix an arbitrary small $\varepsilon>0$, and let $n\rightarrow\infty$. Define $m\triangleq m_{\rho}\triangleq n^{\rho}$ as the effective alphabet size, where $\rho>\varepsilon$. (Note that $\rho=(\log m)/(\log n)$.) Let TeX Source \eqalignno{&{\cal R}_{n}(m)&{\hbox{(47)}}\cr &\!\!\quad\triangleq\cases{{{m-1}\over{2}}\log {{n}\over{m^{3}}}, &m=o(n^{1/3}),\cr {{1}\over{2}}\left(\rho+{{1}\over{3}}\right)\left(\rho+\varepsilon-{{1}\over{3}}\right)(\log n)^{2}n^{1/3}, & o.w.}}

#### Theorem 5

1. Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Let $x^{n}$ be generated by an i.i.d. monotonic distribution $\mmb{\theta}\in{\cal M}$. If there exists $m^{\ast}$, such that TeX Source $$\sum_{i>m^{\ast}}n\theta_{i}\log i=o\left[{\cal R}_{n}\left(m^{\ast}\right)\right],\eqno{\hbox{(48)}}$$ then, there exists a code with length function $L^{\ast}(\cdot)$, such that TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq{{\left(1+\varepsilon\right)}\over{n}}{\cal R}_{n}\left(m^{\ast}\right)\eqno{\hbox{(49)}}$$ for the monotonic distribution $\mmb{\theta}$.
2. If there exists $m^{\ast}$ for which $\rho^{\ast}=o\left(n^{1/3}/(\log n)\right)$, such that TeX Source $$\sum_{i>m^{\ast}}\theta_{i}\log i=o(1),\eqno{\hbox{(50)}}$$ then, there exists a universal code with length function $L^{\ast}(\cdot)$, such that TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)=o(1).\eqno{\hbox{(51)}}$$

Theorem 5 shows that redundancy bounds of the same order as those obtained for finite alphabets are achievable for monotonic distributions that decay fast enough (with effective alphabet that does not exceed $O(n^{\rho})$ symbols for a fixed $\rho$). Specifically, very fast decaying distributions, although over infinite alphabets, may even behave like monotonic distributions with $o\left(n^{1/3}\right)$ symbols. The condition in (48) merely means that the cost that a code would incur in order to code very rare symbols, that are larger than the effective alphabet size, is negligible w.r.t. the total cost obtained from other, more likely, symbols. Note that for $m=n$, the bound is tighter than that of the last region of Theorem 4, and a constant of 4/9 replaces 2/3. The second part of the theorem states that if the decay is slow, but the cost of coding rare symbols is still diminishing per symbol, a universal code still exists for such distributions. However, in this case the redundancy will be dominated by coding the rare (out of order) symbols.

Applying the additional steps used to prove Corollary 1 to the proof of the first part of Theorem 5 yields a tighter expression for the second region of ${\cal R}_{n}(m)$ in (47), which for fixed $\rho$ is $\Theta\left(n^{1/3}(\log n)^{4/3}\right)$. While Theorem 5 bounds the redundancy decay rate for two extremes, a more general theorem can provide the redundancy rates for coding an unknown monotonic distribution whose decay rate is between these extremes. As the examples at the end of this section show, the next theorem is very useful for slower decaying distributions. It also encapsulate the derivation of a tighter bound as that in Corollary 1 for the more general case.

#### Theorem 6

Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Let $x^{n}$ be generated by an i.i.d. monotonic distribution $\mmb{\theta}\in{\cal M}$. Then, there exists a code with length function $L^{\ast}(\cdot)$, that achieves redundancy TeX Source \eqalignno{&\displaystyle n R_{n}(L^{\ast},\mmb{\theta})\leq(1+\varepsilon)\cdot\cr &\quad\min_{\alpha>0,\rho:\rho\geq\alpha+\varepsilon}\left\{{{(\rho+\alpha)(\rho-\alpha)(\log n)^{2}n^{\alpha}}\over{2}}+{{5 n^{1-2\alpha}}\over{\ln 2}}+\right.\cr &\qquad\qquad\qquad\quad\left.\left(1+{{1}\over{\rho}}\right)n\sum_{i>n^{\rho}}\theta_{i}\log i\right\}&\hbox{(52)}} for coding sequences generated by the source $\mmb{\theta}$.

The theorems above lead to the following corollary.

#### Corollary 2

As $n\rightarrow\infty$, sequences generated by monotonic distributions with $H_{\theta}(X)=O(1)$ are universally compressible in the average sense.

Corollary 2 shows that sequences generated by finite entropy monotonic distributions can be compressed in the average with diminishing per symbol redundancy. This result is consistent with the results shown in [11].

We continue with proving the two theorems and the corollary.

##### Proof

The proof of both theorems is constructive in a similar manner to the proof of Theorem 4. This time, however, the main idea is first separating the more likely symbols from the unlikely ones. The code first determines the point of this separation $m=n^{\rho}$. (Note that $\rho$ can be greater than 1.) All symbols $i\leq m$ are considered likely and are quantized and described in a similar manner as in the codes for smaller alphabets. Unlike bounded alphabets, though, a more robust grid is used here to allow larger values of $m$. The unlikely symbols are coded hierarchically. They are first merged into a single innovation symbol. Then, they are encoded within this symbol by coding their actual values. As long as the decay is fast enough, the average cost of conveying these symbols becomes negligible w.r.t. the cost of coding the likely symbols. If the decay is slower, but still fast enough, as the case described in condition (50), the coding cost of the rare symbols dominates the redundancy, which is still diminishing. The description length of likely symbols is bounded as in the proof of Theorem 4, consisting of description of the probability grid points and the quantization cost. In order to determine the best value of $m$ for a given sequence, all values are tried and the one yielding the shortest description is used for coding a specific $x^{n}$. The steps described prove both Theorems 5 and 6.

Let $m\geq 2$ determine the number of likely symbols in the alphabet. For a given $m$, define TeX Source $$S_{m}\triangleq\sum\limits_{i>m}\theta_{i},\eqno{\hbox{(53)}}$$ as the total probability of the remaining symbols. Given $\mmb{\theta}$, $m$ and $S_{m}$, a probability TeX Source \eqalignno{& P\left(x^{n}\vert m, S_{m},\mmb{\theta}\right)&{\hbox{(54)}}\cr &\quad\triangleq\left[\prod_{i=1}^{m}\theta_{i}^{n_{x}(i)}\right]\cdot S_{m}^{n_{x}(x>m)}\cdot\prod_{i>m}\left({{n_{x}(i)}\over{n_{x}(x>m)}}\right)^{n_{x}(i)}} can be computed for $x^{n}$, where $n_{x}(i)$ counts occurrences of symbol $i$ in $x^{n}$, and $n_{x}(x>m)$ is the count of all symbols greater than $m$ in $x^{n}$. This probability mass function clusters all symbols greater than $m$ into one innovation symbol. Then, it uses the ML estimate of each to distinguish among them in the clustered symbol.

For every $m$, we can define a quantization grid $\mmb{\xi}_{m}$, in a similar manner to the proof of Theorem 4, for the first $m$ components of $\mmb{\theta}$. If $m=o(n^{1/3})$, we use $\mmb{\xi}_{m}=\mmb{\tau}_{m}$, where $\mmb{\tau}_{m}$ is the grid defined in (20) with $m$ replacing $k$. Otherwise, we can use the definition of $\mmb{\eta}$ in (39). However, to obtain tighter bounds for large $m$, we define a different grid for the larger values of $m$ following similar steps to those in (36)(40). First, define the $j$th interval as TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n^{\rho+2\alpha}}},{{n^{j\beta}}\over{n^{\rho+2\alpha}}}\right),\;1\leq j\leq J_{\rho},\eqno{\hbox{(55)}}$$ where $\rho=(\log m)/(\log n)$ as defined above, $\alpha>0$ is a parameter, and $\beta=1/(\log n)$ as before. Within the $j$th interval, we define the spacing in the grid by TeX Source $$\Delta_{j}^{(\rho)}={{n^{j\beta}}\over{n^{\rho+3\alpha}}}.\eqno{\hbox{(56)}}$$ As in (38), TeX Source $$\left\vert I_{j}\right\vert\leq 0.5\cdot n^{\alpha}\;\forall j: j=1, 2,\ldots, J_{\rho},\eqno{\hbox{(57)}}$$ and the total number of intervals to describe probabilities less up to $1/n^{\alpha}$ is TeX Source $$J_{\rho}=\left\lceil (\rho+\alpha)\log n\right\rceil.\eqno{\hbox{(58)}}$$ As in the proof of Corollary 1, we use $O\left(n^{\alpha}\log n\right)$ bits to describe and quantize probabilities greater than $1/n^{\alpha}$. Similarly to (39), $\mmb{\xi}_{m}$ is defined as TeX Source \eqalignno{\mmb{\xi}_{m}=&\,\left(\xi_{1},\xi_{2},\ldots\right)\cr =&\,\left({{1}\over{n^{\rho+2\alpha}}},{{1}\over{n^{\rho+2\alpha}}}+{{2}\over{n^{\rho+3\alpha}}},\ldots,{{2}\over{n^{\rho+2\alpha}}},\right.\cr &\quad\left.{{2}\over{n^{\rho+2\alpha}}}+{{4}\over{n^{\rho+3\alpha}}},\ldots\right).&{\hbox{(59)}}} The cardinality of $\mmb{\xi}_{m}$ is thus TeX Source $$B_{\rho}\triangleq\vert\mmb{\xi}_{m}\vert\leq0.5\cdot n^{\alpha}\left\lceil (\rho+\alpha)\log n\right\rceil.\eqno{\hbox{(60)}}$$

An $m$th order quantized version $\mmb{\theta}^{\prime}_{m}$ of $\mmb{\theta}$ is obtained by quantizing $\theta_{i}\leq 1/n^{\alpha}$, $i=2,3,\ldots,m$ onto $\mmb{\xi}_{m}$, such that $\theta^{\prime}_{i}\in\mmb{\xi}_{m}$ for these values of $i$. Then, the remaining cluster probability $S_{m}$ is quantized into $S^{\prime}_{m}\in\left[1/n, 2/n,\ldots, 1\right]$. The parameter $\theta^{\prime}_{1}$ is constrained by the quantization of the other parameters. Quantization is performed again in a manner that minimizes the cumulative error but retains monotonicity, and probabilities smaller than $\xi_{1}$ are offset by larger symbols as before.

Now, for any $m\geq 2$, let $\mmb{\varphi}_{m}$ be any monotonic probability vector of cardinality $m$ whose last $m-1$ components are quantized into $\mmb{\xi}_{m}$ (or coded separately in the upper interval $(1/n^{\alpha}, 1)$ if such values exist), and let $\sigma_{m}\in\left[1/n, 2/n,\ldots, 1\right]$ be a quantized value of the innovation symbol, such that $\sum_{i=1}^{m}\varphi_{i,m}+\sigma_{m}=1$, where $\varphi_{i,m}$ is the $i$th component of $\mmb{\varphi}_{m}$. If $m$, $\sigma_{m}$, and $\mmb{\varphi}_{m}$ are known, a given $x^{n}$ can be coded using $P\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)$ as defined in (54), with $\sigma_{m}$ replacing $S_{m}$, and the $m$ components of $\mmb{\varphi}_{m}$ replacing the first $m$ components of $\mmb{\theta}$. However, in the universal setting, none of these parameters are known in advance. Furthermore, neither the symbols greater than $m$ nor their conditional ML probabilities are known in advance. Therefore, the total cost of coding $x^{n}$ using these parameters requires universality costs for describing them. The additional universality cost of coding $x^{n}$ with probability $P\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)$ thus consists of the following five components: 1) $m$ should be described using Elias' representation with at most $1+\rho\log n+2\log (1+\rho\log n)$ bits. 2) The value of $\sigma_{m}$ in its quantization grid should be coded using $\log n$ bits. 3) The $m$ components of $\mmb{\varphi}_{m}$ require $L_{R}\left(\mmb{\varphi}_{m}\right)$ bits. 4) The number $c_{x}(x>m)$ of distinct letters in $x^{n}$ greater than $m$ is coded using $\log n$ bits. 5) Each letter $i>m$ in $x^{n}$ is coded. Elias' coding for the integers using $1+\log i+2\log (1+\log i)$ bits can be used, but to simplify the derivation we can also use the code, also presented in [8], that uses no more than $1+2\log i$ bits to describe $i$. In addition, at most $\log n$ bits are required for describing $n_{x}(i)$ in $x^{n}$. For $n\rightarrow\infty$, $m\gg 1$, and $\varepsilon>0$ arbitrarily small, this yields a total cost of TeX Source \eqalignno{&\displaystyle L\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)\cr &\!\!\quad\leq-\log P\left(x^{n}\vert m,\sigma_{m},\mmb{\varphi}_{m}\right)+L_{R}\left(\mmb{\varphi}_{m}\right)+\cr &\qquad [(1+\varepsilon)\rho+c_{x}(x>m)+2]\log n\cr &\qquad+c_{x}(x>m)+2\sum_{i>m, i\in x^{n}}\log i&{\hbox{(61)}}} where we assume $m$ is large enough to bound the cost of describing $m$ by $(1+\varepsilon)\rho\log n$.

The description cost of $\mmb{\varphi}_{m}$ for $m=o(n^{1/3})$ is bounded by TeX Source $$L_{R}\left(\mmb{\varphi}_{m}\right)\leq\left(1+\varepsilon\right){{m-1}\over{2}}\log {{n}\over{m^{3}}}\eqno{\hbox{(62)}}$$ using (23) with $m$ replacing $k$. The $(\log n)^{2}$ factor in (23) can be absorbed in $\varepsilon$ since we limit $m$ to $o(n^{1/3})$, unlike the derivation in (23). For larger values of $m$, we describe symbol probabilities of $\mmb{\varphi}_{m}$ in the grid $\mmb{\xi}_{m}$ in a similar manner to the description of $O(n)$ symbol probabilities in the grid $\mmb{\eta}$ in the proof of Corollary 1. Similarly to (42), we have TeX Source \eqalignno{& L_{R}(\mmb{\varphi}_{m})\cr &\!\!\quad\leq B_{\rho}+B_{\rho}\log {{n^{\rho}+B_{\rho}}\over{B_{\rho}}}+2B_{\rho}\log \log {{n^{\rho}+B_{\rho}}\over{B_{\rho}}}+O\left(B_{\rho}\right)\cr &\!\!\quad\buildrel{(a)}\over{\leq}{{\left(1+\varepsilon\right)}\over{2}}\left(\rho+\alpha\right)\left(\rho+\varepsilon-\alpha\right)(\log n)^{2}n^{\alpha}&{\hbox{(63)}}} where the term $O (B_{\rho})$ absorbs the cost of probabilities larger than $1/n^{\alpha}$. To obtain inequality $(a)$, we first multiply $n^{\rho}$ by $n^{\varepsilon}$ in the numerator of the argument of the logarithm. This is only necessary for $\rho\rightarrow\alpha$ to guarantee that $n^{\rho+\varepsilon}\gg B_{\rho}$. Substituting the bound on $B_{\rho}$ from (60), absorbing second-order terms in the leading $\varepsilon$ yields the bound.

A sequence $x^{n}$ can now be coded using the universal parameters that minimize the sequence description length, i.e., TeX Source \eqalignno{& L^{\ast}\left(x^{n}\right)\cr &\!\!\quad\triangleq\min_{m^{\prime}\geq 2}\min_{\sigma_{m^{\prime}}\in \left[{{1}\over{n}},{{2}\over{n}},\ldots, 1\right]}\;\min_{\mmb{\varphi}_{m^{\prime}} :\varphi_{i}\in\mmb{\xi}_{m^{\prime}}, i\geq2}L\left(x^{n}\vert m^{\prime}, \sigma_{m^{\prime}},\mmb{\varphi}_{m^{\prime}}\right)\cr &\!\!\quad\leq L\left(x^{n}\vert m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)&{\hbox{(64)}}} where the minimization over $\varphi_{i}$ also includes values larger than $1/n^{\alpha}$, using their designated description. The values $\mmb{\theta}^{\prime}_{m}$ and $S^{\prime}_{m}$ are the true source parameters quantized in the manner described above, and the inequality holds for every $m$. The minimization on $m^{\prime}$ should be performed only up to the maximal symbol that occurs in $x^{n}$.

Following (61)(64), up to negligible integer length constraints, the average redundancy using $L^{\ast}(\cdot)$ is bounded, for every $m\geq 2$, by TeX Source \eqalignno{& n R_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\quad=E_{\theta}\left[L^{\ast}\left(X^{n}\right)+\log P_{\theta}\left(X^{n}\right)\right]\cr &\quad\buildrel{(a)}\over{\leq}E_{\theta}\left[L\left(X^{n}\;\vert\;m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)+\log P_{\theta}\left(X^{n}\right)\right]\cr &\quad\buildrel{(b)}\over{\leq}E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P\left(X^{n}\;\vert\;m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)}}+L_{R}\left(\mmb{\theta}^{\prime}_{m}\right)+\cr &\;\qquad2\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\log i+\cr &\;\qquad\left(1+\varepsilon\right) [E_{\theta}C_{x}\left(X>m\right)+\rho+2]\log n&\hbox{(65)}} where $(a)$ follows from (64), and $(b)$ follows from averaging on (61) with $\sigma_{m}=S^{\prime}_{m}$, and $\mmb{\varphi}_{m}=\mmb{\theta}^{\prime}_{m}$ with the average on $c_{x}(x>m)$ absorbed in the leading $\varepsilon$.

Expressing $P_{\theta}\left(x^{n}\right)$ as TeX Source $$P_{\theta}\left(x^{n}\right)=\left[\prod_{i\leq m}\theta_{i}^{n_{x}(i)}\right]\cdot S_{m}^{n_{x}(x>m)}\cdot\prod_{i>m}\left({{\theta_{i}}\over{S_{m}}}\right)^{n_{x}(i)}\eqno{\hbox{(66)}}$$ and defining $\delta_{S}\triangleq S_{m}-S^{\prime}_{m}$, the first term of (65) is bounded, for the upper region of $m$, by TeX Source \eqalignno{& E_{\theta}\log {{P_{\theta}\left(X^{n}\right)}\over{P\left(X^{n}\;\vert\;m, S^{\prime}_{m},\mmb{\theta}^{\prime}_{m}\right)}}\cr &\quad\leq E_{\theta}\left[\sum\limits_{i=1}^{m}N_{x}(i)\log {{\theta_{i}}\over{\theta^{\prime}_{i,m}}}+N_{x}\left(X>m\right)\log {{S_{m}}\over{S^{\prime}_{m}}}+\right.\cr &\qquad\left.\sum\limits_{i>m}N_{x}(i)\log {{\theta_{i}/S_{m}}\over{N_{x}(i)/N_{x}(X>m)}}\right]\cr &\quad\buildrel{(a)}\over{\leq}n\cdot\sum\limits_{i=1}^{m}\theta_{i}\log {{\theta_{i}}\over{\theta^{\prime}_{i,m}}}+n S_{m}\log {{S_{m}}\over{S^{\prime}_{m}}}\cr &\quad\buildrel{(b)}\over{\leq}n (\log e)\left[\left(\sum\limits_{i=1}^{m}{{\delta_{i}^{2}}\over{\theta^{\prime}_{i,m}}}\right)+{{\delta_{S}^{2}}\over{S^{\prime}_{m}}}\right]\cr &\quad\buildrel{(c)}\over{\leq}(\log e)\cdot{{n\cdot n^{\rho}}\over{n^{\rho+2\alpha}}}+2 (\log e) n^{1-\rho-4\alpha}\cdot\sum\limits_{j=1}^{J_{\rho}}k_{j}n^{j\beta}+\log e\cr &\quad\buildrel{(d)}\over{\leq}5(\log e) n^{1-2\alpha}+\log e&{\hbox{(67)}}} where $(a)$ is since for the third term, the conditional ML probability used for coding is greater than the actual conditional probability assigned to all letters greater than $m$ for every $x^{n}$. Hence, the third term is bounded by 0. Expectation is performed for the other terms. Inequality $(b)$ is obtained similarly to (31) where quantization includes the first $m$ components of $\mmb{\theta}$ and the parameter $S_{m}$. Then, inequality $(c)$ follows the same reasoning as step $(a)$ of (43). The first term bounds the worst case in which all $n^{\rho}$ symbols are quantized to $1/n^{\rho+2\alpha}$ with $\vert\delta_{i}\vert\leq 1/n^{\rho+2\alpha}$. The second term is obtained where $\theta^{\prime}_{i,m}\geq n^{(j-1)\beta}/n^{\rho+2\alpha}$ and $\vert\delta_{i}\vert\leq n^{j\beta}/n^{\rho+3\alpha}$ for $\theta_{i}\in I_{j}$, and $k_{j}=\vert\theta_{i}\in I_{j}\vert$ as before. Offsetting of probabilities smaller than $\xi_{1}$, if required, results, similarly to (27), in $\vert\delta_{i}\vert\leq n^{j\beta}/n^{\rho+3\alpha}+\gamma^{\prime}_{i}\theta^{\prime}_{i}/n^{2\alpha}$ where $\gamma^{\prime}_{i}>0$ is some constant, and adds negligibly to both terms. The last term of $(c)$ is since $S^{\prime}_{m}\geq 1/n$ and $\vert\delta_{S}\vert\leq 1/n$. Finally, $(d)$ is obtained similarly to step $(b)$ of (43), where as in (33), $\sum k_{j}n^{j\beta}\leq 2n^{\rho+2\alpha}$. For $m=o(n^{1/3})$, the same initial steps up to step $(b)$ in (67) are applied. The remaining steps in (31) are then applied with $m$ replacing $k$, yielding a total quantization cost of $5(1+o(1))(\log e)m+\log e$.

To bound the third and fourth terms of (65), TeX Source $$P_{\theta}\left(i\in X^{n}\right)=1-\left(1-\theta_{i}\right)^{n}\leq n\theta_{i}.\eqno{\hbox{(68)}}$$ Similarly, TeX Source $$E_{\theta}C_{x}(X>m)=\sum_{i>m}P_{\theta}\left(i\in X^{n}\right)\leq n S_{m}.\eqno{\hbox{(69)}}$$ Combining the dominant terms of the third and fourth terms of (65), we have TeX Source \eqalignno{& 2\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\log i+(1+\varepsilon)E_{\theta}C_{x}(X>m)\log n\cr &\!\!\quad\buildrel{(a)}\over{=}\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\left[2\log i+(1+\varepsilon)\log n\right]\cr &\!\!\quad\buildrel{(b)}\over{\leq}\left(2+{{1+\varepsilon}\over{\rho}}\right)\sum\limits_{i>m}P_{\theta}\left(i\in X^{n}\right)\log i\cr &\!\!\quad\buildrel{(c)}\over{\leq}\left(2+{{1+\varepsilon}\over{\rho}}\right)n\sum\limits_{i>m}\theta_{i}\log i&{\hbox{(70)}}} where $(a)$ is because $E_{\theta}C_{x}(X>m)=\sum_{i>m}P_{\theta}\left(i\in X^{n}\right)$, $(b)$ is because for $i>m=n^{\rho}$, $\log i>\rho\log n$, and $(c)$ follows from (68). Given $\rho>\varepsilon$ for an arbitrary fixed $\varepsilon>0$, the resulting coefficient above is upper bounded by some constant $\kappa$.

Summing up the contributions of the terms of (65) from (31), (62), and (70), absorbing second-order terms in a leading $\varepsilon^{\prime}$, we obtain that for $m=o(n^{1/3})$, TeX Source $$n R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq\left(1+\varepsilon^{\prime}\right){{m-1}\over{2}}\log {{n}\over{m^{3}}}+\kappa n\sum_{i>m}\theta_{i}\log i.\eqno{\hbox{(71)}}$$ For the second region, substituting $\alpha=1/3$, and summing up the contributions of (67), (63), and (70) to (65), absorbing second-order terms in $\varepsilon^{\prime}$, we obtain TeX Source \eqalignno{&\displaystyle n R_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\!\!\quad\leq (1+\varepsilon^{\prime}){{1}\over{2}}\left(\rho+{{1}\over{3}}\right)\left(\rho+\varepsilon^{\prime}-{{1}\over{3}}\right)\left(\log n\right)^{2}n^{1/3}+\cr &\qquad\kappa n\sum_{i>m}\theta_{i}\log i.&{\hbox{(72)}}} Using the value of $\alpha$ in (45) instead would yield a tighter expression of $\Theta\left(n^{1/3}(\log n)^{4/3}\right)$ for the first term, and then the value of $\nu$ can be optimized to minimize the leading coefficient. Since (71) and (72) hold for every $m>n^{\varepsilon}$, there exists $m^{\ast}$ for which the minimal bound is obtained. To bound the redundancy, we choose this $m^{\ast}$. Now, if the condition in (48) holds, then the second term in (71) and (72) is negligible w.r.t. the first term. Absorbing it in a leading $\varepsilon$, normalizing by $n$, yields the upper bound of (49), and concludes the proof of the Part I of Theorem 5.

For Part II of Theorem 5, we consider the bound of the second region in (72). If there exists $\rho^{\ast}=o\left(n^{1/3}/(\log n)\right)$ for which the condition in (50) holds, then both terms of (72) are of $o(n)$, yielding a total redundancy per symbol of $o(1)$. The proof of Theorem 5 is concluded. $\hfill \square$

Now, consider the upper region in (65) with parameters $\alpha$ and $\rho$ taking any valid value. (The code leading to the bound of the upper region can be applied even if the actual effective alphabet size is in the lower region.) We can sum up the contributions of (67), (63), and (70) to (65), absorbing second-order terms in $\varepsilon$. Equation (63) is valid without the middle $\varepsilon$ term as long as $\rho\geq\alpha+\varepsilon$. Since, in the upper region of $m$, $i\geq m$ is large enough, Elias' code for the integers can be used costing $(1+\varepsilon)\log i$ to code $i$, with $\varepsilon>0$ which can be made arbitrarily small. Hence, the leading coefficient of the bound in (70) can be replaced by $(1+\varepsilon)(1+1/\rho)$. This yields the expression bounding the redundancy in (52). This expression applies to every valid choice of $\alpha$ and $\rho$, including the choice that minimizes the expression. Thus, the proof of Theorem 6 is concluded. $\hfill \square$

To prove Corollary 2, we use Wyner's inequality [41], which implies that for a finite entropy monotonic distribution TeX Source $$\sum_{i\geq 1}\theta_{i}\log i=E_{\theta}\left[\log X\right]\leq H_{\theta}\left[X\right].\eqno{\hbox{(73)}}$$ Fix an arbitrarily small $\varepsilon>0$. Since the sum on the left-hand side of (73) is finite if $H_{\theta}[X]$ is finite, there must exist some $n_{0}$ such that $\sum_{i>n_{0}}\theta_{i}\log i<\varepsilon$. Let $n>n_{0}$, then for $m^{\ast}=n$ and $\rho^{\ast}=1$, using Theorem 6 with any $\alpha\in (0, 1)$, we obtain $R_{n}\left(L^{\ast},\mmb{\theta}\right)<\kappa\varepsilon$ for some fixed constant $\kappa>0$. The proof of Corollary 2 is concluded. $\hfill \blacksquare$

### B. Examples

We demonstrate the use of the bounds of Theorems 5 and 6 with three typical distributions over the integers. We specifically show that the redundancy rate of $O\left(n^{1/3+\varepsilon}\right)$ bits overall is achievable when coding sequences generated by many of the typical monotonic distributions, and, in fact, for many distributions faster convergence rates are achievable with the codes proposed. The examples render the assumption reflected in conditions (48) and (50), that very few large symbols appear in $x^{n}$, very practical. Specifically, in the phone book example, there may be many rare names, but only very few of them may occur in a certain city. The more common names can constitute most of a possible phone book sequence.

#### 1) Zipf Distribution

Consider the monotonic distributions over the integers [42], [43] of the form TeX Source $$\theta_{i}={{a}\over{i^{1+\gamma}}},\;i=1,2,\ldots,\eqno{\hbox{(74)}}$$ where $\gamma>0$, and $a$ is a normalization coefficient that guarantees that the probabilities over all integers sum to 1. Approximating summation by integration, we can show that TeX Source \eqalignno{S_{m}\leq &\,{{a}\over{\gamma m^{\gamma}}}&{\hbox{(75)}}\cr \sum_{i>m}\theta_{i}\log i\leq &\,{{a}\over{\ln 2}}\left[{{\ln m}\over{\gamma m^{\gamma}}}+{{1}\over{\gamma^{2}m^{\gamma}}}\right]\cr =&\,\left(1+\varepsilon\right){{a\log m}\over{\gamma m^{\gamma}}}&{\hbox{(76)}}} where the last equality holds for $m\rightarrow\infty$ with some fixed $\varepsilon>0$. For $m=n^{\rho}$ and fixed $\rho$, the sum in (48) is thus $O\left(n^{1-\rho\gamma}\log n\right)$, which is $o\left(n^{1/3}(\log n)^{2}\right)$ (and even $o\left(n^{1/3}(\log n)^{4/3}\right)$ if the tighter form of the bound is considered) for every $\rho\geq 2/(3\gamma)$, thus satisfying the negligibility condition (48) at least relative to the second region of (47). As long as $\gamma\leq 2$ (slow decay), the minimal value of $\rho$ required to guarantee negligibility of the sum in (48) is greater than 1/3. Using Theorem 5, this implies that for $\gamma\leq 2$, the second (upper) region of the upper bound in (49) holds with the minimal choice of $\rho^{\ast}=2/(3\gamma)$. Plugging in this value in the second region of (47) [i.e., in (49)] yields the upper bound shown below for this region. For $\gamma>2$, $2/(3\gamma)<1/3$. Hence, (48) holds for $m^{\ast}=o\left(n^{1/3}\right)$. This means that for the distribution in (74) with $\gamma>2$, the effective alphabet size is $o\left(n^{1/3}\right)$, and thus the achievable redundancy is in the first region of the bound of (49). Thus, even though the distribution is over an infinite alphabet, its compressibility behavior is similar to a distribution over a relatively small alphabet. To find the exact redundancy rate, we balance between the contributions of (62) and (70) in (65). As long as $1-\rho\gamma<\rho$, condition (48) holds, and the contribution of rare letters in (70) is negligible w.r.t. the other terms of the redundancy. Equality, implying $\rho^{\ast}=1/(1+\gamma)$, achieves the minimal redundancy rate. Thus, for $\gamma>2$, TeX Source \eqalignno{& n R_{n}\left(L^{\ast},\mmb{\theta}\right)\cr &\enspace\buildrel{(a)}\over{\leq}\!\left(1+\varepsilon\right)\left[{{a (2\rho^{\ast}\!+1)}\over{\gamma}}n^{1-\rho^{\ast}\gamma}\log n\!+{{n^{\rho^{\ast}}}\over{2}}\left(1-3\rho^{\ast}\right)\log n\right]\cr &\enspace\buildrel{(b)}\over{=}\left(1+\varepsilon\right)\left({{a{{3+\gamma}\over{1+\gamma}}}\over{\gamma}}+{{1-{{3}\over{1+\gamma}}}\over{2}}\right)n^{{1}\over{1+\gamma}}\log n&{\hbox{(77)}}} where the first term in $(a)$ follows from the bounds in (70) and (76), with $m=n^{\rho^{\ast}}$, and the second term from that in (62), and $(b)$ follows from $\rho^{\ast}=1/(1+\gamma)$. Note that for a fixed $\rho^{\ast}$, the factor 3 in the first term can be reduced to 2 with Elias' coding for the integers. The results described are summarized in the following corollary.

#### Corollary 3

Let $\mmb{\theta}\in{\cal M}$ be defined in (74). Then, there exists a universal code with length function $L^{\ast}(\cdot)$ that has only prior knowledge that $\mmb{\theta}\in{\cal M}$, that can achieve universal coding redundancy TeX Source \eqalignno{&\displaystyle R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq&{\hbox{(78)}}\cr &\quad\cases{\left(1+\varepsilon\right){{1}\over{18}}\left(1+{{2}\over{\gamma}}\right)\left({{2}\over{\gamma}}+\varepsilon-1\right){{n^{1/3}(\log n)^{2}}\over{n}},&\gamma\leq 2,\cr \left(1+\varepsilon\right)\left({{a{{3+\gamma}\over{1+\gamma}}}\over{\gamma}}+{{1-{{3}\over{1+\gamma}}}\over{2}}\right){{n^{{1}\over{1+\gamma}}\log n}\over{n}}, &\gamma>2.}} Corollary 3 gives the redundancy rates for all distributions defined in (74). With a tighter form of the bound (choosing $\alpha$ as in (45) and applying to Theorem 6), a tighter bound of $\Theta\left(n^{1/3}(\log n)^{4/3}/n\right)$ can be obtained for the first region. Using the looser bound of Corollary 3, if, for example, $\gamma=1$, the redundancy is $O\left(n^{1/3}(\log n)^{2}\right)$ bits overall with coefficient 1/6. For $\gamma=3$, $O(n^{1/4}\log n)$ bits are required. For faster decays (greater $\gamma$) even smaller redundancy rates are achievable.

#### 3) Geometric Distributions

Geometric distributions given by TeX Source $$\theta_{i}=p\left(1-p\right)^{i-1};\;i=1,2,\ldots,\eqno{\hbox{(79)}}$$ where $0<p<1$, decay even faster than the Zipf distribution in (74). Thus, their effective alphabet sizes are even smaller. This implies that a universal code can have even smaller redundancy than that presented in Corollary 3, when coding sequences generated by a geometric distribution (even if this is unknown in advance, and the only prior knowledge is that $\mmb{\theta}\in{\cal M}$). Choosing $m=\ell\cdot\log n$, the contribution of low probability symbols in (70) to (65) can be upper bounded by TeX Source \eqalignno{& 2n\sum\limits_{i>m}\theta_{i}\left(\log i+\log n\right)&{\hbox{(80)}}\cr &\quad\buildrel{(a)}\over{\leq}2n (1-p)^{m}\log n+O\left(n (1-p)^{m}\log m\right)\cr &\quad\buildrel{(b)}\over{=}2 n^{1+\ell\log (1-p)}(\log n)+O\left(n^{1+\ell\log (1-p)}\log \log n\right)} where $(a)$ follows from computing $S_{m}$ using geometric series, and bounding the second term, and $(b)$ follows from substituting $m=\ell\log n$ and representing $(1-p)^{\ell\log n}$ as $n^{\ell\log (1-p)}$. As long as $\ell\geq 1/(-\log (1-p))$, the expression in (80) is $O(\log n)$, thus negligible w.r.t. the redundancy upper bound of (49) with $m^{\ast}=\ell^{\ast}\log n=(\log n)/(-\log (1-p))$. Substituting this $m^{\ast}$ in (49), we obtain the following corollary.

#### Corollary 4

Let $\mmb{\theta}\in{\cal M}$ be a geometric distribution defined in (79). Then, there exists a universal code with length function $L^{\ast}(\cdot)$ that has only prior knowledge that $\mmb{\theta}\in{\cal M}$, that can achieve universal coding redundancy TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq{{1+\varepsilon}\over{-2\log (1-p)}}\cdot{{(\log n)^{2}}\over{n}}.\eqno{\hbox{(81)}}$$ Corollary 4 shows that if $\mmb{\theta}$ parameterizes a geometric distribution, sequences governed by $\mmb{\theta}$ can be coded with average universal coding redundancy of $O\left((\log n)^{2}\right)$ bits. Their effective alphabet size is $O(\log n)$, implying that larger symbols are very unlikely to occur. For example, for $p=0.5$, the effective alphabet size is $\log n$, and $0.5 (\log n)^{2}$ bits are required for a universal code. For $p=0.75$, the effective alphabet size is $(\log n)/2$, and $(\log n)^{2}/4$ bits are required by a universal code.

#### 5) Slow Decaying Distributions Over the Integers

Up to now, we considered fast decaying distributions, which all achieved the $O(n^{1/3+\varepsilon}/n)$ redundancy rate. We now consider a slowly decaying monotonic distribution over the integers, given by TeX Source $$\theta_{i}={{a}\over{i\left(\log i\right)^{2+\gamma}}},\;i=2,3,\ldots\eqno{\hbox{(82)}}$$ where $\gamma>0$ and $a$ is a normalizing factor (see, e.g., [14], [32], [33]). This distribution has finite entropy only if $\gamma>0$ (but is a valid infinite entropy distribution for $\gamma>-1$). Unlike the previous distributions, we need to use Theorem 6 to bound the redundancy for coding sequences generated by this distribution. Approximating the sum with an integral, the order of the third term of (52) is TeX Source $$n\sum_{i>m}\theta_{i}\log i=O\left({{n}\over{(\log m)^{\gamma}}}\right).\eqno{\hbox{(83)}}$$ In order to minimize the redundancy bound of (52), we define $\rho=n^{\ell}$. For the minimum rate, all terms of (52) must be balanced. To achieve that, we must have TeX Source $$\alpha+2\ell=1-2\alpha=1-\gamma\ell.\eqno{\hbox{(84)}}$$ The solution is $\alpha=\gamma/(4+3\gamma)$ and $\ell=2/(4+3\gamma)$. Substituting these values in the expression of (52), with $\rho=n^{\ell}$, results in the first term in (52) dominating, and yields the following corollary.

#### Corollary 5

Let $\mmb{\theta}\in{\cal M}$ be defined in (82) with $\gamma>0$. Then, there exists a universal code with length function $L^{\ast}(\cdot)$ that has only prior knowledge that $\mmb{\theta}\in{\cal M}$, that can achieve universal coding redundancy TeX Source $$R_{n}\left(L^{\ast},\mmb{\theta}\right)\leq\left(1+\varepsilon\right){{n^{{\gamma+4}\over{3\gamma+4}}(\log n)^{2}}\over{2n}}.\eqno{\hbox{(85)}}$$ In a similar manner to the Zipf distribution, the tighter form of the general upper bound can be used, reducing the $(\log n)^{2}$ term to $\Theta\left((\log n)^{4/3}\right)$ (with a different leading coefficient). Due to the slow decay rate of the distribution in (82), the effective alphabet size is much greater here. For $\gamma=1$, for example, it is $n^{n^{2/7}}$. This implies that very large symbols are likely to appear in $x^{n}$. As $\gamma$ increases though, the effective alphabet size decreases, and as $\gamma\rightarrow\infty$, $m\rightarrow n$. The redundancy rate increases due to the slow decay. For $\gamma\geq 1$, it is $O\left(n^{5/7}(\log n)^{2}/n\right)$. As $\gamma\rightarrow\infty$, since the distribution tends to decay faster, the redundancy rate tends to the finite alphabet rate of $O\left(n^{1/3}(\log n)^{2}/n\right)$. However, as the decay rate is slower $\gamma\rightarrow 0$, a nondiminishing redundancy rate is approached. Note that the proof of Theorem 6 does not limit the distribution to a finite entropy one. Therefore, the bound of (85) applies, in fact, also to ${-}{1}<\gamma\leq 0$. However, for $\gamma\leq 0$, the per-symbol redundancy is no longer diminishing.

SECTION VI

## INDIVIDUAL SEQUENCES

In this section, we show that if we have side information of the monotonicity of the distribution governing an individual sequence (i.e., its ML distribution), we can universally compress the individual sequence as well as (and even better than) the average case. We next show that in this case the lower bound of Theorem 3 is asymptotically achieved. Moreover, the upper bound derived here for the upper region is tighter than the bounds obtained in Theorem 4 and Corollary 1 for the average case. The reason is that, with the additional side information that ${\hat{\mmb{\theta}}}\in{\cal M}$, we restrict the smallest nonzero symbol probability to $1/n$. This is not the case in the average case, where symbols from a long tail of the distribution can have unordered occurrences in a given sequence. For a specific sequence, we can have ${\hat{\mmb{\theta}}}\not\in{\cal M}$, but we still need to describe ${\hat{\mmb{\theta}}}_{\cal M}\in{\cal M}$ for that sequence. The distributions ${\hat{\mmb{\theta}}}_{\cal M}$ may have probability parameters smaller than $1/n$ for symbols $i\not\in x^{n}$ and $j\in x^{n}$, where $j>i$ (recalling the assumption that for $\mmb{\theta}\in{\cal M}$, we must have $\theta_{i}\geq\theta_{j}$).

The side information assumed restricts the set of allowable sequences to those which obey the monotonicity, omitting all sequences for which ${\hat{\mmb{\theta}}}\ne{\hat{\mmb{\theta}}}_{\cal M}$ from the set considered. This means that the class considered is smaller than the class considered for the lower bound in Theorem 3. However, in proving Theorem 3, all sequences that do not obey the monotonicity are excluded from the Shtarkov sum [step $(b)$ of (13)], essentially rendering the bound also as a bound on the class containing only sequences that obey the monotonicity requirement.

If one assumes some monotonicity on the symbol probabilities, but the observed sequence diverges from this assumption, the code proposed can still be used to describe the probabilities of the symbols that obey the monotonicity. An additional description is added as a prefix to the code to describe the number of symbols that do not obey the monotonicity, and then $O(\log n)$ bits are used for each such symbol to describe its occurrence count. If the maximal symbol in $x^{n}$ is ${\hat{k}}$, as long as $o({\hat{k}})$ symbols are out of order for ${\hat{k}}\leq n^{1/3}$ or $o (n^{1/3})$ symbols are out of order for greater ${\hat{k}}$, the additional cost of coding the symbols violating the monotonicity is negligible. The method described below can thus still be used. Moreover, it can be shown (see, e.g., [34]) that as long as the largest symbol is polynomial in $n$ and there are not too many symbols larger than $n$, diminishing redundancy w.r.t. the monotonic ML probability ${\hat{\mmb{\theta}}}_{\cal M}$ can be achieved coding any such $x^{n}$. However, this result does not imply cheaper total description length than the one using the true ML ${\hat{\mmb{\theta}}}$ of $x^{n}$, as the loss in using ${\hat{\mmb{\theta}}}_{\cal M}$ instead of ${\hat{\mmb{\theta}}}$ may dominate over the redundancy savings.

Finally, the class of the distributions of all sequences with ${\hat{k}}$ symbols that obey the monotonicity is identical to the class of the distributions of all patterns with ${\hat{k}}$ indices for a given ${\hat{k}}$. The Shtarkov sum on the ML sequence probabilities is not equal in these cases because the pattern ML sequence probability is the sum over the probabilities of all permutations of these sequences. However, the method used for describing the ML i.i.d. distribution within this class can be used to derive tight bounds for coding patterns. (Bounding the quantization cost for patterns, on the other hand, is more complicated.) This was not the case when addressing the average case, as the description cost in the average case for monotonic distributions is more complicated due to the use of the monotonic ML over all sequences, including those not obeying the monotonicity. This section is concluded with the theorem that upper bounds the individual sequence redundancy and its proof.

#### Theorem 7

Fix an arbitrarily small $\varepsilon>0$, and let $n\rightarrow\infty$. Let $x^{n}$ be a sequence for which ${\hat{\mmb{\theta}}}\in{\cal M}$, i.e., ${\hat{\theta}}_{1}\geq{\hat{\theta}}_{2}\geq\ldots$. Let $k={\hat{k}}$ be the number of letters occurring in $x^{n}$. Then, there exists a code $L^{\ast}\left(\cdot\right)$ that achieves individual sequence redundancy w.r.t. ${\hat{\mmb{\theta}}}_{\cal M}={\hat{\mmb{\theta}}}$ for $x^{n}$ which is upper bounded by TeX Source \eqalignno{&\displaystyle{\hat{R}}_{n}\left(L^{\ast}, x^{n}\right)\leq&{\hbox{(86)}}\cr &\quad\cases{\left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n}\over{k^{3}}}, &k=o\left(n^{1/3}\right)\cr \left(1+\varepsilon\right){{k-1}\over{2n}}\log {{n\left(\log n\right)^{2}}\over{k^{3}}}, &k\leq n^{1/3}\cr {{\left(0.79\log {{k}\over{n^{1/3-\varepsilon}}}+0.14\log n\right) (n\log n)^{1/3}}\over{n}}, &n^{1/3}<k=o(n)\cr {{0.4\left(\log n\right)^{4/3}n^{1/3}}\over{n}}, &k=O(n).}} Note that by the monotonicity constraint, the number of symbols ${\hat{k}}$ occurring in $x^{n}$ also equals to the maximal symbol in $x^{n}$. Since, in the individual sequence case, this maximal symbol defines the class considered and also to be consistent with Theorem 3, we use $k$ to characterize the alphabet size of a given sequence. Since ${\hat{\mmb{\theta}}}$ is monotonic, ${\hat{\mmb{\theta}}}_{\cal M}={\hat{\mmb{\theta}}}$.

##### Proof [Theorem 7]

The proof enhances on that of Theorem 4 and Corollary 1. Both regions of the proof apply here, where instead of quantizing $\mmb{\theta}$ to $\mmb{\theta}^{\prime}$, we quantize ${\hat{\mmb{\theta}}}$ to ${\hat{\mmb{\theta}}}^{\prime}$ in a similar manner, and do not need to average over all sequences. Instead of using any general ${\hat{\mmb{\varphi}}}$ to code $x^{n}$, we can use ${\hat{\mmb{\theta}}}^{\prime}$ without any additional optimizations, where $\log n$ bits describe $k$. The first two regions of (86) are then proved similarly to these regions in Theorem 4.

To prove the bounds of the upper regions, which are tighter than those of Corollary 1, we make several modifications based on now using three major intervals (as in [35]) instead of two. To describe ${\hat{\mmb{\theta}}}^{\prime}$, using parameter $\alpha$, describe the components of ${\hat{\mmb{\theta}}}^{\prime}$ separately for three intervals $(1/n, n^{\alpha}/n]$, $(n^{\alpha}/n, 1/n^{\alpha}]$, and $(1/n^{\alpha}, 1]$. For the bottom interval, use $n^{\alpha}\log n$ bits to describe all probability parameters in this interval. For each of the $n^{\alpha}$ points in this interval use at most $\log n$ bits to describe the multiplicity of these values in ${\hat{\mmb{\theta}}}$. The top interval consists of at most $n^{\alpha}$ probability parameters. Use at most $\log n$ bits to describe the value of each. For both intervals, no quantization is necessary, and the components of ${\hat{\mmb{\theta}}}^{\prime}$ are identical to those of ${\hat{\mmb{\theta}}}$.

As in [35], the middle interval is the one in which the parameters need to be quantized. Partition this interval into $J^{\prime}_{2}\triangleq J^{+}_{2}-J^{-}_{2}$ smaller intervals, in a similar manner to (17) TeX Source $$I_{j}=\left[{{n^{(j-1)\beta}}\over{n}},{{n^{j\beta}}\over{n}}\right),\;J^{-}_{2}\leq j\leq J^{+}_{2}\eqno{\hbox{(87)}}$$ where $J^{-}_{2}$ and $J^{+}_{2}$ coincide with the end points of the large middle interval. This results in $J^{\prime}_{2}\leq (1-2\alpha)\log n$. Partition each interval into grid points with the spacing TeX Source $$\Delta^{\prime^{(2)}}_{j}={{n^{j\beta}}\over{n^{1+\alpha}}}.\eqno{\hbox{(88)}}$$ Similarly to (40), this yields TeX Source $$B^{\prime}_{2}\leq 0.5 (1-2\alpha) n^{\alpha}\log n\eqno{\hbox{(89)}}$$ grid points. Following a similar derivation to that in (42), the description cost of ${\hat{\mmb{\theta}}}^{\prime}$ is bounded by TeX Source \eqalignno{&\displaystyle L_{R}({\hat{\mmb{\theta}}}^{\prime})\leq&{\hbox{(90)}}\cr &\;\quad\cases{{{1+\varepsilon}\over{2}}(1-2\alpha) (\log n)\left(\log {{k}\over{n^{\alpha-\varepsilon}}}\right) n^{\alpha},&n^{\alpha}<k\leq n^{1-\alpha}\cr {{1+\varepsilon}\over{2}}(1-2\alpha)^{2}\left(\log n\right)^{2}n^{\alpha}, &k>n^{1-\alpha}}} where the description cost of the upper and lower large intervals is absorbed in second-order terms, and the second $1-2\alpha$ factor in the upper region results from $k_{mid}\leq n^{1-\alpha}$ due to the lower limit $n^{\alpha-1}$ of the middle large interval.

The number of symbols with parameters in small interval $J^{-}_{2}$ is upper bounded by $k_{J^{-}_{2}}\leq n^{1-\alpha}$, then, $k_{J^{-}_{2}+1}\leq (n^{1-\alpha}-k_{J^{-}_{2}})/2$, and so on. Similarly to (33), we have TeX Source $$\sum_{j=J^{-}_{2}}^{J^{+}_{2}}k_{j}2^{j}\leq n^{1-\alpha}\cdot 2^{J^{-}_{2}-1}=n.\eqno{\hbox{(91)}}$$ Thus, following (43) and using (87) and (88), the quantization cost can be upper bounded by TeX Source $$n (\log e)\sum_{i=1}^{k}{{\delta_{i}^{2}}\over{{\hat{\theta}}^{\prime}_{i}}}\leq{{2\log e}\over{n^{2\alpha}}}\sum_{j=J^{-}_{2}}^{J^{+}_{2}}k_{j}2^{j}=2 (\log e)n^{1-2\alpha}.\eqno{\hbox{(92)}}$$ There is thus a factor of 2 reduction over (43) because of the increased lower limit of the first point of quantized parameters.

Combining (90) and (92) for the second region of (90) TeX Source $$n{\hat{R}}_{n}(L^{\ast}, x^{n})\!\leq\!{{1\!+\!\varepsilon}\over{2}}(1\!-\!2\alpha)^{2}n^{\alpha}(\log n)^{2}\!+\!2 (\log e) n^{1-2\alpha}\!.\eqno{\hbox{(93)}}$$ Choosing $\alpha$ from (45) yields TeX Source $$n{\hat{R}}_{n}\left(L^{\ast}, x^{n}\right)\leq\left(1+\varepsilon\right)\left({{\nu}\over{18}}+{{2\log e}\over{\nu^{2}}}\right) n^{1/3}(\log n)^{4/3}\eqno{\hbox{(94)}}$$ for this region. Taking $\nu=(72\log e)^{1/3}\approx 4.7$ minimizes (94), resulting in coefficient of less than 0.4 in (94). Letting $n\rightarrow\infty$, absorbing all second-order terms in the gap of the coefficient to 0.4 proves the last region of (86). Using the same value of $\nu$ for the resulting terms for the third region in a similar manner, proves the third region. A slightly tighter bound can be obtained for the third region if the value of $\nu$ is optimized for the specific value of $k$. $\hfill \blacksquare$

SECTION VII

## SUMMARY AND CONCLUSION

Universal compression of sequences generated by monotonic distributions was studied. We showed that for finite alphabets, if one has the prior knowledge of the monotonicity of a distribution, one can reduce the cost of universality. For alphabets of $o(n^{1/3})$ letters, this cost reduces from $0.5\log (n/k)$ bits per each unknown probability parameter to $0.5\log (n/k^{3})$ bits per each unknown probability parameter. Otherwise, for alphabets of $O(n)$ letters, one can compress such sources with overall redundancy of $O(n^{1/3+\varepsilon})$ bits. This is a significant decrease in redundancy from $O(k\log n)$ or $O(n)$ bits overall that can be achieved if no side information is available about the source distribution. Redundancy of $O(n^{1/3+\varepsilon})$ bits overall can also be achieved for much larger alphabets including infinite alphabets for fast decaying monotonic distributions. Sequences generated by slower decaying distributions can also be compressed with diminishing per-symbol redundancy costs under some mild conditions and specifically if they have finite entropy rates. Examples for well-known monotonic distributions demonstrated how the diminishing redundancy decay rates can be computed by applying the bounds that were derived. The general results were shown to also apply to individual sequences whose empirical distributions obey the monotonicity. The techniques used for individual sequences can also be applied to bounding redundancy coding patterns.

## APPENDIX

### A. Proof of Theorem 1

The proof follows the same steps used in [30] and [31] to lower bound the maximin redundancies for large alphabets and patterns, respectively, using the weak version of the redundancy-capacity theorem [6]. This version ties between the maximin universal coding redundancy and the capacity of a channel defined by the conditional probability $P_{\theta}\left(x^{n}\right)$. We define a set ${\mmb{\Omega}}_{{\cal M}_{k}}$ of points $\mmb{\theta}\in{\cal M}_{k}$. Then, show that these points are distinguishable by observing $X^{n}$, i.e., the probability that $X^{n}$ generated by $\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ appears to have been generated by another point $\mmb{\theta}^{\prime}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ diminishes with $n$. Then, using Fano's inequality [4], the number of such distinguishable points is a lower bound on $R_{n}^{-}\left({\cal M}_{k}\right)$. Since $R_{n}^{+}\left({\cal M}_{k}\right)\geq R_{n}^{-}\left({\cal M}_{k}\right)$, it is also a lower bound on the average minimax redundancy. The two regions in (6) result from a threshold phenomenon, where there exists a value $k_{m}$ of $k$ that maximizes the lower bound, and can be applied to all ${\cal M}_{k}$ for $k\geq k_{m}$.

We begin with defining ${\mmb{\Omega}}_{{\cal M}_{k}}$. Let $\mmb{\omega}$ be a vector of grid components, such that the last $k-1$ components $\theta_{i}$, $i=2,\ldots, k$, of $\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ must satisfy $\theta_{i}\in\mmb{\omega}$. Let $\omega_{b}$ be the $b$th point in $\mmb{\omega}$, and define $\omega_{0}=0$ and TeX Source $$\omega_{b}\triangleq\sum\limits_{j=1}^{b}{{2 (j-{{1}\over{2}})}\over{n^{1-\varepsilon}}}={{b^{2}}\over{n^{1-\varepsilon}}},\;b=1, 2,\ldots.\eqno{\hbox{(A.1)}}$$ Then, for the $b$th point in $\mmb{\omega}$, TeX Source $$b=\sqrt{\omega_{b}}\cdot\sqrt{n}^{1-\varepsilon}.\eqno{\hbox{(A.2)}}$$

To count the number of points in ${\mmb{\Omega}}_{{\cal M}_{k}}$, let us first consider the standard i.i.d. case, where there is no monotonicity requirement, and count the number of points in ${\mmb{\Omega}}$, which is defined similarly, but without the monotonicity requirement (i.e., ${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$). Let $b_{i}$ be the index of $\theta_{i}$ in $\mmb{\omega}$, i.e., $\theta_{i}=\omega_{b_{i}}$. Then, from (A.1) and (A.2) and since the components of $\mmb{\theta}$ are probabilities TeX Source $$\sum_{i=2}^{k}{{b_{i}^{2}}\over{n^{1-\varepsilon}}}=\sum_{i=2}^{k}\omega_{b_{i}}=\sum_{i=2}^{k}\theta_{i}\leq1.\eqno{\hbox{(A.3)}}$$ It follows that for $\mmb{\theta}\in{\mmb{\Omega}}$, TeX Source $$\sum_{i=2}^{k}b_{i}^{2}\leq n^{1-\varepsilon}.\eqno{\hbox{(A.4)}}$$ Hence, since the components $b_{i}$ are nonnegative integers TeX Source \eqalignno{M&\triangleq\left\vert{\mmb{\Omega}}\right\vert\cr &\geq\sum\limits_{b_{2}=0}^{\left\lfloor\sqrt{n^{1-\varepsilon}}\right\rfloor}\sum\limits_{b_{3}=0}^{\left\lfloor\sqrt{n^{1-\varepsilon}-b_{2}^{2}}\right\rfloor}\cdots\sum\limits_{b_{k}=0}^{\left\lfloor\sqrt{n^{1-\varepsilon}-\sum\nolimits_{i=2}^{k-1}b_{i}^{2}}\right\rfloor}1\cr &\buildrel{(a)}\over{\geq}\!\int_{0}^{\sqrt{n^{1-\varepsilon}}}\!\int_{0}^{\sqrt{n^{1-\varepsilon}-x_{2}^{2}}}\cdots\!\int_{0}^{\sqrt{n^{1-\varepsilon}-\sum\nolimits_{i=2}^{k-1}x_{i}^{2}}}dx_{k}\cdots dx_{2}\cr &\buildrel{(b)}\over\triangleq{{V_{k-1}\left(\sqrt{n}^{1-\varepsilon}\right)}\over{2^{k-1}}}&{\hbox{(A.5)}}} where $V_{k-1}\left(\sqrt{n}^{1-\varepsilon}\right)$ is the volume of a $k-1$ dimensional sphere with radius $\sqrt{n}^{1-\varepsilon}$, $(a)$ follows from monotonic decrease of the function in the integrand for all integration arguments, and $(b)$ follows since its left-hand side computes the volume of the positive quadrant of this sphere. Note that this is a different proof from that used in [30] and [31] for this step. Applying the monotonicity constraint, all permutations of $\mmb{\theta}$ that are not monotonic must be taken out of the grid. Hence, TeX Source $$M_{{\cal M}_{k}}\triangleq\left\vert{\mmb{\Omega}}_{{\cal M}_{k}}\right\vert\geq{{V_{k-1}\left(\sqrt{n}^{1-\varepsilon}\right)}\over{k!\cdot2^{k-1}}},\eqno{\hbox{(A.6)}}$$ where dividing by $k!$ is a worst-case assumption, yielding a lower bound and not an equality. This leads to a lower bound equal to that obtained for patterns in [31] on the number of points in ${\mmb{\Omega}}_{{\cal M}_{k}}$. Specifically, the bound achieves a maximal value for $k_{m}=\left(\pi n^{1-\varepsilon}/2\right)^{1/3}$ and then decreases to eventually become smaller than 1. However, for $k>k_{m}$, one can consider a monotonic distribution for which all components $\theta_{i}$; $i>k_{m}$, of $\mmb{\theta}$ are zero, and use the bound for $k_{m}$.

Distinguishability of $\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}$ is a direct result of distinguishability of $\mmb{\theta}\in{\mmb{\Omega}}$, which is shown in Lemma 3.1 in [30]. The lemma states the following: there exits an estimator ${\hat{{\mmb{\Theta}}}}_{g}(X^{n})\in{\mmb{\Omega}}$ for which the estimate ${\hat{\mmb{\theta}}}_{g}$ satisfies $\lim_{n\rightarrow\infty}P_{\theta}\left({\hat{\mmb{\theta}}}_{g}\ne\mmb{\theta}\right)=0$ for all $\mmb{\theta}\in{\mmb{\Omega}}$. Since this is true for all points in ${\mmb{\Omega}}$, it is also true for all points in ${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$, where now, ${\hat{\mmb{\theta}}}_{g}\in{\mmb{\Omega}}_{{\cal M}_{k}}$. Assuming all points in ${\mmb{\Omega}}_{{\cal M}_{k}}$ are equally probable to generate $X^{n}$, we can define an average error probability $P_{e}\triangleq\Pr\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=\sum_{\mmb{\theta}\in{\mmb{\Omega}}_{{\cal M}_{k}}}P_{\theta}\left({\hat{\mmb{\theta}}}_{g}\ne\mmb{\theta}\right)/M_{{\cal M}_{k}}$. Using the redundancy-capacity theorem, TeX Source \eqalignno{nR^{-}_{n}\left[{\cal M}_{k}\right]\geq &\, C\left[{\cal M}_{k}\rightarrow X^{n}\right]\cr \buildrel{(a)}\over{\geq}&\, I [{\mmb{\Theta}}; X^{n}]=H\left[{\mmb{\Theta}}\right]-H\left[{\mmb{\Theta}}\vert X^{n}\right]\cr \buildrel{(b)}\over{=}&\,\log M_{{\cal M}_{k}}-H\left[{\mmb{\Theta}}\vert X^{n}\right]\cr \buildrel{(c)}\over{\geq}&\,\left(1-P_{e}\right)\left(\log M_{{\cal M}_{k}}\right)-1\cr \buildrel{(d)}\over{\geq}&\, (1-o(1))\log M_{{\cal M}_{k}}&{\hbox{(A.7)}}} where $C\left[{\cal M}_{k}\rightarrow X^{n}\right]$ denotes the capacity of the channel between ${\cal M}_{k}$ and the observation $X^{n}$, and $I [{\mmb{\Theta}}; X^{n}]$ is the mutual information induced by the joint distribution $\Pr\left(\Theta=\theta\right)\cdot P_{\theta}\left(X^{n}\right)$. Inequality $(a)$ follows from the definition of capacity, equality $(b)$ from the uniform distribution of ${\mmb{\Theta}}$ in ${\mmb{\Omega}}_{{\cal M}_{k}}$, inequality $(c)$ from Fano's inequality, and $(d)$ follows since $P_{e}\rightarrow 0$. Lower bounding the expression in (A.6) for the two regions (obtaining the same bounds as in [31]), then using (A.7), normalizing by $n$, and absorbing second-order terms in $\varepsilon$, yields the two regions of the bound in (6). The proof of Theorem 1 is concluded. $\hfill \square$

### B) Proof of Theorem 2

To prove Theorem 2, we use the random-coding strong version of the redundancy-capacity theorem [19]. The idea is similar to the weak version used in Appendix A. We assume that grids ${\mmb{\Omega}}_{{\cal M}_{k}}$ of points are uniformly distributed over ${\cal M}_{k}$, and one grid is selected randomly. Then, a point in the selected grid is randomly selected under a uniform prior to generate $X^{n}$. The random choice of a grid and then of a source in the grid must uniformly cover the whole space ${\cal M}_{k}$. Showing distinguishability within a selected grid, for every possible random choice of ${\mmb{\Omega}}_{{\cal M}_{k}}$, implies that a lower bound on the cardinality of ${\mmb{\Omega}}_{{\cal M}_{k}}$ for every possible choice is essentially a lower bound on the overall sequence redundancy for most sources in ${\cal M}_{k}$.

The construction of ${\mmb{\Omega}}_{{\cal M}_{k}}$ is identical to that used in [31] to construct a grid of sources that generate patterns. We pack spheres of radius $n^{-0.5(1-\varepsilon)}$ in the parameter space defining ${\cal M}_{k}$. The set ${\mmb{\Omega}}_{{\cal M}_{k}}$ consists of the center points of the spheres. To cover the space ${\cal M}_{k}$, we randomly select a random shift of the whole lattice under a uniform distribution. The cardinality of ${\mmb{\Omega}}_{{\cal M}_{k}}$ is lower bounded by the relation between the volume of ${\cal M}_{k}$, which equals (as shown in [31]) $1/[(k-1)! k!]$, and the volume of a single sphere, with factoring also of a packing density (see, e.g., [3]). This yields (55) in [31] TeX Source $$M_{{\cal M}_{k}}\geq{{1}\over{(k-1)!\cdot k!\cdot V_{k-1}\left(n^{-0.5(1-\varepsilon)}\right)\cdot2^{k-1}}}\eqno{\hbox{(B.1)}}$$ where $V_{k-1}\left(n^{-0.5(1-\varepsilon)}\right)$ is the volume of a $k-1$ dimensional sphere with radius $n^{-0.5(1-\varepsilon)}$ (see, e.g., [3] for computation of this volume).

For distinguishability, it is sufficient to show that there exists an estimator ${\hat{{\mmb{\Theta}}}}_{g}(X^{n})\in{\mmb{\Omega}}_{{\cal M}_{k}}$ such that $\lim_{n\rightarrow\infty}P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=0$ for every choice of ${\mmb{\Omega}}_{{\cal M}_{k}}$ and for every choice of ${\mmb{\Theta}}\in{\mmb{\Omega}}_{{\cal M}_{k}}$. This is already shown in [30, Lemma 4.1] for a larger grid ${\mmb{\Omega}}$ of i.i.d. sources, which is constructed identically to ${\mmb{\Omega}}_{{\cal M}_{k}}$ over the complete $k-1$ dimensional probability simplex. The lemma states the following: let ${\mmb{\Theta}}\in{\mmb{\Omega}}$ be a randomly selected point in a grid ${\mmb{\Omega}}$. Let a random sequence $X^{n}$ be governed by $P_{\Theta}\left(X^{n}\right)$. Then, there exists a decision rule that chooses a point ${\hat{\Theta}}_{g}\left(X^{n}\right)\in{\mmb{\Omega}}$, such that $\lim_{n\rightarrow\infty}P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=0$. By the monotonicity requirement, for every ${\mmb{\Omega}}_{{\cal M}_{k}}$, there exists an i.i.d. ${\mmb{\Omega}}$, such that ${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$. Since [30, Lemma 4.1] holds for ${\mmb{\Omega}}$, it then must also hold for the smaller grid ${\mmb{\Omega}}_{{\cal M}_{k}}$. Now, since all the conditions of the strong random-coding version of the redundancy-capacity theorem hold, taking the logarithm of the bound in (B.1), absorbing second-order terms in $\varepsilon$, and normalizing by $n$, leads to the first region of the bound in (8). By [19, Th. 3], since for any fixed arbitrarily small $\varepsilon>0$ we have $P_{e}\triangleq P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]\rightarrow 0$, then, $\mu_{n}\left(A_{\varepsilon}(n)\right)\rightarrow 0$, thus completing the proof for the first region of the bound.

The second region of the bound is handled in a manner related to the second region of the bound of Theorem 1. However, here, we cannot simply set the probability of all symbols $i>k_{m}$ to zero, because all possible valid sources must be included in one of the grids ${\mmb{\Omega}}_{{\cal M}_{k}}$ to achieve a complete covering of ${\cal M}_{k}$. As was done in [31], we include sources with $\theta_{i}>0$ for $i>k_{m}$ in the grids ${\mmb{\Omega}}_{{\cal M}_{k}}$, but do not include them in the lower bound on the number of grid points. Instead, for $k>k_{m}$, we bound the number of points in a $k_{m}$-dimensional cut of ${\cal M}_{k}$ for which the remaining $k-k_{m}$ components of $\mmb{\theta}$ are very small (and insignificant). This analysis is valid also for $k>n$. In proving distinguishability, however, we must take into account the effect of the additional sources in the grid, and make sure that the existence of these sources in ${\mmb{\Omega}}_{{\cal M}_{k}}$ does not lead to nondiminishing error probability. Lemma 6.1 in [31] shows that $\lim_{n\rightarrow\infty}P_{\Theta}\left[{\hat{{\mmb{\Theta}}}}_{g}(X^{n})\ne{\mmb{\Theta}}\right]=0$ for $k>k_{m}$ for i.i.d. nonmonotonically restricted grid of sources ${\mmb{\Omega}}$. The proof is given in [31, Appendix D]. As before, it carries over to monotonic distributions, since as before, for each ${\mmb{\Omega}}_{{\cal M}_{k}}$, there exists an unrestricted corresponding ${\mmb{\Omega}}$, such that ${\mmb{\Omega}}_{{\cal M}_{k}}\subseteq{\mmb{\Omega}}$. The choice of $k_{m}=0.5 (n^{1-\varepsilon}/\pi)^{1/3}$ gives the maximal bound w.r.t. $k$. Since, again, all conditions of the strong version of the redundancy-capacity theorem are satisfied, the second region of the bound is obtained. This concludes the proof of Theorem 2. $\hfill \square$

### ACKNOWLEDGMENT

The author would like to express gratitude to the associate editor, W. Szpankowski, for handling this paper and for very valuable comments that helped improving this paper, and also to an anonymous reviewer for providing valuable feedback.

## Footnotes

This work was supported by the NSF under Grant CCF-0347969. This paper was presented at the 2007 IEEE International Symposium on Information Theory.

The author is with Google, Inc., Pittsburgh, PA 15206 USA (e-mail: gshamir@ieee.org).

Communicated by W. Szpankowski, Associate Editor for Source Coding.

1For two functions $f(n)$ and $g(n)$, $f(n)=o(g(n))$ if $\forall c$, $\exists n_{0}$, such that $\forall n>n_{0}$, $f(n)<cg(n)$; $f(n)=O(g(n))$ if $\exists c$, $n_{0}$, such that $\forall n>n_{0}$, $0\leq f(n)\leq cg(n)$; $f(n)=\Theta (g(n))$ if $\exists c_{1}$, $c_{2}$, $n_{0}$, such that $\forall n>n_{0}$, $c_{1}g(n)\leq f(n)\leq c_{2}g(n)$.

2In this paper, redundancy is defined per-symbol (normalized by the sequence length $n$). However, when we refer to redundancy in overall bits, we address the block redundancy cost for a sequence.

3The original submission of this paper derived a looser bound for the first region of (10). A tighter bound was obtained using results that appeared subsequently to the submission of this paper in [38].

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available