Network Traffic Compression With Side Information

,


I. INTRODUCTION
Several studies have confirmed the presence of considerable amount of correlation in network traffic data [1]- [5]. Specifically, we may broadly define two types of correlation in network traffic: 1) Temporal correlation within content from an information source being delivered to a client. 2) Spatial correlation across content from different information sources delivered to the same/different clients. Network traffic abounds with the first dimension of temporal correlation, which is well understood. For example, if traffic contains mostly English text, there is significant correlation within the content. The existence of the second dimension of correlation is also confirmed in several real data experiments [1]- [5].
This has motivated the employment of correlation elimination techniques for network traffic data. 1 The present correlation elimination techniques are mostly based on content caching mechanisms used by solutions such as webcaching [6], CDNs [7], and P2P networks [8]. However, caching approaches that take place at the application layer, do not effectively leverage the spatial correlation, which The associate editor coordinating the review of this manuscript and approving it for publication was Honglong Chen . 1 Within the networking community, these techniques are known as redundancy elimination (RE) but since redundancy has a specific meaning within the universal compression community, we chose to refer to these techniques as correlation elimination for the clarity of discussion. exists mostly at the packet level [1]- [4]. To address these issues, a few studies have considered ad-hoc methods such as packet-level correlation elimination (deduplication) in which redundant transmissions of segments of a packet that are seen in previously sent packets are avoided [3], [4]. However, these techniques are limited in scope and can only eliminate exact duplicates from the segments of the packets leaving statistical correlations intact.
It is natural to consider universal compression algorithms for correlation elimination from network traffic data. While universal compression algorithms, e.g., the Lempel-Ziv algorithm [9], [10] and context tree weighting (CTW) [11], have been very successful in many domains, they do not perform very well on limited amount of data as learning the unknown source statistics incurs an inevitable redundancy (compression overhead). This redundancy depends on the richness of the class of the sources with respect to which the code is universal [12]- [16]. Further, traditional universal compression would only attempt to deal with temporal correlation from a stationary source and lacks the structure to leverage the spatial correlation dimension.
In this paper, as an abstraction of correlation elimination from network traffic, we study universal compression with side information from a correlated source. The organization of the paper and our contributions are summarized below.
overhead. This motivates using side information for removing this redundancy.
• In Section III, we present the formal problem setup.
We define a notion of correlation between two parametric information sources, and study strictly lossless and almost lossless compression when side information from a correlated source is available to the encoder and/or the decoder.
• In Section IV, we establish several nice properties of correlated information sources. We show that the degree of correlation is tuned with a single hyperparameter, which results in independent information sources in one end and duplicate sources in the other end.
• In Section V, we characterize the average maximin redundancy with side information from a correlated source. We also show that if permissible error is sufficiently small the redundancy of almost lossless compression dominates the reduction in codeword length due to the permissible error.
• In Section VI, we define and characterize a notion of side information gain and establish a sufficient condition on the length of a side information string that would guarantee almost all of the benefits. We show that the side information gain can be considerable in many scenarios and derive a cutoff threshold on the size of memory needed to obtain all of the side information gain.
• In Section VII, we show that the side information gain is largely preserved even if the prefix constraint on the code is dropped.
• In Section VIII, we provide a case study that shows how these benefits would be extended in a network setting.
• Finally, the conclusions are summarized in Section IX.

II. MOTIVATION
We describe universal compression with side information from a correlated source in the most basic scenario. We use the notation x n = (x 1 , . . . , x n ) to denote a string of length n on the finite alphabet X . For example, for an 8-bit alphabet that has 256 characters, each x i is a byte and x n denotes a packet at the network layer. We assume that, as shown in Fig. 1, the network consists of content server nodes S 1 and S 2 , an intermediate memory-enabled (relay or router) node M , and client nodes C 1 and C 2 .
Let's assume that the content at S 1 is stationary and correlated with the content at S 2 . Assume that y m has already been routed through S 2 → M → C 2 path. Also, assume that all nodes in the route, i.e., S 2 , M and C 2 , have memorized the content y m . Now, assume that x n is to be routed through S 1 → M → C 1 path. In this case, at the S 1 → M link the side information string is only available to the decoder, while at the M → C 1 link, the side information is only available to the encoder. If x n was to be routed through S 1 → M → C 2 path, in this case, the side information would be available to both the encoder and the decoder at M → C 2 link. As such, we wish to study universal compression with side information that is available to the encoder and/or the decoder in this paper.
Given the side information gain, in [17], we analyzed the network-wide benefits of introducing memory-enabled nodes to the network and provided results on memory placement and routing for extending the gain to the entire network. However, [17] did not explain how to characterize the side information gain.
Let redundancy be the overhead in the number of bits used for describing a random string drawn from an unknown information source compared to the optimal codeword length given by the Shannon code. In the universal compression of a family of information sources that could be parametrized with d unknown parameters, Rissanen showed that the expected redundancy asymptotically scales as d 2 log n + o(log n) for almost all sources in the family [13]. 2 Clarke and Barron [18] derived the asymptotic average minimax redundancy for memoryless sources to be d 2 log n + O n (1). 3 This was later generalized by Atteson to Markov information sources [19]. The average minimax redundancy is concerned with the redundancy of the worst parameter vector for the best code, and hence, does not provide much information about the rest of the source parameter values. However, in light of Rissanen's result one would expect that asymptotically almost all information sources in the family behave similarly. The question remains as how these would behave in the finitelength regime.
In [16,Theorem 1], using a probabilistic treatment, we derived sharp lower bounds on the probability of the event that the redundancy in the compression of a random string of length n from a parametric source would be larger than a certain fraction of d 2 log n. [16,Theorem 1] provides, for any n, a lower bound on the probability measure of the information sources for which the average redundancy of the best universal compression scheme would be larger than d 2 log n. To demonstrate the implications of this result in the finite-length regime of interest in this paper, we consider an example using a first-order Markov information source with alphabet size k = 256. This information source is represented using d = 256×255 = 62580 parameters. We further assume that the source entropy rate is 0.
Theorem 1] that the compression overhead is more than 75% for strings of length 256kB. We conclude that redundancy is significant in the compression of finite-length low-entropy sequences, such as the Internet traffic packets that are much shorter than 256kB. It is this redundancy that we hope to suppress using side information from a correlated source. The compression overhead becomes negligible for very long sequences (e.g., it is less than 2% for strings of length 64MB and above), and hence, the side information gain vanishes as O (log n/n) when the sequence length grows large. It is also worth noting the scope of benefits expected from universal compression of network traffic with side information is significant since file sharing and web data comprise more than 50% of network traffic [20] for which, the correlation levels may be as high as 90% [1]. Further, universal compression with side information is applicable to storage reduction in cloud and distributed storage systems, traffic reduction for Internet Service Providers, and power and bandwidth reduction in wireless communications networks (e.g., wireless sensors networks, cellular mobile networks, hot spots). See [17], [21] for a more thorough investigation of such applications and also for practical coding schemes for network packet compression.

III. PROBLEM SETUP
Let X be a finite alphabet. We assume that the server S comprises of two parametric sources θ (1) , θ (2) ∈ d , where θ (1) := (θ d ), and where d is a d-dimensional compact set. Denote µ n θ (1) and µ n θ (2) as the probability measures defined by the parameter vector θ on strings of length n. If the information sources are memoryless, we let θ (1) be the stochastic vector associated with the categorical distribution over source characters, and d would be the d-dimensional probability simplex. In the memoryless case, µ n θ (1) would be a product distribution. In this paper, we consider a Bayesian setting where we assume that θ (1) is a priori unknown but its prior is known. Unless otherwise stated, we use the notation X n ∈ X n and Y m ∈ X m to denote random string of length n and m drawn from µ n θ (1) and µ m θ (2) , respectively. See Assumption 1 (appendix) for a set of regularity conditions that we assume on the parametric family.
We put forth a notion of correlation between the parameter vectors θ (1) and θ (2) , which are jointly drawn. As we shall see, the degree of correlation could be tuned using a hyperparameter t. We assume that the unknown (and unobserved) parameter vector θ (1) follows a prior distribution q supported on d . Let Z t be a random string of length t that is drawn from µ t θ (1) . We assume that given Z t , the parameter vectors θ (1) and θ (2) are independent and identically distributed. This is shown in the Markov chain represented in Fig. 2. We will state several nice properties of this proposed model in Section IV.
Note that this framework is fundamentally different from Slepian-Wolf coding that also targets the spatial correlation between distributed information sources [22]- [25]. In Slepian-Wolf coding, the sequences from the distributed sources are assumed to have character-by-character correlation, which is also different from our correlation model that is due to the parameter vectors being unknown in a universal compression setup.
Let H n (θ ) denote the Shannon entropy of the source given θ , i.e., Throughout this paper expectations are taken over functions of the random sequence X n with respect to the (unknown) probability measure µ θ , and log(·) denotes the logarithm in base 2, unless otherwise stated. We further use the notation H (θ ) to denote the entropy rate, defined as H (θ ) lim n→∞ 1 n H n (θ). Let I(θ) be the Fisher information matrix, where each element is given by . (2) Fisher information matrix quantifies the amount of information, on the average, that a random string X n from the source conveys about the source parameters. Let Jeffreys' prior on d be defined as Roughly speaking, Jeffreys' prior is optimal in the sense that the average minimax redundancy is asymptotically achieved when the parameter vector θ is assumed to follow Jeffreys' prior (see [18] for a formalized statement and proof). This prior distribution is particularly interesting because it corresponds to the worst-case compression performance for the best compression scheme. We consider the family of block codes that map any nstring to a variable-length binary sequence, which also satisfy Kraft's inequality [26]. Let where ε denotes an erasure. We use c : C → {0, 1} * and c E : , 1} * to denote the encoder without and with side information, respectively. Similarly, we also use d : D → X n VOLUME 8, 2020 and d D : D D → X n to denote the decoder without and with side information, respectively. We use notations Next, we present the notions of strictly lossless and almost lossless source codes, which will be needed in the sequel. While the definitions are only given for the case with no side information at the encoder and the decoder, it is straightforward to extend them using the above definitions. Our main focus in on prefix free codes that ensure unique decodability of concatenated code blocks (see [27,Chapter 5.1]).
Definition 1: The code c : C → {0, 1} * is called strictly lossless (also called zero-error) if there exists a reverse mapping d : D → X n such that where 1 e (x n ) denotes the error indicator function, i.e, where D and E are defined in (5).
Most of the practical data compression schemes are examples of strictly lossless codes, namely, the arithmetic code [28], Huffman code [29], Lempel-Ziv codes [9], [10], and CTW code [11]. In almost lossless source coding, which is a weaker notion of the lossless case, we allow a nonzero error probability (n) for any finite n while if (n) = o n (1) the code is almost surely asymptotically error free. The proofs of Shannon [30] for the existence of entropy achieving source codes are based on almost lossless random codes. The proof of the Slepian-Wolf theorem [22] also uses almost lossless codes. Further, all of the practical implementations of SW source coding are based on almost lossless codes (see [24], [25]).
We consider four coding strategies according to the orientation of the switches s e and s d in Fig. 3 for the compression of x n drawn from µ n θ (1) provided that the sequence y m drawn from µ m θ (2) is available to the encoder/decoder or not. 4 • Ucomp (Universal compression without side information), where the switches s e and s d in Fig. 3 are both open. This corresponds to C ∈ C and D ∈ D.
• UcompE (Universal compression with encoder side information), where the switch s e in Fig. 3 is closed but the switch s d is open. This corresponds to C E ∈ C E and D E ∈ D. • UcompD (Universal compression with decoder side information), where the switch s e in Fig. 3 is open but the switch s d is closed. This corresponds to C ∈ C and D D ∈ D D . • UcompED (Universal compression with encoderdecoder side information), where the switches s e and s d in Fig. 3 are both closed. This corresponds to C E ∈ C E and D ED ∈ D D .

IV. IMPLICATIONS OF THE CORRELATION MODEL
In this section, we study some implications of the proposed correlation model. This section may be skipped by the reader and only referred to when a particular lemma is needed in the subsequent proofs. Lemma 1: The joint distribution of (θ (1) , θ (2) ) for all t ≥ 0 is given by where f t (θ (1) , θ (2) ) is defined as Proof: We have where (9) follows from the fact that θ (2) and θ (1) are independent and identically distributed given Z t , and (10) follows from the Bayes rule. Hence, the result follows.
Proof: By definition of f t (·, ·), and the fact that Hence, the claim follows by invoking Lemma 1.
Proof: Let θ (1) (Z t ) be the maximum likelihood estimator (MLE) of θ (1) from the observation Z t . By definition, θ (1) (Z t ) also serves as the MLE for θ (2) . Then, and the statement follows from the convergence of MLE in mean square for the parametric information source as assumed in the regularity conditions put forth in Assumption 1 (appendix). Remark: The degree of correlation between the two parameter vectors θ (1) and θ (2) is determined by the hyperparameter t. This degree of correlation varies from independence of the two parameter vectors at t = 0 all the way to the vectors being equal (convergence in mean square) when t → ∞. Further note that the covariance matrix of the parameter vectors θ (1) and θ (2) asymptotically as t grows large behaves like 2 t I −1 (θ (1) ).

V. AVERAGE MAXIMIN REDUNDANCY
In this section, we investigate the average maximin redundancy in universal compression of correlated sources for different coding strategies put forth in Section III.

A. Ucomp CODING STRATEGY
Let l n : X n → R + denote the universal (strictly lossless) length function for Ucomp. 5 This is the length associated with a strictly lossless code. A necessary and sufficient condition for existence of a code that satisfies unique decodability is given by Kraft inequality: Denote L n as the set of all strictly lossless universal length functions that satisfy Kraft inequality. Denote R n (l n , θ) as the average redundancy of the code with length function l n (·), defined as Define R as the minimax redundancy of Ucomp, i.e., It is well known that the maximum above is attained by Jeffreys' prior in the asymptotic limit as n grows large. Hence, in the rest of this paper we assume that θ (1) , θ (2) ∼ w J follow Jeffreys' prior given in (3). On the other hand, the length function that achieves the inner minimization is simply the information random variable.
Putting it all together, we have where I (·; ·) denotes the mutual information. This is Gallager's redundancy-capacity theorem in [32]. Clarke and Barron [18] showed that the average maximin redundancy for strictly lossless Ucomp is This result states that the average maximin redundancy in Ucomp coding strategy is O(log n) and also is linearly proportional to the number of unknown source parameters, d.
It is straightforward to define R n as the average redundancy when θ (1) follows Jeffreys' prior when we are restricted to almost lossless codes with permissible error . Note that it is clear that R n ≤ R n . A natural question that arises is how much reduction is achievable by allowing a permissible error probability in decoding. Our main result on Ucomp coding strategy with almost lossless codes is given in the following theorem.
Theorem 1: Proof: The proof is completed by invoking Lemma 12 in the appendix and noting that R n = O(log n). The content of Theorem 1 is that if the permissible error, (n), in almost lossless compression vanishes fast enough as n grows, then asymptotically the maximin risk imposed by universality of compression dominates any savings obtained by allowing an (n) average error in decoding. Hence, in the rest of this paper we only focus on the family of strictly lossless codes.

B. UcompE CODING STRATEGY
Since the side information sequence y m is not available to the decoder, then the minimum number of average bits required at the decoder to describe the random sequence X n is indeed H (X n ). On the other hand, it is straightforward to see that where I (X n ; θ (1) ) = R n by the redundancy-capacity theorem. Hence, in UcompE strategy, we establish that the side information provided by y m only at the encoder does not provide any benefit on the strictly lossless universal compression of the sequence x n .

C. UcompD CODING STRATEGY
Considering the UcompD strategy, by Assumption 1 (appendix), the two sources µ θ (1) and µ θ (2) are d-dimensional parametric ergodic sources. In other words, any pair (x n , y m ) ∈ X n × X m occurs with non-zero probability and the support set of (x n , y m ) is equal to the entire X n × X m . Therefore, the knowledge of the side information sequence y m at the decoder does not rule out any possibilities for x n at the decoder. Hence, we conclude that side information provides no reduction in average codeword length (see [33] and the references therein for a discussion on zero-error coding). However, this is not the case in almost lossless source coding. See [21] for an almost lossless code in this case.
The following intuitive inequality demonstrates that the redundancy decreases when side information is available.
Lemma 5: For all n, m, t ≥ 0, we have with equality if and only if min{n, m, t} = 0. Proof: First notice that R n,m,t = I (X n ; θ (1) |Y m ) and R n = I (X n ; θ (1) ) and hence the inequality is achieved by applying Lemma 9 (appendix) and noticing the Markov chain X n → θ (1) → Y m . Equality holds if and only if I (X n ; Y m ) = 0. We just need to show that I (X n ; Y m ) = 0 if and only if min{n, m, t} = 0. If n = 0 or m = 0, then I (X n ; Y m ) = 0. If t = 0, then θ (1) and θ (2) are independent by Lemma 3. Hence, X n and Y m are also independent. Conversely, assume that n, m > 0, then by Lemma 3, I (X n ; Y m ) = 0 only if t = 0 completing the proof.
According to Lemma 5, side information cannot hurt, which is intuitively expected. However, there is no benefit provided by the side information when the two parameter vectors of the sources S 1 and S 2 are independent. This is not surprising as when θ (1) and θ (2) are independent, then X n (produced by S 1 ) and Y m (produced by S 2 ) are also independent. Thus, the knowledge of y m does not affect the distribution of x n . Hence, y m cannot be used toward the reduction of the codeword length for x n .
Next, we present our main result on the average maximin redundancy for strictly lossless UcompED coding.

where R(n, m, t) is defined as
and m (·, ·) is given by the following: Proof: Recall that R n,m,t = I (X n ; θ (1) |Y m ).
Further, note the following Markov chain Assuming that min{m, t} = ω n (1), i.e., both grow unbounded with n. Then, we can rely on the asymptotic normality of all of the variables above and noting that θ (1) (X n ) is a sufficient statistic for X n , then θ (1) is Gaussian distributed with mean θ (Y m ) with variance 1 m = 2 t + 1 m given Y m . Hence, invoking Lemma 10 (appendix) we arrive at the desired result.
For min{m, t} = O n (1), notice that from Lemma 9 we can deduce that Hence, the result is concluded by noting that I (θ (1) ; Y m ) = O n (1).
Theorem 2 characterizes the average maximin redundancy in the case of UcompED with side information from a correlated source. If the sources are not sufficiently correlated or the side information string is not long enough, then not much performance improvement is expected and the redundancy is close to that of Ucomp strategy. On the other hand, for sufficiently correlated information sources with sufficiently long side information string, one expects that the redundancy would be significantly reduced. In a sense, m (m, t) can be thought of as the effective length of the side information string. When t → ∞, we see that m (m, t) ≈ m while for smaller t, we see that m (m, t) < m.

VI. SIDE INFORMATION GAIN
In this section, we define and characterize the side information gain in the different coding strategies described in Section III. Side information gain is defined as the ratio of the expected codeword length of the traditional universal compression (i.e., Ucomp) to that of the universal compression with side information from a correlated source (i.e., UcompED): In other words, g n,m,t (θ) is the side information gain on a string of length n drawn from µ n θ and compressed using UcompED coding strategy with a side information string of length m drawn from a correlated source with degree of correlation t.
The following is a trivial lower bound on the side information gain.
Lemma 6: For all n, m, t ≥ 0, and θ ∈ d : Proof: This is proved by invoking Lemma 5. Next, we present our main result on the side information gain in the next theorem.
where R(n, m, t) is defined in (31).

Proof: The theorem is proved by invoking Theorem 2 and light algebraic manipulations.
Consider the case where the string length n grows to infinity. Intuitively, we would expect the side information gain to vanish in this case.
Let us demonstrate the significance of the side information gain through an example. We let the information source be a first-order Markov source with alphabet size k = 256. We also assume that the source is such that H n (θ )/n = 0.5 bit per source character (byte). In Fig. 4, the lower bound on the side information gain is demonstrated as a function of the sequence length n for different values of the memory size m. As can be seen, significant improvement in the compression may be achieved using memorization. For example, the lower bound on g 32kB,m,∞ (θ ) is equal to 1.39, 1.92, 2.22, and 2.32, for m equal to 128kB, 512kB, 2MB, and 8MB, respectively. Further, g 512kB,∞,∞ (θ ) = 2.35. Hence, more than a factor of two improvement is expected on top of traditional universal compression when network packets of lengths up to 32kB are compressed using side information. See [17, Section III] for practical compression methods that aim at achieving these improvements. As demonstrated in Fig. 4, the side information gain for memory of size 8MB is very close to g n,∞,∞ (θ ), and hence, increasing the memory size beyond 8MB does not result in substantial increase of the side information gain. On the other hand, we further observe that as n → ∞, the side information gain becomes negligible regardless of the length of the side information string. For example, at n = 32MB even when m → ∞, we have g 32MB,∞,∞ ≈ 1.01, which is a subtle improvement. This is not surprising as the redundancy that is removed via the side information is O(log n), and hence the gain in (38) is O( log n n ) which vanishes as n grows. Thus far, we have shown that significant performance improvement is obtained from side information on the compression of finite-length strings from low-entropy sources. As also was evident in the previous example, as the size of the memory increases the performance of the universal compression with side information is improved. However, there is a certain memory size beyond which increasing the side information length does not provide further compression improvement. In this section, we will quantify the required size of memory such that the benefits of the memory-assisted compression apply.
Then, the following theorem determines the size of the required memory for achieving (1 − δ) fraction of the gain for unlimited memory. Let g n,t (θ ) be defined as It is straightforward to see that g n,t (θ ) is the limit of side information gain as the effective side information string length m (m, t) → ∞, where m (·, ·) is defined in (32). Theorem 4: Let m n δ (θ ) be defined as Then, for any m, t ≥ 0 such that m (m, t) ≥ m n δ (θ ), we have g n,m,t (θ ) ≥ (1 − δ) g n,t (θ ).
Hence, we need to show that for m (m, t) > m n,t δ (θ ), we have H n (θ ) H n (θ ) + R(n, m, t) By noting the definition of m n δ (θ ), for any m (m, t) > m n δ (θ), we have d 2 By noting that log 1 + n m ≤ n m log e, we have and hence, the proof is completed by noting the definition of R(n, m, t) in (31) and light algebraic manipulations. Theorem 4 determines the size of the memory that is sufficient for the gain to be at least a fraction (1−δ) of the gain obtained as m → ∞. Considering our working example of the first-order Markov source in this section with H n (θ )/n = 0.5, with δ = 0.01, we have m δ (θ) ≈ 8.9MB is sufficient for the gain to reach 99% of its maximum confirming our previous observation. This also complements the practical observations reported in [17, Section IV.C].

VII. IMPACT OF PREFIX CONSTRAINT
Thus far, all of the results of the paper are on prefix-free codes that satisfy Kraft inequality in (19). However, we remind the reader that our main application is in network packet compression. In this case, the code need not be uniquely decodable (satisfy Kraft inequality) as the beginning and the end of each block is already determined by the header of the packet. Thus, the unique decodability condition is too restrictive and can be relaxed. It is only necessary for the mapping (the code) to be injective so as to ensure that one block of length n can be uniquely decoded. Such codes are known as one-to-one codes. These are also called nonsingular codes in [27,Chapter 5.1]. An interesting fact about one-toone codes is that while the average codeword length of prefixfree codes can never be smaller than the Shannon entropy, the average codeword length of one-to-one codes can go below the entropy (cf. [34]- [38] and the references therein).
Let l n * (·) denote a strictly lossless one-to-one length function. Further, denote L n * as the collection of all one-to-one codes (bijective mappings to binary sequences) on sequences of length n. Let R n * (l n * , θ) denote the average redundancy of the one-to-one code, which is defined in the usual way as Further, define where d denotes the set of probability measures on d . Theorem 5: The following bound holds: Assuming that θ follows Jeffreys' prior, we can get where R n is the average minimax redundancy for prefix-free codes given in (24) and H n is given by We now invoke the main theorem in [34] to obtain a lower bound on E{l n * (X n )}. The proof is completed by observing that log H n ≤ log n and noting that the average redundancy for the case where θ follows Jeffreys' prior provides a lower limit on the average maximin redundancy.
Theorem 5 shows that the compression overhead as measured against entropy is d−2 2 log n + O n (1). However, as discussed earlier, non-universal one-to-one codes achieve an average codeword length that can go below entropy. In particular, for the family of parametric sources studied in this paper, for almost all θ ∈ d , it is shown that the average codeword length is given by H n (θ ) − 1 2 log n + O n (1) [35], [36], [38]. Hence, the cost of universality is d−1 2 log n + O n (1). See [39], [40] for a more complete study of the one-to-one universal compression problem. Additionally, see [41] for new insights on why the cost of universality scales with one less parameter in one-to-one compression, i.e., d−1 2 , as compared to d 2 for prefix-free codes.
It is desirable to characterize how much reduction is offered by universal one-to-one compression compared with the prefix-free universal compression. We compare the performance of universal one-to-one codes with that of the universal prefix-free codes through the running numerical example from Section II. This example is based on a firstorder Markov source with alphabet size |X | = 256, where the number of source parameters is d = 256 × 255 = 62580. Note that we have not provided an actual code for the one-to-one universal compression. We compare the converse bound of Theorem 5 with the average maximin redundancy of universal prefix-free codes. Fig. 5 compares the minimum average number of bits per symbol required to compress the class of the first-order Markov sources normalized to the entropy of the sequence for different values of entropy rates in bits per source symbol (per byte). As can be seen, relaxing the prefix constraint at its best does not offer meaningful performance improvement on the compression performance as the curves for the prefixfree codes and one-to-one codes almost coincide. This leads to the conclusion that the universal one-to-one codes are not of much practical interest. On the other hand, if the source entropy rate is 1 bit per byte (H n (θ )/n = 1), the compression rate on sequences of length 32kB (for both prefix-free and one-to-one codes) is around 2.25 times the entropy-rate, which results in more than 100% overhead on top of the entropy-rate for both prefix-free and one-to-one universal codes. Hence, we conclude that average redundancy poses significant overhead in the universal compression of finite-length low-entropy sequences, such as the Internet traffic, which cannot be compensated by dropping the prefix constraint. Hence, the side information gain provided from a correlated information source is essential even if the prefix constraint is dropped.

VIII. A NETWORK CASE STUDY
In this section, we demonstrate how the side information gain could be leveraged in terms of the compression of network traffic. Assume that source S is the CNN server and the packet size is n = 1kB. Further, assume that the memory size is 4MB. In Section II, we demonstrated that for this source, the average compression ratio for Ucomp is 1 n E{l n (X n )} = 4.42 bits per byte for this packet size. We further expected that the side information gain for such packet size be at least g = 5. Note that the rest of this discussion is concerned as to how the side information gain impacts the overall performance in the network.
We define the network-wide gain of side information measured in bit×hop (BH) for the sample network presented in Fig. 6, where M denotes the memory element. Assume that the server S would serve the client C in the network. The intermediate nodes R i are not capable of memorization. Recall that the side information gain g is only achievavle on every link in a path where the encoder and the decoder both have access to the side information string. Let d(S, C) denote the length of the shortest path from S to C, which is clearly d(S, C) = 3, e.g., using the path e 1 , e 5 , e 10 . Let BH(S, C) denote the minimum bit-hop cost required to transmit the sequence (of length n) from S to C without any compression mechanism, which is BH(S, C) = 24kbits (which is 1kB×8bits/byte×3). In the case of end-toend universal compression, i.e., using Ucomp, on the average we need to transmit BH Ucomp = E{l n (X n )}d(S, C) bit×hop for the transmission of a packet of length n to the client.
On the other hand, in the case of universal compression with side information, i.e., using UcompED, for every information bit on the path from the server to the memory element M , we can leverage the side information, and hence, we only require 1 n E{l n,m (X n , Y m )} = 1 ng E{l n (X n )} bit transmissions per each source character that is transmitted to the memory element. Then, the memory element M will decode the received codeword using UcompED decoder and the side information string y m . It will then re-encode the result using Ucomp encoder for the final destination (the client C). In this example, this implies that we require to transmit 2 n E{l n,m (X n , Y m )} bit×hop on the average from S to M on links e 1 and e 3 (where d(S, M ) = 2) for each source character. Then, we transmit the message using E{l n (X n )} bit×hop per source character from M to C on the link e 9 . Let BH UcompED be the minimum bit×hop cost for transmitting the string (of length n) using network compression that leverages side information in the S → M path, i.e., Further, let G BH be the bit×hop gain of network compression, defined as G BH = BH Ucomp BH UcompED . Thus, G BH = 2.14 in this example by substituting g = 5. In other words, network compression (using UcompED in the S → M path) achieves more than a factor of 2 saving in bit×hop over the traditional universal compression of the packet (using Ucomp from S to C) in the sample network.
In [17], we fully characterize the scaling of the bit×hop gain, G BH , for scale-free networks (random power-law graphs) as a function of side information gain, g. We show that G BH ≈ g if the fraction of nodes in the network equipped with memorization capability is larger than a phase-transition cutoff. We refer the interested reader to [17] for more details.

IX. CONCLUSION
In this paper, we formulated and studied universal compression with side information from a correlated source. We showed that redundancy can impose a significant overhead in universal compression of finite-length sequences, such as network packets. We put forth a notion of correlation between information sources where the degree of correlation is controlled by a single hyperparameter. We showed that side information from a correlated source can significantly suppress the redundancy in universal compression. We defined the side information gain and showed that it can be large with reasonable side information size for small strings, such as network packets. We showed that this gain is largely preserved even if the code is allowed to be only almost lossless allowing a sufficiently small error that vanishes asymptotically. We also showed that dropping the prefix constraint would not remedy the universal compression problem either. Finally, we showed how these benefits are applicable in network compression in a case study.

ASSUMPTIONS AND PROOFS
Assumption 1 (regularity conditions): We need some regularity conditions to hold for the parametric model so that our results can be derived.
1) The parametric model is smooth, i.e., twice differentiable with respect to θ in the interior of d so that the Fisher information matrix can be defined. Further, the limit in (2) exists.

2) The determinant of fisher information matrix is finite
for all θ in the interior of d and the normalization constant in the denominator of (3) is finite.
Proof: This is a well-known result on Markov chains and could be proved by applying the chain rule to I (X ; Y , Z ) in different orders and noting that I (X ; Z |Y ) = 0 due to the Markov chain.
Lemma 10: Let X → Y → Z form a Markov chain, where X , Y , Z are all Gaussian distributed and supported on R d . Further, let X follow a non-informative improper uniform distribution on R d . Let Y be a noisy observation of X with variance σ 2 , i.e., Y = X + N 1 where N 1 ∼ N (0 d , σ 2 I d ) is independent of X , and 0 d and I d denote the d-dimensional all-zero vector and identity matrix, respectively. In the same way, let Z be a noisy observation of Y with variance τ 2 . Then,