Tabular Interpolation Approach Based on Stable Random Projection for Estimating Empirical Entropy of High-Speed Network Traffic

The empirical entropy of the network flow attributes is an essential measure for identifying anomalous network traffic. However, computing the exact entropy values for high-speed networks in real-time is computationally expensive. Accordingly, the present study replaces the complex computations of existing stable random projection methods for entropy estimation with a simple table lookup procedure. Notably, the size of the lookup table is reduced through a piece-wise linear interpolation heuristic in order to facilitate the implementation of the proposed scheme in resource-constrained pipeline environments. The proposed architecture enables entropy estimation to be performed using both the Log-Mean Estimator (LME) method and the New Estimator of Compressed Counting (NECC) algorithm reported in the literature. The feasibility of the proposed approach is verified empirically using both real-world network traffic traces and synthetic data streams. Moreover, the practical applicability is demonstrated via stream-based implementation in the programmable data planes of the NetFPGA-Plus framework and a Tofino P4 switch, respectively. The results indicate that the proposed tabulation-based entropy estimation scheme allows minimum-sized Ethernet frames to be processed with a wire speed of up to several hundred gigabits per second.


I. INTRODUCTION
Information-theoretic methods provide effective means of detecting anomalies in network traffic. One such measure is the empirical Shannon entropy, defined as where m is the total elements in the data stream, m i is the frequency of item i appears in the stream, and n is the total number of distinct items in the stream. As shown in Equation (1), the entropy H reaches its minimum value of zero The associate editor coordinating the review of this manuscript and approving it for publication was Cong Pu .
when all the items in the stream are the same, and reaches its maximum value when all of the items are different. Compared to the volume-based traffic anomaly detection methods [1], entropy-based approaches provide a high-sensitive detection capability [2] and finer-grained insights into the network behavior [3] without requiring continuous volumechange monitoring.
Entropy-based methods thus provide a more feasible approach for detecting both low-and high-rate [4] Distributed Denial of Service (DDoS) attacks [5], [6] and for differentiating between Flash Events (FE) and DDoS attacks [7]. However, developing effective techniques [8] for measuring the entropy of high-speed network traffic in real-time still remains a significant challenge [9].

A. MOTIVATIONS
Since 2010, as the IEEE 100 Gbps network standards (IEEE802.3ba) [10] were introduced, wire speeds in data center networks have increased tremendously in order to meet the emerging needs of "Anything as a Service" (XaaS) technologies [11], [13] for low-latency network functions with high-bandwidth and enhanced flexibility. However, as the deployment of high-speed networks has increased, the scale, severity, and number of malicious attacks also increased. These attacks are capable of causing substantial economic losses and disruption [12]. Hence, effective methods for real-time detection of anomalous traffic in large-scale, high-speed networks are urgently required [6], [13]. However, computing the exact entropy values of high-speed and high-volume packet streams for networking applications is extremely challenging, particularly in environments with limited system resources. Moreover, it is generally necessary to compute the entropy of one or multiple combinations of traffic features simultaneously, which further increases the cost and complexity of the detection process.
As shown in Equation (1), a software-based implementation of the entropy computation is straightforward for offline, postmortem analyses of data anomalies in low-speed networks. However, given the high cardinality nature of the network traffic in high-speed networks, sampling techniques [14] are required to overcome the time and space complexity of real-time packet processing. Furthermore, the sampling processes may create data losses, which introduce distortion and measurement bias [15] into the anomaly detection process.
Frequency moment estimation [16] is an essential building block for many stream-based applications and has been widely used to simplify and accelerate the entropy estimation process for high-speed network traffic [17], [18], [19]. In the year 2000, Indyk proposed a unified framework [20] to estimate the L p norm of a packet stream Φ where p ∈ {1, 2} based on an α stable distribution. This foundation was later used in [8], [21], [22], and [23] to perform entropy estimation. However, generating the random values required to compute the entropy from the stable distribution involves complex floating-point multiplication, division, logarithmic, and trigonometric operations, which impose a significant processing bottleneck on the estimation process.
Accordingly, motivated by the work of Cormode [24], this study presents a tabulation-based methodology for estimating the empirical Shannon entropy based on the stable random projection method [20] and a piece-wise linear interpolation technique. The overall goal of the proposed method is to obtain high-accuracy estimates of the network entropy with a rapid processing speed. Moreover, achieving low computational and memory cost is also essential. Thereby, supporting real-time anomaly detection in high-speed networks in even low-resource systems is feasible.

B. CONTRIBUTIONS
This paper presents a tabulation-based method, designated as the k-parallel lookup with m-hash, for estimating the empirical Shannon entropy using the stable random projection framework proposed by Indyk [20]. The key component of the proposed method is the use of an inverse transform sampling technique to construct an empirical distribution function in a read-only lookup table. To facilitate the implementation of the proposed scheme in resource-constrained environments, the size of the lookup table is reduced through the use of a piece-wise linear interpolation heuristic based on three adaptive parameters (Span, Exponential Head, and Exponential Tail).
The proposed architecture can support both the Log-Mean Estimator (LME) method proposed by Clifford and Cosma [23] and the New Estimator of Compressed Counting (NECC) [25] proposed by Li [26]. The feasibility of the proposed method is verified using both real-world network traffic traces and synthetic data streams. Moreover, the practical applicability is demonstrated via stream-based implementation in the programmable data planes of an Xilinx U200 FPGA and Tofino P4 switch, respectively. The PoC design is capable of processing minimum-sized Ethernet frames at 100 Gbps wire-speed.
The remainder of this paper is organized as follows. Section II presents the background and related work on stream-based entropy estimation. Section III briefly outlines the problem considered in the present study. Section IV introduces the proposed tabulation-based entropy estimation method and interpolation technique, and briefly describes the system implementation. Section V discusses the key parameters of the proposed scheme and explores the corresponding design space. Section VI evaluates the performance of the proposed architecture using both real-world and synthetic traffic traces. Section VII presents the system implementation on the FPGA platform and P4 switch. Section VIII discusses the comparisons of evaluated results, implementation flexibility, and limitation. Finally, Section IX provides some brief concluding remarks and indicates the intended direction of future research.

II. BACKGROUND AND RELATED WORKS
A. STREAM COMPUTATION Data stream computation [27] has been an active research topic ever since the Internet started to undergo exponential growth. Typically, a data stream Φ = (a 1 , a 2 , . . . , a m ) consists of m elements, where some are distinct and others are repeated. The elements arrive at the observation point sequentially at time t, where the t th element a t = (key t , d t ) consists of a key t ∈ [n] and an update d t ∈ R. The space of set [n] has a maximum value of 2 104 , if the packet stream measurement is based on the 5-tuple traffic flow consisting of the protocol number, source, destination of TCP/UDP ports and IPv4 addresses. Thus, researchers have proposed numerous algorithms for summarizing such massive data flows in a one-pass fashion [28]. Such stream-based algorithms, commonly known as (ε, δ)-approximation algorithm, are not only capable of estimating the measurement outcome accurately with an error of less than ε with a high probability of 1 − δ, but are also highly suited to implementation in embedded networking systems with only limited memory and computing resources.
Indyk [20] proposed a framework for estimate the L p norm of a packet stream Φ based on the fact that the magnitude of the product of the key of each data item in stream Φ and the corresponding random variable R drawn from an α-stable distribution is proportional to the L p norm of a packet stream Φ, where p ∈ {1, 2}. This framework was later taken as the foundation for many proposals for stream-based entropy estimation [8], [21], [22], [23], [25]. (Note that for a more in-depth introduction to stable distributions and their related applications are available in the discussion paper of Borak et al. [39] and the study of Cormode and Indyk [40] on high-speed data stream processing.)

C. STREAM-BASED ENTROPY ESTIMATION
As shown in Equation (1), the computation process for estimating the empirical Shannon entropy comprises two parts: (1) determining the frequency statistics of each arriving distinct item m i , and (2) performing logarithmic and summation operations. Many entropy estimation algorithms have been proposed over the years. These algorithms can be broadly categorized into three main groups: (1) Alon-Matias-Szegedy (AMS) sampling [16], (2) Hash tables with sketch data structures, and (3) Random projection based on stable distributions. Table 1 presents a qualitative comparison of these stream-based entropy estimation schemes with that proposed in the present study.

1) AMS SAMPLING
Lall et al. [41] presented a data streaming algorithm for estimating the entropy norm m i logm i of high-speed networks based on the second frequency moment estimation obtained via AMS sampling [16]. The authors also presented a sieving methodology for improving the estimation accuracy by separating the larger flows from the smaller flows. The experimental results obtained for a traffic trace consisting of 6 million distinct counts and 67 million packets showed that the proposed algorithm consumed approximately 1.4 Mbytes of memory space and enabled the entropy to be estimated with at most 25% relative error with a probability of 75%.
Chakrabarti et al. [18] proposed a similar AMS-based approach for estimating the entropy and entropy norm of data streams. However, while a comprehensive theoretical performance analysis was given, no implementation details were provided.
Bhuvanagiri and Ganguly [46] presented a Hierarchical Sampling over Sketches (HSS) methodology for estimating the entropy over data streams. In the proposed approach, O(log m) levels of data structures were created from the original data stream, and the Count-Min [47] and Count sketch [48] data structures were then used to estimate the top-k items and frequency of each item at each level. The total memory space requirement was shown to be O(( top−k ε log m+ log 1 ε )(log 3 m)(log 1 δ )). Chakrabarti et al. [49], [50] proposed a near-optimal stream-based algorithm for estimating the empirical entropy using the AMS algorithm [16], which requires O(ε −2 log 1 δ log m) words of memory space. The implementation described in the paper utilized a reservoir sampling approach and maintained the associate counters in a dictionary structure. Moreover, two heap structures were employed for the the primary and backup samples, respectively. The per-item processing time in the stream was shown to be O(log 1 ε + loglog 1 δ + loglog m). Harvey et al. [51] approximated the Shannon entropy by estimating the Renyi entropy under the boundary condition of α → 1and sufficiently large values of ε. It was shown that, based on the AMS algorithm [16], the proposed scheme consumed just consumed O(ε −4 log 4 m) words of memory space. In a later study [19], the same group proposed another near-optimal algorithm for estimating the Tsallis entropy and Shannon entropy with α → 1 using just O(ε −2 log m) words of memory space.
Lapolli et al. [42] proposed a pipeline scheme based on AMS sampling algorithm [16] and the Count sketch [48] data structure for performing entropy estimation in the programmable data plane for DDoS attack detection. In the proposed method, the complex logarithmic computations required to estimate the Shannon entropy were replaced with a simple lookup operation on a Longest Prefix Match (LPM) table using Ternary Content-Addressable Memory (TCAM). The experimental results showed that the memory space required to monitor a single 1Gbps link consisting of 2 18 packets in a 250-millisecond observation time was just 58.125 Kbytes. However, a total of 9 Mbytes [42] were required in a high-speed device to monitor 24 links of 10 Gbps.

2) HASH TABLE WITH SKETCH DATA STRUCTURE
To compute the empirical Shannon entropy in Equation (1), the frequency count must be updated for every packet. Moreover, the update process should be performed at wire speed. Accordingly, Bartos and Martin [52] proposed a simple hardware accelerator for performing the update process using a logarithmic table with the linear interpolation technique. It was reported that a total memory space of 4.5 Mbits was required to compute the Shannon entropy for ten features of networking traffic. However, no details were given of the traffic traces, measurement accuracy, or attainable throughput.
Soto et al. [44] presented a high-throughput hardware accelerator for estimating the entropy of network traffic, in which the estimation process focused mainly on the top-k flows since the least frequent flows had no significant effect on the entropy of the data stream. The core of the accelerator consisted of a priority queue (PQ) array for the top-k flow selection and utilized the Count-Min sketch with Conservative Updates (CM-CU) [47] to evaluate the frequency statistics of the traffic flows. The system consumed approximately 560 K bytes (CM-CU + PQ array) of on-chip Block RAM and achieved a relative estimation error of less than 3% and mean value of 1.67% when evaluated using several realworld networking traffic traces. Moreover, the minimum processing throughput was shown to be 204 Gbps with a 400 Mhz system clock.
Ding et al. [45] proposed a framework of P4DDoS for detecting DDoS attacks in the data plane of P4 switches by estimating the empirical entropy based on the Count-Min [47], Count sketch [48], and P4LogLog with low relative error. The algorithm, designated as P4LogLog, was implemented using P4-supported arithmetic, bit shift, and bitwise logical operations. The performance of the proposed algorithm was evaluated using passive traffic traces drawn from the CAIDA dataset (2018) with each 5-second observation period containing around 2 21 packets. The relative estimation error was shown to be close to 3% [45] when utilizing a Count-Sketch size of 40 Kbytes (5 × 2, 000 × 32 bits).

3) RANDOM PROJECTION WITH STABLE DISTRIBUTION
Zhao et al. [8] proposed a streaming algorithm for estimating the entropy of Origin-Destination (OD) flows using the framework proposed by Indyk [20] for estimating the L p norm using a symmetric (β = 0) stable distribution S(x : α, β, γ , µ). To avoid the complex computations usually required to generate random variables from a stable distribution, the authors utilized two pre-computed lookup tables implemented using the high-throughput Direct Rambus DRAM DIMM. Up to one million entries were allocated in each table, where each entry consisted of a total of 640 bits arranged in 20 blocks. In the proposed implementation, four tables were allocated, where each table contained 80,000 entries (buckets) of twenty 32-bit blocks. The tables consumed a total memory space of 51.2 Mbits between them. According to the simulation results, the relative error of the estimated entropy was less than 10% with a probability of approximately 0.85 [8].
Li [21] proposed efficient schemes of various estimators for estimating the Shannon entropy using the Compressed Counting (CC) [26] data structure based on a maximallyskewed (β = 1) stable random projection with α → 1 and a scale parameter of γ = cos( απ 2 ). In later study, Li and Zhang later presented the New Estimator (NE) [25] for estimating the Shannon entropy, in which up to k number of sketch registers were used to accomplish the random projection process for each packet in the traffic stream. Algorithm 1 presents the corresponding pseudo code, in which the update process based on stable random projection is shown in line 14 and the final entropy estimate is obtained in line 18. As shown in Algorithm 2, the random variables were generated using an observing key (e.g., IPv4 source address) as the seed.
Clifford and Cosma [23] developed a sketching algorithm for entropy estimation over streaming data based on a maximally-skewed α-stable random distribution (β = −1) with a scale parameter γ = π 2 . Algorithm 3 shows the pseudo code of the proposed Log-Mean Estimator (LME) for an assumed stability parameter of α = 1, in which the Shannon VOLUME 10, 2022 entropy is estimated by the LME shown in line 19 based on the k sketch registers updated in line 13.

III. PROBLEM STATEMENT
As shown in Algorithms 1 and 3, the primary process of the stream-based entropy estimation schemes proposed is to project each network traffic element, represented by a key and update pair (key, d), to a random variable, R(key), drawn from an alpha-stable distribution, S(x : α, β, γ , δ). However, given a key, generating the random value from the skewed alpha-stable distribution requires complex computations involving division, logarithmic, and trigonometric operations (as shown in Algorithms 2 and 4, respectively). Consequently, the entropy estimation time is inevitably prolonged, therefore imposing a performance bottleneck on the anomaly detection process. Accordingly, the present study proposes an efficient tabulation scheme for estimating the empirical Shannon entropyĤ i of high-speed network traffic, in which the complex computations used in Algorithms 2 and 4 to generate the required random values are replaced with a simple lookup table procedure. Moreover, a piece-wise linear interpolation heuristic based on three adaptive parameters, namely Span, Exponential Head, and Exponential Tail, is proposed to minimize the total size of the lookup table compared to that in previous studies [8]. [25] for Estimating the Entropy of Data Streams Using the New Estimator of Compressed Counting (NECC) 1: Input: α, key t 2: Output:Ĥ (Φ), the estimated Shannon Entropy 3: Initialization 4: data sketch (y 0 , . . . , y k−1 ) ← (0, . . . , 0) 5: counter Y ← 0 6: ∆ ← 1 − α 7: d t ← 1 //sketch update is based on packet count 8: Function R(key t ): 9: generate the random variable R with the maximally skewed alpha-stable distribution of S(x; α → 1, 1, 1, cos( απ 2 )), by using key t as the seed. 10: // For each incoming packet with key within the observation time T 11: Update the counter Y = Y + 1 12:

Algorithm 1 Algorithm Proposed by Li and Zhang
Update y j = y j + R j (key t ) × d t 15: end for 16: // At the end of the observation time T 17 Algorithm 2 Ping Li's Pseudo Codes [25] to Generate the Random Variable R(key t ) With Maximally Skewed Alpha-Stable Distribution of S(x; α → 1, 1, 1, cos( απ 2 )) 1: Input: α, key t as the seed for generating random numbers 2: Output: random variable R(key t ) 3: Generate two random numbers U 1 , The proposed architecture is compatible with both the Log-Mean Estimator (LME) proposed by Clifford and Cosma [23] (see Algorithm 3) and the New Estimator of Compressed Counting (NECC) proposed by Li and Zhang [25] (see Algorithm 1). Moreover, through the use of a tabulation approach and the piece-wise linear interpolation heuristic, the proposed method is suitable for implementation in the programmable data plane of limited-memory-space systems with a wire speed of multi-hundred gigabit per second. Table 2 summarizes the notations used in the present study.

IV. SYSTEM DESIGN A. OBSERVATION
In general, a random number r can be generated by seeding a pseudo-random number generator with a key key t , where 0 ≤ r ≤ RAND_MAX. The random numbers U 1 , U 2 ∼ Unif(0, 1), generated in Algorithms 2 and 4, can be calculated as U 1 = r 1 /2 b and U 2 = r 2 /2 b , where parameter b represents the computational resolution of the random values and RAND_MAX =2 b − 1. Having generated these random numbers, the corresponding values of R(key t ) can be obtained accordingly. Figure 1 presents a three-dimensional (3D) figure of the random value R(key t ) for a maximally skewed alpha-stable distribution with a resolution of 1/(2 16 ) in Algorithm 4.
Based on general logarithmic properties, and grouping the variables W 1 and W 2 separately, line 6 in Algorithm 4 can be represented in the form of R(key t ) = f 1 (W 1 ) + f 2 (W 2 ).
The values of f 1 (W 1 ) and f 2 (W 2 ) can be computed in advance and stored in two lookup tables T 1 [ ] and T 2 [ ] each of size E n . The random value R(key t ) can then be obtained simply as R(key t ) = T 1 [r 1 ] + T 2 [r 2 ]. A similar technique can be applied to line 7 in Algorithm 2 such that the random value R(key t ) can be obtained as the product of T 1 [r 1 ] and T [r 2 ]. It is noted that Zhao et al. [8] adopted a similar approach with a large entry size (i.e., E n approximately one million) and allocated the tables in an off-chip Rambus DRAM.

B. PIECE-WISE LINEAR INTERPOLATION HEURISTIC
Algorithm 5 presents the offline table construction process in the proposed present study. Briefly, an inverse transform sampling approach is employed to draw up to En mc pairs of random values (U 1, U 2) ∼ Unif (0, 1) with a resolution of 1/(2 b ) (lines 21 and 23) from the alpha-stable distribution. These values, shown as blue dots in Figure 2, are then sorted in ascending order and stored in a lookup table Table MC of size En mc (line 26). The purpose here is to re-construct the empirical distribution function of R(U 1, U 2) in Algorithms 2 and 4 in a tabular form. Note that, the sorting for Algorithm 2 is based on the descending order due to the property of the distribution.
Observing the empirical distribution represented by the blue dots in Figure 2, it is seen that, with more than 98.5% probability, the values are distributed within a small range of -5 to 5. Furthermore, the empirical distribution contains only a very small number (1.5%) of values in the wider range of -5 to -17,995. Therefore, by using the piece-wise linear interpolation technique illustrated in Figure 3, only a few selected points are sufficient to approximate most of the values in the original empirical distribution [53]. Consequently, the size of the lookup table used to reproduce the random values, R(key t ), in Algorithms 2 and 4 can be substantially reduced. Crucially, the proposed scheme enables the random values not stored in the table (i.e., approximately 98% of the random values stored in the original two parameters can be tuned adaptively as required to represent the particular skew characteristics of the alpha-stable distribution. The goal is to achieve an acceptable tradeoff between the total table size (i.e., the memory consumption, smaller is better) and the entropy estimation accuracy (larger is better).
The Span table is constructed by selecting values from  Table MC with a fixed span of 2 sp , starting from index zero and ending at index 2 th head . In other words, in the entropy estimation process, the Span table is accessed for all lookup values with hash indexes less than or equal to 2 th head .
By contrast, the Exp-Head table (see upper-right corner of Figure 2) is accessed for all lookup values with hash indexes 2 n in the interval between 2 th head and 2 th tail . Finally, for lookup values with hash indexes greater than 2 th tail , the hash indexes are inverted and the Exp-Tail As shown in Algorithms 1 and 3, the number of random values generated from the skewed alpha-stable distribution depends on the size of the data sketch k. To avoid the need to perform k sequential lookups over the same table, the present study proposes a k-parallel with m-hash lookup data structure consisting of kp read-only tables, as shown in Figure 4. For each table, mp hash functions are used to compute the indexes of a given key for table lookup purposes. Through the use of this k-parallel structure, the total lookup latency is significantly reduced and the maximum throughput increased correspondingly. Moreover, for values selected from Table MC , the hash indexes to the table have the form of two to the power of n. Thus, the complex division computation in the proposed linear interpolation scheme (see Figure 3) can be replaced by a simple logic shift operator.
For each incoming packet, up to mp × kp hash values are computed based on the key selected. As shown in Algorithm 6, these hash values are then used to index the relevant lookup tables following simple logical shift and invert operations (see lines 20 and 34).
In particular, as described above, three different address ranges are specified based on the threshold values assigned to th head and th tail . In accordance with these ranges, the Span (  9: sort the content in increasing order 10: Function invert(x): 11: inverting bits in the binary representation of x 12: // Construct the α-stable Table  13: for i = 0 to 2 b − 1 do 14: for j =0 to 2 b − 1 do 15:  21: for i = 0 to (En mc − 1) do 22: Generate two random numbers m, n where 0 ≤ m, n < 2 b

23:
j  25: for i = 0 to (En mc − 1) do 26: j  for i = 0 to ((2 th head /2 sp ))) do 30: j  for i = th head to th tail do 34: j  for i = 0 to th tail do 38: j  At the end of the observation time T , the host CPU collects this data structure from the fast data plane and computes the empirical Shannon entropy using the Log-Mean [23] and New estimator [25] described in Section II-C3. The corresponding pseudo codes for the two proposed estimation processes are shown in Algorithms 7 and 8, respectively.

V. DESIGN EXPLORATION
The memory space required by the proposed scheme is equal to the total size (kp×En) of the Span-Head-Tail tables shown  in Figure 4, where kp denotes the number of lookup tables deployed in parallel, and En is the size (i.e., total number of entries) of each table. It is noted that the two parameters (kp and En) play a key role in determining the accuracy and variance of the final entropy estimates.

A. TABLE SIZE REDUCTION
The total entry (En) of the Span-Head- Tail table constructed using Algorithm 6 can be summarized as where th tail = log 2 En mc − 1 and th head ≤ th tail . The proposed linear interpolation scheme substantially reduces the total lookup table size. As illustrated in Equation (2), th head and sp are the dominating parameters of the table size.
Example: As shown in Table 3, for th head = 11, sp = 1 and En mc = 64K , the Span  for j = 0 to kp − 1 do 19: indx ij ← h ij (key) 20: indx ij ← indx ij sp 21: if (indx ij < 2 th head ) then 22: a1 ij ← indx ij sp 23: a2 ij ← (indx ij + 1) sp 24: b1 ij ← j  27: else if (2 th head ≤ indx ij ≤ 2 th tail ) then 28: else 34: indx ij ← invert(indx) 35:   piece-wise linear interpolation process can be evaluated as Equation (3). Figure 5 illustrates the selection of cut-off thresholds (th head ) and the total entry consumed of the proposed interpolation scheme. For ease of comparison, as shown in Figure 6, two baselines of the average error distance are shown, namely (1) the dotted L1_quarter line, which represents the L1 distance of the quarter-sized interpolation table in which every 4 th entries of the original table Table MC is retained; and (2) the dashed L1_half line, which represents the average error distance of the half-sized interpolation table in which every 2 nd entries of the original one is retained. As expected, the average error distance of the L1_half scheme is much less  However, the resulting tradeoff between the interpolation error and the memory space saving may be sub-optimal. As shown in Figure 5, by using a cut-off threshold of thirteen (th head = 13) and storing all the values (sp = 0) in the original table, the L1 distance reaches almost the same level as that provided by the L1_half. Moreover, the consumed memory space of the interpolation table Table span is equal to just 12.5% of that of the original table Table MC . Furthermore, if every 2 nd entries (sp = 1) is selected, the memory space can be halved (4,114 entries) while the average error distance is still less than the level of the L1_quarter line.
Accordingly, simulations were performed using a realworld Internet traffic trace extracted from the MAWI dataset [54] to evaluate the effects of the cut-off threshold (th head ) and span (sp) on the cumulative probability of various error percentages. The corresponding results are shown in Figure 7. It is seen that for parameter settings of  th head = 13 and sp = 0, the cumulative probability reaches almost 0.9 within an error percentage of 3%. Furthermore, the table size can be halved using a span size of two (sp = 1) with no more than a minor reduction in the cumulative probability. Figure 8 demonstrates the box-and-whisker plot of the percentage errors for different cut-off thresholds. The distribution of relative errors declines as the cut-off threshold increases. For th head = 10, the upper quartile is less than 3%.

B. K-PARALLEL TABLE WITH M-HASH
In practice, the variance of the entropy estimates obtained using the proposed scheme can be further reduced through an averaging approach by using multiple tables in parallel and independent hash functions for each table lookup. Figure 9 shows the cumulative probability of the relative error of the entropy estimates given the use of different numbers of lookup tables. The relative error, defined as (| H −H | H ) × 100%, is the absolute error divided by the magnitude of the exact entropy. It can be seen that for a cumulative probability of 0.9, for example, the relative error reduces from 7% to just 2.9% as the number of tables increases from twenty (kp = 20) to forty (kp = 40) given the use of eight hash functions (mp = 8) in every case. By contrast, as shown in Figure 10, the cumulative probability increases from 0.71 to 0.98 as the number of hash functions increases from two (mp = 2) to sixteen (mp = 16) for a relative error of 5%.
Note that the results presented in Figures 9 and 10 are obtained using the interpolation-based scheme of Clifford and Cosma [23] and a synthetic traffic stream consisting of one million elements. In addition, the skew parameter [55] is set as Zipf=1.4 (i.e., the same as that of the real-world MAWI trace used in the previous simulations) and the span and cut-off threshold parameters have values of sp = 1, th head = 11, and th tail = 15, respectively.

C. RESOLUTION
Parameter b in the proposed table construction algorithm (Algorithm 5) defines the resolution, 1/2 b , of the values stored in the lookup table. For a fixed value of En mc (e.g., 65,536, see above), a higher resolution results in a more accurate estimation performance for a larger number of distinct keys in the packet stream. Figure 11 shows the mean absolute percentage error (MAPE), defined as 1 , of the estimated entropy for one thousand streams (s = 1, 000) of different distributions and resolutions.
It is seen that for a typical real-world network trace [54] with a moderate skew (Zipf=1.4) and a length of one million elements (approximately 18 K distinct elements), a resolution bit size of 12 is sufficient to yield an accurate entropy estimation performance. However, for synthetic data streams containing 100 K, 1 M, and 10 M different items [55], resolution bit sizes of at least 16, 18 and 22, respectively, are required to achieve an adequate estimation accuracy.

D. SIZE AND ERROR TRADEOFF
In general, the entry size (En mc ) of Table MC used to store the values selected from Table alpha through the inverse transform sampling process should not be too small. The reason is that if those random values are sampled coarse-grained, it is hard to represent the original stable distribution; hence, the interpolation error increases. Based on two distributions (Zipf=0.1 and 1.9), Figure 12 illustrates the mean absolute percentage error with different sizes of the Table MC for the interpolation-based entropy estimations of LME and NECC (kp = 20, mp = 1). The mean value is obtained based on simulations of 1, 000 times.
It is apparent that for a highly-skewed stream (Zipf=1.9), the mean absolute percentage error remains the same as the size increases. In contrast, for a uniform-distributed stream (Zipf=0.1), the larger the entry size, the lower the error. The LME and NECC interpolation schemes achieve less than 5% of the mean error with the 512 K entry of the MC table.
Please be noted that Table MC is a temporary data structure storing the random values generated by the inverse transform sampling. Based on this table, the proposed interpolation process further creates the Span-Head-Tail table with a much smaller size suitable for system implementation.
As indicated in Equation (2), the size of the Span-Head-Tail table is mainly affected by the parameters of cut-off (th head ) and span parameters (sp). Figures 13 and 14 demonstrate the mean absolute percentage error with different cut-off thresholds (th head ) and span parameters (sp) for the interpolation-based entropy estimations.
Obviously, using a lower span value, the larger the cutoff threshold, the lower the mean absolute percentage error  Table. The simulation is conducted based on two synthetic data streams of Zipf=0.1 and 1.9.
for the estimation. However, for a fixed span parameter, the table size (En) increases exponentially as the cut-off threshold increases. Therefore, for approximately 4 K-entry of the k-parallel table implementation, suitable selections of the span and cut-off threshold parameters can be (sp = 0, th head = 12) or (sp = 1, th head = 13).

FIGURE 13.
Mean absolute percentage error of the proposed interpolation-based LME methods with different cut-off thresholds (th head ) and three span parameters (sp = 0, 1, 2). The simulation is conducted based on a uniform synthetic data stream (Zipf=0.1) with the MC table size (En mc ) of 512K .

VI. SYSTEM EVALUATION
The feasibility of the proposed interpolation-based platform for estimating the Shannon entropy using either the algorithm of NECC Ping Li and Zhang [25] or LME schemes of Clifford and Cosma [23] was evaluated using both synthetic data stream and real-world traffic traces adopted from the MAWI dataset [54] and the CAIDA DDoS attack dataset [56].

A. BASELINE
In general, up to k numbers of R k values are used in the LME [23] and NECC [25] schemes to minimize the variance of the empirical Shannon entropy estimates. For example, Zhao et al. [8] used twenty blocks of sketch data (k = 20) Mean absolute percentage error of the proposed interpolation-based LME and NECC methods with different cut-off thresholds (th head ) and span parameters (sp = 0, 1). The simulation is conducted based on a skewed synthetic data stream (Zipf=1.9) with the MC table size (En mc ) of 512K . in the lookup table implementation. Figure 15 presents boxand-whisker plots of the relative errors obtained using the schemes of original LME of Clifford & Cosma (left section) and NECC of Ping Li (right section) for distributions with two extreme cases (Zipf = 0.1, 1.9) and three different table configurations (k = 20, 100, 180). The variance of the estimated entropy can be minimized effectively with a higher number of sketch data structures (k) in both schemes. Figure 16 compares the mean estimated entropy values obtained from the original LME [23] and NECC [25] schemes with those obtained from the correspondence interpolationbased schemes (kp = 20, mp = 1). For both schemes, the simulations were repeated 1,000 times with different hash parameters each time and the results were then derived as mean values with the standard deviation shown as error bars.
On average, as shown in Figure 17, the original LME and NECC schemes yield a relative error of less than 5% compared to the exact entropy solutions with a Zipf value less VOLUME 10, 2022 FIGURE 16. Estimated entropy obtained using original and proposed interpolation-based LME [23] and NECC [25] schemes for stable random distributions with different Zipf parameters. Note that the original LME and NECC schemes are implemented using a data sketch size of twenty (k = 20), while the proposed interpolation-based schemes are implemented using twenty tables (kp = 20) in parallel and one hash function per table (mp = 1). than 1.01. By contrast, the proposed schemes overestimate the entropy value due to the interpolation error introduced. In particular, the mean relative error of both schemes increases to approximately 18% as the stream elements become highly-skewed (Zipf = 1.9) distributed. Fortunately, as shown in Figure 17, the mean relative error for a traffic distribution with a Zipf value of 1.9 can be reduced to around 6% using the proposed interpolation-based schemes given the use of more hash functions per table (kp = 20, mp = 12).

FIGURE 17.
Mean absolute percentage error of estimated entropy obtained using original and proposed interpolation-based LME [23] and and NECC [25] schemes for stable random distributions with different Zipf parameters. Note that the original LME and NECC schemes are implemented using a data sketch size of twenty (k = 20), while the proposed interpolation-based schemes are using twenty tables (kp = 20) in parallel and one hash function per table (mp = 1) and twelve (mp = 12).
The simulations considered synthetic data streams, each consisting of 30 K items, generated using different Zipf parameters in the range of 0.1 to 1.9 [55]. For the interpolation-based scheme, the head threshold parameter was set as ten (th head = 10), the tail threshold parameter was set as fifteen (th tail = 15), and the span parameter was set as one (sp = 1). Finally, the resolution bit size was set as eighteen (b = 18).

B. MAWI TRACE
Six real-world network traffic traces from the MAWI Working Group Traffic Archive [54] are used to evaluate the performance of the proposed system. Those 15-minute long packet traces are originated from part of the 24h-long trace at MAWI's samplepoint B, F, and G. As shown in Table 4, the average number of distinct source IPv4 addresses ranged from 51.7 K to 5.07 M, while the total number of packets ranged from 3.5 M to 588.7 M approximately. Simulations were performed to compute the cumulative probabilities of the relative error of the estimated entropy of the six traces given the use of both estimation schemes [23], [25]. The estimation process was confined to a small portion (30 seconds) of the original trace (15 minutes) and the estimation procedure was repeated 500 times using different hash parameters and table contents. Table 5 shows the cumulative probability of 3% and 5% relative errors of the estimated entropy when using the interpolation-based approach for LME and NECC algorithms, respectively. It is seen that when the estimation process is implemented using forty lookup tables (kp = 40) in parallel, the cumulative probability reaches 0.89 for a 3% relative error when using the LME scheme and 0.9 when using the NECC method. Moreover, the total memory space consumption (sp = 1, th head = 13) is 640 K bytes, assuming that each table entry utilizes a 32-bit counter.
For each 15-minute-long packet trace, an estimated entropy valueĤ i and exact value H i were obtained using an observation time of 30 seconds. The corresponding MAPEs were then computed. The simulation process was repeated 20 times with different hashing parameters and table contents each time. Box-and-whisker plots were plotted for the MAPE values of the six traces given the use of the interpolation-based LME and NECC schemes, respectively. The corresponding results are presented in Figure 18 and Figure 19, respectively. It is seen that for all six traffic traces, the means of the box-and-whisker plots are less than 3% for both estimation schemes. Following the LME simulation results using only thirty lookup tables (kp = 30), the MAPE for the trace of 201501011400 and 201904091800 are 1.6% and 1.83%, respectively. The total memory space consumption (sp = 1, TABLE 5. Cumulative probabilities of relative error for entropy estimation of 3% and 5% given use of the LME and NECC schemes ( t = 30).  th head = 13) is 480 K bytes. Noted that, as illustrated in Table 4, the 201904091800 trace contains the highest number of distinct source IP addresses (197.9 K) in an average of 30-second observation time. Table 6 illustrates the mean absolute percentage errors of the LME estimation on MAWI traces in an observation time of 300 seconds. The errors are all less than 3% except for the trace of 202004081400. Besides, the standard deviation for 202004081400 trace is higher than those of the other traces. This is mainly due to the excess number of packets (196.2 M) processed.
Noted that, as shown in Table 4, the whole traces (900 seconds) of 201501011400 and 201904091800 contain approximately five million distinct source IP addresses with total packet counts of 58.1 M and 98.8 M, respectively. Therefore, according to the simulation results shown in Figure 11, a higher resolution bit (b) needs to be applied in the table construction phase. Thus, as shown in Figure 20, with resolution bits of 22, the interpolation-based LME and NECC can estimate the empirical entropy of these two 900-second traces with less than 3% of mean absolute percentage error.

C. CAIDA 2007 DDoS TRACE
The entropy estimation performance of the proposed architecture was further evaluated using the CAIDA 2007 DDoS dataset [56]. In particular, a one-hour-long packet trace was adopted from the MAWI Working Group Traffic Archive (MAWI 2019 DITL Trace) [54] and was merged as a background traffic with the CAIDA DDoS attack trace.
As shown in Figure 21, the synthetic trace contained two DDoS attacks, where these attacks were simulated simply by inserting the same DDoS attack records into the background traffic twice. The entropy values were estimated using the interpolation-based LME [23] and NECC [25] schemes based on the source IP address and an observation time of 30 seconds in both cases. For both estimators, the interpolation scheme was implemented using parameter settings of b = 18, sp = 0, th head = 11, and th tail = 15. Moreover, the interpolation process was performed using twenty tables in parallel (kp = 20) with four hash functions for each table (mp = 4). A detailed inspection of Figure 21 shows that the MAPE values for the interpolation scheme of LME and NECC (α=0.999) is 3.3% and 2.46%, respectively.
In order to compare the entropy values in different observation time slots, the entropy values are often normalized based on the distinct count or the total count [41] of the stream elements. Figure 21 presents the entropy normalized based on the exact distinct count for illustration purposes. As the cardinality estimation is outside the scope of this study, we refer the readers to the literature of Flajolet and Martin [57], [58] for the original algorithm. Furthermore, Kulkarni et al. [59] and Soto et al. [44] presented the estimation accelerator in FPGAs, and Ding et al. [45] introduced the practical implementations in the P4 programming language.

VII. PRACTICAL IMPLEMENTATION A. FPGA
The practical feasibility of the proposed architecture was demonstrated by implementing it in the data plane of the NetFPGA-Plus [60] project in an UltraScale+ XCU200 FPGA consisting of 2, 160 blocks of 36 K-bit BRAMs. The total processing latency of the hardware design comprised ten clock cycles for frame parsing and key hashing (2 cycles), table lookup (1 cycle), and interpolation (7 cycles). Figure 23 presents the FPGA system operation estimating the entropy values of five-tuple attributes by replaying the CAIDA DDoS 2007 network traffic trace [56] through a 100 Gbps network interface card. Each incoming packet was processed in a pipeline fashion using a 250 MHz AXI-Stream bus with 512-bit. A typical minimum-sized Ethernet frame consisted of a 12-byte inter-frame gap, a preamble of 8 bytes, a 14-byte frame header, a 46-byte payload, and a 4-byte  CRC checksum of the wire. Thus, two cycles were required to process the 84-byte frame on the 512-bit AXI-Stream bus. The interpolation architecture was implemented using two sets of fifty tables in parallel (kp = 100 in total), where each table utilized two dual-port BRAMs with a size of 2 K × 36 bits. Accordingly, four hash lookups (mp = 4) were performed for each table within two clock cycles enabling 148, 809, 524 frames to be processed per second at a 100 Gbps wire speed. The Verilog HDL implementation is synthesized and the resource utilization is presented in Table 7.
Given the UltraScale+ BlockRAM's maximum clock frequency of 825 MHz [61], the proposed design was capable of processing up to twelve hash-lookups (mp = 12) within two clock cycles at a 100 Gbps wire speed. The theoretical processing throughput of the proposed pipelined design was thus 422.4 Gbps for minimum-sized Ethernet frames.

B. P4
The proposed Entropy estimation scheme was also implemented in the data plane of a P4 switch using P4-16 programming language [62]. The interpolation parameters were set as sp = 1, th head = 10, and th tail = 15, and the estimation process was performed using twenty tables (kp = 20) in parallel and one hash function (CRC32) per table (mp = 1). Each entry stored a 32-bit fixed-point value consisting of a 12-bit fraction and a 20-bit integer. Figure 24 shows an excerpt of the P4 code for the interpolation operation in the behavioral model.  Most of the programmable data plane architecture [63] can not support complex mathematical operations such as multiplication and division. The solution is to adopt the approximation techniques [64] of bit-shifting with adders and table lookups. In addition, more complex operations such as exponential and logarithmic can also be realized [45] based on the approximation techniques with binomial series expansion.
Assuming that a total of m packets of key are observed within the range of a 1 and a 2 during the observation window t, and X j = hash(key j ), the estimation sketch data structure Y j obtained using the interpolation process shown in Fig. 3 can be expressed as Since the values of a 1 , b 1 , a 2 , and b 2 are all known constant values, the multiplication step in of the interpolation process can be conducted in batch mode by using the host CPU in the control plane. In order to implement the proposed scheme in  the Tofino Native Architecture (TNA) pipeline without the need for a multiplication operation, the implemented design utilizes additional register tables to accumulate the summation of the hash index X i and corresponding frequency count m i within the range of a i and a i+1 for each incoming packet in the data plane. The average resource consumption of the final implementation (kp = 20, mp = 1) using eleven pipeline stages is shown in Table 8. Figure 25 presents a photograph of the P4 testbed. A total of 100 K minimum-sized Ethernet frames were generated at a rate of 100 Gbps by the Thor-400G-7S-1P test module in the ValkyrieCompact chassis (XENA Networks). Three different distributions of the IPv4 source addresses were configured, namely random, linear-increasing, and fixed. The relative error of the estimated entropy was found to be less than 11% in all three cases.

A. PACKET COUNT AND DISTINCT ITEM
Typically, the network traffic analysis adopts observation epochs of 30∼900 seconds [8], [41], [44]. Thus, based on the memory size ranging from 480 K to 640 K bytes, the proposed scheme can process traces of up to 98.8 million packets and handle approximately up to five million distinct items with less than 3% of mean absolute percentage error.
The proposed schemes can handle a more different number of distinct items for a traffic stream as the resolution bit value (b) increases. Hence the accuracy of the entropy estimates improves. A high-resolution setting extends the depth range of the span region. Thus, the parameters of th head and sp need to be adjusted accordingly to meet the required estimation accuracy.

B. IMPLEMENTATION FLEXIBILITY
The number of lookup tables (kp) and hash functions (mp) in the proposed interpolation-based scheme provides valuable flexibility in the system implementation to minimize the variance of the estimated entropy. Figure 22 presents boxand-whisker plots of the relative errors obtained using the interpolation-based LME of Clifford & Cosma (upper panel) The designer can deploy the proposed schemes with a fair number of tables and hash functions based on the available memory space and processing throughput requirement. Thus, compared to the existing hardware solutions [8], [41], [44], [45] we can implement the proposed scheme in the programmable data plane with P4 on the Tofino Native Architecture (TNA) and FPGA easily. Moreover, instead of conducting read-modify-write operations (2∼3 clock cycles) on the entire sketch memory [44], the read-only lookup procedure (1 clock cycle) of the proposed scheme provides a faster processing speed for the packet updates. Since the table lookup procedure can be performed in parallel, the latency is reduced. Hence the estimation process is favorable for high-speed network traffic (providing that sufficient memory space is available).

C. LIMITATION
Increasing the number of tables can minimize the variance to a certain extent (e.g., kp > 80) since the variance is inherently dependent on the magnitude of the interpolation error. For a given number of tables (kp = 80) shown in Figure 22, the variance can also be reduced by increasing the number of hash functions from eight (mp = 8) to sixteen (mp = 16). However, most on-chip embedded memories have only a limited number of read-ports. As a result, the lookup process must be conducted sequentially, causing a long latency delay. Consequently, the number of hash functions must be balanced in such a way as to meet the wire-speed processing requirement. Figure 26 presents the mean absolute percentage error for the interpolation-based LME and NECC with different numbers of packet counts processed. The simulation is conducted based on three synthetic data streams of Zipf=0.1, 1.3, and 2.1 with parameters of kp = 20 and mp = 12. For a traffic distribution of Zipf=1.3, the interpolation-based LME and NECC schemes can process up to approximately 400 M packets with less than 5% of error. However, for an extreme case where the traffic is very uniformly distributed (Zipf=0.1), the error of the entropy estimation increases rapidly as the packet count grows. The reason is primarily due to the accumulation of interpolation errors originating from the table lookup process.

IX. CONCLUSION AND FUTURE WORK
This paper has proposed a tabular interpolation scheme for estimating the empirical Shannon entropy of network traffic based on the stable random projection method. The existing entropy estimation methods, such as the Log-Mean Estimator (LME) [23] and the New Estimator of Compressed Counting (NECC) [25], [26], required complex computations. In contrast, the present study derives the required data structures using a simple table lookup process and a piece-wise linear interpolation technique. The total size of the lookup table is reduced by separating the table into three smaller tables in accordance with parameters of sp, th head and th tail .
Notably, the parameters can be adjusted by the particular characteristics of the skewed alpha-stable distribution. The purpose is to correctly reproduce the distribution and achieve an acceptable tradeoff between the proposed scheme's memory consumption and the entropy estimates' accuracy.
The feasibility of the proposed architecture has been demonstrated using both real-world traffic traces and synthetic data streams. The scheme has additionally been evaluated, delivering the capability of processing network traffic at a 100 Gbps wire speed on a Xilinx U200 FPGA platform and a Tofino programmable P4 switch. In general, the results have shown that the proposed architecture is compatible with both the LME scheme [23] and the NECC method [25].
In addition, the simulation results have indicated that the proposed scheme can process traces of up to 98.8 million packets and handle up to five million distinct items with a mean relative error of less than 3%. The total memory space consumed is 480 K bytes (kp = 30) and 640 K bytes (kp = 40), respectively based on the configuration of (mp = 12, sp = 1, th head = 13).
Since the primary entropy estimation involves only the lookup of read-only tables and the update of some sketch registers, the process has very low latency. Thus, a theoretical processing throughput in excess of 400 Gbps can be achieved given the latest advances in FPGA technology with a Block RAM frequency of 825 MHz. In future studies, we plan to optimize the proposed design further and deploy it in real-world network environments for traffic monitoring and anomaly detection applications.
JIM HAO CHEN is currently the Associate Director of the International Center for Advanced Internet Research (iCAIR), Northwestern University, where he is also responsible for the center's research infrastructure design and engineering. Before joining iCAIR, he was a Coordinator for technology testbeds at Northwestern. He leads multiple projects focus on the design and development of high performance platforms for advanced network systems and advanced network applications. His research interests include include 100G network exchanges and data movement, high performance digital media networks, high resolution 2D/3D digital media streaming over networks, international collaboration virtual environments for science, programmable network testbeds, science cloud networks, and virtual science environments. VOLUME 10, 2022