Weaknesses of Popular and Recent Covert Channel Detection Methods and a Remedy

Network covert channels are applied for the secret exfiltration of confidential data, the stealthy operation of malware, and legitimate purposes, such as censorship circumvention. In recent decades, some major detection methods for network covert channels have been developed. In this article, we investigate two highly cited detection methods for covert timing channels, namely <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math><alternatives><mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq1-3241451.gif"/></alternatives></inline-formula>-similarity and compressibility score from Cabuk et al. (jointly cited by 949 articles and applied by several researchers). We additionally analyze two recent ML-based detection methods: <italic>GAS</italic> (2022) and <italic>SnapCatch</italic> (2021). While all these detection methods must be considered valuable for the analysis of typical covert timing channels, we show that these methods are not reliable when a covert channel's behavior is slightly modified. In particular, we demonstrate that when confronted with a simple covert channel that we call <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math><alternatives><mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq2-3241451.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$\kappa$</tex-math><alternatives><mml:math><mml:mi>κ</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq3-3241451.gif"/></alternatives></inline-formula>libur, all detection methods can be circumvented or their performance can be significantly reduced although the covert channel still provides a high bitrate. In comparison to existing timing channels that circumvent these methods, <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math><alternatives><mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq4-3241451.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$\kappa$</tex-math><alternatives><mml:math><mml:mi>κ</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq5-3241451.gif"/></alternatives></inline-formula>libur is much simpler and eliminates the need of altering previously recorded traffic. Moreover, we propose an enhanced <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math><alternatives><mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq6-3241451.gif"/></alternatives></inline-formula>-similarity that can detect the classical covert timing channel as well as <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math><alternatives><mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq7-3241451.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$\kappa$</tex-math><alternatives><mml:math><mml:mi>κ</mml:mi></mml:math><inline-graphic xlink:href="zillien-ieq8-3241451.gif"/></alternatives></inline-formula>libur.


I. INTRODUCTION
C OVERT channels are undesired and stealthy communi- cation channels that aid multiple cybercriminal activities.For instance, botnets can use them to hide their command and control channels and spyware can employ them to secretly exfiltrate stolen information like credentials or database content [1], [2], [3].Moreover, covert channels can be part of DDoS attacks [4].Alternatively, covert channels can also be used for legitimate purposes.For example, Wustrow et al. show an approach that uses a covert channel to circumvent state-sponsored censorship by enabling an "end-to-middle proxy" [5].
One specific type of covert channel is based on the interarrival time (IAT, sometimes also inter-packet delay, IPD).The IAT is the elapsed time between two succeeding network packets and each IAT represents a secret symbol.For instance, an IAT of 100 ms might indicate a '0' bit while an IAT of 200 ms might indicate a '1' bit.Several improvements of the plain IAT channel have been proposed, cf.[6], [7], [8], [9].
For such IAT covert channels, multiple heuristics exist.Among these are two highly cited ones by Cabuk et al. that were published between 2004 and 2009 [10], [11], [12].Research work has improved over these algorithms during the succeeding decade, leading to mostly ML-based detection methods, such as the recent GAS [13] (RNN-based) and SnapCatch (based on image processing and ML) [14] methods that reach almost perfect detection quality.
Often, publications that present new covert channels are accompanied by detection approaches and algorithms.In several cases, such detection approaches can perform well in test scenarios that were evaluated by the authors.However, it is usually not considered that there might be other possibilities to impair the effectiveness of these detection algorithms, i.e., without even decreasing a covert channel's bandwidth.We address this topic as follows: 1) We demonstrate that both, the -similarity and the compressibility score, provide unreliable results when confronted with a slight modification of the standard covert timing channel, which we call -κlibur.In comparison to previous attempts such as JitterBug, MB-CTC and TRCC, -κlibur is much simpler and eliminates the requirement of altering pre-defined (legitimate) network traffic or a complex traffic generation framework.2) We show that -κlibur can moreover degrade the performance of two novel detection approaches GAS and SnapCatch.3) We propose an enhanced -similarity to replace the original heuristic.Our enhanced -similarity has shown that it can detect the standard covert timing channel as well as -κlibur.Please note that it is not our major goal to provide a covert channel that circumvents all known covert timing channel detection methods but to enhance the understanding and limitations of the -similarity and the compressibility score.
The remainder of our paper is structured as follows.Section II presents fundamentals and Section III covers related work.Our -κlibur covert channel is presented in Section IV.We evaluate -similarity and compressibility score against -κlibur in Section V. Afterwards, we suggest an enhanced detection heuristic and evaluate it against -κlibur and TRCC in Section VI, we further evaluate -κlibur against the ML-based detectors GAS and SnapCatch in Section VII.Section VIII concludes.

II. FUNDAMENTALS
In this section, we first cover the general concept of the related covert timing channels (Section II.A), followed by an explanation of the analyzed detection methods (Section II.B).

A. Timing Covert Channels
The considered timing covert channel of Cabuk et al. [10] modulates the IATs between consecutive network packets of a connection by intentionally delaying packets.The channel is a form of the so-called inter-packet times (or: inter-arrival times) hiding pattern [15] in network steganography, or -more generally -the element positioning pattern in steganography [16], as packets are "positioned" in time.
To transmit data, the covert sender and covert receiver first agree on two or more IATs corresponding to two or more secret symbols.The covert sender then encodes the message into a list of symbols.Each of these symbols is transmitted by waiting for a corresponding time after a packet has been sent, and then sending out the next network packet to reach a certain IAT.This is repeated until all symbols have been transmitted.In the setup of Cabuk et al. the covert channel sends data every τ and 2τ units of time, e.g., every 5 ms and 10 ms, to encode two secret bits.Thus, the average IAT is 3τ /2.

B. Covert Timing Channel Detection Methods
In the original two papers published at ACM CCS in 2004 [10], ACM TISSEC/TOPS in 2009 [12] and a related dissertation [11], Cabuk et al. introduced and evaluated the two detection heuristics for IAT-based covert channels that we selected for our investigation.These three publications received a widespread influence in the field, as shown by their derivatives (see Sections III.A and III.B) and their citations: according to Google Scholar, the CCS'04 paper received 619 citations, the TISSEC paper 184 citations, and the dissertation 146 citations, summing up to 949 citations as of late January 2023.
These detection methods try to find patterns or structure in the IATs of the network packets.Regular network traffic has mostly random IATs that are influenced by the network hardware, used software, protocols, topology, and load.Traffic containing a covert channel will show a clear structure due to the artificial delays.Fig. 1 shows a scatter plot of the IATs of two network recordings.We can see that the IATs for the regular traffic form a single larger cluster, while the covert channel traffic forms two distinct clusters.
1) -Similarity: The first detection method by Cabuk et al. that we investigated is the -similarity.The goal of this heuristic is to quantify the "similarity" of the IATs in a network recording.The idea behind this is that regular traffic will have random timings, which are not too similar to each other, while covert channel traffic will have groups of rather similar IATs.By comparing the plots of Fig. 1, these differences in structure  are clearly visible.The -similarity is a numerical score that is calculated as follows: 1) Calculate all IATs for a given flow with 2,000 packets (called window size).2) Sort the IATs (illustrated in Fig. 2).
3) Calculate the relative differences of two consecutive IATs: Calculate the percentage of λ values that are below the threshold , which is called the similarity score.The similarity score is then used as a threshold to differentiate between legitimate and covert channel traffic.
2) Compressibility: The second detection approach introduced by Cabuk et al. is the compressibility score.The goal is again to quantify the structure of the IATs.This heuristic uses the help of a compression algorithm to approximate the Kolmogorov complexity [11], [17] of an IAT string-representation.The main functionality of the compression algorithm is to find patterns and structure in the data that can be exploited to efficiently compress the data.Highly structured data, like natural language texts or HTML files, can often be compressed with a high rate, while pseudo-random data, like encrypted data, can be compressed only slightly.The idea behind the compressibility heuristic is that the string representation of legitimate IATs will be more random and therefore less compressible, while covert channel traffic will have more structure and therefore be better compressible.By comparing the compression ratio of the string-representations of the IATs, we obtain a numerical measure of their "structure".Consequently, legitimate traffic should have lower compressibility scores than covert channel traffic.
with |x| representing the length of string x.

III. RELATED WORK
In this section, we discuss related improvements of the standard version of the covert timing channel and how -κlibur is different from these approaches (Section III.A).Finally, we cover related detection heuristics (Section III.B).

A. Improvements to the Timing Covert Channel
Cabuk et al. further proposed ideas how to circumvent their detection methods by modifying the covert channel.But their approaches reduce the bandwidth of the covert channel by either mixing in legitimate traffic into covert transmissions en bloc or by significantly increasing the delays that are used for the encoding schema.Further, they proposed a more sophisticated timing channel called a time-replay covert channel (TRCC) in [11].The TRCC functions similar to the basic IAT covert channel, but draws its delays from two (or more) sets of pre-recorded IATs of legitimate traffic.Each set corresponds to a secret symbol.However, Cabuk et al. did not perform an evaluation of the -similarity or the compressibility score with TRCC.Moreover, TRCC requires pre-recorded traffic, which -κlibur does not.Related Timing Covert Channels: Groza et al. presented Jitter-Bug [6], an optimized timing channel that takes legitimate Telnet traffic and adds random delays to the traffic so that it renders the traffic undetectable by multiple covert channel detection metrics.The covert channel was evaluated using keyboard input, which limits the scenario of JitterBug; the detectability using the -similarity and compressibility score were not evaluated.Gianvecchio et al. introduced a model-based covert timing channel (MB-CTC) [7], which aims at evading detection by modeling and mimicking statistical properties of legitimate traffic.Therefore, the authors developed a framework that uses an appropriate distribution function in conjunction with a traffic library to build a covert channel.In comparison, we apply a much more simplified approach to create our channel and show that the two popular detection methods -similarity and compressibility score can be circumvented much easier than expected.
Zander et al. developed another sophisticated covert timing channel [18], which has shown low detectability in comparison to previous attempts.The detectability was evaluated using a Kolmogorov-Smirnov (KS) test and the C4.5 decision tree classifier.
Walls et al. proposed an improved covert timing channel called Liquid [8], which is based on JitterBug.Their goal was to further increase the covert channel's stealthiness when faced with entropy-based detection methods.The authors achieve this by splitting the channel in "transmitting" and "shaping" delays.The shaping delays carry no information but are used to manipulate the statistics of the transmission to more closely resemble the statistics of pre-recorded, legitimate traffic.
In 2015, Archibald and Ghosal presented a covert timing channel that is tailored to fit the behavior of Skype traffic [9].To this end, and similarly to our approach, the authors applied multiple inter-packet delays.In comparison, our approach is generic (not tailored for a specific application, such as Skype) and targets different countermeasures, namely the -similarity or compressibility score.
It must be noted, that covert timing channels also appear in other forms and for other purposes.For instance, Lamshöft et al. recently utilized the timing of port knocking messages to influence syslog messages, so that they store secret information [19].Moreover, the timing (as well as other metadata) of network traffic is actively manipulated for network flow watermarking [20], [21].Further, there are timing-based covert channel patterns in addition to the inter-packet times or element positioning patterns [15], [16] analyzed in this paper, such as covert channels exploiting the timing of retransmissions [22], [23], covert channels influencing the throughput of a connection over time [24], and covert timing channels that drop selected network packets [25].

B. Improvements and Derivatives of the Detection Heuristics
Cabuk et al. proposed improvements of the compressibility score called compressibility-walk and CosR-walk to increase the detection performance when faced with mixed (covert and legitimate) traffic, which function as follows: The compressibilitywalk uses a sliding window approach to evaluate a long flow of mixed covert channel traffic.For each window position, the compressibility score (κ) is calculated and then plotted.Windows which contain (only) covert channel traffic will be visible as peaks in this plot and could then be inspected more thoroughly.The CosR-walk goes one step further and investigates the relations between consecutive windows [11].The CosR score is again a similarity score, and the CosR-walk combines this score with a sliding window approach to compare two consecutive windows.With this it is possible to determine if two legitimate windows or one legitimate and a covert window follow each other.[11], [26].Since our focus was detection based on a single window, and we only used pure covert channel Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
recordings instead of mixed recordings, we did not investigate these methods further as to minimize false-positives.
Related Detection Methods: In 2014, JitterBug, the MB-CTC and the TRCC were evaluated by Archibald et al. [27] using KS test, Welch's t-test, Entropy evaluation, regularity score and a shape test in conjunction with the WAND NZIX dataset [28] from July 2000, which cannot be considered as a valid example of 2022's Internet traffic characteristics anymore.
Han et al. recently analyzed the detectability of several covert timing channel variants using SVM, kNN, Naive Bayes and Logistic Regression [29], where the highest detection rates where achieved with SVM and kNN.For the Receiver Operator Characteristic (ROC), the authors reached AUC (Area Under Curve) values of 0.9727 (kNN) and 0.9452 (SVM), respectively.
Li et al. proposed another machine learning-based pipeline called Generic and Sensitive (GAS) anomaly detection in [13], which employs a recurrent neural network (RNN), in particular an LSTM.GAS has shown good performance on timing channel detection.They compared GAS with several other detection methods, including statistical methods (KS and regularity), entropy-based methods, and SVM.
Al-Eidi et al. proposed a new detection method called Snap-Catch [14].For this approach, network traffic is first transformed into 16x16 pixel images to afterwards extract several features rooted in image processing, like mean grey value, center of mass and standard deviation of grey values.These features are then used to train several machine learning models.The authors evaluated their feature set against other approaches such as CCE, regularity and entropy.In their tests, SnapCatch outperforms the other evaluated methods, resulting in almost perfect accuracy when detecting simple covert timing channels.
Finally, Wu et al. tested the detectability of different covert timing channels using -similarity, KS test, Entropy and Corrected Conditional Entropy tests as well as regularity metric in [30].This is the only current work that evaluates the -similarity, and the authors reported that two of the tested channels where detectable with an accuracy of 98% and 100%, while JitterBug and TRCC were not detectable.In comparison to Wu et al. we show that a much simpler covert timing channel can already significantly decrease the performance of the -similarity while our channel can moreover decrease the detectability of the compressibility score, thus, showing key weaknesses in both detection methods.
We conclude that none of the previous works analyzed the compressibility score against sophisticated covert timing channels and only one work analyzed the -similarity for such a scenario.Given the high number of citations, we decided to investigate the -similarity and the compressibility score in detail and show that they cannot handle -κlibur, which -in contrast to previous works -does not rely on the modulation of pre-recorded or complex traffic generation frameworks while providing a high bitrate.
Derived Heuristics for Alternative Covert Channels: Moreover, the two detection approaches have been adapted by other researchers in order to work with different covert channels.Zillien et al. modified both detection approaches in [23] to work with a covert channel that uses the timing of artificial TCP retransmissions to transmit hidden data, similarly to the IAT covert channel.To adapt the detection methods to this new covert channel, the authors involved the distance between succeeding TCP sequence numbers.This distance measure is then used as the input for the -similarity and the compressibility score.The authors found that the -similarity is a promising approach to detect the retransmission covert channel, while the compressibility score alone did not perform well enough but could be used as a feature for a more sophisticated detection approach.
Fu et al. created covert channels in IaaS environments and applied the -similarity for their detection and reported statisfying results [31].
Wendzel et al. modified the compressibility score, thesimilarity as well as the so-called regularity metric to detect covert channels that modulate the sizes of network packets to transfer a secret message [32].
Accuracy, precision, and recall relied on the covert channel's configuration, which implies that these heuristics are suitable to detect highly specific covert channels but remain fragile to disturbances.Further, Mileva et al. used a method based on the compressibility score to detect two covert channels using the MQTT 5.0 IoT protocol -the results have shown that for one of the covert channels, the applied coding influenced the detectability significantly while the other covert channel was well-detectable with different configurations of their testbed [33].
As can be seen, -similarity and compressibility score are actively used to detect covert channels of different types but tests on their functioning on sophisticated timing channels are lacking, rendering these popular methods not well-understood.

IV. DESIGN OF -κLIBUR
The aim for our research is to find ways to manipulate the IATs of the covert channels in such a way that we minimize the detectability using -similarity and compressibility score.We further change the statistically observable behavior of the cover channel without compromising the bandwidth or the reliability of the transmission.This means that we try to create a high-bandwidth covert channel that both methods cannot detect.As both detection methods try to find some sort of regularity or pattern in the IATs of network traffic to detect the covert channels, we tried to break this regularity by changing the timing behavior of the covert channel.First, to have a numerical measure on how much our changes would impact the reliability of the covert channels, we developed an impact score that would tell us how many symbols would be unintentionally changed by modifying the delay too much.
For this impact score (Algorithm 1), we compare the original delays d i that the covert channel would normally use, and the modified delays d i .We determine if we would accidentally move a delay from one side of the decoding threshold t to the other (i.e., whether the symbol interpretation would be flipped).
With this measure, we can quantify how strongly we influence the covert channel's decoding quality.Ideally, the impact score I should remain at 0 to prevent introducing any decoding errors, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 2: Fuzziness Injection.but depending on the networking environment and whether an error-correcting code is being used, smaller values of I might still be tolerable.In our proof-of-concept implementation, we did not apply an error-correcting code.

A. Injection of Fuzziness
Our approach to break up the temporal structure of the covert channels is the systematic injection of fuzziness.The basic idea of this approach is to make the timings of the covert channel less precise and therefore obscure the structure.
Normally, the covert sender would choose one of the delays, corresponding to the hidden symbol that is to be sent next, and then delay the next network packet by that amount.This results in the covert sender using the same delays over and over again.If the covert channel uses only two symbols, this effect is maximized, as there are only two possible delays.With more than two symbols, this effect is lessened to some extent, as more different delays are used while transmitting the hidden data.But even with several different symbols, a significant degree of structure will still remain.
With our new approach, the covert sender will no longer just choose a delay from a fixed list, instead the sender applies a postprocessing function to the delay before using it in the sending process.The corresponding function is shown in Algorithm 2. Fig. 3 shows the sorted IATs of regular, covert and -κlibur traffic.As visible, the curve for -κlibur has two distinct parts, one defined by the normal distribution, the other one by the stepwise function.We chose the normal distribution as it resembles the curvature of the legitimate traffic, only scaled down.For the second half of the graph, we chose a stepwise function.If we were to use another normal distribution with an offset (to ensure the separation between two symbols), we would have significantly lower λ values, as the delays are generally higher while their differences stay on the same level as before.To counter the offset, we would have to increase the scale of the distribution to unfeasibly high levels.Therefore, we chose the stepwise function.Each of these sharp steps result in a large spike in the λ values, compared to a smooth curve with the same upper and lower bounds, which will only produce low λ values.This choice of functions allowed us to push down the -similarity-score further.
The scale of threshold/7 for the normal distribution and the upper limit of 2.4τ in the else branch have shown good results for both detection methods in our empirical evaluation.Our goal was to optimize both detection values simultaneously.We therefore had to find a balance that would reduce the effectiveness of both algorithms at once without accidentally benefitting one or the other.
The threshold choice of (3τ )/2 stems from the configuration of the covert channel.Since the covert channel uses a timing configuration of τ and 2τ , (3τ )/2 gives us a threshold in the middle of the two values. 1lgorithm 2 accomplishes multiple things: 1) Different IATs are spread apart from each other.
2) IATs closer to 0 and with more zeros after the decimal point are introduced.3) The "slope" of the sorted IATs becomes more similar to that of legitimate traffic.4) Delays that were below the decoding threshold will still be below the threshold, delays above will still be above.In all our tests, we reached an impact score of I = 0, so we did not introduce any decoding errors.
This algorithm can also be easily adapted to covert channels with more than two symbols by adding more stages to the ifelse block.
Covert Channel Bitrate: Depending on the τ values, -κlibur provides a different bitrate.Our fastest configuration (τ = 5 ms) achieved ≈185 symbols per second.With our encoding schema, this results in ≈185 Bits/s.The original covert channel with the same configuration achieved ≈127 Bits/s, so we even increased the bandwidth by introducing delays that are smaller on average.

Implementation of Covert Channels for Evaluation:
In our implementation of the covert channels, we used two timings, τ and 2τ , a short and a long delay, resulting in a morse like encoding.The covert receiver monitors the delays of consecutive packets from the covert sender to then measure and decode the IATs.It is also possible to use more delays, resulting in a more complex encoding.We decided against this, as a two-symbol covert channel is the most difficult scenario for circumventing the detection heuristics as it is the easiest to detect, since the introduced structure will be the clearest (as we discussed before).
To create the recordings for the original covert channel, we used the tool CCEAP [34].The tool offers a simple interface to create different covert channels, including IAT-based covert channels.For our tests, we used various different timing intervals to get a broader overview of the performance.
Used Traffic Recordings: The original paper as well as several later works, such as [27] (2014) and [32] (2019), used reference recordings from the NZIX II dataset [28], which was recorded in the year 2000.Since this dataset is now more than 20 years old and networking hardware has changed a lot since then, we decided to create new reference recordings.We chose four different activities that are most prevalent in the recent years: video streaming, video conferences, online gaming and file downloads.All recordings were performed on the WAN interface of a home internet gateway.In total, we recorded 4.8 GByte of reference recordings from which we extracted roughly 1,790,000 packets.We believe that this traffic mix represents the average Internet usage today more closely than the original NZIX recordings.

A. -Similarity Evaluation
The first detection method that we evaluated is the -similarity.We used the settings and thresholds from the original paper when evaluating our approach, all scores were calculated with a window size of 2,000 packets as this was also used in the original paper.
Our datasets include several different configurations for the covert channel.We used τ values of 5, 10, 20, 30, 40, 50 and 100 ms for our tests.
We evaluated three different splits of covert and legitimate traffic in order to simulate realistic, best-and worst-case situations for the detection (50/50, 1/99 and 99/1 mixture of legitimate/covert channel traffic).
To quantify the performance of the detection methods, we chose AUC.The AUC generally show us how well the two datasets (legitimate and covert channel) can be separated regarding the -similarity or the compressibility score, respectively.A steeper curve and a higher AUC signal a good separation, while a flatter curve with a lower AUC signals a worse separation.We do not want to minimize the AUC too far, as values below 0.5 can be flipped above 0.5 by inverting the detection labels [35].Therefore, the goal for -κlibur was to bring the AUC as closely to 0.5 as possible.
In Fig. 4, we compare the box plots of the -similarity for legitimate, original covert channel and -κlibur traffic.We can see that scores for the legitimate traffic are spread and their median rises with the -threshold.The -similarity scores for the original covert channel are above 0.95 for all -thresholds.Thus, it is easy to see that the original covert channel can be detected with this method, while -κlibur produces significantly lower values than the original covert channel.While the -similarity values for the -κlibur dataset are still closer to each other compared to the values of legitimate traffic, we can observe that the values blend in better with the legitimate traffic.This shows us that the basic principle of our approach works for the -similarity.
Table I provides the results for the different splits in the datasets.-similarity can easily detect all the original covert channels in various dataset splits, as all AUC values are 1.00 (rounded to 2 decimals).We can also see that the different -thresholds are all equally efficient in this situation.
Table II gives an overview of the AUC values in relation to different τ values.For the original covert channel, all values are again 1.00.With this and the results from Table I, we can observe that the delay configuration of the covert channel and the split of the dataset do not have an influence on the results of the -similarity.
The AUC values for -κlibur show a clear change in the detection performance.In Table I, we can see that our approach generally reduces the AUC across all -thresholds.We can also observe that the split of the dataset does not influence the effectiveness of our approach.Table II shows the effectiveness of the fuzziness in relation to the covert channel configuration and -threshold.We can again determine that the effectiveness of -κlibur is reduced with larger -thresholds, but even the best detection result is still only at AUC = 0.87.This would limit the usefulness of the detection heuristic in a real life scenario given that several GBit/s of flow data would need to be processed where even a small false-positive rate would render the approach impractical.Moreover does the best value (AUC = 0.87) only apply to one type of covert channel, while realistic setups have to deal with different potential covert channel configurations, which do not perform well, as already shown in Table I.
Table II shows that the detectability of -κlibur changes depending on the τ values.Specifically for higher τ values, it is harder to produce higher relative differences, as the decoding threshold forces higher IATs.This explains the worse performance for higher -thresholds with higher τ values.
Fig. 5(a) shows that we obtain a perfect detection of the unmodified covert channels, while in Fig. 5(b), we can observe -similarity score comparison (mixed traffic with all τ configurations).that our modifications resulted not only in a reduction of the AUC to 0.48, but also did we manage to push the curve closer to the diagonal line.This means that the detection algorithm provided unfavorable scaling between true-and false-positives over large portions of the entire range, i.e., only after 60% false-positives, we could significantly gain additional true-positives without also suffering more false-positives.For our approach, this is an almost optimal result.
In Fig. 5(c) and (d), we compare the ROC curves for two different covert channel configurations.In Fig. 5(c), we can see the curves for τ = 5 ms and in Fig. 5(d) for τ = 30 ms.Both plots show the results for -κlibur.
In Fig. 5(c), the AUC is larger than in Fig. 5(d), but both have the same distance from 0.5.Both figures show a rather steep ROC curve, so if we only look at a single covert channel configuration, we have a certain threshold from which on we only gain truepositives without suffering many more false-positives.
While this technically denotes worse performance from our approach, we still believe that the performance of the detection algorithm is still not useable in this state.
In a real-world scenario, a detector would have to base its detection thresholds around all possible covert channels, so only comparing a single covert channel configuration does not give a real-world image.We therefore mostly focussed our evaluations on datasets that include multiple covert channel configurations.However, the more detection thresholds are applied simultaneously, the higher the number of cumulative false-positives.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Compressibility Score Evaluation
The second detection heuristic that we evaluated is the compressibility score.We also used the same settings from the original paper and used a windows size of 2,000 packets.Similar to the -similarity, we applied different configurations for the covert channel delays in order to reach a broader view of the performance of our approach.We used the same configuration of τ and 2τ for short and long delays as well as the same list of τ values of 5, 10, 20, 30, 40, 50 and 100 ms.
Similar to the -similarity, our evaluation was based on the AUC of a ROC curve as a measure of how well the detection algorithm works on the original covert channel and -κlibur.Our goal was again an AUC of 0.5 with a slope as close to 1 as possible.
Fig. 6 compares the histograms of the compressibility scores of legitimate, covert and -κlibur recordings.Similar to the -similarity, we observe a clear difference between the covert channel and -κlibur.We can easily notice that compressibility values are significantly lower for -κlibur and blend in well with the legitimate values.This shows that the basic idea of the fuzziness injection also works for the compressibility score.In Fig. 6, we can moreover determine a different distribution of the compressibility scores for legitimate, covert and -κlibur traffic.-κlibur's distribution of κ values overlaps significantly with legitimate traffic.Table III lists the AUC values for the compressibility score for different splits of the dataset.We can see that the compressibility score can also detect all original covert channels for each tested split.All AUC values are again 1.00 (rounded to 2 decimals), so the algorithm can perfectly distinguish between covert channel and traffic.
Table IV presents the detection performance in relation to the different delay configurations.We can again observe that the compressibility score can detect all original covert channels, no matter their configuration.Thus, similar to the -similarity, delay configurations and dataset splits have no impact on the raw performance of the compressibility score.
If we look at the AUC values for -κlibur, we can again notice a clear difference.In Table III, it is visible that the AUC values are lower and around 0.40.We can thus conclude that our approach also works for the compressibility score.
Table IV demonstrates that our approach works for all τ configurations of the covert channel.The effect of the fuzziness depends on the delay configuration.Some values are below 0.5 while others are above 0.5 and -if we combine all recordings in a dataset -we obtain AUC values around 0.4.
This scaling behavior can be explained by the second part of the fuzziness injection algorithm.Since the upper bound for the random delays scales with the value of τ , we get a larger spread in possible values, which in turn result in more different elements in the string (see Section II.B.2) which leads to a lower compressibility.Even if we take the worst performance (from our viewpoint), which is an AUC of 0.16, and flip the labels, we would end up with an AUC of 0.84.But this result is limited to channels with τ = 100 ms and is not well-useable in a real-world scenario, as the detector could not optimize its thresholds only for a single covert channel configuration but would rather have to apply the detection to all possible covert channels (scenario of Table III).Even if we focus on only this single worst-case configuration, we still would suffer from a high false-positive rate (>13%) if we want to achieve a true-positive rate of over 82%.
In Fig. 7, we compare the ROC curves for the original covert channel and -κlibur.The figures underpin the perfect detection for the original covert channel in contrast to the results for -κlibur.Not only did we reduce the AUC to around 0.4, we also managed to achieve an almost diagonal line.That means the detector has bad scaling between true-and false-positives throughout the entire range.

VI. ENHANCED DETECTION APPROACH BASED ON -SIMILARITY
In this section, we discuss an adaption of the -similarity that can be used to counter -κlibur.Our adaption follows the original approach closely but has one significant modification.Instead of looking at an entire window at once, we first sort the IATs and then divide the window into three equal subwindows.Thus, with an original window size of 2,000, the first and second subwindow will contain 667 IATs and the third subwindow will contain 666 IATs.
As both symbols of the covert channel occur with an equal probability of p ≈ 0.5, we will have only low IATs in the first subwindow, a mix of low and high in the second and only high values in the last subwindow.With legitimate traffic, we see a steep incline in the last subwindow (see Fig. 3).The original covert channel is flat, and our -κlibur has several steps in this subwindow.-κlibur has two constraints in the last subwindow.First, there is a hard lower bound, as no IATs below the decoding threshold can reside in this window.Theoretically, we could use arbitrarily large IATs, but this would not be feasible in a real life scenario.Therefore, the IATs in this third subwindow are constrained between two bounds, which in turn leads to lower relative differences in this subwindow.Therefore, we can use this third subwindow for the new detection measure by applying the original -similarity to it.We evaluated our new detection approach against the original covert channel and -κlibur.
Fig. 8 shows the detection results for the two different covert channels, the original one and our -κlibur.The results show that

TABLE V AUC -TRRC, -SIMILARITY VERSUS ENHANCED -SIMILARITY
we were able to increase the detection performance for ourκlibur without hurting the detection performance for the original covert channel significantly.
TRCC Evaluation: We decided to perform an additional evaluation of our enhanced -similarity heuristic to analyze whether it is also able to detect the TRCC (see Section III.A) better than the original approach.TRCC depends on legitimate traffic.For this reason, we tested TRCC with two different reference recordings and found drastically different detection results for the two recordings.
Table V(a) shows the results for our first reference recording (which is a recording of an online gaming session).With this recording, the TRCC was (except for the last -threshold) hard to detect for the original -similarity but our enhanced detection approach delivered significantly improved results with AUC Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.values between 0.80 and 0.88, despite being not optimally suited for real-world scenarios.
With the second reference recording (which is a recording of a Microsoft Teams meeting), shown in Table V(b), the TRRC was well detectable by the original -similarity with the first three -thresholds (AUC > 90%).= 0.02 delivered unusable detection performance, while the last two delivered a good performance again, although with flipped labels (AUC < 10%).Our enhanced -similarity delivered, in some cases, worse and in other cases better performance compared to the original -similarity.Generally, the enhanced -similarity delivered a more consistent detection performance.The strong fluctuations in detection performance regarding the different -thresholds and the reference recordings lead to inconclusive results.Therefore, we believe that the detection performance is more dependent on the reference recording than anything else, and thus one would need to conduct further research focusing solely on the parameters and statistics of the reference recordings to sufficiently evaluate the TRCC.

VII. EVALUATION OF -κLIBUR WITH SOPHISTICATED DETECTION METHODS
So far we have shown that -κlibur can circumvent both, the compressibility score and the -similarity, and that the enhanced -similarity method outperforms both classical heuristics.While the main focus of our work was to demonstrate weaknesses and exploitability of these classical heuristics, we additionally evaluate -κlibur against two recent machine learning-based detection methods: GAS [13] and SnapCatch [14] as introduced in Section III.B.Since -κlibur was solely tailored to circumvent the compressibility score and the -similarity, we slightly adjusted our method by adding an additional outlier timing to the high inter-arrival signal of -κlibur (the else-branch of Algorithm 2 is slightly altered for this purpose), which we call -κlibur-O.The idea here is to stretch the overall distribution of the timings further apart.Regular traffic showed a steep increase in delays (see Fig. 3) which we wanted to imitate with these outliers.We used an outlier timing of 10τ .

A. Performance of -κlibur-O Against Compressibility and -Similarity
Our evaluation has shown that -κlibur-O yields similar results as -κlibur when the compressibility score or thesimilarity are used.For this evaluation, we again used the AUC value as a performance metric.As a high AUC value shows a good detection performance and a low AUC value ( 0.5) also leads to a good detection by flipping the labels, we focused on the distance of the AUC score to 0.5.Fig. 9 shows the AUC scores for the compressibility of -κlibur and -κlibur-O based on a) dataset split (as explained in Section V.A) and b) covert channel configuration.We can see that for most configurations there is no significant deviation for -κlibur-O compared to -κlibur.For the configuration with τ = 5 ms and 10 ms, we can even observe a better performance for -κlibur-O, as the AUC value  Similarly to the compressibility score, some configurations suffered slightly in performance while others slightly gained.For most configurations of the compressibility and -similarity, the changes in AUC are below 0.1.So there is no clear trend across all configurations that points towards -κlibur-O being significantly easier or harder to detect compared to -κlibur, when faced with compressibility or -similarity.

B. Performance of -κlibur-O Against the Enhanced -Similarity
We also evaluated -κlibur-O against our enhanced version of the -similarity.We ran the same evaluations as before and used the AUC as performance metric.In our experiment, we found that the AUC remained nearly unchanged for most configurations and only showed a noticeable degradation at = 0.1 (0.1 AUC decrease).Therefore, we can conclude that our enhanced detection performs equally well on -κlibur-O as on -κlibur.

C. GAS
To evaluate the GAS detection approach, we used the code and models provided by the original authors [13] and conducted three tests.We first tested the original covert channel, then -κlibur and -κlibur-O.Fig. 12 shows the results of our tests.We could reproduce the results of the original paper, as GAS also achieved good detection results with AUC values of 0.97 and above for the original covert channel.When faced with -κlibur and -κlibur-O on the other hand, we could observe a significant performance degradation.In parallel to the original paper, we evaluated the performance with the "Labnet" and "Bignet" setups and also with varying window sizes.In the "Labnet" setup, -κlibur and -κlibur-O degraded the detection performance significantly and pushed the AUC for all window sizes down to values between 0.55 and 0.6.The performance in the "Bignet" setup was better but -κlibur and -κlibur-O still showed a sizeable impact with AUC values ranging from 0.68 to 0.92.We thus conclude that both, -κlibur and -κlibur-O, significantly impact the performance of GAS, especially for smaller sample lengths.

D. SnapCatch
Similar to GAS, we evaluated SnapCatch [14] with three datasets.The original covert channel, -κlibur and -κlibur-O.We re-implemented the feature extraction of SnapCatch and used the resulting features to train a SVM model.In each case, we trained the model with the original covert channel and tested the resulting model against the other dataset.Thus, we assume a defender with knowledge about the classical covert channel but without knowledge of -κlibur and -κlibur-O.Fig. 13 shows the results of these experiments.We can observe that SnapCatch achieved an outstanding detection of the original covert channel with an AUC of 1.0.Even -κlibur showed no significant degradation in detection performance, also resulting in an AUC of 1.0.-κlibur-O on the other hand was successful in decreasing the performance of SnapCatch and resulted in an AUC of 0.59.The data processing pipeline of SnapCatch includes a step in which the timings of each window are normalized and mapped to the range between 0 and 255.The timing spread of regular traffic resulted in images that are generally dark with only a few bright pixels.While the original -κlibur resulted in images with around 50% bright and 50% dark pixels.The outlier of -κlibur-O influenced the normalization step and therefore the images had again a few bright pixels in a generally darker image, which helped to bring the pixel values closer to those of regular traffic.

VIII. CONCLUSION
First, we have shown that the two highly cited covert channel detection metrics -similarity and compressibility score can be defeated with a simple covert channel that we call -κlibur.In comparison to previous attempts, -κlibur is easier to construct and requires no pre-recorded traffic while providing an nondegraded bitrate.We second introduced an enhanced -similarity that is able to detect both, the original timing covert channel as well as -κlibur.Third, we evaluated -κlibur against two more recent approaches: -κlibur can defeat GAS and a slight variation of -κlibur called -κlibur-O can also significantly lower the performance of SnapCatch.We conclude that the enhanced -similarity heuristic can compete and partially outperform the most recent ML-based methods.However, in comparison to the other methods, the enhanced -similarity was not evaluated Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
against other sophisticated timing channels, such as MB-CTC or JitterBug.
In future work, we plan to extend our work to additional detection algorithms as well as covert storage channels.We also plan to evaluate the enhanced -similarity against other timing channels.

Fig. 2 .
Fig. 2. Sorted IATs of the two-symbol covert channel and legitimate traffic.

TABLE I -
SIMILARITY: AUC VALUES OF MIXED TRAFFIC WITH ALL τ CONFIGURATIONS (THRESHOLDS AS GIVEN BY CABUK ET AL.)

TABLE II -
SIMILARITY: AUC VALUES FOR ISOLATED COVERT CHANNEL CONFIGURATIONS (THRESHOLDS AS GIVEN BY CABUK ET AL.)