Detecting Steganography in Inactive Voice-Over-IP Frames Based on Statistic Characteristics of Fundamental Frequency

Steganography in inactive Voice-over-IP frames is a new technique of information hiding, which can achieve large steganographic capacity while maintaining excellent imperceptibility. To prevent the illegitimate use of this technique, the entropy-based and poker test-based steganalysis methods have been presented. However, the detection performance of these two methods is not so good for the cases of having small quantity of inactive frames or low embedding rates. Thus, we present a new steganalysis method based on statistic characteristics of fundamental frequency. Specifically, we employ the statistics for zero-crossing count (ZCC), including the average ZCC of inactive frames, the ratio between the average ZCC of inactive frames and that of all frames, and the difference between the average ZCC of inactive frames and their calibrated versions, to characterize the frame-level dynamic characteristic of speech signals; we utilize the average values of Mel-frequency cepstral coefficients (MFCCs) to represent the invariant characteristic of inactive frames; further, using the feature set consisting of the zero-crossing statistics and average MFCCs, we propose a support-vector-machine based steganalysis for inactive speech frames. The proposed steganalysis method is evaluated with a large number of ITU-T G.723.1 encoded speech samples, and compared with the existing methods. The experimental results demonstrate that the proposed method significantly outperforms the previous ones on detection accuracy, false positive rate and false negative rate for any given embedding rates or using the same number of inactive frames. Particularly, the proposed method can provide accurate detecting results for the existing steganographic methods only using very small quantity of inactive frames, and thereby be employed to detecting potential inactive-frame steganography behaviors in real-time speech streams.

many advantages of VoIP for information hiding, such as instantaneity, huge amounts of carrier data, high steganographic bandwidth, and alterable conversation length [6]- [9]. Therefore, VoIP-based steganography is popularly regarded as one of ideal solutions for secure communications. However, it might be also abused by lawbreakers, terrorists and hackers for cybercriminal activities, because unauthorized information flow with its help can covertly pass through firewalls and monitors without being noticed [12]. Therefore, in order to improve cybersecurity, it is indispensable to develop the corresponding countermeasure technique, i.e., steganalysis of VoIP, whose primary aim is to detect covert communications based on VoIP accurately [13]- [15]. In this paper, we focus on detecting steganography in inactive frames in VoIP streams, which is still an open problem.
In general, VoIP-based steganography can be divided into two categories [6]- [9]. One employs the network protocols as carriers [10], [16]- [18], while the other hides information by modifying payloads in speech streams [7]- [9], [19]- [28]. Due to its high steganographic capacity, the second category has been the mainstream of VoIP-based steganography. In VoIP, to obtain required low data rates, speech signals are often encoded into digital frame streams using code excited linear prediction (CELP) codecs, such as ITU-T G. 723.1, ITU-T G.729a, Speex, Internet Low Bitrate Codec (iLBC) and adaptive multi-rate (AMR) codec. Accordingly, most of the payload-based steganographic algorithms achieve information hiding by modifying some specific parameters in speech frames, including linear predictive coefficients (LPC) [19]- [21], fixed codebook (FCB) parameters [22]- [24] and adaptive codebook (ACB) parameters [25]- [27]. In addition, differing from the steganographic methods based on the modification of specific parameters, Huang et al. [11] presented a novel high-capacity steganographic algorithm by hiding information into inactive frames of VoIP streams. The work suggested that the steganography in inactive frames can achieve much larger steganographic capacity than that in active frames, while maintaining the same imperceptibility. For the speech streams encoded with ITU-T G.723.1 codec at 6.3 kbps mode, the proposed method can obtain the steganographic bandwidth of up to 101 bits per frame. Further, Lin [28] extended the idea into speech streams encoded with ITU-T G.723.1 codec at 5.3 kbps mode, whose experimental results show that the presented method can achieve the steganographic bandwidth of up to 81 bits per frame without causing perceptible degradation of speech quality.
As for the steganalysis of VoIP, there have been also many fruitful studies [13]- [15], [29]- [35]. For example, Lin et al. [15] introduced a recurrent neural network to detect quantization index modulation-based steganography for LPC parameters in G.729a speech streams, which can achieve excellent detection performance, even for very short speech samples, and significantly outperforms the steganalysis based on quantization codeword correlation network [29]. To detect steganography for FCB parameters in AMR speech streams, Tian et al. [14] presented a support-vector-machine (SVM) based steganalysis method using three kinds of statistical features for pulse pairs, namely, long-term distribution features based on the probability distributions of pulse pairs, shortterm invariant features based on Markov transition probabilities of pulse pairs, and track-to-track correlation features based on the joint probability matrices of pulse pairs. Moreover, they proposed a feature selection mechanism based on adaptive boosting to optimize the feature set as well as reduce its dimension. The experimental results demonstrate their method can effectively detect the state-of-the-art steganography based on FCB parameters, and achieve much better detection performance than the steganalysis based on Miao et al. [31] and on the probability of same pulse position [32]. At the aspect of detecting steganography for ACB parameters, Ren et al. [34] presented an SVM-based steganalysis method, which uses the matrix of the secondorder difference of pitch delay (MSDPD) as the detection features. Moreover, they employed the calibration method to obtain the calibrated MSDPD features to further enhance the detection accuracy. The experimental results demonstrated that it is by far the best method for detecting steganography based on ACB parameters in AMR speech streams. By contrast, the steganalysis for inactive speech frames is largely unexplored. Recently, two classical statistics, i.e., entropy and poker test statistic, were employed as the steganalysis features to detect steganography in inactive speech frames [35]. The experimental results show that these two methods are feasible, while the latter is better than the former. However, the detection performance of these methods is not so good for the cases of short sample lengths or low embedding rates. Moreover, our observations through research and experiments suggest that the embedding operations would significantly impact on the statistical characteristics of fundamental frequency for inactive speech frames. Specifically, the statistics for zero-crossing count (ZCC) are employed as the steganalysis features, including the average ZCC of inactive frames, the ratio between the average ZCC of inactive frames and that of all frames, and the difference between the average ZCC of inactive frames and their calibrated versions. These statistics are used to characterize the framelevel dynamic characteristic of speech signals. Moreover, the average values of Mel-frequency cepstral coefficients (MFCCs) are employed to represent the invariant characteristic of inactive frames. Note that, differing from the previous Mel-frequency cepstrum-based steganalysis schemes [36], [37], we directly employ the original 12-dimensional MFCCs without calculating the first-order or second-order differences of MFCCs, because the MFCCs in inactive frames are independent, meaning that there is no correlation between any two MFCCs. Further, using the feature set consisting of the zero-crossing statistics and average values of MFCCs, an SVM-based steganalysis for inactive speech frames is presented. The proposed method is evaluated with a large number of ITU-T G.723.1 encoded speech samples, and compared with entropy-based [38] and poker test-based [35] methods. The experimental results demonstrate that the proposed method significantly outperforms the previous ones in detection accuracy, false positive rate and false negative rate for any given embedding rates or using the same quantity of inactive frames.
The rest of this paper is organized as follow. Section 2 analyses how the steganography in inactive frame impacts on the statistical characteristics of fundamental frequency, and presents two types of detection features, i.e., the zero-crossing statistics and average values of MFCCs. An SVM-based steganalysis scheme is proposed in Section 3. The performance evaluation through comprehensive experiments is described in Section 4. Finally, Section 5 offers the concluding remarks.

II. CHARACTERISTICS OF FUNDMENTAL FREQUENCY FOR INACTIVE FRAMES
Assume that a speech signal contains N S inactive frames, each of which is sampled n times. Let the set of inactive frames be S = {s i |i = 1, 2, . . . , N S }, and each inactive frame be s i = {r i,j |j = 1, 2, . . . , n}, where r i,j is the j-th sample of s i . For each inactive frame, the encoding process can be described as where s * i is the i-th encoded inactive frame. Accordingly, the set of the encoded inactive frames can be denoted as S * = {s * i |i = 1, 2, . . . , N S }. Further, the process for embedding secret information into an encoded inactive frame can be stated ass where ψ(.) is the steganographic operation ands i is the steganographic version of the i-th encoded inactive frame and the decoding process of s * i is According to the additive noise model for steganography [39], [40], the decoding process ofs i is where ε i is the additive noise generated by the steganographic operation on the i-th inactive frame. This equation suggests that the steganographic operation would inevitably impact on the signal decoding of inactive frames. In addition, fundamental frequency estimation is popularly applied in the field of speech signal processing [43]- [46], particularly in voice activity detection [45], [46]. Inspired by these successful applications, we study the impact of steganography in inactive frames on fundamental frequency characteristics, and find out that the statistics for zerocrossing count and Mel-frequency cepstral coefficients are eminently suitable for discriminating the cover and steganographic speech samples. In the following text, we will introduce how we exploit these fundamental frequency statistics as the steganalysis features in detail.

A. STATISTICS FOR ZERO-CROSSING COUNTS
In the field of signal processing, zero-crossing counts (ZCC) are widely employed to characterize the frequency of a given signal [47]. Particularly, the zero-crossing counts can help differentiate between active and silent speech. In general, for the i-th speech frame in a sample, denoted as f i = {r i,j |j = 1, 2, . . . , n}, the ZCC λ can be calculated as where ξ (x) is the sign function, namely, and δ(x) is a discriminant function, which is given by For a normal inactive frame, the ZCC should be 0 in theory, since the value of each sample in the inactive frame is equal to 0. However, as mentioned above, if secret information is embedded into the inactive frame, the values of some samples are no longer equal to 0, due to the noises induced by steganography. Accordingly, the ZCC for a steganographic inactive frame would be not equal to 0. In this sense, the ZCC can be used to distinguish between the normal and steganographic inactive frames. Moreover, with the increase of embedding rate, more sample values in the inactive frame would be modified by the steganographic operation, which suggests that the change of the ZCC for a steganographic inactive frame at a high embedding rate is larger than that at a low embedding rate.
In addition, because there are different numbers of inactive frames in different speech samples, it is hard to use all the ZCCs of the inactive frames as the detection feature. Instead, the average ZCC of the inactive frames in each speech sample is utilized in practice. To verify the above deduction, we compare the distribution of the average ZCCs for randomly chosen 1000 cover speech samples with that for the corresponding steganographic samples at the embedding rate of 100%, as shown in Figure 1. The experimental results show there are obvious distinctions between the average ZCCs for cover speech samples and those for steganographic samples, meaning that it is feasible to employ the average ZCC as a steganalysis feature to detect the steganographic behavior in inactive frames.
In the steganography for inactive frames, the secret information is embedded into inactive frames. Thus, only the ZCCs of inactive frames would increase, while those of active frames are unaffected. For a given cover speech sample, the ratio between the average ZCC of inactive frames and that of all frames, denoted as ω, can be calculated as  whereλ S is the average ZCC of inactive frames, andλ is the average ZCC of all frames. Assume that, the number of inactive frames is N S , the sum of ZCCs of all inactive frames is Z S , the number of all active frames is N A , the sum of ZCCs of all active frames is Z A , then ω can be further written as Apparently, for the steganographic version of the given speech sample, the sum of ZCCs of all inactive frames (denoted by Z S ) is larger than Z S . Accordingly, we have ω > ω , where ω is the ratio between the average ZCC of inactive frames and that of all frames for the steganographic sample. That is to say, the ratio between the average ZCC of inactive frames and that of all frames (simply called ZCC_Ratio) would be changed by the steganographic operation, and could be thereby employed to detect the steganography in inactive frames. Similarly, we compare the distribution of ZCC_Ratios for randomly chosen 1000 cover speech samples with that of ZCC_Ratios for the corresponding steganographic samples at the embedding rate of 100%, as shown in Figure 2. The experiment results show that there are obvious distinctions between the ZCC_Ratios for cover speech samples and those for steganographic samples, and thereby demonstrate the feasibility of using ZCC_Ratio as a steganalysis feature.
In addition, like Ren et al. 's work [34], we employ the calibration technique to estimate the cover signal of a given speech signal. To obtain the calibrated speech sample, the given speech sample, is first recompressed, namely, encoded and decoded again, whether it is a cover or steganographic one. As shown in Figure 3, we can respectively extract the average ZCC from the given sample and the average calibrated ZCC from the calibrated version. Finally, we can obtain the third type of ZCC statistic, i.e., the difference between the average ZCC of inactive frames and their calibrated versions, called DIF-ZCC and denoted as ν,  namely, whereλ O is the average ZCC of inactive frames in the original speech sample, andλ C is the average ZCC of inactive frames in the calibrated speech sample. Similarly, we compare the distribution of DIF-ZCC for randomly chosen 1000 cover speech samples with that of the corresponding steganographic samples at embedding rate of 100% as shown in Figure 4. The experimental results show that there are obvious distinctions between them. Thus, it is valid to employ DIF-ZCC as a steganalysis feature.

B. MEL-FREQUENCY CEPSTRAL COEFFICIENTS FOR INACTIVE FRAMES
Mel-frequency cepstral coefficients (MFCCs) are often used to describe the frequency characteristics similar to the human auditory system's response, and commonly applied in speech processing. In general, MFCCs are calculated by applying a Mel-scaled filter-bank to the short-term fast Fourier transform (FFT) magnitude spectrum to obtain a perceptually meaningful smoothed gross spectrum [48], [49]. Figure 5 shows the procedure of extracting the FFT-based MFCCs from a speech signal.  As mentioned above, in accordance with the additive noise model for steganography, the FFT spectrum of a steganographic inactive frame S can be stated as where S is the FFT spectrum of the corresponding cover inactive frame and S ε is the FFT spectrum of the additive noise. Further, let F L,i , F C,i and F H ,i respectively denote the low limit frequency, center frequency and high limit frequency of the i-th (i = 1, 2, . . . , T ) triangular overlapping window of Mel-scaled filter-bank, where T is the number of the involved filters. T is the number of triangular overlapping windows, and usually set as 24. The relationship among the adjacent triangular overlapping windows can be described below All the spectrums for the frames would be passed through their corresponding Mel-filters. For the given inactive frame, an output value of the i-th Mel-filter, denoted as θ i , can be obtained by calculating where W i (x) is the frequency response function of the i-th Mel-filter, and can be determined as Correspondingly, the output of the i-th Mel-filter for the steganographic inactive frame, denoted as θ i , is given by MFCCs are the result of a discrete cosine transform (DCT) operation on the logarithm of the Mel-filter outputs. There are 24 filters in the Mel bank, which leads to 24 DCT coefficients. However, due to the decorrelation property of DCT, only the first few coefficients are chosen in practice. In this work, L is equal to 12, following the convention of speech processing [50]- [52]. For the given cover inactive speech frame, each MFCC, denoted as η j (j = 1, 2, . . . , L, L = 12), which can be written as Correspondingly, each MFCC for the steganographic inactive speech frame, denoted as η j (j = 1, 2, . . . , L), is Apparently, ∃j ∈ [1, L], η j = η j , since it is largely possible that θ i = θ i for i ∈ [1, T ], which suggests that the set of MFCCs for the inactive frames can be employed as the steganalysis feature.
To verify this deduction, we compare the distributions of MFCCs for the inactive frames in randomly chosen 1000 cover speech samples with those in the corresponding steganographic samples at embedding rate of 100%, as shown in Figure 6. All the MFCCs are calculated with a window of 256 samples and overlapping length of 80 sampling points. The experimental results show that the steganographic operation indeed induces effects on the distributions of MFCCs for the inactive frames, although the impacts caused on the different MFCCs vary. Therefore, we can safely conclude that the set of MFCCs for the inactive frames is very suitable for distinguishing between the cover and the steganographic samples.
Resembling the average ZCC, the average MFCCs (i.e., η 1 ,η 2 , . . . ,η 12 ) of the inactive frames in each speech sample are employed as the steganalysis feature, since there are different numbers of inactive frames in different speech samples.

III. PROPOSED STEGANALYSIS SCHEME
Combining the above two types of features, namely, the statistics for zero-crossing counts and the average MFCCs, we can obtain a 15-dimensional steganalysis feature φ = {λ s , ω, ν,η 1 ,η 2 , . . . ,η 12 }. Further, incorporating the SVM [53], we present a steganalysis scheme, as shown in Figure 7, which includes two processes. Specifically, the training process includes the following steps: STEP 1: Sample preparation. Collect a large number of speech samples, encode them with ITU-T G.723.1 codec VOLUME 8, 2020 Correspondingly, the detection process consists of two steps as follows. STEP 1: Feature extraction. Extract the feature set φ from each sample to be detect. STEP 2: Detection. Input the feature set into the wellestablished SVM-based classifier, and decide whether the given test sample contains secret information in accordance with the output of the classifier.

IV. PERFORMANCE EVALUATION A. EXPERIMENTAL SETUP
To evaluate the performance of our proposed scheme, we collect a total of 2200 ten-second speech samples, which are PCM coded files with 8 kHz sampling rate, 16 bits quantization and mono. The sample set consists of two categories, i.e., English and Chinese. In each category, there are male and female speech samples. All speech samples are encoded with the ITU-T G.723.1 codec at 6.3 kbps mode and that   Figure 8 shows the distribution for the numbers of inactive frames in all the samples, which indicates that each speech sample contains a certain number of inactive frames, and the numbers of inactive frames in different speech samples are various. Moreover, in the steganographic experiments, Huang et al.'s method [11] and Lin's method [28] are respectively carried out on the speech samples encoded at 6.3 kbps mode and those encoded at 5.3 kbps mode. In all the experiments, the embedded messages are randomly generated.
In this section, we evaluate the performance of the proposed method, and compare it with the entropy-based and poker test-based methods [35]. In the steganalysis experiments, all the SVM-based classifiers are implemented based on LibSVM [53] with RBF kernel, where the default parameter setting is adopted. In each steganalysis, three statistics, namely, accuracy (ACC), false positive rate (FPR), and false negative rate (FNR) are employed to evaluate the detection performance of the steganalysis schemes. ACC is the proportion of true results and is calculated by where N TP is the total of true positives; N TN is the total of true negatives; N FP is the total of false negatives and N FN is the total of false negatives. FPR is calculated as the ratio between the number of negatives wrong categories as positives and the total number of actual negatives, which is given by FNR is calculated as the ratio between the number of positives wrong categories as negatives and the total number of actual  positives, which is expressed as

B. BASIC PERFORMANCE ANALYISIS
To verify the effect of the proposed feature set, we define ten test modes through changing the sample numbers of training and test sets, as shown in TABLE 1. For each mode, we carry out the experiments with the ten-second speech samples respectively encoded at 5.3 kbps mode and 6.3 kbps mode. All the steganographic samples are produced at the embedding rate of 100%. Figure 9 shows the experimental results for the ten test modes. From the results, we can learn that the detection accuracies are larger than 99% in any cases, indicating that the presented steganalysis feature set is highly effective. Moreover, even with small numbers of training samples, we can obtain good classifier model. In the following experiments, however, to obtain the best classifier model as well as achieve the most reliable detection results, we carry out each steganalysis experiment with the mode 10.
In addition, to evaluate the performance of the presented scheme, we compare the detection performance between the training sets and test sets at various embedding rates (from 10% to 100%), as shown in Figures 10 and 11. From the experimental results, we can learn the following facts: First,    there is almost no difference between the detection results for the training sets and test sets, meaning that there is no overfitting in all the training processes. Second, at the cases of embedding rates not smaller than 30%, the detection accuracies are larger than 80%, indicating the classifier models in these cases are relatively good. However, at the cases of very low embedding rate (e.g., smaller than 20%), there is much room to improve the detection accuracies, meaning the classifier models in the cases are somewhat underfitting. That is to say, improving the detection performance at the cases of very low embedding rates is still a challenge and deserves further study.
To further evaluate the generalization ability of the proposed model, we conduct 5-fold cross-validation at various embedding rates (from 10% to 100%). Specifically, for each embedding rate, the training set, including 1100 cover samples and 1100 corresponding steganographic versions, are randomly divided into 5 subsets. In each experiment, four subsets are used to train the model, and the remaining one is employed for test. The average ACC, FPR and FNR for all the five test subsets are considered as the final results of the 5-fold cross-validation. The experimental results for various embedding rates are shown in Figures 12 and 13. From the charts, we can learn that there are only very slight differences between the results of 5-fold cross-validation and those for the test sets, indicating that the proposed model has a good generalization ability.

C. PERFORMANCE COMPARISON WITH PREVIOUS METHODS
We first compare the proposed method with the entropybased and poker test-based methods using the speech sample at various embedding rates (from 10% to 100%). Figures 14 and 15 show the experimental results for the speech samples respectively encoded at 5.3 kbps mode and those encoded at 6.3 kbps mode, from which we can learn the following facts: First, for all the three detection methods for steganography in inactive frames, the accuracy increases with the embedding rate of the steganographic samples, which means that the detection performance has a positive correlation with the adopted embedding rate of the steganographic methods. Second, for both Huang et al.'s method [11] performed on the speech samples encoded at 6.3 kbps mode and Lin's method [28] performed on the speech samples encoded at 5.3 kbps mode, the proposed steganalysis method can achieve better detection performance than the entropybased and poker test-based methods, particularly at the cases of relatively low embedding rates. For example, for detecting Huang et al.'s method, the accuracy of the proposed method is higher than 85% when the embedding rate is only 40%, while the poker test-based method achieves the similar accuracy rate when the embedding rate is larger than 70%, and the entropy-based method cannot achieve this accuracy rate even if the embedding rate is 100%; for detecting Lin's method, the accuracy of the proposed method is higher than 88%   when the embedding rate is only 30%, while the poker testbased method achieves the similar accuracy rate when the embedding rate is larger than 60% and even if the embedding rate is 100% the entropy-based method cannot achieve this accuracy rate. To further evaluate the performance of the three steganalysis methods, the receiver-operating-characteristic (ROC) curves for detecting the existing two steganographic methods at typical embedding rates of 30%, 60% and 100%, are shown in Figures 16 and 17. The results demonstrate once again that the proposed method significantly outperforms the entropy-based and poker test-based methods in detection performance, particularly at the relatively low embedding rates.
In addition, we evaluate the performance of the three steganalysis methods for detecting the existing steganographic methods at the embedding rate of 100%, using various small quantities (from 1 to 10) of inactive frames. Figures 18 and 19 show the experimental results for the speech samples respectively encoded at 5.3 kbps mode and those encoded at 6.3 kbps mode, from which we can learn the following facts: First, for all the three steganalysis methods, the detection performance has a positive correlation with the number of the used inactive frames. Overall, the more the used inactive frames, the better the detection performance. Second, the proposed steganalysis method can achieve much better detection performance than the entropy-based and poker   test-based methods in all cases. For example, for detecting Huang et al.'s method, the accuracy of the proposed method is higher than 83% using only three inactive frames, while entropy-based method cannot achieve this accuracy rate even using ten inactive frames, and the poker test-based method needs at least seven inactive frames to achieve this similar accuracy rate; for detecting Lin's method, the accuracy of the proposed method is higher than 86% using only two inactive frames, while entropy-based method cannot achieve this accuracy rate even using ten inactive frames, and the poker test-based method needs at least nine inactive frames to achieve this similar accuracy rate. To sum up, the experimental results demonstrate again that, for the case of detecting the steganography in the same small quantity of inactive frames, the proposed steganalysis method outperforms the existing methods in detection performance. Particularly, the proposed steganalysis method can effectively detect the existing steganographic methods even using very small quantities of inactive frames.

V. CONCLUSION
Steganography in inactive speech frames is a new effective technique of covert communication based on VoIP, which can achieve large steganographic capacity while maintaining VOLUME 8, 2020 excellent embedding transparency. However, its illegitimate use by terrorists and lawbreakers would facilitate cybercrimes and pose a serious threat to cybersecurity. Thus, in this paper, we aim to develop an efficient steganalysis technique to detect this type of steganography. Differing from the existing entropy-based and poker test-based methods, we employ the statistics for ZCC, including the average ZCC of inactive frames, the ratio between the average ZCC of inactive frames and that of all frames, and the difference between the average ZCC of inactive frames and their calibrated versions, to characterize the frame-level dynamic characteristic of speech signals; moreover, we utilize the average values of MFCCs to represent the invariant characteristic of inactive frames. Further, an SVM-based steganalysis for inactive speech frames is presented. The proposed steganalysis method is evaluated with a great quantity of ITU-T G.723.1 encoded speech samples, and compared with the existing methods. The experimental results show that the proposed method significantly outperforms the previous ones in detection performance for any given embedding rates or using the same number of inactive frames. In particular, the proposed method can render accurate results for detecting the existing steganographic methods only using very small quantity of inactive frames, and thereby be adopted to detecting potential inactive-frame steganography behaviors in real-time speech streams.