An Encrypted Speech Retrieval Method Based on Deep Perceptual Hashing and CNN-BiLSTM

Since convolutional neural network (CNN) can only extract local features, and long short-term memory (LSTM) neural network model has a large number of learning calculations, a long processing time and an obvious degree of information loss as the length of speech increases. Utilizing the characteristics of autonomous feature extraction in deep learning, CNN and bidirectional long short-term memory (BiLSTM) network are combined to present an encrypted speech retrieval method based on deep perceptual hashing and CNN-BiLSTM. Firstly, the proposed method extracts the Log-Mel Spectrogram/MFCC features of the original speech and enters the CNN and BiLSTM networks in turn for model training. Secondly, we use the trained fusion network model to learn the deep perceptual feature and generate deep perceptual hashing sequences. Finally, the normalized Hamming distance algorithm is used for matching retrieval. In order to protect the speech security in the cloud, a speech encryption algorithm based on a 4D hyperchaotic system is proposed. The experimental results show that the proposed method has good discrimination, robustness, recall and precision compared with the existing methods, and it has good retrieval efficiency and retrieval accuracy for longer speech. Meanwhile, the proposed speech encryption algorithm has a high key space to resist exhaustive attacks.


I. INTRODUCTION
With the increasing popularity of multimedia acquisition equipment and the rapid development of cloud storage, the Internet and other technologies. Multimedia data stored in the cloud saves local space for users, facilitates the data sharing between different clients, but also brings difficulties in searching, privacy leakage and data insecurity [1], [2]. Speech contains a lot of confidential information, it is necessary to encrypt the speech information before uploading to the cloud. Due to the great changes in encrypted speech features and the continuous growth of speech data, it is difficult to retrieve encrypted speech. Therefore, the research on encrypted speech retrieval technology has attracted the attention of many research institutions and scholars.
The traditional encrypted speech retrieval methods are based on speech perceptual hashing technology to extract the perceptual features of speech [3]- [7]. Speech feature The associate editor coordinating the review of this manuscript and approving it for publication was Dominik Strzalka . extraction is the basis of the retrieval process, and the performance of feature expression directly affects subsequent retrieval results. Since these perceptual hashing-based encrypted speech retrieval methods utilize the already designed speech features, redesigning the speech feature requires a large number of prior knowledge and experiments. CNN is the most developed network structure in deep learning. Due to the strong generalization ability and local data mining of CNN, CNN has achieved good results in various fields of artificial intelligence. Different from CNN and LSTM can process the time sequence and model the changes in the time sequence. BiLSTM neural network is a new model based on the LSTM neural network, which can solve the problem of obvious information loss due to the increase in transmission time. Besides, in order to achieve privacy protection for speech in cloud, the speech encryption method is an indispensable technology in the encrypted speech retrieval system. Traditional encryption algorithms such as data encryption standard (DES), advanced encryption standard (AES) and Rivest-Shamir-Adleman (RSA) are no longer suitable for multimedia data encryption. And hyperchaotic systems are widely used for multimedia data encryption with sensitivity, randomness and ergodicity to initial parameters.
Utilizing CNN can extract the deep spatial features of speech and shorten the feature extraction time, BiLSTM can be used to extract the time sequence features of speech. In this paper, CNN and BiLSTM network models are combined to learn the deep perceptual features of speech, and an encrypted speech retrieval scheme based on deep perceptual hashing and CNN-BiLSTM is proposed. The main contributions of this paper can be summarized as follows: (1) A CNN-BiLSTM network fusion model is designed to extract the spatiotemporal features of speech data; (2) The proposed binary deep perceptual hashing construction scheme can achieve efficient speech retrieval with good discrimination and robustness; (3) A speech encryption algorithm based on a 4D hyperchaotic system with quadratic nonlinearity is designed, which can improve the security and privacy of speech data stored in the cloud; (4) The batch normalization algorithm is introduced, which can effectively improve the network fitting speed and reduce the training time.
The rest of the paper is organized as follows: Section II analyzes related research work. Section III details related theoretical algorithms. Section IV gives the encrypted speech retrieval scheme and its processing process. Section V the encrypted speech retrieval scheme is experimentally verified and compared with the existing methods. Section VI concludes the presented work and raises several problems of future work.

II. RELATED WORKS
At present, the existing content-based encrypted speech retrieval algorithm [3]- [7] are all realized by constructing speech perceptual hashing. For example, Wang et al. [3] proposed an encrypted speech perceptual hashing retrieval algorithm based on a zero-crossing rate and used Chua's chaotic system to encrypt speech. Wang et al. [4] proposed a speech perceptual hashing scheme based on the time-frequency domain trend transformation and combined with a logistic XOR encryption algorithm to generate an encrypted speech retrieval scheme. Zhao et al. [5] proposed an encrypted speech retrieval algorithm based on the multi-fractal features of speech signals and piecewise aggregation approximation to generate perceptual hashing sequences. He et al. [6] used syllable-level perceptual hashing has better discrimination and robustness than time domain and frequency domain features. A retrieval method of syllable-level perceptual hashing based on the posterior probability feature of syllable segment model is proposed. Zhang et al. [7] proposed an encrypted speech retrieval algorithm based on short-term cross-correlation and perceptual hashing, which can directly extract the perceptual hashing sequence from the encrypted sample speech. By analyzing the above, the existing contentbased encrypted speech retrieval algorithms are based on existing hand-crafted feature to generate binary hashing codes for speech retrieval.
Inspired by deep learning technology, the deep hashing method takes the output of the deep network as a feature, which is more suitable for describing semantic information. For example, Wu et al. [8] proposed a deep incremental hash learning structure DIHN for large-scale image retrieval, which can be used to learn hashing codes incrementally. Ma et al. [9] applied a rough learning method to human re-recognition and proposed a human re-recognition method based on deep hash learning. Yan et al. [10] proposed a new deep linear discriminant analysis hashing algorithm (DLDAH), which can obtain deep semantic information without using a multi-layer network, simplifying the process of mapping new images into hashing codes. Singh et al. [11] proposed a deep multi-Cauchy hashing framework and its variants to achieve the fast search of clothing inventory. Cheng et al. [12] proposed an adaptive asymmetric residual hashing (AASH) algorithm based on residual hashing, integration learning, and asymmetric pairwise loss. Do et al. [13] proposed a deep network model and learning algorithm for learning binary hashing codes represented by a given image in unsupervised and supervised ways. Cui et al. [14] proposed a scalable deep hashing (SCADH) algorithm to learn enhanced hashing codes for social image retrieval, and also proposed a discrete hashing optimization method based on the augmented Lagrange multiplier.
In addition, deep learning methods are also applied in other fields to capture complex features from data. Shu et al. [15] proposed a novel Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) to learn the dynamic inter-related representation among a group of persons for hierarchically recognizing human interactions. Shu et al. [16] proposed a novel graph LSTM-in-LSTM (GLIL) framework to address the problem of group activity recognition by modeling the person-level actions and the group-level activity simultaneously. Tang et al. [17] proposed a novel Coherence Constrained Graph LSTM (CCG-LSTM) for group activity recognition, by exploring the motion-level characteristics of group activity with several coherence constraints. Yan et al. [18] proposed a novel Participation-Contributed Temporal Dynamic Model (PC-TDM) for group activity recognition with attending to key actors (participants). Peng et al. [19] proposed a novel neural network Structured AutoEncoder (StructAE) for subspace clustering by simultaneously preserving the locality and globality of data sets. Peng et al. [20] proposed a novel clustering method by minimizing the discrepancy between pairwise sample assignments for each data point. Gao et al. [21] proposed a unified encoder-decoder framework hierarchical LSTM with adaptive attention (hLSTMat) for visual captioning.
Aiming at the problem that CNN can only extract local features, and LSTM generates long-term dependence with the increase of input dimension, the researchers proposed to fuse the two models. Zhao et al. [22] constructed one-dimensional CNN-LSTM network and two-dimensional CNN-LSTM network to learned local and global emotion-related features from speech and log Mel spectrogram. Jung et al. [23]  proposed a new method to improve the performance of polyphonic sound event detection, which combines convolutional bidirectional recurrent neural network (CBRNN) with transfer learning. Passricha et al. [24] proposed a CNN-BiLSTM hybrid structure to extract the spatiotemporal features of speech, which could improve the performance of continuous speech recognition. Székely et al. [25] proposed a semi-supervised method in which partially coarsely annotated data could be used to train the breath detector for the speaker. Koller et al. [26] embedded powerful CNN-LSTM models in each hidden Markov model for weakly supervised learning in the video domain. Tang et al. [27] proposed a new deep neural network CNN-LSTM-CRF for entity recognition in Chinese clinical texts. Zhao et al. [28] proposed a text-independent speaker verification model with CNN-LSTM.
CNN can only extract local features and cannot process sequential data well. Although LSTM can model sequential data, the degree of information loss becomes obvious as the increase of speech length. BiLSTM passes through the LSTM network once for each input sequence in both forward and backward, but the computation is heavy and the processing time is long. Therefore, we use CNN can extract the deep spatial features of speech and shorten the feature extraction time, and BiLSTM can be used to extract the time sequence features of speech. In this paper, CNN and BiLSTM network models are combined to learn the deep perceptual features of speech, and an encrypted speech retrieval method based on deep perceptual hashing and CNN-BiLSTM is proposed.

III. RELATED THEORIES ANALYSIS A. LOG-MEL SPECTROGRAM AND MFCC FEATURE
According to the human auditory mechanism, human ear has different auditory sensitivity to different frequency sound waves. Humans can only perceive audio signals in the frequency range of 20Hz-20kHz. According to human auditory characteristics, the Mel filter banks [29] is proposed and shown in (1): where f is the actual frequency value of speech signal. Based on the theory of Mel filter banks, the MFCC features are extracted. The processing flow is as follows: Step 1: Pre-processing, including pre-emphasis, framing, and windowing.
Step 2: Perform the fast Fourier transform (FFT) on the signal to obtain X n (k), where k = 0, 1, 2, . . . , N , N represents the number of points of the FFT. Step 4: Calculate the logarithmic energy of each filter bank output, as shown in (2).
where s(i) is the logarithmic energy of the i-th Mel filter, X n (k) is obtained by the FFT of Step 2, and H m (k) is X n (k) obtained by Mel filtering of Step 3.
where L is the dimensions of the MFCC feature, M is the number of filters, and C MFCC (l) represents the MFCC feature of the 1-th dimension. Compared to the extraction process of the MFCC, the Log-Mel spectrogram [30] only lacks the DCT transform of Step 5 that converts the logarithmic Mel Spectrogram to cepstrum. Fig. 1 shows the speech feature extraction process of Log-Mel spectrogram and MFCC.

B. CONVOLUTIONAL NEURAL NETWORK
CNN [15] is feedforward neural networks with a deep structure that includes convolution calculations, and it is one of the representative algorithms of deep learning. With the advantage of convolution operation, CNN can perform higher-level and more abstract expressions on original data. Meanwhile, its weight sharing network structure significantly reduces the complexity of the model and the number of weights. The basic structure of a convolutional neural network is shown in Fig. 2, including the input layer, the hidden layer and the output layer. 1) Input layer. Since gradient descent is used for learning, the input features of the CNN need to be standardized. The standardization of input features helps improve operating efficiency and learning performance.
2) Hidden layers. The hidden layers of CNN mainly include the convolution layer, pooling layer, and fully connected layer. The convolution layer is to extract features from the input data and get deeper feature maps by multiple layers of convolutional layers. The pooling layer is usually used immediately after the convolution layer to simplify the output, and pooling layers can both speed up calculations and prevent overfitting. The full connected layer means that all the neurons between the two layers have weight connection and usually at the end of the hidden layer.
3) Output layer. The structure and working principle of the output layer are the same as in the traditional feedforward neural network, the Softmax is usually used as the activation function.

C. BIDIRECTIONAL LONG SHORT-TERM MEMORY NEURAL NETWORK
LSTM [31] is proposed to solve the problem of gradient disappearance in traditional recurrent neural network (RNN) model. The biggest change of LSTM model is to replace the neural node with a neuron containing input gate, output gate, forget gate and memory unit (Cell). And input gate, output gate and forget gate are all logic units. They are responsible for setting weights at the edge of the connection between the rest of the neural network and the memory unit. The Cell replaces internal storage and maintains data in Cell, which is called Cell state. This Cell state runs through the entire LSTM model architecture with only a small amount of linear interaction, which allows information to remain unchanged during transmission. Fig. 3 shows the structure of LSTM neurons. Although LSTM can solve the problem of gradient disappearance and long-term dependence, the degree of information loss will become obvious with the speech length increase. BiLSTM neural network is a new model based on LSTM neural network. The bi-directional structure provides complete past and future context information for each node in the output layer.

D. 4D HYPERCHAOTIC SYSTEM WITH QUADRATIC NONLINEARITY
The 4D hyperchaotic system with quadratic nonlinearity [32] derives a new 11-term 4D hyperchaotic system with two quadratic non-linearity from the classical Lorenz system. Through experimental analysis, it is proved that the new hyperchaotic system has three equilibrium points, and the system equation is shown in (4): 4 are state variables, a, b, c are positive real parameters of the system. When the initial value K = (x 0 , y 0 , z 0 , w 0 ) is used as the system key, the 4D chaotic sequence X The experience shows that when a = 10, b = 76 and c = 3, the initial value is (0.3, 0.3, 0.3, 0.3), the Lyappunov index is (1.5146, 0.2527, 0, -12.7626). Since there are two positive Lyappunov exponents in the system, it is obvious that the system is in hyperchaotic state. In addition, the Kaplan-Yorke of the hyperchaotic system is D KY = 3.1385, which indicates the high complexity of the system.

IV. THE PROPOSED SCHEME
In this paper, our goal is to learn the compact binary code to achieve efficient retrieval for massive encrypted speech. Utilizing the advantages of deep learning methods in various fields, an encrypted speech retrieval scheme based on deep perceptual hashing and CNN-BiLSTM is proposed. Fig. 4 shows the encrypted speech retrieval system model based on deep perceptual hashing and CNN-BiLSTM. As shown in Fig. 4, the model mainly includes three steps: deep perceptual hashing construction and generation system hashing index table, the construction of encrypted speech library, and user speech retrieval.

A. SYSTEM MODEL
Step 1: Deep perceptual hashing construction and generation system hashing index table. The Log-Mel Spectrogram/MFCC of the original speech is first extracted as the training data for CNN-BiLSTM model. Then deep perceptual features of the speech extracted from the trained CNN-BiLSTM model, and the deep perceptual hashing sequence is generated by combining with the hash function, which is uploaded to the system hash index table in the cloud.  Step 2: Construction of the encrypted speech library. In order to ensure the security and privacy of speech data in the cloud, the original speech is encrypted by the 4D hyperchaotic system encryption algorithm with quadratic nonlinearity, and a one-to-one mapping relationship with the system hashing index is established.
Step 3: User speech retrieval. During speech retrieval, the same deep hashing construction method is used to generate the binary deep perceptual hashing code of the query speech. The normalized hamming distance algorithm is used to search in the system hashing index table, and the retrieval result is returned to the user.
In this system model, the encrypted speech library and the system hashing index table are constructed offline, and the user speech retrieval can generate a retrieval index online.

B. CONSTRUCTION OF ENCRYPTED SPEECH LIBRARY
Chaotic systems are widely used in the field of encryption due to their high sensitivity to initial conditions and control parameters, ergodicity, determinacy, pseudo-randomness, and aperiodicity. The 4D hyperchaotic encryption algorithm with quadratic nonlinearity described in Section III-D is used to encrypt the original speech. Fig. 5 shows the flow chart of the 4D hyperchaotic speech encryption algorithm with quadratic nonlinearity.
The speech encryption process is as follows: Step 1: Pre-processing. Import the original speech S = {s(i), 1 ≤ i ≤ L}, where L = 160,000.
Firstly, the sequences X and Y obtained from (5) and (6) respectively. Then, the sequence I obtained from (7) is used as the scrambling sequence for position of S, and the scrambling speech S x = {s x (i), 1 ≤ i ≤ L} is obtained.
Step 3: XOR diffusion. The third-dimensional chaotic sequence Z = {z(i), 1 ≤ i ≤ L} and the fourth-dimensional chaotic sequence W={w(i), 1 ≤ i ≤ L} are generated by the 4D hyperchaotic system with quadratic nonlinearity, and the generated chaotic sequences are used to diffuse the scrambled one-dimensional speech S x = {S x (i), 1 ≤ i ≤ L} forward and backward through (8) and (9) respectively.
Step 4: Restore the speech. Finally, it is reconstructed into time domain speech, and the encrypted speech signal S = {S x (i), 1 ≤ i ≤ L} is obtained.
Step 5: Construction of encrypted speech library. The above encryption process is performed on all the speech in the original speech library, and uploads to the encrypted speech library in the cloud.

C. CNN-BiLSTM MODEL CONSTRUCTION
In this paper, utilizing the characteristics of autonomous feature extraction in deep learning, CNN and BiLSTM are  combined to propose a fusion network model to learn the deep perceptual features of speech. Fig. 6 shows the proposed CNN-BiLSTM network learning framework. Table 1 shows the parameter settings of CNN-BiLSTM network.
As shown in Table 1, we can know the main structure and parameter settings of the model. TimeDistributed can use time sequence to perform a series of tensor operations to facilitate the connection between CNN and BiLSTM. Meanwhile, the batch normalization algorithm is introduced to improve the network fitting speed and reduce the training time. And MaxPooling2D can perform maximum pooling for spatial features extracted by Conv2D. The flatten layer is to turn the data into one-dimensional data for the next layer. Finally, the first fully connected layer is used as a feature extraction layer, and the Softmax is used as the activation function of the network output layer to classify the speech data.
The network model is shown in Fig. 6 is implemented by Python's Keras library. The loss function for training is binary cross-entropy. The optimization algorithm is stochastic gradient descent (SGD).

D. DEEP PERCEPTUAL HASHING CONSTRUCTION
In this paper, the fully connected layer before the classification layer is used as a feature extraction layer, and the rectification linear unit (ReLU) is used to provide range constraints. The number of neurons in the fully connected layer is the hashing code length of the target binary deep perceptual hashing code. Through the training network model, the semantic information can be embedded in the output of the fully connected layer. The trained network is used to map the original high-dimensional feature space to the lowdimensional Hamming space to generate a compact binary deep perceptual hashing code, which can greatly improve the efficiency of system retrieval. The construction process of the binary deep perceptual hashing sequence is as follows: Step 1: Speech feature extraction. The feature extraction method described in Section III-A is used to extract Log-Mel VOLUME 8, 2020 Spectrogram/MFCC features of speech. In the feature extraction stage, the Librosa library [33] was used to extract speech features. The sampling rate was 16kHz, and the frame length and frame shift were 25ms and 10ms respectively. The Hamming window function is used, and the input speech duration is fixed to 10s. Step where H meadian is the median of the feature vector H, and M = 384 is the length of the binary deep perceptual hashing sequence.
Step 4: Construct a system hashing index table. According to the above steps, the deep perceptual hashing sequences (h 1 , h 2 , . . . , h x ) of all the original speeches (S 1 , S 2 , . . . , S N ) are obtained. Furthermore, the deep perceptual hashing sequence generated establish a one-to-one mapping relationship of Key-Value with the encrypted speech and upload to the system hashing index table in the cloud.

E. USER SPEECH RETRIEVAL
After uploading the encrypted speech library and the system hashing index table to the cloud server, the user can submit the speech to be queried online, and the encrypted speech retrieval can be performed by ''no download, no decryption''. The speech retrieval process is as follows: Step 1: Submit the query speech. Given the speech q to be queried, H q is firstly extracted from the CNN-BiLSTM network model as the deep perceptual feature, then the binary deep perceptual hashing sequence h q is obtained in (10).
Step 2: Retrieve the match. The deep perceptual hashing sequence h q generated in Step 1 and the hashing sequence h x in the system hashing index table are matched by the normalized Hamming distance (also known as the bit error rate, BER) algorithm D(h x , h q ). The BER is calculated as shown in (11): where M is the length of the binary deep perceptual hashing sequence. When user retrieval, the threshold T (0 < T < 0.5) is set as the similarity threshold. If D(h x , h q ) < T , the retrieval is successful and the system will return the decrypted speech to the user, otherwise the retrieval is fail.

F. SPEECH DECRYPTION
To decrypt successfully retrieved speech, the decryption process is the reverse process of encryption. The speech decryption process is as follows: Step 1: Import the encrypted speech Sx ={Sx(i), 1 ≤ i ≤ L} where L = 160, 000, and generate the chaotic sequence using the same key as the encryption.
Step 3: Scrambling operation. The speech S x ={S x (i), 1 ≤ i ≤ L} obtained in Step 2 is scrambled using the first-dimensional chaotic sequence X={x(i), 1 ≤ i ≤ L} and second-dimensional chaotic sequence Y={y(i), 1 ≤ i ≤ L} generated by 4D hyperchaotic system in Section III-D. Firstly, the chaotic sequences X and Y obtained from (5) and (6) respectively. Then, the sequence I obtained from (7) is used as the scrambling sequence for position of S x , and the scrambling speech S = {s(i), 1 ≤ i ≤ L} is obtained.
Step 4: Restore the speech. Finally, the speech S = {s(i), 1 ≤ i ≤ L} obtained in Step 3 is reconstructed into time domain speech, and the decrypted speech is obtained.

V. EXPERIMENTAL RESULTS AND PERFORMANCE ANALYSIS
In the experiment, we use the speech from the THCHS-30 [34] as the experimental data, which is an open Chinese speech database published by the center for speech and language technology (CSLT) of Tsinghua University. The speech segment in a single channel wav format with a frequency of 16kHz and sampling accuracy of 16bit is adopted. In the stage of network model training, according to the definition of perceptual hashing, the multimedia digital representations with the same perceptual content is uniquely mapped into a digital digest. Then, we select 10 speeches with the same speech content by 17 people and perform 17 speech content preserving operations (CPOs) including amplitude adjustment, noise addition, re-quantization, resampling, and MP3 operation. A total of 3,060 speeches were obtained. In the stage of performance analysis, 1,000 speeches with 10s are randomly selected from the speech library for evaluation. To test the retrieval efficiency, 10,000 speeches with 10s are randomly selected for evaluation.

A. PERFORMANCE ANALYSIS OF THE CNN-BiLSTM MODEL
Speech feature extraction is the key of encrypted speech retrieval, and the performance of feature expression directly affects the subsequent retrieval results. In this paper, utilizing the characteristics of autonomous feature extraction in deep learning, CNN and BiLSTM are combined to propose a fusion network model to learn the deep perceptual features of speech. Fig. 7 shows the train/test loss curves of Log-Mel Spectrogram/MFCC features in CNN, BiLSTM and CNN-BiLSTM models. As can be seen from Fig. 7, the CNN-BiLSTM model converges faster and loses less. This is because CNN can only extract local features and cannot process time series data well. BiLSTM has a large amount of learning computation and long learning time as the speech length increases. In this paper, the network uses CNN to extract the deep spatial features of speech and shortens the feature extraction time, BiLSTM can extract the time sequence features of speech to achieve better results. Table 2 shows the test accuracy of CNN, BiLSTM and CNN-BiLSTM models. As can be seen from Table 2, the test accuracy of the CNN-BiLSTM network model is significantly higher than other network models. To further test the performance of the network model, mean Average Precision (mAP) is introduced to evaluate the algorithm performance. The (14) is used to calculate the Average Precision (AP) after different CPOs, then the mAP of Table 3 is obtained from (15).
where Q represents the number of queries, AP(q) represents the query accuracy of the q-th, and n represents the number of speeches, rel(k) represents whether the retrieved k-th speech is related to the query speech (correlation 1, irrelevant 0). The larger the mAP value, the better the retrieval algorithm. Table 3 shows that the fusion network model in this paper can obtain better results. This is because the spatiotemporal features extracted by the CNN-BiLSTM model can make full use of the representation capabilities for the two networks.

B. PERFORMANCE COMPARISON WITH EXISTING PERCEPTUAL HASHING METHODS
Discrimination and robustness are the two most important indexes to evaluate deep perceptual hashing sequence. By calculating the BER between deep perceptual hashing sequences, we can determine the similarity between speech sounds. In order to better verify the performance of the algorithm, false accept rate (FAR) as shown in (16) is introduced. And 1,000 speeches were randomly selected from the THCHS-30 for analysis. The deep perceptual hashing algorithm was used to generate 1,000 deep perceptual hashing sequences for pairwise matching, and 1, 000 × 999/2 = 499, 500 BER data are obtained.
where τ is the threshold of hash matching, µ is the BER mean, δ is the BER standard deviation, and the x is BER.  As can be seen from Fig. 8(a) and Fig. 8(b), the probability distribution of the BER values almost overlaps with the probability curve of the normal distribution, so the binary deep hashing sequence obtained by the proposed algorithm basically obeys the normal distribution.
According to the De Moivre-Laplace central limit theorem, the hamming distance approximates the normal distribution µ = p, δ = √ p (1 − p) /M , where M is the length of hash sequence, µ is the BER mean, δ is the BER standard deviation, and p is the probability of the hash sequence 0, 1.
The better the normal distribution curve of BER, the better the randomness and anti-collision performance of the algorithm. In this paper, the deep perceptual hashing sequence length is M = 384, and the theoretical normal distribution parameter mean µ = 0.5 and standard deviation δ = 0.0255 can be calculated. In the experiment, the BER mean of Log-Mel spectrogram is µ 0 = 0.4972, and the standard deviation is δ 0 = 0.0336. The BER mean of MFCC is µ 1 = 0.4964, and the standard deviation δ 1 = 0.0322. Table 4 shows the comparison results of the proposed scheme with the existing perceptual hashing based encrypted speech retrieval methods [3], [5]- [7] under different thresholds. The lower FAR of perceptual hashing algorithm, the higher the anti-collision performance and the better the discrimination of the algorithm.
As can be seen from Table 4, the FAR of Log-Mel Spectrogram/MFCC are lower than the methods [3], [5]- [7] under different thresholds. Therefore, the method in this paper has strong anti-collision performance and discrimination, which can meet the needs of retrieval.
Robustness refers to the degree of change in deep perceptual hashing generated of speech data generated after different Content Preserving operations (CPOs). In the experiment, the software Gold Wave 6.38 and MATLAB R2017b were used to perform CPOs for 1,000 test speeches, including MP3 compression (128 kbps, MP3), re-quantization (16→8→16bit, R.Q), amplitude increase or decrease of 3 dB (+3dB, -3dB), and 30 dB narrowband Gaussian noise (G.N). Table 5 shows the average BER after 5 operations.
As can be seen from Table 5, the Log-Mel spectrogram is more robust than the MFCC. The BER of Log-Mel spectrogram is less than the methods [5]- [7] after a 3 dB decrease in amplitude. The BER less than Zhao's [5] after the 3 dB increase in amplitude. And the BER less than Wang's [3] after the G.N operation. For the MFCC, the BER is less than Zhao's [5] and close to the methods [6], [7] after the 3 dB decrease in amplitude. The low robustness is mainly due to the poor robustness of Log-Mel Spectrogram/MFCC, and the lack of network model training data.

C. ANALYSIS OF RETRIEVAL PERFORMANCE
When evaluating the performance of speech retrieval algorithms, the recall R and the recall P are generally used to measure. The calculation methods of the recall R and the precision P are shown in (17) and (18), respectively.
where, f T is the retrieved relevant speech, f L is the relevant speech that is not retrieved, and f F is the retrieved irrelevant speech.
In retrieval, the similarity threshold T (0 < T < 0.5) is set. And the retrieval is successful if the normalized Hamming distance D(h x , h q ) < T . The choice of the threshold  directly affects the recall R and the precision P of the retrieval algorithm. In the discrimination experimental analysis, the minimum BER values of Log-Mel spectrogram/MFCC for 1,000 speeches are 0.3385 and 0.3464, respectively. In the robustness experimental analysis, the maximum BER values are 0.3203 and 0.2760, respectively. In order to avoid miss retrieval and achieve high performance, the Log-Mel spectrogram/MFCC similarity thresholds are set to T 0 = 0.33 and T 1 = 0.33, respectively. Table 6 shows the recall R and the precision P calculated by the (17) and (18).
As shown in Table 6, except for the MFCC's recall after G.N operation, the high recall R and precision P can still be guaranteed after the other CPOs. In addition, compared with the methods [3], [5]- [7], we can obtain similar or even better recall R and precision P under several CPOs. This is because the robustness of MFCC is poor and the performance becomes lower after adding noise.
The recall and precision are mutually influential and ideally both are high. But in general, the higher the recall, the lower the precision. Drawing Precision-Recall (P-R) curve can visually observe the interaction between recall and precision. Fig. 9 shows a comparison of the P-R curve between the proposed algorithm and the existing methods [3], [5]- [7].
As can be seen from Fig. 9, the area enclosed by the P-R curve and X-Y coordinate axis of the proposed method is larger than the existing methods [3], [5]- [7], which indicates that the retrieval performance of this algorithm is optimal. In addition, since recall and precision are mutually influential, the method in this paper has the greatest impact on precision rate when recall is 1.
For the speech retrieval experiment, all query speeches are processed through 5 kind of CPOs, and then matched in the system hashing index table. Fig. 10 is an example of the 500-th speech as query speech. After MP3 operation, BER is calculated in the system hashing index table to get the matching result.
As shown in Fig. 10, except for the BER of the 500-th speech in the system hashing index table, the other BERs are greater than the thresholds T 0 = 0.33 and T 1 = 0.33, the retrieval is successful.
For the retrieval efficiency of the algorithm, 10,000 speeches were randomly selected from the THRHS-30 for evaluation. The average retrieval time of the algorithm is calculated and compared with the methods [3], [5]- [7]. Table 7 shows the experimental results.
As can be seen from Table 7, the retrieval efficiency of this method is higher than that of the literature [5], [6]. The retrieval efficiency of 4s speech is about 6 times higher than Zhao's [5] and 7 times higher than He's [6], which is slightly higher than Zhang's [7]. The retrieval efficiency of 10s speech is about 5 times higher than Zhao's [5] and 6 times higher than He's [6], which is slightly higher than Zhang's [7]. Since our  method uses CNN to shorten the feature extraction time and combines BiLSTM to extract the spatiotemporal features of speech to improve the retrieval efficiency. However, the methods [5], [6] only extract speech features and then construct perceptual hashing for retrieval. The methods [3], [7] uses relatively simple feature extraction methods with low feature dimension. As the longer speech segments used in this paper, the retrieval efficiency is lower than the methods [3], [7]. Compared with the short-term cross-correlation directly used in method [7], our method uses deep learning to extract Log-Mel spectrogram/MFCC in depth, which has better retrieval performance and efficiency.

D. SECURITY ANALYSIS
In the paper, we encrypt the speech data using the speech encryption algorithm described in Section IV-B and the selected key is K = (0.3, 0.3, 0.3, 0.3), a = 10, b = 76, c = 3. Fig. 11 shows the waveform and spectrogram of original speech and encrypted speech, where Fig. 11(a) shows the original speech waveform, Fig. 11(b) shows the original speech spectrogram, Fig. 11(c) shows the encrypted speech waveform and Fig. 11(d) shows the encrypted speech spectrogram.
As can be seen from Fig. 11(c), the encrypted speech waveform is evenly distributed, with almost no features that can be utilized. Fig. 11(d) shows the encrypted speech spectrogram that the pixels in the figure are randomly distributed, and no speech features can be seen. The figures show that the algorithm has a good chaotic effect and high security. A good encryption system must have a large enough key space to defend against exhaustive attacks, and it is enough to meet the security requirements when the key space is greater than

100
≈ 10 30 . In this paper, the key of the speech encryption algorithm adopts the double-precision floating-point data accurate to 12 decimal places is used, and the key space can reach 2 × 10 16 ×2 × 10 16 ×2 × 10 16 ×2 × 10 16 = 16 × 10 64 ≈ 2 218 . If the system parameters a, b, c, and the number of iterations are taken into account, the key space will be larger enough to resist exhaustive attack.
To further verify the performance of the proposed speech encryption algorithm, we analyze the perceptual evaluation of speech quality (PESQ) for encrypted speech and decrypted speech. PESQ [35] is the mean opinion score (MOS) recommended by the Telecommunication Standardization Sector (ITU-T) P.862 from 1.0 (worst) to 4.5 (best) PESQ-MOS. It is generally expected that the encrypted speech PESQ-MOS can be decreased to 1.0 or lower, and the decrypted speech PESQ-MOS can increase to 2.5 or even higher. We randomly selected 15 speeches for experimentation. The average PESQ-MOS values obtained are shown in Table 8.
As can be seen from Table 8, the average value of the encrypted speech PESQ-MOS is only 0.7619, which indicates that the encrypted speech has poor auditory quality, good encryption effect and will not reveal the speech content. Compared with Zhang's [7], the encryption effect is better. The decrypted speech PESQ-MOS is 4.4999, indicating that the decryption effect is very good, and the decryption algorithm has little effect on the audio quality. Therefore, the speech encryption method can meet the security requirements of the system.

VI. CONCLUSION AND FUTURE WORK
In this paper, we have proposed an encrypted speech retrieval method based on deep perceptual hashing and CNN-BiLSTM. The proposed method uses CNN to extract the deep spatial features of speech and shorten the feature extraction time, and BiLSTM can extract the time sequence features of speech. We propose a CNN-BiLSTM fusion network model to learn the deep perceptual features of speech. Then combine the hash function to generate a deep perceptual hashing code, and use the normalized Hamming distance to achieve retrieval. Meanwhile, a speech encryption algorithm based on 4D hyperchaotic with quadratic nonlinearity is proposed, which can effectively improve the security and privacy of speech in the cloud. The experimental results show that the proposed method has good discrimination, robustness, recall and precision compared with the existing methods, and it has good retrieval efficiency and retrieval accuracy for longer speech. And the proposed speech encryption algorithm has a higher key space, which can effectively resist exhaustive attacks.
In addition, our scheme has not yet achieved efficient retrieval of a longer speech, and the research on robustness is not enough. In future work, we will try to improve these problems.
QIUYU ZHANG (Member, IEEE) graduated from the Gansu University of Technology, in 1986. He is currently working as a Professor/Ph.D. Supervisor with the School of Computer and Communication, Lanzhou University of Technology. He is also the Vice Dean of the Gansu Manufacturing Information Engineering Research Center. His research interests include network and information security, information hiding and steganalysis, image understanding and recognition, and multimedia communication technology. He is also a member of ACM and a CCF Senior Member.
YUZHOU LI received the B.S. degree in communication engineering from the Lanzhou University of Technology, Gansu, China, in 2017, where he is currently pursuing the master's degree with the School of Computer and Communication. His research interests include audio signal processing and application, information security, multimedia authentication, and retrieval techniques.
YINGJIE HU received the M.S. degree in computer software and theory from Lanzhou University, Lanzhou, China, in 2011. She is currently working as a Lecturer with the School of Computer and Communication, Lanzhou University of Technology. Her research interests include multimedia information processing and application, information security, multimedia authentication, and retrieval techniques.
XUEJIAO ZHAO received the B.S. degree in digital media technology from the Lanzhou University of Arts and Science, Gansu, China, in 2018. She is currently pursuing the master's degree with the School of Computer and Communication, Lanzhou University of Technology. Her research interests include audio signal processing and application, information security, multimedia authentication, and retrieval techniques. VOLUME 8, 2020