A Long Sequence Speech Perceptual Hashing Authentication Algorithm Based on Constant Q Transform and Tensor Decomposition

Most speech authentication algorithms are over-optimized for robustness and efficiency, resulting in poor discrimination. Hashing shorter sequence is likely to cause the same hashing sequence to come from different speech segments, which will cause serious deviations in authentication. Few people pay attention to the research on the discrimination of hashing sequence length, so this paper proposes a long sequence speech authentication algorithm based on constant Q transform (CQT) and tensor decomposition (TD). In this paper, hashing long sequence is used to solve the problem of poor collision resistance of existing algorithms, fast and accurate authentication can be achieved for important speech fragments with large data volumes. The sub-band in the frequency domain are first divided into different matrix, then the variance set of sub-band in the frequency domain is obtained, and finally the feature values are obtained by CQT and TD transformation. The obtained feature values have strong robustness and can cope with the interference of complex channel environment. In this paper, Texas Instruments and Massachusetts Institute of Technology (TIMIT) speech database and the Text to Speech (TTS) are used to establish a database of 51600 speeches to verify the performance of the algorithm. Experimental results show that compared with the existing speech authentication algorithms, the proposed algorithm has the characteristics of high discrimination, strong robustness and high efficiency.


I. INTRODUCTION
With the development of multimedia technology, the speech not only has a huge amount of data, but also has the characteristics of high redundancy and low confidentiality. Therefore, speech authentication, integrity verification and content recognition face great challenges. At present, speech authentication methods mainly include watermarking technology and digital signature. The disadvantage of watermarking technology is that the original data will be modified and the quality of the speech will be degraded after embedding the watermark [1], [2], [3], [36]. Digital signature technology is too sensitive to changes in the binary level of speech data to be suitable for speech content [35]. The perceptual hash function The associate editor coordinating the review of this manuscript and approving it for publication was Aniello Castiglione . converts the speech data into a short binary string. When the speech datas are the same or similar, they generate the same hash value. For those different speech datas, the hash function could produce different hashing sequence [4]. Therefore, the speech content authentication based on the perceptive hashing just solves the disadvantages of the above method and is also suitable for the speech authentication in the big data environment.
Speech perception hashing authentication mainly consists of two parts: hash construction and matching, among which hashing construction has a very important impact on the performance of the algorithm. At present, the features extracted from speech signals include shortterm energy, short-term correlation, Mel-frequency cepstral coefficient (MFCC) [7], [28], cochleagram [9], spectral entropy [11], short-term zero-crossing rate [12], discrete wavelet transform (DWT) [10], [13], linear prediction coefficient (LPC) [14], spectrogram [22], [27], formant [24], bark frequency Cepstral coefficients [29] and multiple fusion features. Li et al. [8] proposed an audio hash scheme based on non-negative matrix factorization (NMF) of modified discrete cosine transform (MDCT) coefficients. The algorithm has good robustness, especially compression aspects such as MP3 and AAC, but its processing efficiency is relatively low. Zhang et al. [11] proposed an efficient perceptual hashing based on improved spectral entropy for speech authentication. The algorithm has higher efficiency, but its collision resistance performance and robustness performance at the MP3 compression is relatively poor. In Ref. [25], the speech authentication algorithm used a ternary hashing sequence instead of a binary hash sequence, and the hash construct proved to be flexible. The algorithm is not only robust to content preserving operations, but also highly efficient. Jiang et al. [26] proposed an audio fingerprinting extraction algorithm based on lifting wavelet packet and improved optimal-basis selection. Although the algorithm has strong robustness and efficiency, it reflects fragmentary speech data and has certain limitations. Hammad and Wang [39] proposed a secure multimodal biometric system by fusing electrocardiogram (ECG) and fingerprint based on convolution neural network (CNN). The proposed algorithm is efficient, robust and reliable, and provides a new idea for speech authentication.
Although the length of hashing sequence can affect the collision resistance performance of speech authentication, there is a lack of research on the length of hashing sequence. In Ref. [10], the algorithm given rotating QR decomposition is used to extract speech feature parameters for wavelet packet coefficient matrix, and then perceptual hashing sequence is constructed. Although the algorithm compares the effects of different length hashing sequences on the discrimination of the algorithm, the algorithm only adopts the hashing sequence of 250 bits, without in-depth discussion of the characteristics of the hashing long sequence. Zhang et al. [13] proposed a high-performance speech perceptual hashing authentication algorithm based on DWT and measurement matrix. The algorithm adopts the length of 360 bits hashing sequence. Although the discrimination of the algorithm has been improved, its comprehensive performance remains to be improved. Therefore, the increase of hashing sequence length can improve the algorithm discrimination.
To sum up, it can be found that the existing speech perception hashing algorithms adopt shorter hashing sequences, which easily leads to the mapping of multimedia numbers of different perception contents to the same perception hashing value, thus making the algorithm lower discrimination. Most authentication algorithms are optimized independently for robustness and authentication efficiency, without balancing the performance of the whole algorithm. To solve the above problems, this paper studies a novel long sequence speech perception hashing algorithm based on tensor decomposition. Hashing long sequence can improve the discrimination of the algorithm. Using uniform sub-band variance and CQT can enhance the robustness of the algorithm. In this paper, the algorithm is optimized in structure and the authentication efficiency is also improved.
The rest of this paper is organized as follows: Section II describes the related theory. Section III illustrates the detailed proposed algorithm on a long sequence speech perceptual hashing authentication based on CQT and TD. Section IV gives the experimental results and the performance analysis compared with other related methods. Finally, Section V concludes the paper with future work. The major symbols used in this paper are summarized in Table 1 for easy reference.

II. RELATED THEORY INTRODUCTION A. UNIFORM SUB-BAND VARIANCE
The features of speech and noise are different in spectrum domain. The energy of speech varies greatly with the frequency band. There is a large peak at the formant, and a small energy at other frequencies. However, the noise energy is much smaller than the speech energy, and it is more evenly distributed in the frequency band. In this paper, the frequency band variance can not only reduce noise interference, but also enhance the robustness of the algorithm.
The time-domain waveform of the speech signal is x(n), and x i (m) is the i-frame speech signal obtained after preprocessing by adding window division, then it is satisfied where ω(m) is the window function; M is the frame length; T is move the frame length. The spectrum is obtained by applying x i (m) to the discrete Fourier transform (DFT).
In the frequency domain, the data length of each frame is M , and there are ( M 2 + 1) spectral lines in the positive frequency domain after DFT. The ( M 2 + 1) spectral lines X i = X i (1), X i (2), · · · , X i ( M 2 + 1) are divided into q subbands, and each sub-band contains p = fix[( M 2 + 1)/q] spectral lines (fix[·] represents the integer part).
In this paper, sub-bands are divided into r sub-band sets, and the variance of each sub-band set is obtained.
Since each sub-band has p spectral lines after the original DFT, it is called uniform sub-band. In other words, each subband is of equal bandwidth. Each sub-band set contains the same number of sub-bands. The variance of all sub-band sets per frame is D r,i = [D (1) i , D (2) i , · · · , D (r) i ].

B. CONSTANT Q TRANSFORM
The essence of CQT is variable resolution processing, that is, the low frequency part has high frequency resolution and the high frequency part has high time resolution. CQT not only inherits the advantages of high resolution and high precision of DFT, but also has good robustness [16], [17]. In CQT, the relation between the central frequencies of each frequency band f k is defined as Equation (7).
where f mim is the lowest frequency of CQT spectrum; b is the parameter, which determines the weight of time-frequency resolution. It can be seen from Equation (7) that the frequency domain of each frequency band is different. This is different from the frequency domain of the DFT, where each band has an equal frequency domain. In CQT, Q represents the ratio of center frequency to bandwidth, which is a constant independent of k.
where δ f is the bandwidth. CQT of discrete signal x(n) is shown in Equation (9).
where k = 1, 2, · · · , K is the frequency band number; a * k (n) denotes the complex conjugate of a k (n); N k are the variable window lengths; · denotes rounding towards negative infinity.
where f s is the sampling rate; ω(t) is a window function (e.g. Hamming window); φ k is the phase shift; is a given scaling factor.
C. TENSOR DECOMPOSITION TD are efficient tools for data analysis and has been successfully applied in many applications, such as data mining, graph analysis, signal processing and computer vision [19], [21], but its use in speech perceptual hashing authentication is rarely discussed. In this paper, TD is used to derive perceptual hash, and Tucker decomposition algorithm is selected to realize TD. For a third-order eigentensor V ∈ R Q 1 ×Q 2 ×Q 3 , the Tucker decomposition will decompose it into a core tensor G ∈ R I ×J ×K and three orthogonal factor matrices U 1 ∈ R Q 1 ×I , U 2 ∈ R Q 2 ×J , and U 3 ∈ R Q 3 ×K . Mathematically, Tucker's decomposition is expressed in Equation (14).
where u j , and u (3) k are the column vectors of the matrix U 1 , U 2 , and U 3 respectively; g i,j,k represents the core tensor G; the symbol '•' represents the cross product of the two vectors; and the symbol ' · ' is a concise representation of Tucker decomposition. Equation (14) can be rewritten as Equation (15).
where v w,h,r , u (1) w,i , u (2) h,j and u (3) r,k are the elements of V , U 1 , U 2 and U 3 respectively. Calculation of Tucker decomposition is equivalent to solving an optimization problem as follows. where · 2 is the Frobenius norm. In general, this optimization problem can be solved by alternating least squares (ALS). Tucker decomposition is shown in Fig 1. In this paper, the sub-band variance set matrix of speech frequency domain is converted into a third-order tensor V ∈ R K ×r×N by CQT. Tensors can embody the whole framework of speech features in three-dimensional space, and the subband variance features of each frame of speech signal are carefully positioned in space. In this paper, the target tensor is adopted. The target tensor is very similar to the original feature tensor. The target tensor eliminates the noise and enhances the speech feature. To obtain the target tensor, the core tensor should be combined with the orthogonal matrix for reconstruction, as shown in Equation (17).
where W is the target tensor. By reducing the dimension of the target tensor W , a one-dimensional long matrix P is obtained.
where W K ,r , W K ,r , · · · ,W (N ) K ,r are the matrix for each frame of the target tensor W ; p 1 ,p 2 , · · · , p N are the mean value of the eigenmatrix of the target tensor W for each frame.

III. THE PROPOSED ALGORITHM
The generic block diagram of the proposed long sequence speech perceptual hashing authentication algorithm based on CQT and TD is shown in Fig 1. The hash structure and matching of the speech signal are carried out, and the processing steps are as follows : Step 1: Pre-processing Pre-processing includes preemphasis, framing and windowing. The speech signal x(n) is obtained by pre-emphasis the input signal s(n). Pre-emphasis can increase the features of the speech signal's high-frequency components, which is advantageous to further spectrum analysis. Then, the processed signal is framed and windowed, where in the window function selects a Hamming window to smooth the edge of the frame. the speech x(n) is divided into N frame, and signal x(m) = {x i (m)|i = 1, 2, · · · , N , m = 1, 2, · · · , M } is obtained, where the subscript i represents the i frame after framing.
Finding the variance of each sub-band set, and geting the matrix of sub-band variance set is Step 4: CQT The sub-band variance set matrix is transformed by CQT to obtain the two-dimensional feature matrix E * K ,N . The feature matrix of the variance of each sub-band set is fused to obtain a feature tensor V .
Step 5: TD The feature tensor is decomposed by Tucker, and then the low-dimensional core tensor G and three orthogonal matrices U (1,2,3) are recombined to obtain the target tensor W .
Step 6: Hashing long sequence structure The mean value of each frame of the target tensor W is calculated to obtain the target matrix P. The target matrix is constructed with hash length sequence to generate a one-dimensional binary hash long sequence h.
where h(1)=0; h(i) is the perceived hash value of each frame speech signal.
Step 7: Hashing digital distance and matching For the two speech clips s1 and s2, their hashing digital distance BER(:, :) can be calculated from the formula as follows: where h s1 and h s2 respectively represent hashing long sequences for s1 and s2; N is the length of the hashing sequence.
In this paper, we use the hypothesis test of hashing digital distance BER(:, :) to describe the hashing matching.
W 0 : if the perceptual content of the two speech clips s1 and s2 are the same: W 1 : if the perceptual content of the two speech clips s1 and s2 are not the same: where τ represents the perceptual authentication threshold, h(·) is a perceptual hashing function. By setting the size of matching threshold τ , calculating the digital distance between perceptual hashing sequences of the speech clip s1 and s2.
If the digital distance BER(:, :) τ , then when their perceptual content are treated as the same, the authentication is passed, and otherwise it is failed.
In order to evaluate the performance of the authentication algorithm, the False Accept Rate (FAR) and False Reject Rate (FRR) of the algorithm can be calculated by Equations (16) and (17).
where τ is the perceptual authentication threshold, µ is the expected value, σ is the standard deviation. Generally speaking, FAR and FRR are used to evaluate the robustness and discrimination of the authentication algorithm. The lower FAR denotes the better discrimination, and the lower FRR denotes the better robustness.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
The operating experimental hardware platform is Intel(R) Core(TM) i5-7500 CPU, 3.40 GHz, with computer memories of 4G. The operating software environment is MATLAB R2018b of Windows 7 system. In this study, after lots of experiments, we found that the following parameters are given the best results after applying it to the proposed algorithm: M = 178; N = 1064; q = 25; r = 5; b = 12; K = 34. Where: M is the length of a frame of speech signal, N is the length of the hashing sequence, q is number of sub-band in frequency domain, r is number of sub-band variance sets, b is the parameter, K is the frequency band number.

A. DATASETS
The experimental speech datas comes from TIMIT speech database and TTS speech database. There are different 1200 speech clips in the original speech database. The format of each speech clip is WAV with the length 4 s, which is of the form of 16 bits PCM, mono and sampled at 16 kHz.
According to the environment of speech transmission, the content preserving operations are performed on each speech in the speech database. A speech database of 14400 different content preserving operations was established, including 12 types of content preserving operations, such as echo, noise, low pass filter, resampling and MP3 compression. In order to simulate the mixed noises in the real environment, the Noisex-92 noise database was added to the original speech database. A speech database of 36000 different Real background noise was established, including 6 types of noises,such as Gnoisegen noise, Pink noise, Factory floor noise 1, Factory floor noise 2, Babble noise and Volvo noise. The signal-to-noise ratios of noises added are respectively 0db, 5db, 10db, 15db and 20db.

B. DISCRIMINATION TEST AND ANALYSIS
The BER of the perceptual hashing value of different speech contents basically obeys the normal distribution. 719400 BER datas are obtained by comparing the two perceptual hashing values of 1200 speech clips. In this paper, the BER normal distribution of hashing sequence length is shown in Fig 2. The better the BER normal distribution curve is, the better the randomicity and collision resistance performance of the perceptual hashing sequence are. The experimental results show that the probability distribution of BER values of different speechs has a high coincidence degree with the probability curve of the standard normal distribution, and the sequence length of 1064 bits selected in this paper is smaller in BER range than that of 532 bits, 639 bits and 798 bits. The effect is better when 1064 bits are selected for the hashing length.
According to the De Moivre-Laplace central limit theorem, the hamming distance is approximate obeying normal distribution (µ = p, σ = √ p(1 − p)/N , N is the number of bits in a hashing sequence, p represents the probability of 0 or 1). In this paper, the length of the perceptual hashing sequence is 1064 bits, and the mean value and standard deviation of the theoretical normal distribution parameters are µ = 0.5000 and σ = 0.0153. Table 2 describes the mean and standard deviation of the normal distribution of the theoretical and experimental values of different hashing sequences lengths.   It can be seen from Table 2 and Fig 3 that the values of µ and σ measured in this paper are very close to the parameters theoretically calculated. As can be seen from Fig 3, the actual curve is getting closer to the theoretical curve with the increase of hashing sequence length, indicating that the hashing sequence generated by this algorithm has good randomicity and collision resistance performance.
In order to evaluate the discrimination ability of the algorithm in this paper under different thresholds, the FAR is obtained from Equation (25). Table 3 compares FAR of different long hash sequence algorithms and different algorithms.
As shown in Table 3, the smaller the matching threshold τ is, the smaller the FAR value is. When the hashing sequence length is 1064 bits and the threshold τ = 0.35 is set, about 1.31 of each 10 21 speech clips are false accepted. As the length of the hashing sequence increases, FAR is decreasing, indicating that discrimination is increasing. Compared with the FAR of other hashing short sequences in this algorithm, the FAR of the hashing long sequence selected by this algorithm is the best, and it is also proved that the long hashing sequence has a high discrimination. When τ = 0.35 occurs, 1.37 of each 10 07 speech clips in Ref. [8] false accepted, 6.10 of each 10 05 speech clips in Ref. [11] are false accepted,   and 4.28 of each 10 08 speech clips in Ref. [12] are false accepted. In contrast, although Refs. [8], [11], [12] can also completely discriminate different speech clips, the algorithm in this paper has a much lower FAR than these algorithms. Compared with the hashing short sequence used in Refs. [8], [11], [12], the hashing long sequence in this paper has a great advantage in discrimination.
Entropy rate (ER) is a comprehensive evaluation index of discriminative perception hash algorithm, which mainly overcomes the shortcomings of the algorithm being susceptible to sequence size. The value of ER ranges from 0 to 1. The larger value, the stronger the discrimination ability, which can be calculated by Equations (27) and (28).
where σ and σ 1 are theoretical and experimental standard deviation of BERs respectively. According to Table 4 and Table 5, with the increase of hashing sequence length, the ER of the algorithm in this paper is higher. When the hashing sequence length is 1064 bits, the ER of the algorithm is the highest, which proves that the hashing sequence has good discriminability. Compared with Refs. [5], [6], [8], [11], the ER of the algorithm in this paper is the highest, indicating that the discriminative effect of the algorithm in this paper is the best.

C. ROBUSTNESS TEST AND ANALYSIS
In order to evaluate the robustness of the proposed algorithm, the 14400 speech segments in the content preserving operations are extracted to generate hash sequences. According to the hashing sequence of the original speech and the speech after operation, the mean BER between the two is obtained. Table 6 shows the content preserving operations that simulate a real environment. In this paper, the various BER of different hashing sequence lengths are shown in Table 7.
It can be obtained from Table 7: the mean BER of the whole algorithm in this paper does not exceed 0.1713, and the max BER does not exceed 0.2444. It is shown that the proposed algorithm in this paper holds better robustness for paper various content preserving operations. As the length of the sequence increases, the robustness of the operation of the other contents, except the echo, decreases. These robustness are only slightly reduced, which will not affect the overall robustness of the algorithm. At the same time, the average running time increases as the length of the hash sequence increases. In this paper, 1064 bits are used to balance the discriminability and robustness, and the overall effect is the best.
719400 BER datas are obtained by comparing of the two perceptual hashing values of 1200 speech clips. When the hashing length is set as 532bits, 639bits, 798bits and 1046bits, the FAR-FRR curve is obtained. The comparison results are shown in Fig 4. As shown in Fig 4, the FRR and FAR curves of different hashing sequence lengths do not overlap, which can accurately discriminate the content preserving operations and the speech of different contents, indicating that the algorithm in this paper has good discrimination and robustness. The mean BER comparison results of this algorithm with Refs. [6], [8], [11] are shown in Table 8.
As can be seen from Table 8, the proposed algorithm is superior to other algorithms in volume, resampling, gaussian noise and MP3 compression for different content preserving operations. Therefore, the proposed algorithm has better robustness. Especially in MP3 compression, this algorithm has better performance than other algorithms. By comparing the algorithm in this paper with Ref. [11], it can be seen that the algorithm in this paper is better than Ref. [11] in resampling, noise, MP3 compression and other aspects. Therefore, this algorithm is suitable for complex communication  environment. Since this paper takes a long time to TD and construct hash sequences, the average time is lower than that in Ref. [11], which is more suitable for instant messaging. Compared with Ref. [8], although the robustness of the algorithm in this paper is slightly lower in terms of volume and echo, the robustness of the algorithm in this paper is far better than that in Ref. [8] in other aspects. Since the NMF with relatively complex structure is used in Ref. [8], the running time is also much higher than the algorithm in this paper. Compared with Ref. [6], the algorithm in this paper has better overall performance than Ref. [6].
Through pairwise comparison of the perceptual hashing values of 1200 speech clips, 719401 BER datas and FRR-FAR curves are obtained. The comparison results of different algorithms are shown in Fig 5. As shown in Fig 5(a), the length of hashing sequence used in this paper is 1064 bits. The FRR-FAR curves without overlap is obtained through experiments,which indicates that the algorithm in this paper not only has good discrimination and robustness,but also can accurately identify the content retention operation and the speech of different contents. As shown in Fig 5(b), although the FRR-FAR curves obtained VOLUME 8, 2020 in Ref. [11] do not overlap, the two curves are close to each other, which cannot well solve the problems of discrimination and robustness.The comparison results also show that the proposed algorithm is better than that in Ref. [11] in terms of discrimination and collision resistance performance. Comparing Fig 5(c) and Fig 5(d), FRR and FAR curves of the two algorithms intersect, reflecting that the discrimination and robustness cannot be solved well. It can be seen from Table 8 that this algorithm is superior to Ref. [8] and Ref. [6] in terms of discrimination and robustness.

D. PASSING RATE TEST AND ANALYSIS IN REAL NOISE ENVIRONMENT
In order to evaluate the robustness of the proposed algorithm to noise, the passing rate p r is introduced.
where T A is the number of speech clips correctly accepted by the system between the speech clips with the same perception content; T R is the number of speech clips wrongfully rejected by the system; F A is the number of speech clips wrongly accepted by the system between different speech clips of perceived content. The threshold τ is selected as the minimum BER of FAR curve. Different algorithms select different thresholds: the proposed algorithm is 0.4173, that in Ref. [8] is 0.3593, that in Ref. [11] is 0.3037, and that in Ref. [12] is 0.3677. Fig 6 shows the comparison of the passing rate between the proposed algorithm and that in Refs. [8], [11], [12] under six different noise environments. As shown in Fig 6, the algorithm in this paper has strong robustness for Gaussian noise, Factory1 noise and Volvo noise. Especially for Volvo noise, the passing rate of different SNR reaches 100%. For all noises, the passing rate of the algorithm in this paper reaches 100% when the SNR is greater than 30db, which is also uncomparable in Refs. [8,11,12]. The stable feature values obtained by TD are robust to different noises. Compared with Ref. [11], the algorithm in this paper has a lower passing rate under the condition of Factory2 noise and Pink noise, indicating that the improved spectrum entropy has a strong robustness against these two kinds of noise. On the whole, the proposed algorithm has the best robustness. Compared with the algorithm in Refs. [8,12], the passing rate of the algorithm in this paper is much higher than that of the two algorithms no matter what kind of noise. Therefore, the proposed algorithm has better robustness for common noises and can meet the needs of speech matching in daily life.

E. EFFICIENCY TESTING AND ANALYSIS
Efficiency is a very important evaluation criterion in speech content authentication. To evaluate the efficiency of the algorithm in this paper, we need to randomly select 200 speech clips from the speech database, and then calculate the average running time. The same operating environment is adopted, and the speech clips is 4s. Table 9 shows the comparison results between the algorithm in this paper and the algorithm in Refs. [6], [8], [11], [13].
As shown in Table 9, as for the algorithm in this paper, with the increase of hashing sequence length, although the efficiency performance of the algorithm is decreasing, the difference is small, which meets the requirements of efficiency authentication. The length of hashing sequence in this paper is 1064 bits. Compared with the length of other hashing sequences in this paper, the timeliness is relatively low, but the discrimination is greatly improved. Compared with other algorithm, the efficiency of the algorithm in this paper is 1.1 times that in Ref. [13], 2.3 times that in Ref. [8], and 1.2 times that in Ref. [6]. However, compared with Ref. [11], the efficiency of Ref. [11] is 3.2 times of the algorithm in this paper. Since this paper adopts hashing long sequence and tensor decomposition, the complexity is much higher and the average running time is slightly slower than Ref. [11]. Because NMF with large computation and long running time was used in Refs. [6,8], the efficiency performance is lower than the algorithm in this paper. Although the length of hashing sequence in this paper is 4 times of that in Ref. [11] and 3 times of that in Refs. [6,8,13], the algorithm in this paper performs very well in the efficiency performance and can meet the requirements of efficiency authentication.

F. DISCUSSION
We compared the authentication performance of the proposed algorithm with perceptual hashing algorithm based on improved spectral entropy, and perceptual hashing based on NMF and MDCT coefficients. The authentication performance of different algorithms is evaluated in detail. The main highlights of our proposed algorithm are summarized below: VOLUME 8, 2020  1. This algorithm not only improves the length of hash sequence and the recognition rate of algorithm, but also can generate inconsistent hash sequence from a large number of speech data.
2. The features extracted in this paper have strong antiinterference performance, especially various noises with low signal-to-noise ratio.
3. Compared with the existing authentication algorithms, the efficiency of the algorithm in this paper has achieved good results.
According to the advantages of the proposed system, it can be deployed in real speech authentication.
The main disadvantages of the proposed algorithm are: • The algorithm lacks security and is easy to cause information leakage.
• In the case of speech tampering, this proposed algorithm cannot tamper detection and localization, which is a major flaw in this algorithm.

V. CONCLUSION
This paper presents a long sequence perceptual hashing authentication algorithm based on CQT and TD. This algorithm has good comprehensive performance and solves the existing problems of speech authentication algorithms. The following conclusions can be obtained through the experimental analysis: A. The algorithm in this paper adopts a long hashing sequence with high discriminability. For different speech clips, different hashing sequences are generated, which effectively reduces the probability that different speech clips are confirmed as the same speech clips and improves the authentication rate of the algorithm. B. The algorithm in this paper has strong robustness for content preserving operations, especially in the case of resampling, low-pass filtering, noise and MP3 compression, which indicates that the algorithm in this paper is suitable for signal transmission in complex environments. C. From the perspective of overall performance, when the hashing sequence length of 1064 bits is selected by the algorithm in this paper, it not only gives consideration to the discrimination and robustness, but also has highly efficiency performance, which meets the requirements of speech authentication in the real-time communication environment. Because the hashing sequence is too long, which will cause the waste of storage space resources and increase the running time, the hashing sequence length of the algorithm in this paper needs to be further optimized, and the security of the algorithm in an open environment needs to be further solved. The proposed algorithm also needs to address location detection in the case of speech tampering.
YIBO HUANG received the Ph.D. degree from the Lanzhou University of Technology, in 2015. He is currently working as an Associate Professor with the College of Physics and Electronic Engineering, Northwest Normal University. His main research interests include multimedia information processing, information security, and speech recognition.
HEXIANG HOU received the B.S. degree in communication engineering from Dezhou University, Shandong, China, in 2018, where he is currently pursuing the M.S. degree in electronic and communications engineering. His research interests include audio signal processing and application, and multimedia authentication.
YONG WANG received the B.S. degree from the Henan Institute of Science and Technology, Henan, China, in 2017. His research interests include audio signal processing and application, and multimedia authentication techniques.
YUAN ZHANG received the B.S. degree in electronic information engineering from the Wuhan Institute of Technology, Hubei, China, in 2017, where he is currently pursuing the M.S. degree in electronic and communications engineering. His research interests include audio signal processing and application, and multimedia authentication.
MANHONG FAN received the M.Sc. degree in circuits and system from Northwest Normal University, Lanzhou, China, in 2012. His research interest includes computer measurement and control.