Introduction
Speech data used in current speech processing research are typically recorded using sophisticated digital recording equipment in carefully designed environments. However, speech resources, such as those recorded with outdated or low-quality audio equipment [1], [2], or the voices of endangered languages [3], [4], [5], are affected by various acoustic distortions. Speech restoration, which aims to generate high-quality speech from the degraded resources, has been studied to exploit various speech resources. In particular, the restoration of historical audio resources [6], [7] is crucial to preserve cultural and linguistic resources in danger of being lost. A key challenge in this task is that it is nearly impossible to replicate the original speakers, recording equipment, and age-related degradation, which complicates the collection of training data for speech restoration models. A previous study proposed a general speech restoration model that generates high-quality speech from various degraded speech [6]. This model is trained in a supervised manner on artificial paired data generated by randomly applying different acoustic distortions to high-quality speech corpora. However, this simple data generation scheme does not adequately represent the acoustic distortion of real historical speech. Thus, its performance when applied to real historical speech is hampered by a domain shift problem.
We propose a self-supervised speech restoration method without paired speech data. The model consists of analysis, synthesis, and channel modules, all of which are designed to emulate the recording process of degraded audio signals. The analysis module extracts time-variant undistorted speech features and time-invariant acoustic distortion features from degraded speech (hereafter, channel features). The other modules execute the distorted-speech-generation process; the synthesis module synthesizes undistorted speech (i.e., restored speech), and the channel module adds acoustic distortion to the restored speech. The model is trained in a self-supervised manner by minimizing the reconstruction loss between the degraded input and the reconstructed speech. Fig. 1 illustrates a comparison between our self-supervised learning approach and the previous supervised learning approach [6]. The previous supervised learning approach generates simulated paired data by randomly introducing a variety of acoustic distortions into high-quality speech corpora. In contrast, our self-supervised learning approach uses real historical speech resources to train our speech restoration model. As shown in Fig. 1, we also propose a dual learning method using unpaired high-quality data, which helps with the disentanglement of undistorted speech signals and acoustic distortions.
Overview of this paper. Previous speech restoration approach has used artificial paired training data created by introducing various acoustic distortions, which can be affected by domain-shift problems for real historical audio resources. Our model is trained on real historical speech data in self-supervised manner by emulating recording process of historical speech. We also propose dual-learning method to stabilize training process, perceptual loss using automatic speech quality assessment model, supervised pretraining with artificial data, and semi-supervised learning to scale-up training. our model also enables audio effect transfer, in which channel features are extracted from historical speech signals and added to arbitrary audio signals. Notations used in § IV are indicated in italics (e.g., “Self-*”).
We also introduce several methods to improve the performance of our speech restoration model. We extend our method using a semi-supervised learning framework, a hybrid method that combines our self-supervised learning approach with conventional supervised learning methods. This ensures high accuracy in speech feature prediction while enabling domain adaptation to historical speech. We also use perceptual loss with an external automatic speech quality assessment model to improve the perceptual speech quality of the restored speech. Finally, we use the supervised pretraining method, which uses artificial paired data for the supervised pretraining of the analysis and channel modules, and then derives initial model parameters for self-supervised or semi-supervised learning. Table 1 shows the comparison between variants of our approach and previous supervised approaches. While the previous supervised approaches [6] requires the use of simulated paired data generated from clean speech corpora, our method can use the real historical audio resources during training. Our model also allows for audio effect transfer, in which only acoustic distortions are extracted from degraded speech and applied to arbitrary high-quality audio. Experimental evaluations showed that our method achieved significantly higher quality speech restoration than the previous supervised method. The implementation1 and the audio samples2 are publicly available. The contributions of this paper are as follows:
We propose a self-supervised speech restoration approach that can learn various acoustic distortions without a paired speech corpora and is applicable to real degraded speech data.
Our method achieves significantly higher quality speech restoration than the previous supervised method especially on real historical audio resources.
Our model enables audio effect transfer, in which only channel features are extracted from degraded speech and applied to other high-quality speech.
This paper is an extended full paper version of our earlier work [8], supplemented by several contributions:
We propose a perceptual training objective described in § III-C and a semi-supervised learning framework described in § III-D. We verified the effectiveness of both frameworks on real historical speech datasets.
We conducted additional evaluations using a mixture of different historical speech resources. The application of self-supervised learning to such a variety of existing historical speech resources is a crucial experiment in the context of practical applications.
This paper is organized as follows. § II provides the background and related work. § III outlines the basic framework of our proposed self-supervised speech restoration method, along with various strategies to improve its performance. § IV presents an empirical evaluation that demonstrated the effectiveness of the proposed method. Finally, § V presents conclusions and a summary of the paper.
Related Work
A. Speech Restoration
Speech restoration is the task of obtaining high-quality speech signals from degraded speech signals [6], [9], [10], [11], [12], [13], [14], which is a joint task involving bandwidth extension [15], [16], dereverberation [17], [18], denoising [19], [20], and declipping [21], [22]. Several studies have investigated speech restoration methods using a neural vocoder [23], [24] in pursuit of robust and generalizable speech restoration. A previous study [6] separately trained one module for estimating clean acoustic features from degraded speech and another for synthesizing clean speech waveforms from these clean speech features. Another study [25], [26] proposed a vocoder-based speech restoration method that incorporates semantic speech, textual, and speaker representations, which successfully restored a text-to-speech corpus [27] with the high robustness and quality. However, all these methods require artificial training data, generated by introducing various acoustic distortions to clean speech corpora. Textual features are also often not available for the restoration of historical speech, as the transcriptions are not accessible and the writing systems may differ from the modern languages.
Historical speech is typically characterized by a wide range of acoustic distortions, the information about which is usually unknown. Thus, training with artificially paired data can lead to a domain shift problem between rule-based distortions [6] and actual historical speech. To address this, we propose a method for learning speech restoration models using historical speech and unpaired data via self-supervised learning strategies. This method does not use additional inputs related to transcript, speaker characteristics, and acoustic distortions. It also enables audio effect transfer, which allows for adding the distortion features extracted from historical audio signals to arbitrary audio signals. We also propose a method of improving the speech quality using perceptual loss, which is also applicable to previous supervised approaches with paired data.
B. Self-Supervised Representation Learning for Speech Processing
To use diverse untranscribed speech data, self-supervised speech representation learning methods [28], [29], [30] have been widely studied. With such methods, a pretrained model is applied to various downstream tasks [31], [32] and the performance is significantly improved. There have been several studies that applied self-supervised representation learning to speech enhancement or restoration [25], [33], [34], [35]. Unlike the previous approaches, our approach is to simulate the generation process of historical audio resources to train the model only on the unpaired data. Similar to our method, the DDSP autoencoder [36] learns disentangled features from the acoustic signal in a self-supervised manner and manipulates each feature. It uses a simple sinusoidal vocoder with filtered noise components and only reverberation is assumed as acoustic distortion, while our method uses a more expressive waveform synthesis model to achieve high-quality speech restoration and channel modules to capture various acoustic distortions.
C. Speech Coding
Analysis-by-synthesis approaches, such as the one used in this study, have been developed in the field of speech coding. These approaches are used to achieve a compressed representation of speech by integrating encoders and decoders to mimic the human vocalization process, thus reducing redundancies in the speech signal. Prior to the advent of neural networks, low bit-rate speech coding [37], [38], [39] had been extensively studied, albeit with limited reconstruction capabilities. Advances in neural vocoders have led to the proposal of numerous neural speech coding approaches [40], [41], enabling the creation of codecs with superior reconstruction quality for speech [43] and acoustic signals [44], [45]. These discrete codecs allow speech signals to be treated as language models [46], [47], heralding further advances. Our proposed method is designed to autonomously derive speech representations from degraded speech signals by simulating both human vocalization and the recording processes of degraded speech using a neural network.
Proposed Method
In this section, we describe the proposed self-supervised speech restoration method.3 As shown in Fig. 1, our method uses real historical audio resources. § III-A describes the basic framework simulating the generation process of historical speech data. In § 2, we present our proposed dual-learning method that leverages unpaired high-quality speech data. § III-C presents the training objective based on perceptual speech quality. § III-D and § III-E describes the semi-supervised learning and supervised pretraining frameworks, respectively, which use simulated paired data for our method. In § III-F, we present the audio effect transfer of historical audio characteristics using our method.
A. Basic Speech Restoration Framework
We begin with an overview of the recording process of degraded speech underlying our method. High-quality speech (i.e., undistorted speech) is emitted from the mouth through a human speech production process, which can be parameterized by time-variant speech features such as traditional source-filter vocoder features or mel spectrograms. Acoustic distortions (e.g., non-linear response of recording equipment and lossy audio coding) are then added to the high-quality speech, resulting in a final recorded audio. We assume that the distortions (i.e., channel features) are time-invariant, in which the recording equipment or the audio coding does not change in each audio sample.
The proposed speech restoration model consists of analysis, synthesis, and channel modules that are based on the above process, as shown in Fig. 2 (top). All of these modules are composed of neural networks, so they can be constructed in an end-to-end manner. Let \begin{equation*} \{\hat {Z}_{\mathrm {res}}, \hat c \}= \text {Analysis}(Y_{\mathrm {low}}; \theta _{\mathrm {ana}}), \tag{1}\end{equation*}
\begin{equation*} \hat {W}_{\mathrm {res}} = \text {Synthesis}(\hat {Z}_{\mathrm {res}}; \theta _{\mathrm {syn}}), \tag{2}\end{equation*}
\begin{equation*} \hat {W}_{\mathrm {low}} = \text {Channel}(\hat {W}_{\mathrm {res}}, \textrm {$c$}; \theta _{\mathrm {chn}}), \tag{3}\end{equation*}
\begin{equation*} \mathcal {L}_{\mathrm {recons}} = \sum _{i} \{ ||S_{i} - \hat {S}_{i}||_{1} + \alpha ||\log S_{i} - \log \hat {S}_{i}||_{1} \}, \tag{4}\end{equation*}
\begin{equation*} \{ \hat {\theta }_{\mathrm {ana}}, \hat {\theta }_{\mathrm {chn}} \} = \mathop {\mathrm {arg min}}\limits _{\theta _{\mathrm {ana}}, \theta _{\mathrm {chn}}} \mathcal {L}_{\mathrm {recons}}, \tag{5}\end{equation*}
Proposed self-supervised speech restoration method, consisting of analysis, synthesis, and channel modules that simulate recording process of degraded speech. Basic training process (top) minimizes reconstruction loss between degraded and reconstructed speech. We also propose dual learning (top & bottom) with arbitrary high-quality speech corpora.
As in Eq. (1), the analysis module jointly estimates the disentangled features: time-variant speech features and time-invariant channel features. In this paper, we also investigates the use of an additional reference encoder to extract the channel features. We use the reference encoder that is based on a global style token [49], which is used for the acoustic noise modeling in the original paper. In contrast to Eq. 1, the use of the reference encoder can be formulated as \begin{align*} \hat {Z}_{\mathrm {res}} &= \text {Analysis}(Y_{\mathrm {low}}; \theta _{\mathrm {ana}}), \tag{6}\\ \hat c &= \text {RefEnc}(Y_{\mathrm {low}}; \theta _{\mathrm {enc}}), \tag{7}\end{align*}
\begin{equation*} \{ \hat {\theta }_{\mathrm {ana}}, \hat {\theta }_{\mathrm {chn}}, \theta _{\mathrm {enc}} \} = \mathop {\mathrm {arg min}}\limits _{\theta _{\mathrm {ana}}, \theta _{\mathrm {chn}}, \theta _{\mathrm {enc}}} \mathcal {L}_{\mathrm {recons}}. \tag{8}\end{equation*}
Fig. 3 shows the spectrograms obtained with the proposed method. The input low-quality speech is simulated on the basis of the
Spectrograms obtained with proposed method. While original degraded speech signal lacks high-frequency band and has quantization noise in low-frequency band, our model estimates spectrogram that is similar to ground-truth speech. Our model reconstructs original degraded speech through channel module.
The above basic self-supervised training process often fails to disentangle
B. Dual Learning for Stable Self-Supervised Learning
As mentioned in § III-A, we propose a dual-learning method to address the training instability of the basic training process. Fig. 2 shows the proposed dual-learning method. In addition to the basic framework described in § III-A, we introduce a training task that propagates information in the backward direction. We denote the two training tasks in the forward direction and backward direction as
We use arbitrary high-quality speech, which does not align with the degraded speech and only consists of other speaker’s utterances. Let \begin{equation*} \hat {X}^{\prime }_{\mathrm {low}} = \text {Channel}(X^{\prime }_{\mathrm {high}}, \textrm {$c$}; \theta _{\mathrm {chn}}), \tag{9}\end{equation*}
\begin{equation*} \hat {Z}^{\prime }_{\mathrm {res}} = \text {Analysis}(\hat {Y}^{\prime }_{\mathrm {low}}), \tag{10}\end{equation*}
\begin{equation*} \mathcal {L}_{\mathrm {feats}} = ||\hat {Z}^{\prime }_{\mathrm {res}} - Z^{\prime }_{\mathrm {high}}||_{p}, \tag{11}\end{equation*}
\begin{equation*} \hat {\theta }_{\mathrm {ana}} = \mathop {\mathrm {arg min}}\limits _{\theta _{\mathrm {ana}}} \mathcal {L}_{\mathrm {feats}}. \tag{12}\end{equation*}
\begin{equation*} \mathcal {L}_{\mathrm {dual}} = \beta \cdot \mathcal {L}_{\mathrm {recons}} + (1 - \beta) \cdot \mathcal {L}_{\mathrm {feats}}, \tag{13}\end{equation*}
obtains the analysis module that adapts to real historical speech and the channel module that expresses the acoustic distortion.\mathcal {T}_{\mathrm {forward}} obtains the analysis module to estimate high-quality speech features.\mathcal {T}_{\mathrm {backward}}
C. Training Objective Based on Perceptual Speech Quality
To further improve the performance of our speech restoration model, we introduce a loss function that is based on perceptual speech quality. Traditionally, synthetic speech has been evaluated by human listeners using subjective measures such as mean opinion score (MOS) tests. Automatic speech quality evaluation methods [51], [52] that predict the MOS values of the input speech have actively been studied. Therefore, we incorporate a loss function designed to improve the pseudo-MOS, which should improve the perceptual quality of our speech restoration model.
Let \begin{equation*} \mathcal {L}_{\mathrm {percep}} = - \text {SQAModel}(\hat {W}_{\mathrm {res}}; \theta _{\mathrm {sqa}}), \tag{14}\end{equation*}
D. Extending to Semi-Supervised Learning
As shown in Fig. 1, the previous supervised learning approach [6] generates paired data by randomly applying various acoustic distortions to clean speech corpora. Despite the domain mismatch with real historical audio resources inherent in this approach, the use of clean speech as the target ensures more stable training, resulting in high-quality restored speech for in-domain data. Therefore, we introduce a hybrid framework that both uses both the previous supervised method and the self-supervised learning method described in Section III-B. It is a multi-task semi-supervised learning framework that integrates supervised learning using simulated paired data with self-supervised learning using real historical speech data.
We basically follow the algorithm presented in a previous study [6] to generate simulated paired data. Algorithm 1 shows the entire algorithm used in our method. Let
Algorithm 1 Generation of simulated Paired Data
High-quality speech waveform
Simulated degraded speech waveform
if
end if
if
if
end if
end if
if
end if
Let \begin{align*} \hat {Z}^{\prime \prime }_{\mathrm {res}} &= \text {Analysis}(Y^{\prime \prime }_{\mathrm {sim}}; \theta _{\mathrm {ana}}), \tag{15}\\ \mathcal {L}_{\mathrm {supervised}} &= \mathcal {L}_{\mathrm {feats}}(\hat {Z}^{\prime \prime }_{\mathrm {res}}, Z^{\prime \prime }_{\mathrm {high}}) \tag{16}\\ & = || \hat {Z}^{\prime \prime }_{\mathrm {res}} - Z^{\prime \prime }_{\mathrm {high}} ||_{p}, \tag{17}\end{align*}
\begin{equation*} \mathcal {L}_{\mathrm {semi}} = \mathcal {L}_{\mathrm {dual}} + \gamma \cdot \mathcal {L}_{\mathrm {supervised}}, \tag{18}\end{equation*}
E. Supervised Pretraining
Inspired by curriculum learning [55], we pretrain the analysis and channel modules in a supervised manner, before training our model with the methods described in § III-A to § III-D. We train the modules using simulated paired data to obtain the initial parameters for the self-supervised (§ III-B) or semi-supervised (§ III-D) learning. Let \begin{align*} \hat {W}^{\prime \prime }_{\mathrm {low}} &= \text {Channel}(W^{\prime \prime }_{\mathrm {high}}; \theta _{\mathrm {chn}}), \tag{19}\\ \mathcal {L}_{\mathrm {pre}} &= \kappa \cdot \mathcal {L}_{\mathrm {recons}}(\hat {W}^{\prime \prime }_{\mathrm {low}}, W^{\prime \prime }_{\mathrm {sim}}) \tag{20}\\ &+ (1-\kappa) \cdot \mathcal {L}_{\mathrm {feats}}(\hat {Z}^{\prime \prime }_{\mathrm {res}}, Z^{\prime \prime }_{\mathrm {high}}) \tag{21}\end{align*}
F. Audio Effect Transfer with Learned Channel Representation
Our frameworks described in § III-A to § III-E are used to train the analysis, synthesis, and channel modules for speech restoration. In addition to the speech restoration task, the obtained model can also be used for audio effect transfer. Fig. 4 illustrates the audio effect transfer. Let \begin{equation*} \hat {W}_{\mathrm {trans}} = \text {Channel}(W_{\mathrm {high}}, \textrm {$c$}; \theta _{\mathrm {chn}}), \tag{22}\end{equation*}
Proposed audio effect transfer, in which only channel features are extracted from historical speech and applied to arbitrary input speech so that transferred speech has audio effects like historical speech.
Experimental Evaluations
A. Dataset
To evaluate of our speech restoration framework, we used three types of datasets: simulated datasets, a single-domain real dataset, and a multi-domain real dataset. We used the simulated datasets to use the ground-truth speech for the evaluation and to investigate each type of acoustic distortion. The single-domain real dataset, which contained a limited amount of data within a single domain, was used to investigate the few-shot domain adaptation to real historical speech data. An advantage of our self-supervised approach is that it can be trained on a large amount of degraded speech resources. Therefore, we applied our method to the multi-domain real-world dataset to investigate its performance in modelling historical speech across multiple domains simultaneously.
We created the simulated datasets by applying different types of acoustic distortions to the JSUT corpus [56], which consists of approximately six hours of high-quality speech utterances from a Japanese female speaker. We randomly selected 25 sentences for both of the validation and test sets. We created the datasets by applying the following four types of distortions including (a) Band-Limit: We applied biquad-lowpass filtering to the high-quality speech with a cut-off frequency of 1 kHz and the Q-factor of 1.0. (b) Clip: We clipped the high-quality speech waveform with an absolute-value threshold of 0.25. (c)
For the evaluation discussed in § IV-D, we created the single-domain real dataset from a Japanese historical speech resource [58]. This dataset was recorded in the 1960s–1970s and consists of nine speakers narrating folktales, with a total duration of approximately 22-minutes. The analog audio clips recorded on the cassette tape were digitized using a radio cassette player. As this data contained a large amount of additive noises, iZotope RX9 was used for preliminary noise reduction as preprocessing. If the additive noise was very high, it would have a dominant effect on the original MOS values. Since our goal is to synthesize restored speech with higher speech quality and intelligibility, rather than dealing with additive noise that can be removed by simple denoising, we have reduced the noise level to some extent by preprocessing. Note that Original in § IV-D comes from the data after this preprocessing. We randomly selected approximately 2- and 4-minute audio clips for the validation and test sets, respectively.
For the multi-domain real dataset, we used a compilation of ten different domains. This dataset consists of Japanese historical speech resources, including the Tohoku Folktale Corpus, 4 the CPJD Corpus [4], the tri-jek Corpus,5 and the single-domain real dataset [58]. We simply used the single-domain dataset in the last paragraph as one of the resources in the multi-domain real dataset. However, for the rest of the resources in the multi-domain real dataset, we did not apply denoising and chose to use the original speech data. By not using external noise reduction for most of the domains, we wanted to evaluate the robustness of our model to different types of real acoustic distortions. The cumulative duration of the training set was 8.9 hours. The validation and test sets contained 3.8 and 12.3 minutes of data, respectively.
We used the JVS corpus [56] for
B. Experimental Settings
a: Speech Feature Extraction
We set the sampling rate of the speech waveform to 22.05 kHz. As described in § III-A, we used mel spectrograms and source-filter features for the restored speech features
b: Model Details.
For the analysis and channel modules described in § III, we used the U-Net architecture [61] that is based on 1-dimensional and 2-dimensional convolution layers, respectively. In each down sampling of the U-Net, the temporal resolution was reduced by half with four layers of residual convolution blocks and average pooling. Each residual convolution block consists of convolution layers and batch normalization layers [62] with a skip connection [63]. During up-sampling, the temporal resolution is doubled by applying deconvolution, resulting in time-variant high-quality speech features with the same temporal resolution as that of the input features. As described in § III-A, we used HiFi-GAN [48] for the synthesis module. When using mel spectrograms for
c: Training Settings.
The batch size was set to 4 for all the training cases in our evaluations. We used used
C. Evaluation on Speech Restoration with Simulated Data
We first evaluated our method using the simulated datasets described in § IV-A. Since the ground-truth speech was accessible during this evaluation, we used mel cepstral distortion (MCD) [65] for the evaluation metrics of speech quality. We also used an automatic assessment model (referred to as NISQA) [66] for the evaluation of speech quality. We conducted MOS tests with 40 native Japanese evaluators for each distortion setting.
We compared our method described in § III with a previous supervised method [6], which is referred to as Supervised. It should be noted that there is another emerging speech restoration method [25], as described in § II, which uses text transcript. Therefore, we only compared our method with Supervised. We trained the supervised model in the same manner as the supervised pretraining (§ III-E) with the dataset described in § IV-A. We compared the different variants of our proposed method. Self denotes the self-supervised method described in § III-B, where Self-Mel and Self-SF use the mel spectrograms and source-filter features for
We first observed that both the previous supervised approach and our proposed method achieved better speech quality than the original degraded speech, indicating that they successfully performed the speech restoration. When comparing Self-Mel and Self-SF, Self-Mel shows better results for almost all the metrics, suggesting that it is better to use mel spectrograms than to use source-filter features for the hidden feature
D. Evaluation on Speech Restoration with Single-Domain Real Data
As described in § IV-A, we conducted an evaluation using a single-domain real historical speech dataset. Unlike the evaluation using simulated data presented in § IV-C, we did not have access to the ground-truth speech. Therefore, we conducted the evaluations using NISQA and MOS. We also applied the supervised pretraining described in § III-E for both our self-supervised (Self-Pretrain) and semi-supervised (Semi-Pretrain) models. Table 3 lists the results. The notation of each method is also shown in Fig 1 and described in § III.
As shown in Table 3(a), we observed that the proposed self-supervised and semi-supervised models outperformed Supervised, while Supervised often outperformed our methods on the simulated data as described in § IV-C. This suggests that Supervised is affected by a domain-shift problem for the real historical speech resources and our methods achieved better performance on real data. The supervised pretraining described in § III-E also improved the performance of both the self-supervised and semi-supervised models. As in the evaluation presented in § IV-C, the perceptual loss improved the performance. Overall, the self-supervised model incorporating the supervised pretraining and perceptual loss showed the best performance in terms of both the NISQA and MOS.
To further validate the results, we also conducted preference AB tests. Forty native Japanese evaluators were recruited for each of the AB test evaluations. We compared Self-Mel-Pretrain-Ploss — the best of our methods in the objective metrics — with Original and Supervised. We can see that the proposed method significantly improved the speech quality of the original degraded speech and outperformed Supervised.
Fig. 5 shows the spectrograms for the original historical speech and the restored speech. As shown in Fig. 5(b), our self-supervised learning method restored the original speech, showing the undistorted spectrogram in the lower-frequency band and extending the bandwidth. However, it suffers in reconstructing the missing frequency band, because it does not use the paired data. As shown in Fig. 5(c), our semi-supervised learning method better handled the missing frequency band and achieved better bandwidth extension using the paired data. While Supervised shown in Fig. 5(d) successfully extended the bandwidth, it showed slightly more distorted spectrograms compared with our method shown in 5(b) and 5(c).
Visualizations of spectrograms. (b) While self-supervised method restored original speech, it failed to reconstruct a frequency band. (c), our semi-supervised learning method better handled the missed frequency band and achieved better bandwidth extension. While supervised method shown in (d) successfully extended bandwidth, it showed more distorted spectrograms in lower-frequency band.
E. Evaluation on Speech Restoration with Multi-Domain Real Data
We conducted this evaluation using the multi-domain real historical speech data described in § IV-A. Table 4 lists the results. The notation of each method is also shown in Fig 1 and described in § III.
Unlike the evaluation with single-domain real data described in § IV-D, our self-supervised learning methods could not handle the multi-domain historical speech data, as shown in Table 4(a). This may be because they cannot accurately capture the channel features when trained on the multi-domain real historical dataset, resulting in the lower performance in disentangling of undistorted speech features and channel features. In contrast, our semi-supervised learning could stabilize the training with the help of some paired data, resulting in the better restored speech quality. By applying the supervised pretraining described in § III-E and the perceptual loss described in § III-C to our semi-supervised learning method, it performed the best among all the comparison methods. These results suggest the feasibility of large-scale semi-supervised learning using various existing historical speech resources in the future work.
To further validate the effectiveness of the proposed method, we conducted preference AB tests with 40 native Japanese evaluators, as shown in Table 4(b). We compared Semi-Mel-Pretrain-Ploss — the best proposed method in the objective evaluation results — with Original and Supervised. We observed it significantly improved the speech quality of the original degraded speech and outperformed Supervised.
F. Evaluation on Audio Effect Transfer
We evaluated the audio effect transfer using our proposed method described in § III-F. We conducted the evaluation using the simulated
The results indicate that the proposed method is closer to the target channel features than the original high-quality speech on both simulated and real data. Regarding the results on the simulated data, the SMOS was higher than that of the case corrected by the mean amplitude spectral difference. Although the score of Proposed was around 0.8 lower than the reference, it showed a higher SMOS than that of High-quality.
Conclusion
We proposed a self-supervised speech restoration method for historical audio resources. Our framework, which consists of analysis, synthesis, and channel modules, can be trained in a self-supervised manner using real degraded speech data. We also propose a dual-learning method for our self-supervised learning method, which facilitates stable training leveraging the unpaired high-quality speech data. Our framework includes supervised pretraining method, semi-supervised extension, and perceptual loss, which can further improve the performance for real historical speech resources. Experimental evaluation showed that our approach performed significantly better than a previous supervised approach for real historical speech resources.
a: Limitation and Future Work
Our work still has some limitations. Our supervised method cannot handle multiple channel features and requires the semi-supervised learning. We will need to improve the modeling of the channel features for better self-supervised learning. Our training data also contained a small amount of real data (less than 10 hours). For future work, we plan to construct a general speech restoration model that is trained on large-scale existing historical speech resources.