Loading web-font TeX/Math/Italic
SelfRemaster: Self-Supervised Speech Restoration for Historical Audio Resources | IEEE Journals & Magazine | IEEE Xplore
Scheduled Maintenance: On Monday, 30 June, IEEE Xplore will undergo scheduled maintenance from 1:00-2:00 PM ET (1800-1900 UTC).
On Tuesday, 1 July, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET (1800-2200 UTC).
During these times, there may be intermittent impact on performance. We apologize for any inconvenience.

SelfRemaster: Self-Supervised Speech Restoration for Historical Audio Resources


Previous speech restoration approach has used artificial paired training data, which can be affected by domain-shift problems for real historical audio resources. Our mod...

Abstract:

Restoring high-quality speech from degraded historical recordings is crucial for the preservation of cultural and endangered linguistic resources. A key challenge in this...Show More

Abstract:

Restoring high-quality speech from degraded historical recordings is crucial for the preservation of cultural and endangered linguistic resources. A key challenge in this task is the scarcity of paired training data that replicate the original acoustic conditions of the historical audio. While previous approaches have used pseudo paired data generated by applying various distortions to clean speech corpora, their limitations stem from the inability to authentically simulate the acoustic variations in historical recordings. We propose a self-supervised approach to speech restoration that does not require paired corpora. Our model has three main modules: analysis, synthesis, and channel modules, all of which are designed to emulate the recording process of degraded audio signals. The analysis module disentangles undistorted speech and distortion features, and the synthesis module generates the restored speech waveform. The channel module then introduces distortions into the speech waveform to compute the reconstruction loss between the input and output degraded speech signals. We further improve our model by introducing several methods including dual learning and semi-supervised learning. An additional feature of our model is the audio effect transfer, which allows acoustic distortions from degraded audio signals to be applied to arbitrary audio signals. Experimental evaluations demonstrated that our approach significantly outperforms the previous supervised approach for the restoration of real historical speech resources.
Previous speech restoration approach has used artificial paired training data, which can be affected by domain-shift problems for real historical audio resources. Our mod...
Published in: IEEE Access ( Volume: 11)
Page(s): 144831 - 144843
Date of Publication: 19 December 2023
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Speech data used in current speech processing research are typically recorded using sophisticated digital recording equipment in carefully designed environments. However, speech resources, such as those recorded with outdated or low-quality audio equipment [1], [2], or the voices of endangered languages [3], [4], [5], are affected by various acoustic distortions. Speech restoration, which aims to generate high-quality speech from the degraded resources, has been studied to exploit various speech resources. In particular, the restoration of historical audio resources [6], [7] is crucial to preserve cultural and linguistic resources in danger of being lost. A key challenge in this task is that it is nearly impossible to replicate the original speakers, recording equipment, and age-related degradation, which complicates the collection of training data for speech restoration models. A previous study proposed a general speech restoration model that generates high-quality speech from various degraded speech [6]. This model is trained in a supervised manner on artificial paired data generated by randomly applying different acoustic distortions to high-quality speech corpora. However, this simple data generation scheme does not adequately represent the acoustic distortion of real historical speech. Thus, its performance when applied to real historical speech is hampered by a domain shift problem.

We propose a self-supervised speech restoration method without paired speech data. The model consists of analysis, synthesis, and channel modules, all of which are designed to emulate the recording process of degraded audio signals. The analysis module extracts time-variant undistorted speech features and time-invariant acoustic distortion features from degraded speech (hereafter, channel features). The other modules execute the distorted-speech-generation process; the synthesis module synthesizes undistorted speech (i.e., restored speech), and the channel module adds acoustic distortion to the restored speech. The model is trained in a self-supervised manner by minimizing the reconstruction loss between the degraded input and the reconstructed speech. Fig. 1 illustrates a comparison between our self-supervised learning approach and the previous supervised learning approach [6]. The previous supervised learning approach generates simulated paired data by randomly introducing a variety of acoustic distortions into high-quality speech corpora. In contrast, our self-supervised learning approach uses real historical speech resources to train our speech restoration model. As shown in Fig. 1, we also propose a dual learning method using unpaired high-quality data, which helps with the disentanglement of undistorted speech signals and acoustic distortions.

FIGURE 1. - Overview of this paper. Previous speech restoration approach has used artificial paired training data created by introducing various acoustic distortions, which can be affected by domain-shift problems for real historical audio resources. Our model is trained on real historical speech data in self-supervised manner by emulating recording process of historical speech. We also propose dual-learning method to stabilize training process, perceptual loss using automatic speech quality assessment model, supervised pretraining with artificial data, and semi-supervised learning to scale-up training. our model also enables audio effect transfer, in which channel features are extracted from historical speech signals and added to arbitrary audio signals. Notations used in § IV are indicated in italics (e.g., “Self-*”).
FIGURE 1.

Overview of this paper. Previous speech restoration approach has used artificial paired training data created by introducing various acoustic distortions, which can be affected by domain-shift problems for real historical audio resources. Our model is trained on real historical speech data in self-supervised manner by emulating recording process of historical speech. We also propose dual-learning method to stabilize training process, perceptual loss using automatic speech quality assessment model, supervised pretraining with artificial data, and semi-supervised learning to scale-up training. our model also enables audio effect transfer, in which channel features are extracted from historical speech signals and added to arbitrary audio signals. Notations used in § IV are indicated in italics (e.g., “Self-*”).

We also introduce several methods to improve the performance of our speech restoration model. We extend our method using a semi-supervised learning framework, a hybrid method that combines our self-supervised learning approach with conventional supervised learning methods. This ensures high accuracy in speech feature prediction while enabling domain adaptation to historical speech. We also use perceptual loss with an external automatic speech quality assessment model to improve the perceptual speech quality of the restored speech. Finally, we use the supervised pretraining method, which uses artificial paired data for the supervised pretraining of the analysis and channel modules, and then derives initial model parameters for self-supervised or semi-supervised learning. Table 1 shows the comparison between variants of our approach and previous supervised approaches. While the previous supervised approaches [6] requires the use of simulated paired data generated from clean speech corpora, our method can use the real historical audio resources during training. Our model also allows for audio effect transfer, in which only acoustic distortions are extracted from degraded speech and applied to arbitrary high-quality audio. Experimental evaluations showed that our method achieved significantly higher quality speech restoration than the previous supervised method. The implementation1 and the audio samples2 are publicly available. The contributions of this paper are as follows:

  • We propose a self-supervised speech restoration approach that can learn various acoustic distortions without a paired speech corpora and is applicable to real degraded speech data.

  • Our method achieves significantly higher quality speech restoration than the previous supervised method especially on real historical audio resources.

  • Our model enables audio effect transfer, in which only channel features are extracted from degraded speech and applied to other high-quality speech.

TABLE 1 Comparison Between Our Proposed Methods and Previous Supervised Approaches for Speech Restoration
Table 1- 
Comparison Between Our Proposed Methods and Previous Supervised Approaches for Speech Restoration

This paper is an extended full paper version of our earlier work [8], supplemented by several contributions:

  • We propose a perceptual training objective described in § III-C and a semi-supervised learning framework described in § III-D. We verified the effectiveness of both frameworks on real historical speech datasets.

  • We conducted additional evaluations using a mixture of different historical speech resources. The application of self-supervised learning to such a variety of existing historical speech resources is a crucial experiment in the context of practical applications.

This paper is organized as follows. § II provides the background and related work. § III outlines the basic framework of our proposed self-supervised speech restoration method, along with various strategies to improve its performance. § IV presents an empirical evaluation that demonstrated the effectiveness of the proposed method. Finally, § V presents conclusions and a summary of the paper.

SECTION II.

Related Work

A. Speech Restoration

Speech restoration is the task of obtaining high-quality speech signals from degraded speech signals [6], [9], [10], [11], [12], [13], [14], which is a joint task involving bandwidth extension [15], [16], dereverberation [17], [18], denoising [19], [20], and declipping [21], [22]. Several studies have investigated speech restoration methods using a neural vocoder [23], [24] in pursuit of robust and generalizable speech restoration. A previous study [6] separately trained one module for estimating clean acoustic features from degraded speech and another for synthesizing clean speech waveforms from these clean speech features. Another study [25], [26] proposed a vocoder-based speech restoration method that incorporates semantic speech, textual, and speaker representations, which successfully restored a text-to-speech corpus [27] with the high robustness and quality. However, all these methods require artificial training data, generated by introducing various acoustic distortions to clean speech corpora. Textual features are also often not available for the restoration of historical speech, as the transcriptions are not accessible and the writing systems may differ from the modern languages.

Historical speech is typically characterized by a wide range of acoustic distortions, the information about which is usually unknown. Thus, training with artificially paired data can lead to a domain shift problem between rule-based distortions [6] and actual historical speech. To address this, we propose a method for learning speech restoration models using historical speech and unpaired data via self-supervised learning strategies. This method does not use additional inputs related to transcript, speaker characteristics, and acoustic distortions. It also enables audio effect transfer, which allows for adding the distortion features extracted from historical audio signals to arbitrary audio signals. We also propose a method of improving the speech quality using perceptual loss, which is also applicable to previous supervised approaches with paired data.

B. Self-Supervised Representation Learning for Speech Processing

To use diverse untranscribed speech data, self-supervised speech representation learning methods [28], [29], [30] have been widely studied. With such methods, a pretrained model is applied to various downstream tasks [31], [32] and the performance is significantly improved. There have been several studies that applied self-supervised representation learning to speech enhancement or restoration [25], [33], [34], [35]. Unlike the previous approaches, our approach is to simulate the generation process of historical audio resources to train the model only on the unpaired data. Similar to our method, the DDSP autoencoder [36] learns disentangled features from the acoustic signal in a self-supervised manner and manipulates each feature. It uses a simple sinusoidal vocoder with filtered noise components and only reverberation is assumed as acoustic distortion, while our method uses a more expressive waveform synthesis model to achieve high-quality speech restoration and channel modules to capture various acoustic distortions.

C. Speech Coding

Analysis-by-synthesis approaches, such as the one used in this study, have been developed in the field of speech coding. These approaches are used to achieve a compressed representation of speech by integrating encoders and decoders to mimic the human vocalization process, thus reducing redundancies in the speech signal. Prior to the advent of neural networks, low bit-rate speech coding [37], [38], [39] had been extensively studied, albeit with limited reconstruction capabilities. Advances in neural vocoders have led to the proposal of numerous neural speech coding approaches [40], [41], enabling the creation of codecs with superior reconstruction quality for speech [43] and acoustic signals [44], [45]. These discrete codecs allow speech signals to be treated as language models [46], [47], heralding further advances. Our proposed method is designed to autonomously derive speech representations from degraded speech signals by simulating both human vocalization and the recording processes of degraded speech using a neural network.

SECTION III.

Proposed Method

In this section, we describe the proposed self-supervised speech restoration method.3 As shown in Fig. 1, our method uses real historical audio resources. § III-A describes the basic framework simulating the generation process of historical speech data. In § 2, we present our proposed dual-learning method that leverages unpaired high-quality speech data. § III-C presents the training objective based on perceptual speech quality. § III-D and § III-E describes the semi-supervised learning and supervised pretraining frameworks, respectively, which use simulated paired data for our method. In § III-F, we present the audio effect transfer of historical audio characteristics using our method.

A. Basic Speech Restoration Framework

We begin with an overview of the recording process of degraded speech underlying our method. High-quality speech (i.e., undistorted speech) is emitted from the mouth through a human speech production process, which can be parameterized by time-variant speech features such as traditional source-filter vocoder features or mel spectrograms. Acoustic distortions (e.g., non-linear response of recording equipment and lossy audio coding) are then added to the high-quality speech, resulting in a final recorded audio. We assume that the distortions (i.e., channel features) are time-invariant, in which the recording equipment or the audio coding does not change in each audio sample.

The proposed speech restoration model consists of analysis, synthesis, and channel modules that are based on the above process, as shown in Fig. 2 (top). All of these modules are composed of neural networks, so they can be constructed in an end-to-end manner. Let W_{\mathrm {low}} = (w_{\mathrm {low}, t} \in \mathbb {R} | t=1, \cdots, T_{w}) and Y_{\mathrm {low}} = ({y}_{\mathrm {low}, t} \in \mathbb {R}^{D} | t=1, \cdots, T_{y}) denote the speech waveform and speech feature sequence of the input degraded speech, respectively. Then the feature extraction is written as Y_{\mathrm {low}} = f_{\mathrm {Feature}}^{Y}(W_{\mathrm {low}}) , as shown in Fig. 2. Note that we use a mel spectrogram for Y_{\mathrm {low}} . Let Z_{\mathrm {res}} = ({z}_{\mathrm {res}, t} \in \mathbb {R} | t=1, \cdots, T_{y}) and c denote the speech feature sequence of restored speech and time-invariant channel features, respectively. For Z_{\mathrm {res}} , we use two types of features for Z_{\mathrm {res}} : mel spectrograms and source-filter features. The mel spectrograms contain rich acoustic information, but are easily entangled with distortion features. We use F_{0} and mel cepstrum for the source-filter features, which aims to disentangle speech features from non-speech features, but this may be affected by the lower resynthesis performance due to the limited acoustic information. We compared these features in the evaluation described in § IV, where the proposed method using mel spectrograms and source-filter features are denoted as “*-Mel*” and “-SF*”, respectively. The operation of the analysis module can be written as \begin{equation*} \{\hat {Z}_{\mathrm {res}}, \hat c \}= \text {Analysis}(Y_{\mathrm {low}}; \theta _{\mathrm {ana}}), \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \theta _{\mathrm {ana}} denote the model parameter of the analysis module. The synthesis module simulates human speech production and generates speech waveform. Let W_{\mathrm {res}} = (w_{\mathrm {res}, t} \in \mathbb {R} | t=1, \cdots, T_{w}^{\prime }) denote the speech waveform generated by the synthesis module. The operation can be written as \begin{equation*} \hat {W}_{\mathrm {res}} = \text {Synthesis}(\hat {Z}_{\mathrm {res}}; \theta _{\mathrm {syn}}), \tag{2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \theta _{\mathrm {syn}} denote the model parameters of the synthesis module. We use HiFi-GAN [48] for the synthesis module, which is trained on an arbitrary high-quality speech corpus and the model parameters are frozen. The channel module simulates the acoustic distortion; it is conditioned by the channel feature c and distorts the restored speech \hat {W}_{\mathrm {res}} to estimate the input degraded speech W_{\mathrm {low}} . The operation of the channel module can be written as \begin{equation*} \hat {W}_{\mathrm {low}} = \text {Channel}(\hat {W}_{\mathrm {res}}, \textrm {$c$}; \theta _{\mathrm {chn}}), \tag{3}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \theta _{\mathrm {chn}} denote the model parameters of the channel module. The model is trained to minimize the frame-level reconstruction loss, which can be defined as a multi-scale spectral loss [36]. Let S_{i} = ({s}_{i, t} \in \mathbb {R} | t=1, \cdots, T_{s, i}) denote the amplitude spectrogram sequence obtained from W_{\mathrm {low}} . We then define the reconstruction loss as \begin{equation*} \mathcal {L}_{\mathrm {recons}} = \sum _{i} \{ ||S_{i} - \hat {S}_{i}||_{1} + \alpha ||\log S_{i} - \log \hat {S}_{i}||_{1} \}, \tag{4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \hat {S}_{i} denote the amplitude spectrograms obtained from \hat {W}_{\mathrm {low}} . The subscript i is the window length of the short-time Fourier transform and \alpha is a weight of the log term, where in our experiment we used i = (2048, 1024, 512, 256, 128, 64) and \alpha =1.0 . The analysis and channel modules are then trained as \begin{equation*} \{ \hat {\theta }_{\mathrm {ana}}, \hat {\theta }_{\mathrm {chn}} \} = \mathop {\mathrm {arg min}}\limits _{\theta _{\mathrm {ana}}, \theta _{\mathrm {chn}}} \mathcal {L}_{\mathrm {recons}}, \tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
In inference, we can obtain the restored speech waveform \hat {W}_{\mathrm {res}} by driving the analysis and synthesis modules.

FIGURE 2. - Proposed self-supervised speech restoration method, consisting of analysis, synthesis, and channel modules that simulate recording process of degraded speech. Basic training process (top) minimizes reconstruction loss between degraded and reconstructed speech. We also propose dual learning (top & bottom) with arbitrary high-quality speech corpora.
FIGURE 2.

Proposed self-supervised speech restoration method, consisting of analysis, synthesis, and channel modules that simulate recording process of degraded speech. Basic training process (top) minimizes reconstruction loss between degraded and reconstructed speech. We also propose dual learning (top & bottom) with arbitrary high-quality speech corpora.

As in Eq. (1), the analysis module jointly estimates the disentangled features: time-variant speech features and time-invariant channel features. In this paper, we also investigates the use of an additional reference encoder to extract the channel features. We use the reference encoder that is based on a global style token [49], which is used for the acoustic noise modeling in the original paper. In contrast to Eq. 1, the use of the reference encoder can be formulated as \begin{align*} \hat {Z}_{\mathrm {res}} &= \text {Analysis}(Y_{\mathrm {low}}; \theta _{\mathrm {ana}}), \tag{6}\\ \hat c &= \text {RefEnc}(Y_{\mathrm {low}}; \theta _{\mathrm {enc}}), \tag{7}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \theta _{\mathrm {enc}} denote the model parameters of the reference encoder. Instead of Eq. (5), the modules are then trained as \begin{equation*} \{ \hat {\theta }_{\mathrm {ana}}, \hat {\theta }_{\mathrm {chn}}, \theta _{\mathrm {enc}} \} = \mathop {\mathrm {arg min}}\limits _{\theta _{\mathrm {ana}}, \theta _{\mathrm {chn}}, \theta _{\mathrm {enc}}} \mathcal {L}_{\mathrm {recons}}. \tag{8}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
In § IV-C, we compare the both schemes formulated as Eq. (5) and Eq. (8). Note that, in § IV-C, the proposed method with the reference encoder is indicated with the notation: “*-RefEnc*”.

Fig. 3 shows the spectrograms obtained with the proposed method. The input low-quality speech is simulated on the basis of the \mu -Law quantization and resampling described in § IV-A, and is missing high-frequency bands that are present in the ground-truth high-quality speech and has distorted low-frequency bands. In the restored speech, the missing and distorted bands are restored. Furthermore, the reconstructed speech output by the channel module faithfully reproduces the input speech.

FIGURE 3. - Spectrograms obtained with proposed method. While original degraded speech signal lacks high-frequency band and has quantization noise in low-frequency band, our model estimates spectrogram that is similar to ground-truth speech. Our model reconstructs original degraded speech through channel module.
FIGURE 3.

Spectrograms obtained with proposed method. While original degraded speech signal lacks high-frequency band and has quantization noise in low-frequency band, our model estimates spectrogram that is similar to ground-truth speech. Our model reconstructs original degraded speech through channel module.

The above basic self-supervised training process often fails to disentangle Z_{\mathrm {res}} and c because each module has a high expressive power, and the modules are only trained with the reconstruction loss between the degraded and reconstructed waveform. Specifically, the analysis module can represent the effect of the channel module, so there is no guarantee that the analysis module outputs the speech features of high-quality speech. To address this, we propose a dual-learning method described in § III-B.

B. Dual Learning for Stable Self-Supervised Learning

As mentioned in § III-A, we propose a dual-learning method to address the training instability of the basic training process. Fig. 2 shows the proposed dual-learning method. In addition to the basic framework described in § III-A, we introduce a training task that propagates information in the backward direction. We denote the two training tasks in the forward direction and backward direction as \mathcal {T}_{\mathrm {Forward}} and \mathcal {T}_{\mathrm {Backward}} , respectively. \mathcal {T}_{\mathrm {Forward}} is the basic learning framework described in § III-A; degraded speech is sent to the analysis module to reconstruct the degraded waveform. In \mathcal {T}_{\mathrm {Backward}} , high-quality speech is input to the channel module to estimate the features of high-quality speech. Our method is analogous to the forward and backward learning between two domains in machine translation [50].

We use arbitrary high-quality speech, which does not align with the degraded speech and only consists of other speaker’s utterances. Let X^{\prime }_{\mathrm {high}} denote the high-quality speech waveform, which is irrelevant to the input degraded speech for \mathcal {T}_{\mathrm {forward}} . First, X^{\prime }_{\mathrm {high}} and the channel features c are passed through the channel module as \begin{equation*} \hat {X}^{\prime }_{\mathrm {low}} = \text {Channel}(X^{\prime }_{\mathrm {high}}, \textrm {$c$}; \theta _{\mathrm {chn}}), \tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where X^{\prime }_{\mathrm {low}} denotes the degraded speech waveform with the channel features. The speech features of the degraded speech, \hat {Y}^{\prime }_{\mathrm {low}} , are then input to the analysis module as \begin{equation*} \hat {Z}^{\prime }_{\mathrm {res}} = \text {Analysis}(\hat {Y}^{\prime }_{\mathrm {low}}), \tag{10}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where Z^{\prime }_{\mathrm {res}} are estimated speech features of the restored speech. A feature loss is defined as the L1 or L2 norm between \hat Z _{\mathrm {res}} , and the speech feature {Z}_{\mathrm {high}} is obtained from the high-quality speech as:\begin{equation*} \mathcal {L}_{\mathrm {feats}} = ||\hat {Z}^{\prime }_{\mathrm {res}} - Z^{\prime }_{\mathrm {high}}||_{p}, \tag{11}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where p = \{1, 2\} is satisfied. The gradient from the feature loss is not propagated to the channel module. Therefore, the feature loss is used to train the analysis module as \begin{equation*} \hat {\theta }_{\mathrm {ana}} = \mathop {\mathrm {arg min}}\limits _{\theta _{\mathrm {ana}}} \mathcal {L}_{\mathrm {feats}}. \tag{12}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
The overall loss function is given as the weighted sum of the loss function between the reconstruction loss and the feature loss as:\begin{equation*} \mathcal {L}_{\mathrm {dual}} = \beta \cdot \mathcal {L}_{\mathrm {recons}} + (1 - \beta) \cdot \mathcal {L}_{\mathrm {feats}}, \tag{13}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \beta is a weight of the feature loss. Intuitively, \mathcal {T}_{\mathrm {Forward}} and \mathcal {T}_{\mathrm {Backward}} can be interpreted as:

  • \mathcal {T}_{\mathrm {forward}} obtains the analysis module that adapts to real historical speech and the channel module that expresses the acoustic distortion.

  • \mathcal {T}_{\mathrm {backward}} obtains the analysis module to estimate high-quality speech features.

This dual learning method enables stable training of the analysis and channel modules without paired data. In § IV, the proposed method with the basic framework (§ III-A) and the dual-learning framework is indicated with the notation: “Self-*”.

C. Training Objective Based on Perceptual Speech Quality

To further improve the performance of our speech restoration model, we introduce a loss function that is based on perceptual speech quality. Traditionally, synthetic speech has been evaluated by human listeners using subjective measures such as mean opinion score (MOS) tests. Automatic speech quality evaluation methods [51], [52] that predict the MOS values of the input speech have actively been studied. Therefore, we incorporate a loss function designed to improve the pseudo-MOS, which should improve the perceptual quality of our speech restoration model.

Let \mathcal {L}_{\mathrm {percep}} denotes the perceptual loss defined with a pretrained speech quality assessment model. For the speech quality assessment model (referred to as SQAModel), we leverage the previous high-performance model called UTMOS [53], which is trained on the VoiceMOS Challenge 2022 [54] dataset. Then \mathcal {L}_{\mathrm {percep}} is defined as \begin{equation*} \mathcal {L}_{\mathrm {percep}} = - \text {SQAModel}(\hat {W}_{\mathrm {res}}; \theta _{\mathrm {sqa}}), \tag{14}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \theta _{\mathrm {sqa}} denote the pretrained model parameters of the speech quality assessment model. The loss term formulated as Eq. (14) is added to the whole training loss represented as Eq. 13 with a weighting term. As shown in Fig. 1, in § IV, the proposed method using the perceptual loss is indicated with the notation: “*-PLoss*”.

D. Extending to Semi-Supervised Learning

As shown in Fig. 1, the previous supervised learning approach [6] generates paired data by randomly applying various acoustic distortions to clean speech corpora. Despite the domain mismatch with real historical audio resources inherent in this approach, the use of clean speech as the target ensures more stable training, resulting in high-quality restored speech for in-domain data. Therefore, we introduce a hybrid framework that both uses both the previous supervised method and the self-supervised learning method described in Section III-B. It is a multi-task semi-supervised learning framework that integrates supervised learning using simulated paired data with self-supervised learning using real historical speech data.

We basically follow the algorithm presented in a previous study [6] to generate simulated paired data. Algorithm 1 shows the entire algorithm used in our method. Let W^{\prime \prime }_{\mathrm {high}} , W^{\prime \prime }_{\mathrm {noise}} , and W^{\prime \prime }_{\mathrm {sim}} denote the speech waveform sampled from the high-quality speech corpora, randomly sampled noise waveform, and the simulated degraded speech waveform, respectively. Let \mathcal {U}(\cdot) denote the uniform distribution that can return reals, integers, or categories depending on the context. We first apply clipping with a probability P_{1} , where \text {Max}(\cdot) and \text {Min}(\cdot) denote the operations to take maximum and minimum values for each speech sample t . The clipping is carried out on the basis of the threshold \eta , which is sampled between the lowest threshold H_{\mathrm {low}} and highest threshold H_{\mathrm {high}} . Let P_{2} and P_{3} denote the probabilities for applying a low-pass filter to W^{\prime \prime }_{\mathrm {high}} and W^{\prime \prime }_{\mathrm {noise}} , respectively. Let \text {FilterCandidate}() and \text {BuildFilter}(\cdot) denote the functions to return the different types of low-pass filters and to get a filter with the randomly sampled parameters, respectively. Here, t_{\mathrm {f}} , c_{\mathrm {f}} , and o_{\mathrm {f}} denote the type of filter, the cutoff frequency, and the order of the low-pass filter, respectively. Note that C_{\mathrm {low}} and C_{\mathrm {high}} denote the lowest and highest thresholds of the cutoff frequency, while O_{\mathrm {low}} and O_{\mathrm {high}} denote the lowest and highest thresholds of the order. Finally, we add the noise with the probability P_{4} . Let r_{\mathrm {SN}} denote a randomly signal-to-noise ratio for the noise addition, where R_{\mathrm {low}} and R_{\mathrm {high}} are the lowest and highest thresholds. Let \text {Mean}(\cdot) and \text {Abs}(\cdot) denote the operations to take mean and absolute values for each speech sample t . Note that we use \text {Mean}(\text {Abs}(\cdot)) to approximate the normalization of the noise power as in the previous study [6]. We give the details of the parameter values in Algorithm 1 in § IV-B.

Algorithm 1 Generation of simulated Paired Data

Input:

High-quality speech waveform W^{\prime \prime }_{\mathrm {high}} and randomly sampled noise waveform W^{\prime \prime }_{\mathrm {noise}} .

Output:

Simulated degraded speech waveform W^{\prime \prime }_{\mathrm {sim}} .

1:

W^{\prime \prime }_{\mathrm {sim}} = W^{\prime \prime }_{\mathrm {high}}

2:

p_{1} \sim \mathcal {U}(0, 1)

3:

if p_{1} < P_{1} then

4:

\eta \sim \mathcal {U} (H_{\mathrm {low}}, H_{\mathrm {high}})

5:

W^{\prime \prime }_{\mathrm {sim}} = \text {Max}(\text {Min}(W^{\prime \prime }_{\mathrm {sim}}, \eta), -\eta)

6:

end if

7:

p_{2} \sim \mathcal {U}(0, 1)

8:

p_{3} \sim \mathcal {U}(0, 1)

9:

if p_{2} < P_{2} then

10:

t_{\mathrm {f}} \sim \mathcal {U} (\text {FilterCandidate}())

11:

c_{\mathrm {f}} \sim \mathcal {U}(C_{\mathrm {low}}, C_{\mathrm {high}})

12:

o_{\mathrm {f}} \sim \mathcal {U}(O_{\mathrm {low}}, O_{\mathrm {high}})

13:

W^{\prime \prime }_{\mathrm {sim}} = W^{\prime \prime }_{\mathrm {sim}} * \text {BuildFilter}(t_{\mathrm {f}}, c_{\mathrm {f}}, o_{\mathrm {f}})

14:

if p_{3} < P_{3} then

15:

W^{\prime \prime }_{\mathrm {noise}} = W^{\prime \prime }_{\mathrm {noise}} * \text {BuildFilter}(t_{\mathrm {f}}, c_{\mathrm {f}}, o_{\mathrm {f}}) .

16:

end if

17:

end if

18:

p_{4} \sim \mathcal {U}(0, 1)

19:

if p_{4} < P_{4} then

20:

r_{\mathrm {SN}} \sim \mathcal {U} (R_{\mathrm {low}}, R_{\mathrm {high}})

21:

W^{\prime \prime }_{\mathrm {noise}} = W^{\prime \prime }_{\mathrm {noise}} \cdot \frac {\text {Mean}(\text {Abs}(W^{\prime \prime }_{\mathrm {sim}}))}{\text {Mean}(\text {Abs}(W^{\prime \prime }_{\mathrm {noise}}))}

22:

W^{\prime \prime }_{\mathrm {sim}} = \left({W^{\prime \prime }_{\mathrm {sim}} + \frac {W^{\prime \prime }_{\mathrm {noise}}}{10^{r_{\mathrm {SN}}/20}}}\right)

23:

end if

Let f(\cdot) denotes the function used for the data generation, which follows the Algorithm 1. Note that W^{\prime \prime }_{\mathrm {sim}} = f(W^{\prime \prime }_{\mathrm {high}}) is satisfied. We then define the supervised feature loss \mathcal {L}_{\mathrm {supervised}} as \begin{align*} \hat {Z}^{\prime \prime }_{\mathrm {res}} &= \text {Analysis}(Y^{\prime \prime }_{\mathrm {sim}}; \theta _{\mathrm {ana}}), \tag{15}\\ \mathcal {L}_{\mathrm {supervised}} &= \mathcal {L}_{\mathrm {feats}}(\hat {Z}^{\prime \prime }_{\mathrm {res}}, Z^{\prime \prime }_{\mathrm {high}}) \tag{16}\\ & = || \hat {Z}^{\prime \prime }_{\mathrm {res}} - Z^{\prime \prime }_{\mathrm {high}} ||_{p}, \tag{17}\end{align*}

View SourceRight-click on figure for MathML and additional features. where Y^{\prime \prime }_{\mathrm {sim}} , Z^{\prime \prime }_{\mathrm {res}} , and Z^{\prime \prime }_{\mathrm {high}} denote degraded, restored, and high-quality speech features, respectively. Note that Y^{\prime \prime }_{\mathrm {sim}} and Z^{\prime \prime }_{\mathrm {high}} can be computed by applying feature extraction to W^{\prime \prime }_{\mathrm {sim}} and Y^{\prime \prime }_{\mathrm {high}} , respectively. The the training objective for semi-supervised learning can be defined as \begin{equation*} \mathcal {L}_{\mathrm {semi}} = \mathcal {L}_{\mathrm {dual}} + \gamma \cdot \mathcal {L}_{\mathrm {supervised}}, \tag{18}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \gamma denotes the weighting term for the supervised loss. As shown in Fig. 1, in § IV, the proposed method with the semi-supervised framework is indicated with the notation: “Semi-*”.

E. Supervised Pretraining

Inspired by curriculum learning [55], we pretrain the analysis and channel modules in a supervised manner, before training our model with the methods described in § III-A to § III-D. We train the modules using simulated paired data to obtain the initial parameters for the self-supervised (§ III-B) or semi-supervised (§ III-D) learning. Let \mathcal {L}_{\mathrm {pre}} denotes the training objective used for the supervised pretraining. As mentioned in § III-D, we create the simulated paired data on the basis of W^{\prime \prime }_{\mathrm {sim}} = f(W^{\prime \prime }_{\mathrm {high}}) , where f(\cdot) is the function defined using Algorithm 1. We then define the training objective \mathcal {L}_{\mathrm {Pre}} as \begin{align*} \hat {W}^{\prime \prime }_{\mathrm {low}} &= \text {Channel}(W^{\prime \prime }_{\mathrm {high}}; \theta _{\mathrm {chn}}), \tag{19}\\ \mathcal {L}_{\mathrm {pre}} &= \kappa \cdot \mathcal {L}_{\mathrm {recons}}(\hat {W}^{\prime \prime }_{\mathrm {low}}, W^{\prime \prime }_{\mathrm {sim}}) \tag{20}\\ &+ (1-\kappa) \cdot \mathcal {L}_{\mathrm {feats}}(\hat {Z}^{\prime \prime }_{\mathrm {res}}, Z^{\prime \prime }_{\mathrm {high}}) \tag{21}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \hat {Z}^{\prime \prime }_{\mathrm {res}} is computed using Eq. (15). Unlike the reconstruction loss mentioned in § III-A, the reconstruction loss used in Eq. (20) is defined without the synthesis module. Therefore, W^{\prime \prime }_{\mathrm {high}} and W^{\prime \prime }_{\mathrm {low}} are precisely time-aligned. We use the pretrained parameters of the analysis and channel modules \hat {\theta }_{\mathrm {ana}} and \hat {\theta }_{\mathrm {chn}} as the initial parameters for the self-supervised learning described in § III-B or semi-supervised learning presented in § III-D. As shown in Fig. 1, in § IV, the proposed method with the supervised pretraining framework is indicated with the notation: “*-Pretrain*”.

F. Audio Effect Transfer with Learned Channel Representation

Our frameworks described in § III-A to § III-E are used to train the analysis, synthesis, and channel modules for speech restoration. In addition to the speech restoration task, the obtained model can also be used for audio effect transfer. Fig. 4 illustrates the audio effect transfer. Let W_{\mathrm {high}} denotes the arbitrary high-quality audio waveform. We first extract the channel features of the degraded speech used for the training c as in Eq. (1). We then add the channel features to W_{\mathrm {high}} as \begin{equation*} \hat {W}_{\mathrm {trans}} = \text {Channel}(W_{\mathrm {high}}, \textrm {$c$}; \theta _{\mathrm {chn}}), \tag{22}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where W_{\mathrm {trans}} denotes the output waveform of the audio effect transfer. In this audio effect transfer, the trained channel module conditioned by c receives arbitrary high-quality audio and distorts it so that the resulting audio sounds distorted, similar to the input historical speech.

FIGURE 4. - Proposed audio effect transfer, in which only channel features are extracted from historical speech and applied to arbitrary input speech so that transferred speech has audio effects like historical speech.
FIGURE 4.

Proposed audio effect transfer, in which only channel features are extracted from historical speech and applied to arbitrary input speech so that transferred speech has audio effects like historical speech.

SECTION IV.

Experimental Evaluations

A. Dataset

To evaluate of our speech restoration framework, we used three types of datasets: simulated datasets, a single-domain real dataset, and a multi-domain real dataset. We used the simulated datasets to use the ground-truth speech for the evaluation and to investigate each type of acoustic distortion. The single-domain real dataset, which contained a limited amount of data within a single domain, was used to investigate the few-shot domain adaptation to real historical speech data. An advantage of our self-supervised approach is that it can be trained on a large amount of degraded speech resources. Therefore, we applied our method to the multi-domain real-world dataset to investigate its performance in modelling historical speech across multiple domains simultaneously.

We created the simulated datasets by applying different types of acoustic distortions to the JSUT corpus [56], which consists of approximately six hours of high-quality speech utterances from a Japanese female speaker. We randomly selected 25 sentences for both of the validation and test sets. We created the datasets by applying the following four types of distortions including (a) Band-Limit: We applied biquad-lowpass filtering to the high-quality speech with a cut-off frequency of 1 kHz and the Q-factor of 1.0. (b) Clip: We clipped the high-quality speech waveform with an absolute-value threshold of 0.25. (c) \mu -Law: We quantized the high-quality speech using \mu -law quantization with 128 of the quantization level and resampled it to 8 kHz. We used mu-law quantized signals prior to decoding in order to significantly degrade the original speech with the non-linear transformation of \mu -law encoding. (d) Noise: We added additive noise randomly sampled from the TUT Acoustic Scenes 2017 dataset [57] to the high-quality speech waveform.

For the evaluation discussed in § IV-D, we created the single-domain real dataset from a Japanese historical speech resource [58]. This dataset was recorded in the 1960s–1970s and consists of nine speakers narrating folktales, with a total duration of approximately 22-minutes. The analog audio clips recorded on the cassette tape were digitized using a radio cassette player. As this data contained a large amount of additive noises, iZotope RX9 was used for preliminary noise reduction as preprocessing. If the additive noise was very high, it would have a dominant effect on the original MOS values. Since our goal is to synthesize restored speech with higher speech quality and intelligibility, rather than dealing with additive noise that can be removed by simple denoising, we have reduced the noise level to some extent by preprocessing. Note that Original in § IV-D comes from the data after this preprocessing. We randomly selected approximately 2- and 4-minute audio clips for the validation and test sets, respectively.

For the multi-domain real dataset, we used a compilation of ten different domains. This dataset consists of Japanese historical speech resources, including the Tohoku Folktale Corpus, 4 the CPJD Corpus [4], the tri-jek Corpus,5 and the single-domain real dataset [58]. We simply used the single-domain dataset in the last paragraph as one of the resources in the multi-domain real dataset. However, for the rest of the resources in the multi-domain real dataset, we did not apply denoising and chose to use the original speech data. By not using external noise reduction for most of the domains, we wanted to evaluate the robustness of our model to different types of real acoustic distortions. The cumulative duration of the training set was 8.9 hours. The validation and test sets contained 3.8 and 12.3 minutes of data, respectively.

We used the JVS corpus [56] for \mathcal {T}_{\mathrm {backward}} of the dual-learning described in § III-B, the supervised learning for the semi-supervised learning presented in § III-D, and the supervised pretraining described in § III-E. We used the WHAM noise dataset [59] for the random noise used to generate the simulated paired data in the semi-supervised learning (§ III-D) and the supervised pretraining (§ III-E).

B. Experimental Settings

a: Speech Feature Extraction

We set the sampling rate of the speech waveform to 22.05 kHz. As described in § III-A, we used mel spectrograms and source-filter features for the restored speech features \hat {Z}_{\mathrm {res}} . The 80-dimensional mel spectrogram was extracted with a frame size of 1024 and a frame shift of 256. For the source-filter features, we used F_{0} and 41-dimensional mel cepstrum coefficients with a 5-ms frame shift, which were extracted using the WORLD vocoder [60].

b: Model Details.

For the analysis and channel modules described in § III, we used the U-Net architecture [61] that is based on 1-dimensional and 2-dimensional convolution layers, respectively. In each down sampling of the U-Net, the temporal resolution was reduced by half with four layers of residual convolution blocks and average pooling. Each residual convolution block consists of convolution layers and batch normalization layers [62] with a skip connection [63]. During up-sampling, the temporal resolution is doubled by applying deconvolution, resulting in time-variant high-quality speech features with the same temporal resolution as that of the input features. As described in § III-A, we used HiFi-GAN [48] for the synthesis module. When using mel spectrograms for Z_{\mathrm {res}} , we used a pretrained multi-speaker model.6 When using source-filter features for Z_{\mathrm {res}} , we trained the model with the ground-truth speech in the JVS corpus.

c: Training Settings.

The batch size was set to 4 for all the training cases in our evaluations. We used used p = 2 (i.e., L2 loss) for Eq. (11), Eq.(17) and Eq. 21. We set the weights to 0.1 when adding the perceptual loss (Eq. (14)) to the dual learning loss (Eq. (13)) and the semi-supervised learning loss (Eq. (18)). In Algorithm 1, we set (P_{1}, P_{2}, P_{3}, P{4}) = (0.25, 0.50, 0.50, 0.50) . We used (H_{\mathrm {low}}, H_{\mathrm {high}}) = (0.06, 0.9) , (C_{\mathrm {low}}, C_{\mathrm {high}}) = (850, 11025) hz , (O_{\mathrm {low}}, O_{\mathrm {high}}) = (2, 10) , and (R_{\mathrm {low}}, R_{\mathrm {high}}) = (-5, 40) . \text {FilterCandidate()} in Algorithm 1 used the set of Butterworth filter, Chebyshev Type I filter, Chebyshev Type II filter, Bessel filter, and Elliptic filter. We set the \kappa = 0.1 for the supervised pretraining described in § III-E and \beta = 0.001 for the dual learning described in § III-B. The small value of \beta is due to the difference in scale between \mathcal {L}_{\mathrm {recons}} and \mathcal {L}_{\mathrm {feats}} described in Eq. 13. In our empirical results, \mathcal {L}_{\mathrm {recons}} tended to be larger than \mathcal {L}_{\mathrm {feats}} by about an order of 100. Our preliminary study also suggested that choosing a beta of this value stabilized the training. We determined the number of training steps of the proposed methods on the basis of our preliminary studies. When using the simulated data described in § IV-A, we trained the self-supervised methods and the semi-supervised methods for 10 and 20 epochs, respectively. When using the single-domain real dataset described in § IV-A, we trained the self-supervised methods and the semi-supervised methods for 50 and 100 epochs, respectively. When using the multi-domain real dataset described in § IV-A, we trained the self-supervised methods and the semi-supervised methods for 25 and 50 epochs, respectively. Adam [64] was used as the optimizer with the initial learning rate set to 0.001. We applied a learning rate scheduling that multiplies the learning rate by 0.5 if the validation loss did not decrease over three epochs. In the supervised pretraining described in § III-E, we trained the model for 50 epochs with the initial learning rate set to 0.005.

C. Evaluation on Speech Restoration with Simulated Data

We first evaluated our method using the simulated datasets described in § IV-A. Since the ground-truth speech was accessible during this evaluation, we used mel cepstral distortion (MCD) [65] for the evaluation metrics of speech quality. We also used an automatic assessment model (referred to as NISQA) [66] for the evaluation of speech quality. We conducted MOS tests with 40 native Japanese evaluators for each distortion setting.

We compared our method described in § III with a previous supervised method [6], which is referred to as Supervised. It should be noted that there is another emerging speech restoration method [25], as described in § II, which uses text transcript. Therefore, we only compared our method with Supervised. We trained the supervised model in the same manner as the supervised pretraining (§ III-E) with the dataset described in § IV-A. We compared the different variants of our proposed method. Self denotes the self-supervised method described in § III-B, where Self-Mel and Self-SF use the mel spectrograms and source-filter features for Z_{\mathrm {res}} , respectively. Self-Mel-RefEnc uses the reference encoder, which extracts the channel features as formulated in Eq. (6) and (7). For Self-PLoss, we used the perceptual loss described in § III-C. We also explored the semi-supervised learning framework described in § III-D, which is referred to as Semi-Mel. Some of the notations are shown in Fig. 1. Note that all the proposed methods (i.e., (4)–​(8) in Table 2) did not use the supervised pretraining described in § III-E, so the self-supervised methods (referred to as Self) are trained from scratch on the unpaired datasets. Table 2 lists the results, with the best results shown in bold. The error bars in the MOS results indicate the 95 % confidence intervals.

TABLE 2 Evaluation Results with Simulated Data. Bold Indicates Best Scores. Lower is Better for MCD, and Higher is Better for NISQA and MOS. Error Bars in MOS Indicate 95% Confidence Intervals
Table 2- 
Evaluation Results with Simulated Data. Bold Indicates Best Scores. Lower is Better for MCD, and Higher is Better for NISQA and MOS. Error Bars in MOS Indicate 95% Confidence Intervals

We first observed that both the previous supervised approach and our proposed method achieved better speech quality than the original degraded speech, indicating that they successfully performed the speech restoration. When comparing Self-Mel and Self-SF, Self-Mel shows better results for almost all the metrics, suggesting that it is better to use mel spectrograms than to use source-filter features for the hidden feature Z_{\mathrm {res}} . When comparing Self-Mel and Self-Mel-RefEnc, Self-Mel showed better results for almost all the metrics. This could be because processing degraded speech with a single analysis module promotes the disentanglement of undistorted features and channel features. On the basis of these results, we decided to use the mel spectrogram for Z_{\mathrm {res}} and not to use the reference encoder in the following evaluations. Self-Mel-Ploss improved the speech quality for many acoustic distortion types and evaluation metrics, confirming the effectiveness of the perceptual loss described in § III-C. Although our self-supervised methods used in this evaluation did not use any paired data, they performed better than the supervised learning approach depending on the acoustic distortions and evaluation metrics. In addition, Semi-Mel outperformed the self-supervised methods for MCD, but did not show clear superiority in terms of NISQA or MOS.

D. Evaluation on Speech Restoration with Single-Domain Real Data

As described in § IV-A, we conducted an evaluation using a single-domain real historical speech dataset. Unlike the evaluation using simulated data presented in § IV-C, we did not have access to the ground-truth speech. Therefore, we conducted the evaluations using NISQA and MOS. We also applied the supervised pretraining described in § III-E for both our self-supervised (Self-Pretrain) and semi-supervised (Semi-Pretrain) models. Table 3 lists the results. The notation of each method is also shown in Fig 1 and described in § III.

TABLE 3 Evaluation Results with Single-Domain Real Dataset
Table 3- 
Evaluation Results with Single-Domain Real Dataset

As shown in Table 3(a), we observed that the proposed self-supervised and semi-supervised models outperformed Supervised, while Supervised often outperformed our methods on the simulated data as described in § IV-C. This suggests that Supervised is affected by a domain-shift problem for the real historical speech resources and our methods achieved better performance on real data. The supervised pretraining described in § III-E also improved the performance of both the self-supervised and semi-supervised models. As in the evaluation presented in § IV-C, the perceptual loss improved the performance. Overall, the self-supervised model incorporating the supervised pretraining and perceptual loss showed the best performance in terms of both the NISQA and MOS.

To further validate the results, we also conducted preference AB tests. Forty native Japanese evaluators were recruited for each of the AB test evaluations. We compared Self-Mel-Pretrain-Ploss — the best of our methods in the objective metrics — with Original and Supervised. We can see that the proposed method significantly improved the speech quality of the original degraded speech and outperformed Supervised.

Fig. 5 shows the spectrograms for the original historical speech and the restored speech. As shown in Fig. 5(b), our self-supervised learning method restored the original speech, showing the undistorted spectrogram in the lower-frequency band and extending the bandwidth. However, it suffers in reconstructing the missing frequency band, because it does not use the paired data. As shown in Fig. 5(c), our semi-supervised learning method better handled the missing frequency band and achieved better bandwidth extension using the paired data. While Supervised shown in Fig. 5(d) successfully extended the bandwidth, it showed slightly more distorted spectrograms compared with our method shown in 5(b) and 5(c).

FIGURE 5. - Visualizations of spectrograms. (b) While self-supervised method restored original speech, it failed to reconstruct a frequency band. (c), our semi-supervised learning method better handled the missed frequency band and achieved better bandwidth extension. While supervised method shown in (d) successfully extended bandwidth, it showed more distorted spectrograms in lower-frequency band.
FIGURE 5.

Visualizations of spectrograms. (b) While self-supervised method restored original speech, it failed to reconstruct a frequency band. (c), our semi-supervised learning method better handled the missed frequency band and achieved better bandwidth extension. While supervised method shown in (d) successfully extended bandwidth, it showed more distorted spectrograms in lower-frequency band.

E. Evaluation on Speech Restoration with Multi-Domain Real Data

We conducted this evaluation using the multi-domain real historical speech data described in § IV-A. Table 4 lists the results. The notation of each method is also shown in Fig 1 and described in § III.

TABLE 4 Evaluation Results with Multi-Domain Real Dataset
Table 4- 
Evaluation Results with Multi-Domain Real Dataset

Unlike the evaluation with single-domain real data described in § IV-D, our self-supervised learning methods could not handle the multi-domain historical speech data, as shown in Table 4(a). This may be because they cannot accurately capture the channel features when trained on the multi-domain real historical dataset, resulting in the lower performance in disentangling of undistorted speech features and channel features. In contrast, our semi-supervised learning could stabilize the training with the help of some paired data, resulting in the better restored speech quality. By applying the supervised pretraining described in § III-E and the perceptual loss described in § III-C to our semi-supervised learning method, it performed the best among all the comparison methods. These results suggest the feasibility of large-scale semi-supervised learning using various existing historical speech resources in the future work.

To further validate the effectiveness of the proposed method, we conducted preference AB tests with 40 native Japanese evaluators, as shown in Table 4(b). We compared Semi-Mel-Pretrain-Ploss — the best proposed method in the objective evaluation results — with Original and Supervised. We observed it significantly improved the speech quality of the original degraded speech and outperformed Supervised.

F. Evaluation on Audio Effect Transfer

We evaluated the audio effect transfer using our proposed method described in § III-F. We conducted the evaluation using the simulated \mu -Law dataset and the single-domain real dataset described in § IV-A, denoted Simulated and Real, respectively. The high-quality speech waveform W_{\mathrm {high}} to be modified was sampled from the JVS corpus [56]. We conducted a five-level similarity MOS (SMOS) test with 40 listeners to assess the degree to which the output speech sounded distorted like the training data. The listeners conducted the evaluations in such a way that a rating of 1 meant that the speech sample in question had completely different acoustic distortion characteristics to the speech samples obtained from the training data, and a rating of 5 meant that the speech samples had very similar acoustic distortion characteristics. Reference is a sample with the same channel features as the training data. For comparison, we prepared Mean spec. diff, a method that multiplies the amplitude spectrum of the high-quality speech by the time-averaged amplitude spectrum difference of the training data and the high-quality data.

TABLE 5 Evaluation Results of Audio Effect Transfer Based on Similarity MOS. Bold Indicates Best Scores. Error Bars Indicate 95% Confidence Intervals
Table 5- 
Evaluation Results of Audio Effect Transfer Based on Similarity MOS. Bold Indicates Best Scores. Error Bars Indicate 95% Confidence Intervals

The results indicate that the proposed method is closer to the target channel features than the original high-quality speech on both simulated and real data. Regarding the results on the simulated data, the SMOS was higher than that of the case corrected by the mean amplitude spectral difference. Although the score of Proposed was around 0.8 lower than the reference, it showed a higher SMOS than that of High-quality.

SECTION V.

Conclusion

We proposed a self-supervised speech restoration method for historical audio resources. Our framework, which consists of analysis, synthesis, and channel modules, can be trained in a self-supervised manner using real degraded speech data. We also propose a dual-learning method for our self-supervised learning method, which facilitates stable training leveraging the unpaired high-quality speech data. Our framework includes supervised pretraining method, semi-supervised extension, and perceptual loss, which can further improve the performance for real historical speech resources. Experimental evaluation showed that our approach performed significantly better than a previous supervised approach for real historical speech resources.

a: Limitation and Future Work

Our work still has some limitations. Our supervised method cannot handle multiple channel features and requires the semi-supervised learning. We will need to improve the modeling of the channel features for better self-supervised learning. Our training data also contained a small amount of real data (less than 10 hours). For future work, we plan to construct a general speech restoration model that is trained on large-scale existing historical speech resources.

References

References is not available for this document.