CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.


I. INTRODUCTION
In recent years, a wide variety of speech-related applications have been developed. Most of these applications are highly convenient for human-human and human-machine communications. However, the following long-existing and critical issues that may limit the achievable performance of these applications remain to be solved: speech distortions caused by additive/convolutional noises and channel/device effects [1]- [6]. Identifying an effective method of addressing this distortion issue is a critical and challenging task, and numerous approaches have been proposed to this end, among which speech enhancement (SE) is notable.
The goal of SE is to transform noisy speech signal into enhanced speech signal with improved quality and intelligibility [7], [8]. In the past several decades, SE has been widely used as a front-end unit in many voice-based applications, such as automatic speech recognition [9], speaker identification [10], speech coding [11], hearing aids [12], and cochlear implants [13], [14]. The existing SE methods can be divided into two classes. In the first class, SE methods design a filter or function to attenuate noise components. Examples of methods in this class include the Wiener filter and its extensions [15]- [18], the minimum mean square error spectral estimator (MMSE) [19]- [21], Karhunen-Loeve transform [22], maximum a posteriori spectral amplitude estimator [23]- [25], maximum likelihood spectral amplitude estimator [26], [27], linear prediction models [28], orthogonal polynomial-based method [29], super-Gaussian-based methods [30], [31], and the hybrid of orthogonal polynomial and super-Gaussian [32]. Most SE methods of the first class have a common limitation: the inability to effectively contrast non-stationary noise signals in real-world scenarios under unexpected acoustic conditions. SE methods in the second class are based on machinelearning algorithms; these methods typically prepare a model for noisy-to-clean transformation in a data-driven manner. Notable SE methods belonging to this class include hidden Markov models [33], non-negative matrix factorization [34]- [36], compressive sensing [37], and robust principal component analysis [38]. In addition, artificial neural networks (ANNs), as a successful machine-learning model, have been used for SE because of their powerful nonlinear transformation capability. In [39]- [42], a shallow ANN was used to map noisy speech signals to clean ones. More recently, various types of ANNs with deep structures have been used for SE (e.g., deep neural networks (DNNs) [43]- [49], deep recurrent neural networks and long-short term memory (LSTM) networks [50]- [53], convolutional neural networks (CNNs) [54]- [56], and convolutional recurrent neural networks (CRNNs) [57], [58]). Also, [59] proposed a hybrid architecture of CNN and a tensor-train layer and compared the performance between DNN and CNN.
To improve the performance of these ANN-related approaches, several SE studies have applied a generative adversarial network (GAN) model [60]- [63]. The GAN model is used to generate enhanced samples for a discriminator to determine whether the input follows the distribution of a real clean speech signal. In addition, some researchers applied a transformer technique to perform SE, in which the attention mechanism was utilized to capture long-term temporal correlations to extract clean components from noisy input [64]- [68]. Moreover, instead of using a large amount of training data to perform SE, a transfer learning technique has been commonly used to enhance the generalization of models in unseen environments. For example, [69] finetuned the generator in a pretrained GAN-based SE model with small amounts of data and confirmed the efficiency of transfer learning. In [70], the authors proposed the use of a teacher-student learning strategy to adapt an SE model to unlabeled noisy speech signals. Furthermore, the FA-MK-MMD approach was proposed in [71] to train a neural network model from the labeled source domain to extract the shared representation to enhance the unlabeled input. Although the effectiveness of these SE approaches has been verified, their performance in mobile applications is yet to be confirmed.
In this study, we present a speech signal processing mobile application called CITISEN 1 . CITISEN is a standardized SE software with a user interface that can be used as a platform for utilizing and evaluating newly performed deeplearning-SE models by simply replacing the default settings with the associated model. Based on SE, two extended functions-model adaptation (MA) and background noise conversion (BNC)-were also implemented in CITISEN. The MA function was built to further improve the SE performance for a specific user or under certain noise environments. The adaptation data were prepared by the users to meet their requirements, thus making the framework a customized tool. The BNC function converts the original background noise to another one. BNC can be used to evaluate SE performance under practical conditions. In this condition, the residual noises in an enhanced source speech signal are combined with different background interference and affect the quality and intelligibility of a target speech signal. In addition, the BNC can be used to cover people's tracks by converting the original environment noises to noises from other places when a positioning system is unavailable or not being used because of limited access to the technology or the lack of intention. Furthermore, the BNC can also be used for entertainment purposes, such as adding background music or sound effects.
The contribution of this study is summarized as follows: • To the best of our knowledge, the proposed CITISEN is the first to integrate BNC and MA functions with SE in a mobile application. • CITISEN has a user interface for performing SE on an prerecording or instant recording. The experimental results confirmed the SE function of improving shorttime objective intelligibility (STOI) [72] and perceptual evaluation of speech quality (PESQ) [73] scores. • CITISEN has an MA function that allows users to adapt the SE models to unseen background noises or speakers. The MA function is proven to provide notable STOI and PESQ improvements compared to the results without MA. • CITISEN provides a novel BNC function that can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The listening test results indicated that the BNC function could convert the background noise while maintaining the clarity and intelligibility of the converted speech signals. • An acoustic scene classification (ASC) model was used to evaluate the BNC performance. The results showed that new background noise could be successfully recognized. Moreover, the ASC embeddings suggested that the conversation results from a silent background were close to a noisy background. Therefore, the BNC function can potentially serve as a data augmentation method for the ASC model when clean speech signals are unavailable.
• By simply replacing the settings with the associated model, CITISEN can utilize and evaluate other deep learning-based SE models not described in this study. Therefore, CITISEN can effectively reduce the development interval for converting SE models to industrial applications. The remainder of this paper is organized as follows. Section II reviews related works. Section III elaborates the functions and user interface of CITISEN. Section IV presents the experimental setup and results. Finally, Section V provides some concluding remarks regarding this research.

II. RELATED WORKS
In this section, we first review one traditional filter-based SE method and four neural-network-based SE models used for comparison in the experiments. Then, we introduce the concept of MA.

A. TRADITIONAL GAIN FUNCTION-BASED SE METHOD
In the SE task, we generally assume that the noisy speech signal x contains a clean speech signal s and noise signal v.
For the MMSE SE [19], [74] approach, the time-domain signal, x, is first converted to a spectral feature, X, using the short-time Fourier transform (STFT). After the STFT, Eq. 1 can be expressed as: where m denotes the mth frequency bin in the entire set of spectral features. By estimating the a priori and a posteriori signal-to-noise ratio (SNR) statistics based on a noise-estimation approach, we can estimate a function G

B. NEURAL-NETWORK-BASED SE METHOD
In this work, we used one waveform-based SE model, fully convolutional network (FCN) [54], and three spectral-based SE models, namely, deep denoising autoencoder (DDAE) [45], LSTM [51], and CRNN [57]. Table. 1 summarizes the NN models used in this study. Similar to traditional SE methods, the goal of the neural-network-based SE is to find the enhanced speech signalŝ that is close to the clean speech signal s. Fig. 2 shows an FCN model, which is similar to a conventional CNN, but all the fully connected layers are removed. As reported in [55], the FCN model can address the highand low-frequency components of the raw waveform simultaneously. The relation between the output sample s t and the connected hidden nodes R t can be represented by:

1) FCN-based SE model
where Q denotes one of the learned filters and subscript t indexes the time step. Then, the objective function of the FCN-based SE model is defined as: where θ F denotes the model parameters of FCN.

2) DDAE-based SE model
During the training of DDAE, noisy-clean speech signal pairs were used to compute the mapping function from noisy to clean spectral (logarithm amplitude in this study) features. The DDAE model aims to transform the noisy speech signal to a clean speech signal by minimizing the reconstruction error between the predicted S and the reference clean spectral features S such that: with where θ D denotes the model parameters of DDAE. ρ is a constant that controls the trade-off between the reconstruction accuracy and regularization term C(θ D ) [45] and is determined through the validation set in the training process. VOLUME 3, 2020 [50], [52], [53], [67] CRNN-based [57] S Conv LSTM Dense The CRNN-based model takes advantage of both the convolutional and LSTM layers.
[58], [68]  In this study, to simplify and compare with other methods, we set ρ to 0. Given noisy spectral features X, the DDAE estimates clean speech signal by: where W 1 ...W L and b 1 ...b L are the weight matrices and bias vectors, respectively, and L is the number of layers. In addition, σ is a vector-wise non-linear activation function sigmoid.

3) LSTM-based SE model
Because LSTM can capture the temporal relation of speech signal, it has proven to deliver promising results in SE [51]. The objective function of the LSTM-based SE model is close to that of the DDAE model, which is to find the best model parameters of LSTM θ L that can minimize: In this study, we used the LSTM unit defined as follows: c n = f n c n−1 + i n g n , h n = o n tanh(c n ) where X n , f n , i n , o n , g n , c n , and h n represent the input, forget gate, input gate, output gate, cell input activation, cell state, and hidden state vectors, respectively, and the subscript n indexes the frame step. In addition, W q and b q denote the weights and biases, respectively, where the subscript q can either be the input gate i, output gate o, forget gate f , or memory cell g, and represents element-wise multiplication.

4) CRNN-based SE model
The CRNN in this work combines a CNN, LSTM, and Dense layer. Previous work indicated that CRNN could lead to better objective intelligibility and perceptual quality than an LSTM model with fewer trainable parameters [57]. The architecture of the CRNN-based SE model is shown in Fig.  3.
Convolutional layer LSTM layer LSTM layer Dense layer

C. MODEL ADAPTATION
When operating SE in a real-world scenario, unknown noise types and new users are often encountered. Therefore, in many cases, the testing data may not be adequately covered by the trained SE model.  SE is a major function of CITISEN. As shown by the blue block in Fig. 4, given the noisy speech signal, the SE function removes background noises and generates the enhanced speech signal with improved quality and intelligibility. The SE models were trained in a cloud server, and the trained models were loaded into mobile devices. Because the model is trained and saved in a cloud server, mobile devices do not need to have a huge computational resource. When connected to the Internet, mobile devices automatically download updated SE models. A third-party module, called okhttp3, was used to save and manage the SE models. In addition, for SE, CITISEN has two recording modes: prerecording and instant recording. In the prerecording mode, CITISEN records the entire speech signal before processing, whereas in instant recording, CITISEN records and processes the speech signal simultaneously. CITISEN is a standardized SE software with a user interface that can support pretrained SE models trained by various machine learning frameworks, including Keras, PyTorch, and TensorFlow. In addition, the SE models can have different architectures or input acoustic feature formats. Fig. 5 (a) shows the implementation of the SE function in CITISEN, which contains four steps, including audio-recording, pre-processing, speech enhancing, and postpressing. The details of SE function steps in CITISEN are described as follows.

1) Audio recording
In this step, the application interface is implemented using the Java/Android application programming interface (API) AudioRecord. The AudioRecord saves an audio signal at a sampling rate of 16000 Hz in the signal channel. In the instant recording mode, as AudioRecord processes and analyzes audio data in every 5120 bytes, which is equivalent to 320 sample points per second, the instant recording will approximately have a 20 ms delay. The configuration of the AudioRecord in CITISEN is presented in Table III-A1. 2) Pre-processing In this step, CITISEN transfers the data format of the mobile input (byte) to the data format of the SE model input (float). For waveform-based SE models, such as FCN, the preprocessing step transfers the format of time-domain audio signals from bytes to float. For spectral-based SE models, such as DDAE, an additional STFT is required to transfer time-domain signals into frequency-domain signals. CITISEN performs STFT by calling Java/Android API Dou-bleFFT_1D in the JTransforms library. By calling this API, a one-dimensional time-domain signal is transferred into a complex matrix. The energy part of the complex matrix is presented as a spectrogram, which is used as the input for spectral-based SE models. The phase part of the complex matrix is reserved and used later to convert the enhanced spectrogram back to the time-domain audio signals.

3) Speech-enhancing
To operate the SE model on mobile devices, the pretrained SE model needs to be packaged into a .pb file. Then, CITISEN calls the Java API, which is built in TensorFlow: Tensor-FlowInferenceInterface, and passes the assetManager (.pb file) and modelFilename (model Name) to the API. Finally, CITISEN loads the SE model and calculates the enhanced speech signal. This part requires the microprocessor of the mobile device to participate in the calculation, and thus different types of mobile phone models will have different time delays. Currently, we have implemented FCN-based and DDAE-based SE in CITISEN; however, the available SE models can be easily extended by uploading the SE models using the same method.

4) Post-processing
For spectral-based SE models, such as DDAE, the output of the SE model are reconstructed to a time-domain signal. The waveform reconstruction method in CITISEN is the iSTFT, which is implemented with the DoubleFFT_1D function. For waveform-based SE models, such as FCN, the output is already a time-domain signal and does not require additional conversions. Finally, the data type of the enhanced speech signals is converted to a playable form (from float to byte).

Audio recording
• Transfer the data type from byte to float. • Convert time-domain signals to spectral features by using STFT (optional).

Pre-processing
• Transfer the data type from float to byte. • Convert spectral features to time-domain signals by using iSTFT (optional).

Post-processing Speech enhancing
Load pretrained model with TensorflowInference-Interface. Feed the data to the pretrained model and fetch the output of the data.

Data uploading
• Upload recorded data to cloud server.

Model downloading
• Download fine-tuned model to mobile devices.

Audio recording
• Record background noises or clean speech signals.

Model fine-tuning
• Fine-tune model with uploaded data on the cloud server.

Audio recording Noise selection Audio mixing
• Record background noise of acoustic scenes. • Save the background noise.
• Choose target background noise.
• Play the pre-recorded noise and the enhanced speech signal at the same time.

B. MA FUNCTION
The MA function of CITISEN aims to adapt the SE model to unknown noises or new speakers, or both. CITISEN provides three different MA modes: noise only (N), speaker only (S), and noise and speaker (N+S). Users can upload a short audio clip of the environment noise or their clean speech signal to the cloud server, and the parameters of the original SE model will be fine-tuned using the uploaded data. Users can then download and use the adapted SE models in CITISEN. Currently, we suggest that users record their referenced target speech signal in a noise-free environment. However, previous studies [79], [80] have shown that some level of noise contained in the referenced target can also lead to an effective reconstruction of the clean waveform in an SE system. The implementation of the MA function is shown in Fig. 5 (b).

C. BNC FUNCTION
BNC is a new topic in the field of speech processing. This idea is similar to the changing background of an image or video [81]. With the BNC, users can artificially convert the background noise of their speech signal to another specified noise. To use the BNC function, the noises of the target background must be recorded and stored first. Users can record background noises in different environments in real-world scenarios, such as car engine sounds and train stations. Then, users need to select the target background noise before running the BNC function. When running the BNC function, CITISEN removes the original background noise by using SE first and mixes the enhanced speech signal with new background noises by playing them simultaneously. In addition to SE steps, BNC has three additional steps: audio recording (of background noise), noise selection, and audio mixing. Fig. 5 (c) illustrates the implementation of the BNC function.

D. CITISEN USER INTERFACE AND USAGE
CITISEN has four pages: "speech enhancement," "background noise conversion," "uploading," and "recording," as shown in Fig. 6 default SE models trained using our own collected speech datasets. By pressing the "preview" button, users can hear their instant recording without using SE. By pressing the "activate" button, the SE function will be activated, and users can hear their enhanced instant recording.

Gender
Model switch Preview Activate FIGURE 7. Speech enhancement page of CITISEN. The "gender" button on the upper-right corner is used to specify the user's gender. By pressing the "model switch" button, an SE model list will pop up, and users can change the SE model. After pressing the "preview" button, users will hear their original instant recording, and after pressing the "activate" button, users will hear their enhanced instant recording.

2) Background noise conversion page
The "background noise conversion" page of CITISEN is shown in Fig. 8. On this page, CITISEN mixes the specified background noise with the enhanced speech signal to generate a new speech signal with the specified background noise. By pressing the "sound switch" button, users can choose the background noise they want to use on the pop-up background noise list. By pressing the "record noise" button, users can record and save a new background noise. In addition, by pressing the "activate" button, users will hear their enhanced instant recording with the specified background noise. Moreover, the "background noise conversion" page has a volume bar, which allows users to adjust the volume of background noise and specify the SNR level of the converted speech signal accordingly.

Gender
Model switch Sound switch Record noise Activate Background noise list FIGURE 8. Background noise conversion page of CITISEN. By pressing the "sound switch" button, a background noise list will pop up. After pressing the "record noise" button, users can record and save a new noise signal. After pressing the "activate" button, users will hear the enhanced instant recording contaminated with the specified background noise. Note that the "gender" button and the "model switch" button have the same function as those on the "speech enhancement" page.

3) Uploading page
The "uploading" page is used for uploading the data for the MA function. As CITISEN provides both unknown noise adaptation and new speaker adaptation, there are two file upload buttons: "record speech" and "record noise." To start the recording, users can simply press one button. After finishing the recording by pressing the button again, CITISEN will pop up a submission window. Users can then name the audio file and upload the recorded audio to the server. After receiving the audio file, the server can adapt the SE model by fine-tuning the original SE model using the recorded audio data. The name of the audio file can also be used to call the adapted SE model, which is later sent from the server to the mobile device and appears on the SE model list on "speech enhancement" and "background noise conversion" pages. Accordingly, users can run the SE and BNC functions using the adapted SE model. The "uploading" page of CITISEN is shown in Fig. 9.

4) Recording page
The "recording" page supports prerecording and SE model evaluation. Specifically, on the "recording" page, users can save, playback, and run SE on a saved speech signal. First, users can record new audio by pressing the "record new" button, and CITISEN will redirect to a processing page. After finishing the recording by pressing the "stop" button, users can name and save the record. The workflow is shown in Fig. 10. Then, users can choose an audio file, a model mode, and an SE model with the "choose file," "gender," and "model switch" buttons, respectively. Finally, by pressing the "run" button, an enhanced speech signal is generated. Because CITISEN demonstrates both the noisy and enhanced VOLUME 3, 2020

Submission window
Record noise

Record speech
Cancel / Upload Gender FIGURE 9. Uploading page of CITISEN. After recording a noise or speech signal, CITISEN asks the user to name and save the audio file and upload it to the cloud server.
spectrograms, users can visually evaluate the SE results. In addition, users can aurally evaluate the results by pressing the "play" and "stop" buttons to listen to the original and the enhanced speech signals. An illustration showing more details about the "recording" page is shown in Fig. 11 and Fig. 12.

Record a new file
Stop recording Save the file

IV. EXPERIMENTS
This section presents the setup, implementation details, and results of the experiments that tested the performance of the SE, MA, and BNC functions.

A. EXPERIMENTAL SETUP
In this study, TMHINT utterances [82] were used to prepare the training and testing sets, and the utterances were recorded at a 16 kHz sampling rate in a 16-bit format. Notably, the experiments are conducted offline on the cloud platform  instead of the mobile platform for several reasons. First, the cloud platform provides a more stable communication and computation environment, which ensures the listening test can go smoothly. Second, because the performance of mobile phones varied too much, choosing one as the representative is hard. Moreover, mobile phones progress so fast that the current best mobile phone might be greatly outperformed by the new mobile phone next year. Finally, evaluating the results on the cloud platform provides the upper bound of these functions and makes the results comparable with other studies.

1) SE experiments
In the SE experiments, the training set was prepared using speech utterances from three males and three females. Each speaker read 200 TMHINT utterances in a quiet room, totaling 1200 clean utterances. Each utterance had approximately 3s and contained ten Chinese characters. Noisy utterances were generated by artificially contaminating these 1200 clean training utterances with five randomly sampled noise types from a 100-noise type dataset [83] at eight different SNR levels (±1dB, ±4dB, ±7dB, and ±10dB). Consequently, 48000 noisy-clean pair utterances were obtained. As for the testing set, we used the speech utterance from two other speakers (one male and one female, termed testing speaker in the following discussion), with 120 utterances for each speaker. We generated noisy utterances by artificially contaminating these 120 clean utterances with another set of five noise types (car, sea wave, take-off, train, and song) at two different SNR levels (0dB and 5dB). Notably, the speakers, speech content, and noise types differed between the training and testing sets. The performance of the SE was tested using both subjective listening tests and objective evaluations.
For the listening tests, we recruited 20 participants with a male-to-female ratio of 2 to 3. The group ages were between 20 and 38 years, with a mean age of 21.50 (standard deviation (SD) = 3.97). All participants were native Mandarin speakers with normal hearing to perceive the stimuli during the test. Each participant listened to 80 testing speech signals (40 for 0 dB and 40 for 5 dB) spoken by one male and one female testing speakers. These 80 speech signals had different contents with one of the five assigned background noises (car, sea wave, take-off, train, and song) under four conditions, including original noisy speech signals (without enhancement), enhanced by an MMSE-based SE method, enhanced by a DDAE-based SE model, and enhanced by an FCN-based SE model. These four conditions are denoted as noisy, MMSE, DDAE, and FCN, respectively, in the following discussion. Each participant tested 40 lower-and 40 higher-SNR speech signals. In addition, the subjects were instructed to repeat what they had heard verbally and were allowed to repeat the stimuli once. The character correct rate (CCR), which is calculated by dividing the number of correctly identified characters by the total number of characters, was used to evaluate the intelligibility of speech signals.
For the objective test, we evaluated the results of two more neural-network-based methods, including LSTM-based SE and CRNN-based SE. In the following discussion, the speech signals enhanced by these two methods are denoted as LSTM and CRNN, respectively. PESQ [73] and STOI [72] were used as objective evaluation metrics. PESQ was designed to evaluate the quality of the processed speech signal, and the score ranged from -0.5 to 4.5. A higher PESQ score indicates that an enhanced speech signal is closer to the clean speech signal. STOI was designed to compute speech intelligibility, and the scores ranged from 0 to 1. A higher STOI score indicates better speech intelligibility.

2) MA experiments
The performance of the MA function was evaluated under three modes: MA(N), MA(S), and MA(N+S). The training set of the MA experiments was prepared as follows: For MA(N), two new noises (machine beeping and air flowing) from a real hospital scenario were mixed with the same training clean utterances as the SE experiments to form the new noisy-clean speech signal pairs. For MA(S), we mixed 40 clean utterances of the testing speakers in the SE experiments (20 utterances for each speaker) with the same training noises as the SE experiments to form the new noisy-clean speech signal pairs. For MA(S+N), the testing speakers' clean utterances and new noise signals were mixed to form new noisy-clean speech signal pairs. In the SE experiments, the SNRs for performing noisy-training utterances were ±1 dB, ±4 dB, ±7 dB, and ±10 dB. These training data were then used to fine-tune the pretrained SE model in the SE experiments until the model converged. The testing set of MA experiments had the same testing clean utterances as the SE experiments mixed with machine beeping and air flowing noise at four different SNR levels (±2 dB, 0 dB, and 5 dB).
Specifically, for MA(N), the training and testing speakers were independent, but the noises came from the same source. For MA(S), the training and testing speakers overlapped, but the training and testing noises were independent. In MA(S+N), the training speakers and testing speakers overlapped, and the noises came from the same source. Note that in every MA experiment, the contents of training speech signals and testing speech signals were different. In addition, the training and testing noises in MA(N) and MA(S+N) were from the same sources but recorded at different times.

3) BNC experiments
Based on our literature survey, there is no standard method for evaluating BNC. Because BNC aims to convert the original background noise into the target background noise, the accuracy (ACC), which is the number of correctly identified types of background noise divided by the total number of questions, was used to evaluate the BNC results. In addition, CCR was used to evaluate the maintenance of clarity and intelligibility of the converted speech signals. The ACC and CCR scores are estimated by both humans and machines. Specifically, we invited human listeners to conduct listening tests. We also trained an ASC model to analyze the ACC and used a pretrained automatic speech recognition (ASR) model to measure the CCR. For human evaluation, the CCR was the ratio of characters that a participant could correctly recognize. For machine evaluation, the CCR was calculated using the Levenshtein distance [84] between the predictions of a pretrained ASR system [85] and the ground truth. The details of the listening test and the ASC model are as follows.

a: Listening test
We asked the listeners to identify one out of five background noises (car, sea wave, take-off, train, and song) after listening to a converted speech signal. To avoid random guessing, listeners could choose "not clear" if they could not identify the background noise. During the test, participants were asked to repeat what they had heard, select the characters they had heard, and identify the background noise. Forty participants with a male-to-female ratio of 9 to 11 were recruited to participate in this set of listening tests. The group ages were between 14 and 43 years, with a mean age of 25.74 (SD = 8.68). All participants were native Mandarin speakers with normal hearing to perceive the stimuli during the test.
The stimuli were Mandarin sentences spoken by one male and one female testing speaker. The testing speech signals were either processed using one of three SE methods (i.e., MMSE, DDAE, and FCN) or were not processed (i.e. the clean speech signals). Notably, the enhanced speech signals from SE experiments were used for BNC experiments, which means the original background noises were either car, sea wave, take-off, train, or song. The enhanced speech signals were then contaminated with the car, sea wave, take-off, train, or song noises, making 5 × 5 possible kinds of BNC conditions. To avoid the fatigue effect, we only tested the results of the 5 dB to 5 dB SNR condition. That is, the SNR of original and converted speech signals are both 5 dB. In total, each participant listened to 80 utterances.

b: ASC model
We used the same dataset as the SE experiment to train the ASC model. Specifically, the training and testing utterances were the same as those in the SE experiments described in Section IV-A1. Thirteen noise types were used for the ASC model. Five of them were test noises used for the SE experiment, including car, sea wave, take-off, train, and song. The remaining eight noises were selected from the training noises of the SE experiment. Each noise segment was cut into two segments with a ratio of one to four. The shorter segment was used for testing, whereas the longer segment was used for training. The training and testing SNR levels were the same as the SE experiment. Fig. 13 shows the details of the ASC model, which is based on [86]. The input of the model is the log1p spectrograms [87]. A Training epoch of 100, batch size of 128, optimizer Adam with a learning rate of 0.0001, and cross-entropy loss were used.

B. IMPLEMENTATION DETAILS OF SE MODELS
This section describes the structures and training details of the neural-network-based SE models. For spectral-based models, including DDAE, LSTM, and CRNN, the parameter settings of the STFT were as follows: the window length was 512, the hop length was 256, and the window type was the Hanning window. Then, the log1p spectrograms [87] were used as the input for the SE models. In inference, the noisy phase was reserved and combined with the enhanced spectral features to reconstruct the time-domain signals.

1) FCN
The FCN consisted of eight convolutional layers, where the filter number and kernel size of each of the first seven layers were 128 and 55, respectively. Batch normalization and the LeakyReLU were used to regularize the output of a hidden layer. The filter number and kernel size in the last layer were 1 and 55, respectively, with the hyperbolic tangent activation function applied to the FCN output. The number of training epochs was set to 60. In addition, batch size 1, optimizer Adam with a learning rate of 0.001, and mean square error (MSE) criteria were used.

2) DDAE
To incorporate contextual information, for each self-defined DDAE layer in this work, five adjacent frames of the input feature vector were concatenated to form the input of the next layer, whereas the output of each layer was a single frame. Also, the ReLU was used to regularize the output layer. The DDAE was composed of three DDAE layers with 257 output units in each layer, followed by a dense layer with single frames as the input and 825 output units, and another dense layer with 257 output units. Finally, the DDAE model had four more DDAE layers with 257 output units. The number of training epochs was 200. In addition, a batch size of 128, an Adam optimizer with a learning rate of 0.0001, and MSE criteria were used.

3) LSTM
The LSTM model used in this evaluation was constructed in the order of three stacked LSTMs and dense layers. Each LSTM layer contained 492 memory cells, and the size of the latest dense layer was 257. The number of training epochs was set to 20. The Adam optimizer with a learning rate of 0.001 and MSE criteria were used.

4) CRNN
The CRNN combines CNN and LSTM to enhance the input raw waveform. The CRNN comprised four convolutional blocks first, where each block was composed of three twodimensional convolution layers. The ReLu activation function was applied to process the output of each layer. The kernel size for each convolutional layer was three, and the number of channel settings was arranged in the order of 16, 32, 64, and 64. In each block, the stride setting for the output convolutional layer along the speech feature dimension was three, and the setting for the remaining layers was one. Then, the convolutional block was followed by four LSTM layers with 384 memory cells and 257-dimensional dense layers with the ReLu activation function. The input dimensions for the decoder were reshaped from the output of the encoder to 192 (3 × 64). In addition, the number of training epochs was 200, the batch size was 128, the optimizer was Adam with a learning rate of 0.0001, and MSE criteria were used.

C. EXPERIMENTAL RESULTS
In this section, we compare the complexity of the neuralnetwork-based models and then perform a numerical analysis of the SE, MA, and BNC functions. Finally, we present the visualization results of processed speech signals.

1) Complexity analyses
First, we evaluated the complexity of neural-network-based SE models in terms of floating-point operations (FLOPs 2 ) and the number of model parameters. From the results in Table 3, we can observe that models with convolutional layers, such as the FCN and CRNN, require higher computational cost in terms of the FLOPs metric. The higher FLOPs imply that these models require more computational loading on hardware resources with similar parameter sizes.
Note that to avoid unstable communication and computation, we conducted experiments offline on a computer. However, we also tested whether the model with the highest FLOPs, the FCN model, could run on CITISEN. The results showed that the FCN model could successfully run on CI-TISEN. achieved higher scores than MMSE in terms of both STOI and PESQ, whereas FCN provided the highest PESQ and STOI scores among the evaluated methods. The results also demonstrate the effectiveness of using a deep-learning model for the SE task. Table 5 presents the subjective listening test results for noisy and the three SE methods. From the table, it can be observed that MMSE yielded lower CCRs compared to noisy for both 0 dB and 5 dB SNRs, which is consistent with the findings of previous research and the STOI results reported in Table 1. That is, although some SE methods effectively remove background noise, speech intelligibility might be affected. In addition, the SE function is more helpful under low SNR situations, as noisy speech signals maintain high levels of intelligibility under high SNR situations. The oneway analysis of variance and Tukey post-hoc comparisons were applied to demonstrate the significance of improvements for analyzing the SNR-based CCR results of noisy, MMSE, FCN, and DDAE. The evaluations first revealed the significant difference across four SE systems, with p < 0.001 at 0 dB and 5 dB SNRs. The Tukey post-hoc tests further verified the significant differences for the following SE condition pairs at 0 dB: ( (FCN, DDAE). Notably, the analysis on the scores of FCN and noisy indicated no significant difference, with p > 0.05 at both 0 dB and 5 dB SNRs. To achieve a significant difference from noisy speech signals to enhanced speech signals, a more advanced SE method performing under lower SNR conditions might be required.

2) SE experiment
In addition to the averaged CCRs for all the participants, Fig. 14 (a) and (b) illustrate the subject-wise CCRs at 0 dB and 5 dB, respectively. Each gray circle in the figure represents the CCR score of an individual participant. According to both sub-figures, we can observe a larger CCR variance for MMSE and DDAE than that for FCN and noisy. The results imply the effectiveness of the FCN model in enhancing noisy speech signals with less ambiguous content than that of MMSE and DDAE.

3) MA experiment
For the MA experiment, we fine-tuned the FCN model used in the SE experiment and used the original SE results from VOLUME 3, 2020 the FCN model as our baseline. From Table 6, it can be seen that SE yielded higher STOI and PESQ scores as compared to noisy, thereby confirming the results in that SE can improve speech quality and intelligibility over noisy speech signal, although the noise types are unknown and different from those used in the training set. Next, compared with the baseline (original SE model without MA), all three MA modes achieved higher PESQ and STOI scores. More specifically, MA(N), MA(S), and MA(N+S) yielded noticeable relative improvements of 5.06%, 2.94%, and 5.84% in terms of STOI, and relative improvements of 12.48%, 3.32%, and 11.24%, in terms of PESQ, respectively, as compared to the baseline. Thus, the results obtained confirmed the effectiveness of the MA function and indicated that intelligibility and quality improvements could be attained by adapting the SE model based on both noise and speaker information. From the experimental results, we also observe that MA(N) achieved a higher PESQ than MA(S) and MA(S + N). One of the possible reasons for this is that the data for MA(N) was more than that for MA(S) and MA(S+N). Specifically, the number of fine-tuned speech signals was 2 × 1200 × 8 (new noises × clean training utterances of the original SE model × SNRs), 5 × 40 × 8 (training noises of original SE model × clean utterances from new speakers × SNRs,) and 2 × 40 × 8 (new noise × clean utterances from new speakers × SNRs) for MA(N), MA(S), and MA(S+N), respectively.

4) BNC experiment
We present human and machine evaluations of the BNC function. Human evaluation was performed by conducting a listening test, whereas the machine evaluation was performed using an ASC model and a pretrained ASR system [85]. We evaluated the BNC using machines for three major reasons. First, recruiting humans to perform the tests is expensive and time-consuming, whereas using a machine to evaluate the performance is relatively inexpensive and efficient. Second, the ASC model has potential in several applications, such as monitoring systems, context-aware mobile devices, and audio search. Third, the machine can assist in human judgment. Therefore, the performance of the ASC model is also important for the BNC function. The details of the ASC model used in this study are described in Section IV-A3b.

a: Results of human evaluation
Based on the three SE methods, namely, MMSE, FCN, and DDAE, three sets of converted speech signals were obtained, denoted as BNC(MMSE), BNC(FCN), and BNC(DDAE), respectively. In addition, we included the results of BNC(clean), which is a set of speech signals converted from a silent background. Notably, BNC(clean) represents the upper bound of the BNC results because it was converted from a clean speech signal. From and BNC(DDAE). We excluded the results of BNC(MMSE) because the ACC of BNC(MMSE) was considerably lower than that of BNC(FCN) and BNC(DDAE), and MMSE performed worse than other SE methods (Table 4). From the column "sea" and "take-off" in Fig. 15, we observed that participants were less able to identify the background "sea" and "take-off." The "sea" and "take-off" backgrounds are less recognizable than the other noises because participants must hear a nearly complete wave or take-off sound to confirm it. Conversely, from the column "song," we know that participants found it easier to identify the "song" background. This result might be because the "song" background contains music with a human voice, which is considerably different from other background noise. Evidently, the characteristics of the target background significantly affected the identification results. In addition, the original background noise affected the ACC because the noise type usually notably affects the SE performance. Because the BNC function focused on the background noise, the SNR level affected the performance. We make two assumptions about the effect of the SNR level of a speech signal on the performance of the BNC. The first assumption is that a higher original SNR will lead to a better ACC. That is, the target background noise is easier to identify if the converted speech signal is less affected by the original background noise. The second assumption is that a lower converted SNR will result in a better ACC. That is, the target background noise is easier to recognize if it is louder than the speech signals.
To test these two assumptions, we conducted four pairs of experiments that converted speech signals with the original SNR level a dB to speech signals with converted SNR level b dB, where a ∈ {0, 5}, and b ∈ {0, 5}. Fig. 16 shows the average ACC of the BNC(DDAE) and BNC(FCN). The figures in the same row and column represent the speech signals with the same original levels and converted SNR levels, respectively. That is, the influences of the original SNR level could be obtained by comparing the figures in different rows, whereas the effects of the converted SNR level could be determined by comparing the figures in different columns. In Fig. 16, we find that speech signals with an original SNR level of 5 dB (bottom row) outperform speech signals with an original SNR of 0 dB (top row), which confirms our first assumption that a speech signal with a higher original SNR has a better BNC result. Subsequently, speech signals with the converted SNR level of 0 dB (left column) performed better than speech signals with the converted SNR level of 5 dB (right column). This result verified our second assumption that a speech signal with a lower converted SNR yields a better BNC result.
We evaluated the ACC of speech signals converted from a silent background (i.e., from a clean speech signal instead of an enhanced speech signal), which is the upper bound of BNC performance. Fig. 17 shows the results for different SNR levels. Unlike the results of the previous experiments, the converted SNR level did not affect the ACC of the BNC. None of the background noise conditions indicated that a lower converted SNR would lead to a better ACC. In addition, the average scores remained stable under different SNR levels. One possible reason is that, for enhanced speech signal, a lower converted SNR can suppress the noise that was not removed by the SE models and make the target background easier to identify. Conversely, a lower converted SNR makes no difference for a clean speech signal because it does not contain other noise. Therefore, the target background is easy to recognize despite having a high converted SNR level. Notably, the ASC model achieves high ACC on the "sea" and "take-off" background, whereas the participants of the listening test have a lower identification rate for these two noises. The results suggest that the ASC model has potential to assist human listeners in recognizing noise that they cannot distinguish correctly. Finally, we present the CCR results using a pretrained ASR system [85]. As can be seen in Fig. 18, the SNR levels significantly affect the CCR. That is, the lower the SNR levels, the lower is the CCR. Table 8 presents a summary of the machine evaluations. For the accuracy of BNC, enhanced speech signals performed worse than clean speech signals but still achieved more than 90% of accuracy. For the CCR, the performance decreased when using enhanced speech signals instead of clean speech signals. One possible reason is that the ASR system was not trained with the enhanced speech signals; therefore, the prediction of an enhanced speech signal is less accurate. In addition, the CCR is significantly affected by the language model of the pretrained ASR system. Specifically, despite having the same pronunciation, the ASR system might result in the wrong word, leading to a decline in CCR.  Subsequently, we used principal component analysis (PCA) [88] to visualize the embeddings of the ASC models in Fig. 19, where the clean and enhanced speech signals with background noise "n" are denoted as "c+n" and "en+n," respectively. We first found that different noise types were separated, indicating that the ASC model correctly recognized the background noise types. Then, we observed that the embeddings of clean speech signals with a new noise were close to those of enhanced speech signals with the same converted noise. Therefore, the BNC function can serve as a data augmentation method for the ASC model when clean speech signals are unavailable. Specifically, the BNC function can perform data augmentation by generating arbitrary numbers and SNR levels of training speech signals with specific background noise. In addition, the proposed BNC has the potential to open up new and interesting topics that have not yet received sufficient attention. For example, related studies include conversion of a new background noise naturally and the development of an ASC model that can distinguish between artificially converted and naturally recorded background noise. . Visualization of the ASC embeddings. The original background noise of an enhanced speech signal was either "take-off" or "train." The embeddings of clean speech signals with new noise were close to those of enhanced speech signals with the same converted noise, which indicates that the BNC function might be used as a data augmentation method for the ASC model when a clean speech signal is unavailable.

5) Visualization results
Finally, we present the visualization results shown in Fig.  20. Figs. 20 (a), (b), (c), and (d) depict the spectrogram and waveform plots of the clean, noisy, enhanced, and BNC speech signals, respectively. For each sub-figure in Fig. 20, the left column depicts the spectrogram, and the right side depicts the associated waveform. Noisy speech signal (b) was produced by contaminating clean speech signal with car noise. Additionally, the BNC speech signal (d), which was produced by mixing the enhanced speech signal (c) with train noise, demonstrates the converted result from car noise to train noise.
The enhanced spectrogram shown in Fig. 20 (c) preserves several harmonic clean speech structures when compared with those presented in Figs. 20 (a). In addition, when comparing the waveforms in Figs. 20 (a), (b), and (c), the enhanced waveform presented in Fig. 20 (c) depicts considerably smaller noise components. Both observations demonstrate the effectiveness of SE in reducing noise from noisy input while providing detailed speech structures. The spectrogram shown in Fig. 20 (d) clearly illustrates different noise patterns in comparison with those presented in Fig. 20 (b) and confirms the effectiveness of BNC.

V. CONCLUSION
In this study, we presented a speech signal processing mobile application called CITISEN. The contributions of CITISEN are as follows: (1) CITISEN was developed as a standardized SE tool with a user interface for performing SE on an prerecording or instant recording. In addition, experimental results confirmed the SE function of providing improved STOI and PESQ scores. (2) CITISEN has an MA function that allows users to adapt the SE models in terms of personalized testing conditions, and the MA function was proven to provide notable STOI and PESQ improvements as compared to the results without MA. (3) CITISEN provides a BNC function that converts the background noise of a speech signal into another noise. Notably, the BNC function is a novel concept for SE techniques and was implemented in mobile devices for the first time. The listening test results indicated that the BNC function could convert the background noise while maintaining the clarity and intelligibility of the converted speech signals. In addition, machine evaluation experiments showed that the ASC embeddings of clean speech signals with a new noise were close to those of enhanced speech signals with the same converted noise. Therefore, the BNC function can serve as a data augmentation method for the ASC model in the condition that clean speech signals are unavailable. (4) By simply replacing the settings with the associated model, CITISEN can run with other SE models that were not tested in this study. Therefore, CITISEN provides a suitable platform for evaluating deep-learning-based SE models and effectively reduces the development interval for converting deep-learning models to industrial applications.
YA-HSIN LAI received the Ph.D. degree in Education from University of Bath, Bath, UK., in 2020. From 2020 to 2021, she was a Postdoctoral Research Fellow with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan, where she engaged in a variety of research in automatic speech recognition, speech enhancement, and musical intervention for dyslexic children. She is currently an Assistant Professor at the Master Program of Youth and Child Welfare, Chinese Culture University, Taipei, Taiwan. Her research interests include psychometric instrument development and testing, parenting education, attachment relationships as well as child and youth wellness.