Deep Learning Enabled Semantic Communications With Speech Recognition and Synthesis

In this paper, we develop a deep learning based semantic communication system for speech transmission, named DeepSC-ST. We take the speech recognition and speech synthesis as the transmission tasks of the communication system, respectively. First, the speech recognition-related semantic features are extracted for transmission by a joint semantic-channel encoder and the text is recovered at the receiver based on the received semantic features, which significantly reduces the required amount of data transmission without performance degradation. Then, we perform speech synthesis at the receiver, which dedicates to re-generate the speech signals by feeding the recognized text and the speaker information into a neural network module. To enable the DeepSC-ST adaptive to dynamic channel environments, we identify a robust model to cope with different channel conditions. According to the simulation results, the proposed DeepSC-ST significantly outperforms conventional communication systems and existing DL-enabled communication systems, especially in the low signal-to-noise ratio (SNR) regime. A software demonstration is further developed as a proof-of-concept of the DeepSC-ST.


I. INTRODUCTION
With the booming of artificial intelligence (AI) in the recent years, the unprecedented demands on Part of the work presented in [1], which has been published at IEEE  intelligent applications require extremely high transmission efficiency and impose enormous challenges on conventional communication systems. According to Shannon and Weaver [2], the ultimate goal of communications is to exchange semantic information, named semantic communications. The transmission efficiency can be significantly improved with semantic communications.
Semantic information represents the meaning and veracity of source information [3]. However, it is challenging to quantify the semantic information due to the lack of a mathematical model. The breakthroughs of deep learning (DL) make it possible to tackle many challenges without requiring a mathematical model. Recently, DL-enabled semantic communications have shown great potentials to break the bottlenecks in conventional communication systems [4] and facilitate semantic information exchange. Moreover, in some applications, users only request some critical semantic information from the source, which motivates us to transmit the application-related semantic information. Therefore, task-oriented semantic communications have been regarded as a promising solution for the six generation (6G) and beyond [5], [6], especially when communication resources are limited.
According to the state-of-the-art of DL-enabled semantic communications, the transmission goal could be categorized into two types: source data reconstruction and intelligent task execution. To achieve the data reconstruction, the global semantic information is extracted for transmission. But to serve the intelligent tasks, the extracted semantic information only consists of the task-related semantic features and the other irrelative features can be ignored to minimize the data to be transmitted. Besides, the semantic information can be significantly compressed by employing a lossless compression method [7], which guarantees the feasibility to transmit the task-related semantic information in-stead of the global semantic information to achieve very high bandwidth efficiency.
Semantic communications have attracted intensive research for text [8]- [12], speech/audio [13]- [15], and image/video transmission [16], [17]. For semantic-aware speech/audio transmission, Weng et al. [13] first proposed a DL-enabled semantic communication system, named DeepSC-S, to extract the global semantic information by leveraging an attention mechanism-powered module and reconstruct the speech signals at the receiver. Tong et al. [14] developed a multi-user audio semantic communication system to collaboratively train the convolution neural network (CNN)-based autoencoder by implementing federated learning over multiple devices and a server. Moreover, Shi et al. [15] designed an understanding and transmission architecture for semantic communications and verified its effectiveness by deploying the architecture into the speech transmission system, which converts speech signals into semantic symbols to ensure high semantic fidelity and decodes the received semantic symbols into a speech waveform. Although semantic communications have been designed for intelligent speech transmission, investigation on semantic communications to execute speech-centric intelligent tasks at the receiver is still missing.
Due to the intensive deployment of intelligent devices in the post-Shannon communication era, a large amount of data transmission is inevitable to support massive connectivity amongst those devices. However, the available spectrum resources are scarce, which causes the bottleneck to the conventional communication system. Inspired by this, we propose a DL-enabled semantic communication system, named DeepSC-ST, for speech transmission and serving users with different requests. Particularly, DeepSC-ST first compresses the input speech sequence into the low-dimensional text-related semantic features that are transmitted over physical channels. At the receiver, the text sequence is estimated based on the received semantic features. By doing so, the characteristics of speech signals, e.g., the voice of speaker, speech delay, and background noise, etc., are omitted to be transmitted, which lowers the network traffic significantly and serves users requesting the text information only. Besides, to grant users the accessibility to speech signals at the receiver, the recognized text is passed through a speech synthesis module to restore the speech signals efficiently according to the user identity (ID). Note that the user ID is pre-registered and the corresponding speaker information is available at the receiver to reconstruct the speech sequence as close to the input speech sequence as possible. The main contributions of this paper are summarized as follows: • A novel semantic communication system, named DeepSC-ST, is proposed for the communication scenarios with speech input, in which a joint semantic-channel coding scheme is developed. • The text-related semantic features are extracted from the input speech by leveraging CNN and recurrent neural network (RNN)-based semantic transmitter, which significantly reduces the transmission data and the required communication resources without performance degradation. • We develop speech recognition and speech synthesis tasks to achieve diverse system output. Particularly, the received text-related semantic features are converted to the text information by a feature decoder. Besides, the speech sequence is reconstructed according to the recovered text and the speaker information by leveraging a CNN and RNN-based neural network. • A demonstration of the DeepSC-ST with operable user interface is built to produce the recognized text and the synthesized speech based on the real human speech input.
The rest of this article is structured as follows. Section II presents the related work. The model of the semantic communication system for speech recognition and speech synthesis, as well as the performance metrics are introduced in Section III. In Section IV, the proposed DeepSC-ST is detailed. Simulation results are discussed in Section V and Section VI draws conclusions.
Notation: The single boldface letters are used to represent vectors or matrices and single plain capital letters denote integers. x i indicates the i-th component of vector x, x denotes the Euclidean norm of x. Y ∈ R M ×N indicates that Y is a M × N real matrix. Superscript swash letters refer the blocks in the system, e.g., T in θ T represents the parameter at the transmitter. CN (m, V ) denotes multivariate circular complex Gaussian distribution with mean vector m and co-variance matrix V . Moreover, a * b represents the convolution operation of vectors a and b.

II. RELATED WORK
In this section, we introduce the related work on DL-enabled semantic communication systems and present the state-of-the-art models for speech recognition and speech synthesis.

A. Semantic Communication Systems
Semantic communications have attracted extensive research interest very recently. Particularly, the transformer-powered system for text transmission, named DeepSC, has been proposed in [9] to measure the semantic error at the word level instead of the sentence level. Besides, a variant of DeepSC has been developed in [10] to further reduce the transmission error by leveraging hybrid automatic repeat request (HARQ) to improve the reliability of semantic transmission. Furthermore, the reasoningbased semantic communicator (R-SC) architecture in [12] automatically infers the hidden information by an inference function-based approach and introduces a life-long method to learn from previously received messages. In addition to perform text transmission, semantic communications for serving the text-based intelligent tasks have been investigated. Particularly, the transformer-enabled model, named DeepSC-MT in [18] can achieve the machine translation task by learning the word distribution of target language to map the meaning of source sentences to the target language. Besides, a visual questionanswering (VQA) task has been investigated in [18] by designing a multi-modal system to compress the correlated text-image semantic features, and then to perform the information query at the receiver before fusing the text-image information to infer an accurate answer.
In semantic communications for image and video transmission, the DL-enabled semantic communication system for image transmission in [16] utilizes a generative adversarial network (GAN)-based semantic coding scheme to interpret the meaning of images and reconstruct images with high semantic fidelity. In [17], a basal semantic video conference network has been established, which considers the impact of channel feedback and designs a semantic detector to detect semantic error by leveraging the ID classifier and fluency detection. Inspired by the booming of computer vision, semantic communications for serving intelligent vision tasks have shown great potentials to tackle many challenges beyond human limits. Particularly, the image classification task at the edge server has been investigated in [19], which imposes IoT devices to compress images to the low computational complexity and to reduce the required transmission bandwidth. The semantic communication system for image classification in [20] adopts information bottleneck [21] framework to identify the optimal tradeoff between compression rate and classification accuracy. In [22], the deep reinforcement learning-enabled semantic communication system has been developed for joint image transmission and scene classification. Furthermore, a robust semantic communication system to combat semantic error has been proposed in [23], which incorporates the image with the generated semantic noise and utilizes the masked autoencoder to mitigate the effect of semantic noise with the aid of the discrete codebook shared by the transmitter and the receiver.

B. Speech Recognition
The exploration of speech recognition can be tracked back to several decades ago by exploiting the hidden Markov model [24]. Afterwards, the neural network-powered speech recognition has experienced remarkable improvements owing to the thriving of natural language processing (NLP). About a decade ago, deep neural network (DNN) has been utilized for hybrid modeling of speech recognition system [25]- [27], in which DNN replaces the traditional Gaussian mixture model that is utilized to estimate the distribution of hidden Markov model. However, such hybrid modeling architecture keeps all other models in the speech recognition system, i.e., acoustic model, lexicon model, and language model. Recently, a revolutionary transformation from hybrid modeling to end-toend (E2E) modeling has been witnessed to directly recognize the token sequence from an input speech by leveraging a single integrated neural network, which simplifies the speech recognition pipeline and brings significant performance gains. Particularly, the speech recognition systems combining frequency-domain CNN with long short-term memory (LSTM) have been developed [28], [29], which analyses the temporal dependencies of excessively long speech sequences and greatly increases the recognition accuracy compared to the hybrid modeling systems. The deep LSTM-based speech recognition systems have been proposed in [30]- [32] to perform character-level transcription and remove specific phonetic representation, which boosts the self-supervised learning technologies [33]- [35]. Besides, RNN Transducer (RNN-T) has been utilized in E2E speech recognition systems due to its natural streaming capability and widely investigated in the academia and industry [36]- [38]. Furthermore, the attention-enabled speech recognition systems have been developed due to their capabilities on interaction amongst long sentences and superior training efficiency [39], [40].

C. Speech Synthesis
The ultimate goal of speech synthesis is to generate intelligible and natural human speech corresponding to a text input, which implies the utterance of each word is correct, the intonation of synthesized speech must be similar to that of the native speaker, the speech quality should be good and free of any background noises or speech artifacts. The early speech synthesis systems involve a huge database of small sound units, which generates the speech waveform by concatenating many small sound units and arranges the order of these units by an appropriate algorithm [41], [42]. Inspired by the breakthroughs of DL, the speech synthesis systems employing different types of neural networks have been developed to generate very intelligible and clear speech [43]- [45]. However, the speech produced by such neural network-based system still sounds mechanical and unnatural. The synthesized speech quality can be comparable to real human voice by generating waveform in the time domain from the input linguistic features since the advent of WaveNet [46]. In [47], a convolutional attentionbased speech synthesis system, Deep Voice 3 [47], has been proposed, which is trained by the collected audio from over two thousand speakers. By taking text as inputs, Char2Wav [48] generates speech waveform by utilizing the RNN-enabled reader and neural vocoder. The sequence-to-sequence architecture, named Tacotron [49], maps the text into magnitude spectrogram instead of linguistic and acoustic features, and simplifies the conventional synthesis procedure by a single neural network trained separately. The improved system, Tacotron 2 [50], produces the speech waveform from normalized character sequences, which synthesizes the realistic human voice. Although tremendous success of the above autoregressive speech synthesis systems, the time to generate speech with long sentences could be several seconds. More recently, some non-autoregressive speech synthesis pipelines have been developed [51]- [53] to eliminate the time dependency of the produced waveform and reduce the latency of the synthesis process.
III. SYSTEM MODEL The semantic communication system for speech recognition and speech synthesis first extracts and transmits the low-dimensional text-related semantic features from the input speech, then recognizes the text sequence based on the received semantic features. Finally, the speech waveform is reconstructed at the receiver according to the recognized text and the user ID. In this section, we introduce the considered system model and performance metrics for speech recognition and speech synthesis tasks.

A. Input Spectrum and Text Information
The input speech sample sequence is converted into the spectrum before feeding into the transmitter. First, the input speech sample sequence, m = [m 1 , m 2 , . . . , m Q ], is divided into N frames, then these frames are converted into the spectrum through the Hamming window, fast Fourier transform (FFT), logarithm operation, and normalization. By doing so, the spectrum, s = [s 1 , s 2 , . . . , s N ], contains the characteristics of the sample sequence, m.
Moreover, denote t as the corresponding text of the single speech sample sequence, m. The ultimate goal of the speech recognition task is to recover the final text transcription, t, as close to t as possible. Denote t = [t 1 , t 2 , . . . , t K ], where t k is a token from the token set, t, that could be a character in the alphabet or a word boundary. For English, there are 26 characters in the alphabet. Then there are 29 tokens if including apostrophe, space, and blank as word boundaries, that is, t = [a, b, c, . . . , z, apostrophe, space, blank].

B. Transmitter
Based on the input spectrum and text sequence of the speech sample sequence, the proposed system model is shown in Fig. 1. In the figure, the transmitter consists of the semantic encoder and the channel encoder, implemented by two neural networks. At the transmitter, the input spectrum, s, is converted into the text-related semantic features, p, by the semantic encoder, and these features are mapped into symbols, x, by the channel encoder to be transmitted over physical channels. Denote the neural network parameters of the semantic encoder and the channel encoder as α and β, respectively, then the neural network parameters at the transmitter can be expressed as θ T = (α, β). Hence, the encoded symbols, x, can be expressed as where T S α (·) and T C β (·) indicate the semantic encoder and the channel encoder with respect to (w.r.t.) parameters α and β, respectively.
The encoded symbols, x, are transmitted over a physical channel. x is assumed to be normalized, i.e., E x 2 = 1.
In Fig. 1, the wireless channel, represented by p h (y| x), takes x as the input and produces the output, y, as the received symbols. Denote the coefficients of a linear channel as h, then the transmission process from the transmitter to the receiver can be modeled as where w ∼ CN (0, σ 2 I) denotes independent and identically distributed (i.i.d.) Gaussian noise vector with variance σ 2 for each channel.

C. Receiver
As in Fig. 1, the receiver includes the channel decoder and the feature decoder to recover the text-related semantic features and recognize the final text transcription as close to the raw text sequence as possible. First, the received symbols, y, is mapped into the text-related semantic features, p, by the channel decoder, where p = [ p 1 , p 2 , . . . , p L ] denotes a probability matrix and probability vector p l = [ p 1 l , p 2 l , . . . , p 29 l ] comprises 29 probabilities corresponding to 29 tokens in t. Denote the neural network parameters of the channel decoder as θ R , then the recovered features, p, can be obtained from the received symbols, y, by where R S θ R (·) indicates the channel decoder w.r.t. parameters θ R .
Then, the text-related semantic features, p, are decoded into the text transcription, t, by the feature decoder, denoted as where R F (·) represents the feature decoder.
The objective of the semantic communication system for speech recognition task is to recover the text information of the input speech signals, which is equivalent to maximizing the posterior probability p (t| s). By introducing connectionist temporal classification (CTC) [54], the posterior probability p (t| s) can be expressed as where A(s, t) represents the set of all possible valid alignments of text sequence t to spectrum s, and a l is the token under the valid alignments. For example, if text sequence t = [t, a, s, t, e], the valid alignments could be [blank, t, blank, a, s, blank, t, e], or [t, blank, a, s, blank, blank, t, e], etc., because the blank token is removed when obtaining the final text transcription, t. Note that the number of tokens in every valid alignment is L. If the valid alignment is [blank, t, blank, a, s, blank, t, e], the first token is blank, i.e., a 1 = blank, then we have where p 29 l is one of the probabilities in probability vector p l and number 29 represents the blank token is the 29th token in t.
To maximize the posterior probability p (t| s), the CTC loss is adopted as the loss function for speech recognition task in our system, denoted as where θ denotes the neural network parameters of the transmitter and the receiver, θ = (θ T , θ R ). Moreover, for given prior channel state information (CSI), the neural network parameters, θ, can be updated by the stochastic gradient descent (SGD) algorithm as follows, where η > 0 is a learning rate and ∇ indicates the nabla operator.

D. Speech Synthesis Module
By introducing a flexible task mechanism, the produced information at the receiver is not only limited to the text transcription, but also the speech information, which grants users the privilege to check the speech characteristics and expands the system diversity. As shown in Fig. 1, the input to the speech synthesis module refers to the output of the feature decoder, t, and the speech synthesis module, represented by a neural network, processes and converts t into the speech sample sequence, m, which dedicates to reconstruct the speech waveform as close to the original sample sequence, m, as possible. Denote the neural network parameters of the speech synthesis module as χ, then the reconstructed speech sample sequence, m, can be denoted as where R SS χ (·) indicates the speech synthesis module w.r.t. parameters χ.
E. Performance Metrics 1) Speech Recognition Task: In this task, the semantic similarity between the recovered text and the original text is equivalent to measuring whether the recovered text is as readable and understandable as the original text. It is intuitive that the text sequence is easier to read and understand if it contains fewer incorrect characters/words, which inspires the metrics by calculating the incorrect characters/words in the recovered text sequence. Besides, thanks to the advanced developments in speech recognition field, character error-rate (CER) and word error-rate (WER) are effective metrics to indicate the accuracy of the recognized text transcription. Therefore, we adopt CER and WER as two performance metrics for the speech recognition task. According to the text transcription, t, the substitution, deletion, and insertion operations in character are utilized to restore the raw text sequence, t. The calculation of CER can be denoted as where S C , D C , and I C represent the numbers of character substations, deletions, and insertions, respectively, and N C is the number of characters in t.
Similarly, the substitution, deletion, and insertion operations in word are employed to calculate WER, which can be expressed as where S W , D W , and I W denote the numbers of word substations, deletions, and insertions, respectively, and N W is the number of words in t.
Note that CER and WER may exceed one owing to a large number of deletions. Moreover, for the same sentence, CER is typically lower than WER. In reality, the recognized sentence is usually readable when CER is lower than around 0.15.
2) Speech Synthesis Task: The ultimate goal of the speech synthesis task is to reconstruct the clear speech. Due to the difficulty to attach audio to the paper for listening, the reasonable approach for assessing the quality of the synthesized speech is to compare it with the real speech. In this task, we adopt unconditional Fréchet deep speech distance (FDSD) and unconditional kernel deep speech distance (KDSD) [55] as two quantitative metrics to evaluate the distribution similarity between the synthesized speech and the real speech by measuring the Fréchet distance and the maximum mean discrepancy (MMD) between them, respectively. The lower the FDSD or KDSD values, the higher similarity between the synthesized speech and the real speech, i.e., the higher quality of the synthesized speech.
Given the real speech sample sequence, m, the synthesized speech sample sequence, m, and a publicly available deep speech recognition model, the features of real and synthesized speech sample sequences, denoted as D ∈ R U ×V and D ∈ R U ×V respectively, can be extracted by passing m and m through the speech recognition model, respectively. Therefore, their FDSD can be calculated as where µ D and µ D represent the means of D and D, respectively, Σ D and Σ D denote their covariance matrices, and Tr(·) indicates the trace of a square matrix.
On the other hand, KDSD can be obtained by (13), where kf (·) is a kernel function, defined as IV. SEMANTIC COMMUNICATIONS FOR SPEECH RECOGNITION AND SYNTHESIS In this section, we present the details of the DeepSC-ST. Specifically, in the speech recognition task, CNN and RNN are adopted for the semantic encoding, dense layers are employed for the channel encoding and decoding. In the speech synthesis task, the Tacotron 2 [50] is utilized to reconstruct the speech waveform from the recognized text according to the user ID.

A. Model Description
The proposed DeepSC-ST is shown in Fig. 2. In the figure, M is the set of speech sample sequences drawn from the speech dataset and is converted into a set of spectra, S = [S 1 , S 2 , ..., S B ], where B is the batch size. In addition, T is the set of correct text sequences, t, corresponding to M . The spectra, S, are fed into the semantic encoder to learn and extract the text-related semantic features and to output the features, P . The details of the semantic encoder are presented in part B of this section. Afterwards, the channel encoder, implemented by two dense layers, converts P into U . To transmit U into a physical channel, it is reshaped into symbols, X, via a reshape layer.
The received symbols, Y , are reshaped into V before feeding into the channel decoder, represented by three dense layers. The output of the channel decoder is the recovered text-related semantic features, P . Next, the greedy decoder, i.e., the feature decoder, decodes P into the text transcriptions, T . The details of the greedy decoder are presented in part C of this section.
Furthermore, from the Fig. 2, the speech sample sequences, M , can be reconstructed by the speech synthesis module, i.e., Tacotron 2, by combing the corresponding user ID. It is worth mentioning that Tacotron 2 is trained separately from the speech recognition task and can be omitted in the communication scenarios where users only request the text information. Some details of Tacotron 2 are introduced in part D of this section.

B. Semantic Encoder
The semantic encoder is constructed by the CNN and the gated recurrent unit (GRU)-based bidirectional RNN (BRNN) [56] modules. As shown in Fig. 2, the input spectra, S, are first converted into the intermediate features via several CNN modules. Particularly, the number of filters in each CNN module is E p , p ∈ [1, 2, . . . , P ], and the output of the last CNN module is b ∈ R B×C P ×D P ×E P . Then, b is fed into Q BRNN modules, successively, and produces d ∈ R B×G Q ×H Q , where the number of GRU units in each BRNN module, H q , q ∈ [1, 2, . . . , Q], is consistent. Finally, the textrelated semantic features, P , are obtained from d by passing through multiple cascaded dense layers and a softmax layer.

C. Greedy Decoder
As aforementioned that the recovered features, P , are decoded into the text transcriptions, T , via the greedy decoder. An example to obtain the text transcription by the greedy decoder is shown in Fig. 3. During the decoding process, in each step l, the maximum probability in the vector, p l , is selected and mapped to the corresponding token in t. The final text transcription, t is composed by concatenating all mapped tokens.
It is worth mentioning that the processes of selecting the maximum probability in p l and mapping the maximum probability to the corresponding token in t are non-differentiable, which runs counter to the prerequisite of a differentiable loss function to design a neural network. Therefore, the greedy decoder is unable to be implemented by the neural network.

D. Training and Testing for Speech Recognition Task
According to the prior knowledge of CSI, the training and testing algorithms for speech recognition task are described in Algorithm 1 and Algorithm 2, respectively. During the training stage, the greedy decoder and the speech synthesis module are omitted.

E. Speech Synthesis Module
As shown in Fig. 2, Tacotron 2 is adopted to reconstruct the speech sample sequences, M , as close to the input sample sequences, M , as possible. Tacotron 2 is composed of the spectrogram prediction network and a variant of WaveNet vocoder [57], which converts the input token sequence into the mel-frequency spectrogram and produces time-domain waveform according to the predicted spectrogram, respectively. The mel-frequency spectrogram is obtained by implementing a nonlinear process to the frequency of the short-time Fourier transform (STFT) magnitude, which is straightforward for the WaveNet model to generate Algorithm 1 Training algorithm for speech recognition task. Initialization: initialize parameters θ (0) , i = 0.
Algorithm 2 Testing algorithm for speech recognition task. 1: Input: Speech sample sequences M from testset, trained networks T S α (·), T C β (·), and R C δ (·), testing channel set H, a wide range of SNR regime. 2

7:
Transmit X and receive Y via (2). 8: Decoding P into T via (4). 10: end for 11: end for 12: Output: Recovered text transcriptions, S. audio owing to the simple and low-level acoustic characteristic. Particularly, an encoder maps the input token sequence into the internal feature representation after a 512-dimensional token embedding, which is achieved by a stack of three convolutional layers with 512 filters followed by batch normalization and ReLU activation, as well as a bidirectional long short-term memory (LSTM) layer containing 512 units.
As a consequence, the encoded features are pro-cessed by a decoder enabled by a neural network to predict the mel-frequency spectrogram frame by frame. In details, the previous synthesized frame is fed into a pre-net with two connected layers of 256 ReLU units, then the output is combined with the encoded features and the user ID to predict the current frame after passing through a stack of two LSTM layers containing 1024 units and a post-net with a stack of five convolutional layers with 512 filters followed by batch normalization and ReLU activation. The spectrogram prediction network is trained independently by minimizing the sum of the mean-squared error (MSE) before and after the post-net. Furthermore, the mel-frequency spectrogram frames are fed into the WaveNet vocoder to predict the parameters, e.g., mean and log scale, of the synthesized waveform after a ReLU activation followed by a linear projection, which encourages to train the WaveNet vocoder by maximizing the log-likelihood of the speech waveform w.r.t. the trainable parameters due to the tractability of loglikelihoods.

V. NUMERICAL RESULTS
In this section, we compare the proposed DeepSC-ST with the conventional communication systems and the existing semantic communication systems under different channels, where the accurate CSI is assumed at the receiver. The experiment is conducted on the LJSpeech dataset [58] for both speech recognition and speech synthesis tasks, which is a corpus of English speech with the sampling rate of 22,050 Hz. The speech is downsampled into 16,000 Hz. The adopted simulation environment is Tensorflow 2.4.

A. Simulation Setting and Benchmarks
In the proposed DeepSC-ST, the numbers of CNN modules and BRNN modules in the semantic encoder are two and six, respectively. The number of filters for each CNN module is 32 and the number of GRU units for each BRNN module is 800. Moreover, two dense layers are utilized in the channel encoder with 40 units and three dense layers are utilized in the channel decoder with 40, 40, and 29 units, respectively. The batch size is B = 24 and the learning rate is η = 0.0001.
The parameter settings of the proposed DeepSC-ST for speech recognition task are summarized in Table I. For performance comparison, we provide the following four benchmarks: 1) Benchmark 1: The first benchmark is a conventional system that transmits speech signals, named speech transceiver. The input of the system is the speech signals, which is restored at the receiver. Moreover, the text transcription is obtained from the recovered speech signals after passing through a speech recognition model. The adaptive multirate wideband (AMR-WB) system [59] is used for speech source coding and 64-QAM is utilized for modulation. Polar code with successive cancellation list (SCL) decoding algorithm [60] is employed for channel coding with the block length of 512 bits and the list size of four. Moreover, the speech recognition task aims to recover the text transcription accurately, which is realized by the Deep Speech 2 model [29].
2) Benchmark 2: The second benchmark is a conventional system that transmits text, named text transceiver. Particularly, the input speech signals are converted into the text sequence before feeding into the conventional system and the text sequence is recovered at the receiver. The Huffman coding [61] is employed for text source coding in the system, the settings of channel coding and modulation are same as that in benchmark 1. In addition, The Deep Speech 2 model is utilized to implement speech recognition at the transmitter. Furthermore, the recovered text sequence is passed through the Tacotron 2 model [50] to reconstruct the speech signals in the speech synthesis task.
3) Benchmark 3: The third benchmark is a conventional system that transmits the extracted textrelated semantic features produced by the semantic encoder of DeepSC-ST, named feature transceiver. Those features are the floating-point vectors and encoded to the bit sequence for transmission after passing through the IEEE 754 floating-point arithmetic [62] module, polar code module, and 64-QAM modulator. The text transcription is estimated based on the recovered features at the receiver by leveraging the greedy decoder and the speech signals are reconstructed by feeding the recognized text into the Tacotron 2 model.

4) Benchmark 4:
The fourth benchmark is a hybrid modeling system that incorporates the speech recognition model and the DeepSC [9] model, named SR+DeepSC. Particularly, the input speech is first converted to the text by the Deep Speech 2 model at the transmitter. Then the text is fed into the DeepSC transmitter before transmission and restored by the DeepSC receiver. Moreover, the recovered text is utilized to synthesize the speech sequence according to the Tacotron 2 model. Note that the conventional communication paradigm is adopted in the speech transceiver, the text transceiver, and the feature transceiver, while the DL-enabled semantic communication paradigm is leveraged in the SR+DeepSC and the proposed DeepSC-ST.

B. Complexity Analysis
The system complexity of different transmission schemes for the speech recognition task is introduced as follows. The proposed DeepSC-ST includes 85,796,042 trainable parameters. However, the number of trainable parameters in the SR+DeepSC is 92,212,016, which results in a 7.48% increase of system complexity over the DeepSC-ST. The system complexity of the feature transceiver is at the same level as the DeepSC-ST. Moreover, the adopted Deep Speech 2 model in the speech transceiver and the text transceiver has 85,788,733 trainable parameters, which nearly mitigates no computational burden than the DeepSC-ST, besides, the extra communication resources are required because of the conventional encoding/decoding mechanism. Therefore, in terms of the system complexity, the proposed DeepSC-ST is at the same level as the speech transceiver, the text transceiver, and the feature transceiver, while slightly lower than the SR+DeepSC.
Furthermore, as aforementioned that the semantic communication system transmits much less amount of data than the source information. We compute the average encoded symbols to transmit 16,000 original speech samples in the proposed DeepSC-ST and the benchmarks, which is summarized in Table II. From the table, the proposed DeepSC-ST reduces the transmission data by nearly ten times of the speech transceiver and the feature transceiver while the text transceiver and the SR+DeepSC need average 60 encoded symbols because it only comprises few tokens in the long speech sample sequence.

C. Experiments for Speech Recognition Task
The relationship between the CTC loss and the number of epochs is shown in Fig. 4. From the figure, when learning rate is 0.0005, the CTC loss converges slowest and has the highest value amongst the three learning rates, which experiences some fluctuations when epoch>30. When learning rate is 0.0001 or 0.00005, the CTC loss decreases to around 10 after 20 epochs and converges after about 40 epochs.
According to (2), the fading channel, h, and the Gaussian noise, w, are specified during the training process and the trained DeepSC-ST is utilized to test the system performance under various channel environments. It is logical that when testing the system performance under a certain channel condition, the DeepSC-ST model trained under the same channel condition performs better than the DeepSC-ST models trained under other channel conditions. However, it is impractical to deploy numerous DeepSC-ST models corresponding to dynamic channel environments because of the scarce computation resources, which inspires us to investigate a DeepSC-ST model that is capable of dealing with channel variations. By comparing the testing performance of different DeepSC-ST models trained under three fading channels, i.e., the AWGN channels, the Rayleigh channels, and the Rician channels, the DeepSC-ST trained under the Rician channels and the Gaussian noise with SNR=8 dB is identified as a robust model to cope with diverse channel environments. has lower CER scores than the benchmarks under all tested channel environments. Particularly, the recognized text transcription in the proposed DeepSC-ST is readable when SNR is higher than -2 dB under the AWGN channels while the required SNR is 4 dB in the speech transceiver, the feature transceiver, and the text transceiver. SR+DeepSC obtains no readable text under the adopted fading channels and SNRs. In addition, the DeepSC-ST performs steadily when coping with dynamic channels and SNR>0 dB. However, the performance of the benchmarks is quite poor under different channel conditions. Moreover, the DeepSC-ST significantly outperforms the benchmarks when the SNR ranges from -12 dB to 4 dB for the AWGN channels, and -12 dB to 8 dB for the Rayleigh channels and the Rician channels. The fluctuation in the speech transceiver and the text transceiver for SNR>14 dB is because the Deep Speech 2 model is trained under the clear speech signals, which results in the slight uncertainty to process the speech signals at the similar noise level. Fig. 6 compares the WER of different approaches. From the figure, the proposed DeepSC-ST provides lower WER and outperforms the speech transceiver under various channel conditions, as well as the text transceiver and the feature transceiver when SNR<8 dB. Moreover, similar to the results of CER, the DeepSC-ST has low WER on average when coping with channel variations while the conventional systems and SR+DeepSC provide poor WER scores when SNR is low. According to the simulation results, the DeepSC-ST is able to achieve better performance to recover the text transcription at the receiver from the input speech signals at the transmitter when coping with the complicated communication scenarios than the conventional communication systems and semantic communication systems, especially in the low SNR regime.

D. Experiments for Speech Synthesis Task
In this experiment, we test FDSD and KDSD scores of the received speech in the speech transceiver, the synthesized speech in the feature transceiver, the text transceiver, the SR+DeepSC, and the proposed DeepSC-ST. In practice, the audio is generally treated as unacceptable when it is unable to understand or full of background noise. Therefore, FDSD and KDSD thresholds are necessary to determine the validity of the audio. Inspired by this, a survey is developed to collect the opinions of 100 evaluators by selecting the satisfaction degree of the speech signals corresponding to different FDSD and KDSD scores. Particularly, the Deep Speech 2 model is utilized to extract the aforementioned features, D and D, to compute FDSD and KDSD scores. The survey result is summarized in Table III 1 . From the table, the speech signals can be considered as acceptable when their FDSD or KDSD scores are less than 10 or 0.012, respectively.
The FDSD and KDSD results of the DeepSC-ST and four benchmarks are shown in Fig. 7 and Fig. 8, respectively, where the ground truth is the KDSD and FDSD scores computed by passing the plain text sequence through the Tacotron 2 model directly. From the figure, the DeepSC-ST obtains lower FDSD and KDSD scores than the SR+DeepSC under all tested channel conditions, besides, it achieves better speech recovery, i.e., lower FDSD and KDSD scores, than the speech transceiver, the feature transceiver, and the text transceiver from -8 dB to 2 dB under the AWGN channels, as well nearly -10 dB to 10 dB under the Rayleigh channels and the Rician channels. Particularly, when SNR is lower than around 4 dB under the AWGN channels, the received speech signals in the speech transceiver are invalid while the synthesized waveform in the proposed DeepSC-ST is acceptable for SNR>-10 dB, which proves the adaptability of DeepSC-ST in the low SNR regime. In addition, the proposed DeepSC-ST outperforms the adopted benchmarks under the Rayleigh channels and the Rician channels amongst all the tested SNRs. Furthermore, the DeepSC-ST is robust to cope with the diverse channel environments because the SNR thresholds yielding the valid speech waveform are nearly the same in different fading channels in different fading channels.

E. DeepSC-ST Demonstration
According to the proposed DeepSC-ST, we design a software demonstration with user interface, as shown in Fig. 9 2 . In the figure, users could choose a local .wav file or record the speech input. Besides, the software demonstration allows users to specify   Table IV and Fig. 10, respectively.

VI. CONCLUSIONS
In this paper, we investigated a DL-enabled semantic communication system for speech recognition and speech synthesis tasks, named DeepSC-ST, which recovers the text transcription by utilizing the text-related semantic features and reconstructs the speech sample sequence at the receiver. Particularly, we design a joint semantic-channel coding scheme to learn and extract semantic features and mitigate the channel effects to achieve the speech recognition. Simulation results verified that the DeepSC-ST outperforms the conventional communication systems and the existing semantic communication systems, especially in the low SNR regime. Moreover, we built a software demonstration to allow the real human speech input for the proof-of-concept.
Our proposed DeepSC-ST is envisioned to be a promising candidate for semantic communication systems for speech recognition and speech synthesis tasks.