LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on $\sim$2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets.


I. INTRODUCTION
I NSPIRED by human bimodal perception [1] in which both sight and sound are used to improve the comprehension of speech, a lot of effort has been spent on speech processing tasks by leveraging visual information, for example, integrating simultaneous lip movement sequences into speech recognition [2], [3], guiding neural networks in isolating target speech signals with a static face image for speech separation [4], [5] and grounding speech recognition with visual objects and scene information [6], [7].Multi-modal audio-visual methods achieve significant improvement over single modality models, since the visual signals are invariant to acoustic noise and complementary to auditory representations [8].Moreover, the visual contribution becomes more important as the acoustic signal-to-noise ratio is decreased [9].
In most approaches, the visual information is mainly used as auxiliary input to complement audio signals.However, in some circumstances, the auditory information may be absent or extremely noisy, which motivates speech reconstruction.Speech reconstruction aims to generate both intelligible and qualified speech by only conditioning on image sequences of talking mouths or faces.Generating intelligible speech from silent videos enables many applications, e.g. a silent visual input method on mobile phones for privacy protection in public areas [10]; communication assistance for patients suffering laryngectomy [11]; surveillance video understanding when only visual signals are available [12]; enhancement of video conferences or far-field human-robot interaction scenarios in a noisy environment [13]; non-disruptive user intervention for autonomous vehicles [14].
It is challenging to reconstruct qualified and intelligible speech from only mouth or face movements, since human speech is produced by not only externally observable organs, like lips and tongue, but also internally invisible ones which are difficult to capture in most cases [15], for instance, vocal cords and pharynx.Consequently, it is hard to infer fundamental frequency or voicing information controlled by these organs.Moreover, some phonemes are acoustically discriminative but not easy to distinguish visually since the phonemes share the same places of articulation but with different manners of articulation [16], for example, /v/ and /f/ in English are both fricatives and look the same on lip and teeth movements but are different on the vibration of vocal cords (voiced vs unvoiced) and the attribute of aspirate (unaspirated vs aspirated) which are not visible in most video recordings.Hence, predicting human voices from appearance is still a challenging task [17].
In recent years, there has been a growing interest in speech reconstruction and variant methods have been proposed.A possible technique is to run lip reading (video-to-text) and text-to-speech (TTS) systems in cascade but the lip reading performance is still unsatisfactory and the error is being propagated to TTS.Alternatively, other researchers directly estimate speech representations, for example, linear predictive coding [18], bottleneck features [19], and mel-scale spectrograms [20], from videos, followed by a vocoder used to transform intermediate representations to audio, for instance, STRAIGHT [21] and WORLD vocoder [22].In contrast, the information of speaker identity and speaking styles can be relatively preserved.However, most existing work only focuses on speaker-dependent settings with a small vocabulary or artificial grammar dataset, or even builds one model for each individual speaker, which does not meet the requirements in realistic scenarios.
In our previous work, we proposed LipSound [20] to directly map visual sequences to low-level speech representation, i.e. mel-spectrogram, which is inspired by audio-visual selfsupervised representation learning.By leveraging the natural co-occurrence of audio and visual streams in videos without requiring any human annotations, or treating one modality as the supervision of the other, self-supervised representation learning has received substantial interest, for example, learning representations by matching the temporal synchronization [23] or spatial alignment [24] of audio and video clips for action recognition.
In comparison to our previous work LipSound that only focuses on speaker-dependent settings for the GRID artificial grammar dataset, in this paper, we further explore to what extent the large scale crossmodal self-supervised pre-training can benefit speech reconstruction in generalizability (speakindependent) and transferability (Non-Chinese to Chinese) on a large vocabulary continuous speech corpus TCD-TIMIT.In addition, we also changed the LipSound architecture substantially by replacing 1DCNN with 3DCNN blocks (Conv 3D + Batch Norm + ReLU + Max Pooling + Dropout).This should enable the model to directly learn stable representations from raw pixels and using location-aware attention mechanism to make the alignments between encoder and decoder more robust to non-verbal areas.Moreover, we replace the Griffin-Lim algorithm [25] with a neural vocoder to smoothly generate waveforms and voices.
As shown in Fig. 1 (a), our approach is first pre-training the Lipsound2 model on a large-scale multi-lingual audio-visual corpus (VoxCeleb2) to map silent videos to mel-spectrogram, then fine-tuning the pre-trained model on specific domain datasets (GRID, TCD-TIMIT and CMLR), followed by a neural vocoder (WaveGlow [26]) to reconstruct estimated mel-spectrogram to waveforms.Lip reading (video-to-text) experiments are performed by fine-tuning the generated audios on a pre-trained acoustic model (Jasper [27]) in Fig. 1 (b).
The main contributions of this paper are: 1) We propose an auto-regressive encoder-decoder with attention architecture, LipSound2, to directly map silent facial movement sequences to mel-scale spectrograms for speech reconstruction, which does not require any human annotations.
2) We explore the model generalizability on speakerindependent and large-scale vocabulary datasets which few studies have focused on, and we achieve better performance on speech quality and intelligibility in the speech reconstruction task.3) To the best of our knowledge, no previous research has investigated Chinese speech reconstruction in speakerdependent and -independent cases.4) By leveraging the large-scale self-supervised pretraining on LipSound2 and the advanced Jasper speech recognition model, our cascaded lip reading system outperforms existing models by a margin on both English and Chinese corpora.The paper is organised as follows.Section II reviews related work on lip-to-speech reconstruction, lip reading and selfsupervised learning.Section III provides the model details, followed by the description of datasets and evaluation metrics in Section IV.Experimental results and discussion are presented in Section V and Section VI respectively.We conclude this paper in Section VII.

A. Lip to Speech Reconstruction
In recent years, researchers have investigated a variety of approaches to speech reconstruction from silent videos.We only review the neural network methods in this paper.
Le Cornu et al. [28] propose to use fully connected neural networks to estimate spectral envelope representations, for instance linear predictive coding (LPC) coefficients and Mel-filterbank amplitudes, from visual feature inputs, such as two-dimensional discrete cosine transform, followed by a STRAIGHT vocoder [21], which is used to synthesize timedomain speech signals from the estimated representations.Follow-up work [29] predicts speech-related codebook entries with a classification framework to get further improvement on speech intelligibility.Instead of using handcrafted visual features, Ephrat et al. [18] utilize convolutional neural networks (CNNs) to automatically learn optimal features from raw pixels and show promising results on out-of-vocabulary experiments.Subsequently, improved results are reported by Ephrat et al. [30] via combining RestNet backbone and a postprocessing network on a large-scale vocabulary dataset, TCD-TIMIT [31].Akbari et al. [19] treat the intermediate bottleneck features learned by a speech auto-encoder as training targets by conditioning on lip reading network outputs.Kumar et al. [32] validate the effectiveness of using multiple views of faces on both speaker-dependent and -independent speech reconstruction.Vougioukas et al. [33] utilize generative adversarial networks (GAN) to directly predict raw waveforms from visual inputs in an end-to-end fashion without generating an intermediate representation of audios.Inspired by the speech synthesis model, Tacotron2 [34], Qu et al. propose to directly map video inputs to low-level speech representations, mel-spectrogram, with an encoder-decoder architecture and achieve better results on lip reading experiments.Afterwards, Prajwal et al. [35] improve the model performance with 3D CNN and skip connections.Recently, Michelsanti et al. [36] have presented a multi-task architecture to learn spectral envelope, aperiodic parameters and fundamental frequency separately, which are then fed into a vocoder for waveform synthesis.They integrate a connectionist temporal classification (CTC) [37] loss to jointly perform lip reading, which is capable of further enhancing and constraining the video encoder.
In addition to sequences of lip or face images, further signals can be used for temporal self-supervision.For instance, Gonzalez et al. [38] generate speech from articulatory sensor data and Akbari et al. [39] reconstruct speech from invasive electrocorticography.However, most existing work only focuses on a speaker-dependent setting and small vocabulary or artificial grammar datasets.In this paper, we evaluate our method not only on speaker-dependent experiments but also pay attention to speaker-independent and large-scale vocabulary setups.

B. Lip Reading
Lip reading, also known as visual speech recognition, is the task to predict text transcriptions from silent videos, such as mouth or face movement sequences.Research on lip reading has a long tradition.Approaches to lip reading generally fall into two categories on feature level: a) handcrafted visual feature extraction, such as Discrete Cosine Transform [40], Discrete Wavelet Transform [41] or Active Appearance Models [42]; b) representations learned by neural networks, which has become the dominant technique for this task, for example, using convolutional auto-encoders [43], spatio-temporal convolutional neural networks [44], long short-term memory [45], or residual networks [46].
Alternatively, methods on modeling units for lip reading can be divided into word-and character-level.a) In the case of word-level units, lip reading is simplified as a classification task.Word-level lip reading datasets and benchmarks are built, for instance LRW [47] for English and LRW-1000 [48] for Chinese.Stafylakis et al. [46] adopt spatiotemporal convolutional networks and 2D ResNet as front end to extract visual features and bidirectional Long Short-Term Memory networks as the backend to capture temporal information, and attain significant improvement.Weng et al. [49] present two separated deep 3D CNN front ends to learn features from grayscale video and optical flow inputs, respectively.Martinez et al. [50] replace recurrent neural networks widely used in past work with Temporal Convolutional Networks to simplify the training procedure.The word-level methods are usually able to achieve high accuracy, however, the models disregard the interaction or co-articulation phenomenon between phonemes or words.A predefined lexicon with closed-set vocabulary is used and words are usually treated as isolated units in speech.Thereby, long-term context information, assimilation or dissimilation effects are completely neglected.Moreover, it is hard to recognize out-of-vocabulary words.
b) Lip reading models with character-or phoneme-level mainly use methods proposed in speech recognition.Assael et al. [44] conduct end-to-end lip reading experiments on sentence-level with CTC loss.Subsequently, sequence discriminative training [51] and domain-adversarial training [52] are introduced to lip reading.Chung et al. [2] collected the dataset, 'Lip Reading Sentences' (LRS) which consists of hundreds of thousands of videos from BBC television, and significantly promote the research on sentence-level lip reading.Shillingford et al. [53] verify the effectiveness of large-scale data (3,886 hours of video) for training continuous visual speech recognition.Afouras et al. [54] compare the performance of recurrent neural networks, fully convolutional neural networks and Transformer on lip reading character recognition.
Different from the mainstream methods which directly transform videos to text, we perform lip reading experiments in a cascaded manner, in which the silent videos are firstly mapped to audios with our LipSound2 model, then text transcriptions are predicted by fine-tuning on a pretrained speech recognition system.

C. Self-supervised Learning
As a form of unsupervised learning, self-supervised learning leverages massive unlabelled data and aims to learn effective intermediate representations with the supervision of selfgenerated labels.Training unlabelled data in a supervised manner relies on the pretext tasks that determines what labels and loss functions to be used.In computer vision, the pretext tasks can be predicting angles of rotated images [55], learning the relative position of segmented regions in an image [56], placing shuffled patches back [57] or colorizing grayscale input images [58].The video-based pretext tasks can be tracking moving objects in videos [59], validating temporal frame orders [60] and video colorization [61], and so on.
Self-supervised learning is also widely used in natural language processing.It has made substantial progress recently, where diverse pretext tasks are proposed, for instance, predicting center words using surrounding ones or vice versa [62], generating the next word by conditioning on previous words in an auto-regressive fashion [63], completing masked tokens or consecutive utterances [64], recovering the order of shuffled words [65] or the permutation of rotated sentence [66].
Inspired by the strong correlation between different modalities where, for example, the audio and visual modalities are consistent semantically or happen synchronously, more and more researchers work on multi-modality or cross-modality self-supervised learning.Multi-modality self-supervised learning aims to learn joint or shared latent spaces or representations while cross-modality self-supervised learning perceives one modality as the supervision of the other.Here we only review the audio-visual modalities since this is the main focus of our paper.Different pretext tasks are designed according to the correspondence and synchronization of audio and visual modalities, for instance, predicting whether image and audio clips correspond to enable neural networks to classify sounds [67], learning cross-modal retrieval [68], locating the sound source in an image [69], learning representations by matching the temporal synchronization [23] or spatial alignment [24] of audio and video clips for action recognition, combining a contrastive loss and a clustering loss to learn highlevel semantic representations for visual events and concepts understanding [70].In this paper, we focus on cross-modal self-supervised learning where the corresponding audio signals are treated as the supervisions of face sequence inputs.III.MODEL ARCHITECTURE Fig. 2 shows the LipSound2 model architecture.We split the video clips into an audio stream used as training target and a visual stream used as model input.The system consumes the visual part to predict the audio counterpart in a self-supervised fashion.The proposed architecture is composed of an encoderdecoder and an attention model to map the soundless visual sequences to the low-level acoustic representation, mel-scale spectrograms.Advantages are that, in contrast to directly predicting raw waveform, working with mel-spectrogram not only reduces computational complexity but also easily learns longdistance dependence.Model details are listed in Table I.Then a pre-trained neural vocoder, WaveGlow, follows to reconstruct raw waveform from the generated mel-spectrogram.

A. Encoder
The multi-Task CNN (MTCNN) [71] is used to detect face landmarks from raw videos.We crop only the face region (112 × 112 pixels) and smooth all frame landmarks, since low-resolution videos or profile faces lead to detection failures sometimes and landmark smoothing can eliminate frame skip in adjacent images.The cropped face sequences are then fed into 3D CNN blocks and each block is based on a 3D CNN, Batch Normalization, ReLU activation, Max Pooling and Dropout, as shown in Fig. 2. Then two bidirectional LSTM layers follow which capture the long-distance dependence from the left and right context.

B. Location-sensitive Attention
We use location-aware attention [72] to bridge the encoder and the decoder.The image sequence input i = (i 0 , ..., i n ) is firstly embedded into the latent space representation vector h = (h 1 , ..., h n ) by the encoder with the same dimension n in time, then the intermediate vector h is decoded into the melspectrogram o = (o 0 , ..., o m ).At time step t (0 ≤ t ≤ m), the attention weight a t can be obtained by the following equations: where W, M, Q, L are the matrices learned by Weight FC (fully connected), Memory FC, Query FC and Location FC, respectively.In Eq. ( 3), the sum of attention weights of all previous steps is integrated, which enables the current step attention to be aware of the global location and move forward monotonically.Fig. 3 visualizes the computational flow of the attention mechanism.The attention content vector v t can be obtained by multiplying the encoder output by the normalized attention weights (Eq.( 4)).

C. Decoder
The decoder module consists of one unidirectional LSTM layer and one linear projection layer.The decoder LSTM consumes the attention content vector and the output from attention LSTM to generate one frame at a time.Subsequently, the linear projection layer maps the decoder LSTM outputs to the dimension of the mel-scale filter bank.During training, we use ground truth mel-spectrogram frames as PreNet inputs and during inference, the predicted frames from previous time steps are used.Since the decoder only receives past information at every time step, after decoding, five Conv1D layers (postnet) are used to further improve the model performance by smoothing the transition of adjacent frames and using future information which is not available when decoding.

D. Training Objective
The loss function is the sum of two mean square errors (MSE), as shown in Eq. ( 5), i.e. the MSE between the decoder output O dec and the target mel-spectrogram M tar and the MSE between the postnet output O post and the target melspectrogram.

E. WaveGlow
We use WaveGlow [26] which combines the approach of the glow-based generative model [73] and the architecture insight of WaveNet [74] to transform the estimated mel-spectrogram back to audio.WaveGlow abandons auto-regression [74] and speeds up the procedure of waveform synthesis in high quality and resolution.We train WaveGlow from scratch using the same settings as original work [26] but in 16k sampling rate on the LJSpeech dataset [75] to meet the requirement of following up ASR models.To our surprise, the WaveGlow model that is trained with only one female voice can effectively generalize to any unseen voices and stably perform waveform reconstruction.

Layer
Kernel Stride Padding Channels/Nodes

F. Acoustic Model and Language Model
The Jasper [27] speech recognition system which is a fully convolutional architecture trained with skip connections and CTC loss is adopted to directly predict characters from speech signals.We pretrain the Jasper DR 10x5 model1 on 960h LibriSpeech and 1000h AISHELL-2 corpora, which achieves 3.61% WER (word error rate) and 10.05% CER (character error rate) on the development set for English and Chinese, respectively.
Beam search is utilized to decode the output character possibilities from Jasper and a 6-gram KenLM [76] language model2 into grammatically and semantically correct words on sentence-level [77].

A. Dataset
All datasets used in this paper are summarized in Table II and random frames from audio-visual ones are presented in Fig. 4. VoxCeleb2 is a large-scale audio-visual corpus, extracted from YouTube videos, containing over one million utterances and more than 6k different speakers from around 145 nationalities and languages.It includes noisy and unconstrained conditions, specifically, the audio stream may be recorded with background noise, such as laughter and room reverberation, and the vision part may contain variable head poses (e.g.frontal faces and profile), variable lighting conditions and low image quality, while the GRID and TCD-TIMIT datasets are in controlled experimental environments with fixed frontal face angle and clean background in audio and vision.It is worth to mention that the GRID dataset is designed to contain only a fixed 6-word structure and all sentences are generated by a restricted artificial grammar: command + color + preposition + letter + digit + adverb, for example, set blue in Z three now.CMLR (Chinese Mandarin Lip Reading) is collected from videos by 11 hosts of the Chinese national news program News Broadcast, which contains frontal faces and covers a large amount of Chinese vocabulary.We firstly pretrain LipSound2 on VoxCeleb2, then fine-tune the model on GRID, TCD-TIMIT and CMLR respectively for video to mel-spectrogram reconstruction.
LibriSpeech and AISHELL-2 are the current largest opensource speech corpora and widely-used speech recognition benchmarks for English and Chinese, respectively.Lib-riSpeech is derived from audiobooks, containing 460h of clean speech and 500h of noisy speech.AISHELL-2 consists of 1000h different domain speech, for instance, voice command and smart home scenario, and includes various accents from different areas of China.We use LibriSpeech and AISHELL-2 to pretrain the Jasper acoustic model to boost the performance of waveform-to-text transformation.The generated speech on GRID, TCD-TIMIT and CMLR is used for further fine-tuning to perform lip reading (video-to-text) experiments.
The LJ Speech dataset with only one female voice is especially designed for speech synthesis tasks, which is used for WaveGlow training, in this paper, to transform melspectrogram back to waveforms.

B. Evaluation Metrics
We evaluate the generated speech quality and intelligibility with Perceptual Evaluation of Speech Quality (PESQ) [83] and Extended Short-Time Objective Intelligibility (ESTOI) [84] respectively.The speech-to-text results are measured with Word Error Rate (WER) and Character Error Rate (CER), the ratio of error terms, i.e., substitutions, deletions and insertions, to the total number of words/characters in the ground truth sequences.

C. Training
We only describe the training settings of LipSound2 pretraining, LipSound2 fine-tuning and Jasper acoustic model fine-tuning.More details about Japser 1 pre-training acoustic model, KenLM 2 language model and WaveGlow3 can be found on the open source websites.
1) Vision Stream: face landmarks are detected using MTCNN [71] from all video frames and only the face area is cropped and reshaped to size of 112 × 112 as inputs.We also add one 'visual period' -an empty frame with all values of 255 -at the end of every visual stream to help the decoder stop decoding at the right time.A max decoder step threshold of 1000 is activated to terminate decoding when the decoder fails to capture the 'visual period'.
2) Audio Stream: we first divide the raw waveforms by the max value to normalize all audios to [0, 1], then extract the magnitude using the Short Time Fourier Transform (STFT) with 1024 frequency bins and a 64ms window size with 16ms stride.The mel-scale spectrograms are obtained by applying an 80 channel mel filter bank to the magnitude, followed by dynamic range clipping with a minimum value of 1e-5 and log dynamic range compression.
3) LipSound2 Pre-Training: image horizontal flipping, gradient clipping with a threshold of 1.0, early stopping and scheduled sampling [85] are adopted to avoid overfitting.Linear and convolutional layers are initialized with Xavier [86] and tanh functions respectively.We use the cosine learning rate decay strategy with an initial value of 0.001.Our LipSound2 model has around 100M parameters.The audio and visual sequences are both high dimensional data, so we conduct all experiments on 4 NVIDIA Quadro RTX 6000 GPUs with 24G memory in parallel to enable a big batch size.The entire pretraining procedure took around 25 days.
4) Fine-tuning: Pre-trained LipSound2 is fine-tuned on GRID, TCD-TIMIT and CMLR videos respectively to conduct speech reconstruction experiments.Afterwards, the produced speech for English (GRID and TCD-TIMIT) and Chinese (CMLR) is fine-tuned on the pre-trained English (LibriSpeech) and Chinese (AISHELL-2) acoustic models to perform lip reading tasks with a 10 times smaller learning rate.

V. EXPERIMENTAL RESULTS
A. Lip to Speech Reconstruction 1) Speaker-dependent Result: we report the generated speech results in two perspectives, i.e. speech quality (PESQ) and speech intelligibility (ESTOI).For a fair comparison, we keep the same settings as previous works.For speakerdependent tasks, all datasets are randomly split into 90:5:5 for training, validation and test sets on GRID (Speaker S1 − S4) and TCD-TIMIT (Lipspeaker 1 − 3).Different from previous works that build one model for each individual speaker, we train only one model on all speakers.
As shown in Table III, our LipSound2 system which is firstly pre-trained on the VoxCeleb2 dataset, then fine-tuned on the specific dataset achieves highest scores on both PESQ and ESTOI, which reveals the effectiveness of our proposed method.The last column in Table III compares the number of LipSound2 model parameters against those of baseline systems, showing that its best performance is obtained while staying well in the existing range of numbers of parameters.
2) Speaker-independent Result: for speaker-independent cases, we follow the same setups for GRID [33] and TCD-TIMIT [31].LipSound2 achieves the best results on both metrics on the GRID dataset.Moreover, by listening to the reconstructed audios, we find that our model is capable of producing similar voices as ground truth speakers, instead of generating a weird voice or one of the voices in the training set as occurring in previous works.The model has implicitly learnt the mapping between voices and faces.We highly recommend readers to listen to the produced samples on our demo website 4 .
Furthermore, we find substitution errors occurring on segment-level (vowels and consonants) because the context information is still not sufficient to disambiguate the phonemes that share the same visible organs, like lips and tongue, but are different in the invisible ones.
To the best of our knowledge, we are the first to tackle the speaker-independent case on the TCD-TIMIT dataset, since TCD-TIMIT consists of limited samples (∼370) for each speaker but with large-scale vocabulary (∼5.9K), which makes the tasks on TCD-TIMIT quite challenging.The speakerindependent results reported in Table IV show considerable performance, for example, the PESQ result is even better than some results reported on speaker-dependent settings (as shown in Table III), which suggests that the large-scale self-supervised pre-training enables the model to successfully generalize to unseen speakers.3) Speech Reconstruction for Chinese: To explore the effectiveness of our proposed architecture, we further perform speech reconstruction in Chinese.For the speaker-dependent case, we keep the same training and test splits used in CSSMCM [81] for lip reading; for the speaker-independent case, S1 (male) and S6 (female) are used for testing and the remaining speakers are used for training and validation.
In Table V, only LipSound2 results are reported since we make a first attempt at tackling speech reconstruction in Chinese.After checking the generated audio samples, we find that, besides the confusion on segments, there are some tone errors.One of the reasons is that Chinese is a tonal language in which lexical tones play an important role for semantic discrimination.The fundamental frequency (F0) which is produced by the vibration of vocal cords is not visible in the input videos (face area), and it is reported that the visual features have a weak correlation to F0 [28].Another reason is that the VoxCeleb2 dataset mainly consists of non-tonal languages, e.g.British English, American English and German, which makes the pre-training pay little attention to tone production.4) Attention Alignment: we compare the attention alignments learned by LipSound [20] which is only trained on the GRID dataset and LipSound2 (this paper).As shown in Fig. 6, the LipSound attention weights are fuzzy at non-verbal areas and at short pauses between words, which may mislead the decoder into focusing on irrelevant encoder timesteps, whereas the attention weights learned by LipSound2 are intensive and more robust to silence or short pauses.

B. Lip Reading Results
Different from conventional methods which directly transform videos into text, we perform lip reading experiments in two steps, i.e. video-to-wav and wav-to-text.1) Lip Reading Results for English: we follow the same splits as previous works for training and test on GRID [44] and TCD-TIMIT [87] datasets.The comparison with related results are listed in Table VI.We report the WER of GRID and TCD-TIMIT audio test sets on pre-trained acoustic models (Audio Gold Standard) and the results fine-tuned on the training audio samples (+Fine-Tuning), which is treated as the upper boundary of lip reading.
Our LipSound2 model achieves state-of-the-art performance on both GRID and TCD-TIMIT datasets.Fine-tuning the acoustic model pretrained on 960h LibriSpeech with generated audios can not only significantly boost the model performance but also accelerate training time.
Further improvement can be achieved when an external language model is integrated.The benefit from the language model on the GRID dataset is not as much as on TCD-TIMIT, since the sentence structure in GRID is designed by an artificial grammar.The language model can only help to correct misspelled words but cannot contribute grammatically or semantically.

VI. DISCUSSION
Although the proposed LipSound2 model pre-trained on a large-scale dataset achieves considerable performance on both speech reconstruction and lip reading tasks, it still generates error speech due to the visual similarity on pronunciation, for example, 'pill' is easy to be misrecognized as 'bill' in English and 'ji zhi' is mistaken as 'qi zhi' in Chinese.In addition, our model can generate quite similar voices as the ground truth in speaker-dependent settings, while the model is inclined to predict a voice existing in training set sometimes in speaker-independent cases.For details and a demonstrations we refer also to the demo video on the project website 5 .How to stop the fine-tuning procedure at the appropriate time and avoid the model overfitting on downstream tasks is an important direction for future research, since the MSE loss always declines when using teacher forcing during training, which hardly indicates whether the model is overfitting or not.Besides, a possible solution could be using voice embeddings as additional inputs that can efficiently help models learn speaker identity information, as we found in our previous work [4].

VII. CONCLUSION
In this paper, we have proposed LipSound2 which directly predicts speech representations from raw pixels.We investigated the effectiveness of self-supervised pre-training for speech reconstruction on large-scale vocabulary datasets, particularly for speaker-independent settings.Moreover, stateof-the-art results are achieved by fine-tuning the produced audios on a well pretrained speech recognition model for both English and Chinese lip reading experiments, since our twostep method benefits not only from the large-scale crossmodal supervision which enables the model to learn more robust representations and more different content information, but also from the advanced speech recognition architecture (acoustic and language models) which is pre-trained on abundant labeled data.
Although we have made great progress on speech reconstruction in controlled environments, there is still a significant gap to the requirements of real-world scenarios.Future work will focus on more realistic configuration, such as the variety of light conditions, moving head poses and different background environments.Moreover, the current lip reading experiments are separately conducted in two steps in which the error generated in the first step (video-to-wav) will be propagated to the second step (wav-to-text).How to jointly train the two tasks in an end-to-end fashion could be another direction.Besides, we are also interested in integrating our LipSound2 model into active speaker detection, speech enhancement and speech separation tasks to boost the performance of speech recognition systems in human-robot interaction.

Fig. 3 .
Fig. 3.The computational flow of location-aware attention at time step t.

Fig. 2 .
Fig. 2. The architecture of LipSound2.The video is split into visual and acoustic streams.The face region which is cropped from the silent visual stream is used as the model input.The acoustic spectrogram features extracted from the counterpart audio stream are used as the training target.During training, the ground truth spectrogram frames are utilized to accelerate convergence, while, during inference, the outputs from previous steps are used.

Fig. 4 .
Fig. 4. Random face samples from audio-visual corpora.Only the face region is cropped during training and test.

Fig. 5 .
Fig. 5.The comparison between generated mel-spectrogram and ground truth in speaker-dependent and -independent settings for English and Chinese.

TABLE I CONFIGURATION
OF LIPSOUND2 ENCODER, DECODER, ATTENTION AND POSTNET.

TABLE II OVERVIEW
OF ALL CORPORA USED IN THIS PAPER.SPK: SPEAKERS.UTT: UTTERANCES.VOCAB: VOCABULARY.

TABLE III SPEAKER
-DEPENDENT SPEECH RECONSTRUCTION RESULTS ON GRID AND TCD-TIMIT DATASETS.

TABLE IV SPEAKER
-INDEPENDENT SPEECH RECONSTRUCTION RESULTS ON GRID AND TCD-TIMIT DATASETS.

TABLE V SPEECH
RECONSTRUCTION RESULTS FOR CHINESE ON CMLR DATASETS.
Lip Reading Results for Chinese: we also explore lip reading performance in Chinese, as shown in TableVII.Audio Gold Standard is directly evaluating the CMLR test set on a pre-trained acoustic model trained on 1000h AISHELL2 dataset.After fine-tuning with CMLR training audios, we get 3.88% CER and 4.89% CER for speaker-dependent andindependent cases respectively.In comparison to other work, our LipSound2 model achieves better results.CER further drops when decoding with an external language model.Besides, we build a new baseline for CMLR in speaker-independent settings.

TABLE VII LIP
READING RESULTS FOR CHINESE ON CMLR DATASETS.CER: