Lip Reading Sentences Using Deep Learning With Only Visual Cues

In this paper, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The system has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art works in lip reading sentences, the system has achieved a significantly improved performance with 15% lower word error rate. In addition, experiments with videos of varying illumination have shown that the proposed model has a good robustness to varying levels of lighting. The main contributions of this paper are: 1) The classification of visemes in continuous speech using a specially designed transformer with a unique topology; 2) The use of visemes as a classification schema for lip reading sentences; and 3) The conversion of visemes to words using perplexity analysis. All the contributions serve to enhance the accuracy of lip reading sentences. The paper also provides an essential survey of the research area.


I. INTRODUCTION
The task of automated lip reading has attracted a lot of research attention in recent years and many breakthroughs have been made in the area with a variety of machine learning-based approaches having been implemented [1], [2]. Automated lip reading can be done both with and without the assistance of audio [3] and when performed without the presence of audio, it is often referred to as visual speech recognition [4].
The most recent approaches to automated lip reading are deep learning-based and they largely focus on decoding long speech segments in the form of words and sentences using either words or ASCII characters as the classes to recognize [5], [6], [7], [8], [9], [10]. Lip reading systems that are designed to classify words often use individual words as the classification schema where every word is treated as a class. In recent years, very good accuracies have been achieved for word-based classification on some of the most challenging The associate editor coordinating the review of this manuscript and approving it for publication was Eyhab Al-Masri . audio-visual datasets for words, such as LRW [7] and LRW-1000 [47].
Contrastingly, however, lip reading sentences have not succeeded in attaining accuracies as good as word-based approaches. It still remains an ongoing challenging task to automatically lip reading people uttering sentences which cover a wide range of vocabulary and contain words that may not have appeared in the training phase while using the fewest classes possible. The main obstacles to lip reading sentences are: • Lip reading systems that use words or ASCII characters as classes can only predict words that the systems have been trained to predict because in the case of using words as a class, the word needs to be encoded as a class and presented in the training phase; while in the case of ASCII characters, the prediction of words is based on combinations of characters having been presented in the training phase as patterns.
• The models must be trained to cover a wide range of vocabulary which requires a significant number of parameters in the models to be optimised and a significant volume of training data to be used.
• They often require curriculum learning-based strategies [27], [28] which involve further pre-processing, whereby the videos of individuals speaking in the training data have to be clipped so that the models can be trained on single word examples initially, with the length of the sentences being gradually incremented. This paper focuses on improving the accuracy of lip reading sentences and this is achieved by using visemes as a very limited number of classes for classification, a specially designed deep learning model with its own network topology for classifying visemes, and a conversion of recognised visemes to possible words using perplexity analysis.
Using visemes for lip reading sentences has some unique advantages. The use of visemes as classes in comparison to the use of either words or ASCII characters as classes requires an overall smaller number of classes which alleviates bottleneck in the computation. In addition, using visemes does not require pre-trained lexicons, meaning that a viseme-based lip reading system can be used to classify words that have not presented in the training phase, and they can be generalised to different languages because many different languages share the same visemes.
On the other hand, there are some specific issues to be considered when designing a viseme-based lip reading system for sentences. The general classification performance for individual segmented visemes has been less satisfactory in comparison to the classification of words due to the fact that visemes tend to have a shorter duration than words. This results in there being less temporal information available to distinguish between different classes, as well as there being more visual ambiguity when it comes to class recognition [24]. One possible way to address this problem is to significantly increase the training data available to enhance the system's ability to distinguish between classes, and this is why a high volume of training videos have been utilised. Moreover, there is a direct conversion of recognised ASCII characters to possible words in a one-to-one mapping relationship, whereas this one-to-one mapping relationship does not exist when using visemes, because one set of visemes can map to multiple different sounds or phonemes. This also means that once visemes have been classified, there is still the need to perform a viseme-to-word conversion. This approach also helps to distinguish between homopheme words or words that look the same when spoken but sound different [11], a phenomenon that exists because of the one-to-many mapping relationship between visemes and phonemes.
The proposed automated lip reading system contains a component to classify spoken visemes from people speaking in silent videos, and a component to perform viseme-to-word conversions using perplexity analysis [12]. The proposed model also has a good robustness to varying levels of lighting.
The rest of the paper is organised as follows: First in Section II, the different classification schema for automated lip reading are discussed along with their advantages and limitations. Then in Section III, details of all the components that make up the whole lip reading system including pre-processing, visual feature extraction, viseme classification and word detection are given. In Section IV, the classification results for the overall lip reading system are discussed and compared followed by concluding remarks given in Section V along with suggestions for further research.

II. LITERATURE REVIEW
Automated lip reading systems initially focused on classifying isolated speech segments in the form of digits and letters [13], [14], [15], [16], [17], and then eventually moved on to longer speech segments in the form of words. The success of automated lip reading was previously constrained by the available training data, as initially, the only audio-visual datasets available were those with isolated speech segments, i.e., digits, alphabet and words [17], [18], [19]. Subsequently every speech segment was treated as a class to recognise.
Thanks in part to the availability of larger audio-visual datasets with continuous speech, later lip reading systems have focused on classifying entire sentences utilising a wider range of vocabulary and so have opted for ASCII-based class systems [5], [6], [7], [8], [9], [10]. Sentences are spelt using ASCII characters as opposed to including a class for every single word, which allows for the use of fewer classes and avoids the creation of computational bottleneck [30]. ASCII characters also allow for the modelling of natural language due to the conditional probability relationships that exist between ASCII characters making it easier to predict characters and words.
Very good accuracies have been attained in some of the most recent neural network-based lip reading systems that are trained to classify individual words on word-based lip reading datasets like LRW [7] and LRW-1000 [47]. LRW is a very taxing dataset since it consists of more than 1000 speakers with large variations in head pose and illumination. LRW-1000 is an even more tricky Mandarin lip reading dataset, due to its large variations in scale, resolution and background clutter.
Notable performances have been recorded for lip reading systems that predict entire sentences, such as those predicting phrases from the GRID [48] and OuluVS [49] datasets. However, sentences in datasets like GRID and OuluVS are simple, repetitive and follow standard sequences unlike those contained within the LRS2 corpus which are more random and varied. A summary of the most recent state-of-the-art lip reading models and their performances is given in Table 1.
Other alternative classification schemas for neural network-based lip reading include phonemes which have been used in audio and acoustic speech recognition systems [20]. Shillingford et al. [10] used a neural network architecture consisting of a spatial-temporal convolutional neural network(CNN) and a Long-Short Term Memory Network (LSTM) to classify sequences of phonemes from silent videos where phonemes were then mapped to words using a Finite-state transducer [30]. However, with phonemes, there is still the one-to-many mapping problem where different phonemes map to the same viseme thus producing identical lip movements.
To the best of our knowledge, there is no lip reading sentences system that has decoded entire sequences of visemes, although there has been a lot of work on classifying individual segmented visemes in the form of images or groups of image frames [22], [23], [24], [25]. If visemes are to be classified, they should be classified in the context of continuous speech in order to perform viseme classification in real-time. There is one paper about an LSTM that takes visemes as an input and predicts the words that were spoken by individuals from a limited dataset with some satisfactory results [26], though the individual visemes were already known.
In addition to being treated as individual segments, visemes can also be modelled in the form of clusters like ''visual words'' where groups of visemes that make up a word can be segmented. Whilst approximately 50% of the words in the English language share identical viseme clusters, there are words that have unique visemes and can be classified when performing automated lip reading using solely visual information. For words that share visemes, clusters of visemes in combination would need to be analysed to determine which combination is most linguistically probable. This is the basis for the lip reading sentence system proposed in this paper based entirely on visual cues.
No official standard convention for defining precise visemes or even the precise total number of visemes exists and different approaches to viseme classification have used varying numbers of visemes as part of their conventions with different phoneme-to-viseme mappings [29], [30], [31], [32], [33], [34]. All the different conventions consist of consonant visemes, vowel visemes and one silent viseme; but Lee and Yook's mapping convention of [29] appears to be the most favoured for speech classification and it is the one that has been utilised for this paper. However, it is accepted that there are multiple phonemes that are visually identical on any given speaker [35], [36].
The different automated lip reading approaches summarised in Table 1 indicate many challenges still hindering the success of automated lip reading systems. One of these challenges is the lack of temporal information required to distinguish between segments of speech which is why some of the approaches tasked to classify shorter segments, such as visemes and digits, have not attained as good accuracies as those tasked to classify words. This problem however can be compensated for by increasing the training data available and when a small limited number of speech segments are to be classified, such as in the case of digits or visemes, the performance of such systems can be enhanced by generating as much training data as possible to train the networks.
To apply such an approach is not feasible for the case of words where the number of possible words that can be spoken is unlimited so it is necessary to use a discrete class system to cover general speech such as in the case of ASCII characters. However, the use of ASCII characters in lip reading relies on the conditional dependence relationship that exists between the characters, and ASCII symbols are not always phonetic because of silent letters and digraphs, so to train a network to decode speech in real time requires training to have been done on an extensive range of vocabulary.
Lip reading systems tasked for predicting sentences from sentence datasets such as GRID and OuluVS have been more fruitful in terms of accuracy compared with those tasked to recognise sentences from more challenging datasets like LRS2. One of the main reasons that a dataset like LRS2 is so difficult is because it contains sentences that randomly cover a vocabulary of over 40,000 words, which is very different to the circumstances of the datasets GRID and OuluVS that contain repetitive sentences following a standard sequence, and that only cover a small range of vocabulary. Lip reading systems that use ASCII characters as classes are designed to predict words as combinations of ASCII characters and so to recognise any set of words, such words will need to have appeared in the training phase. As of present an ASCII-based lip reading systems are not be able to decode words that have not presented in training. The low accuracy of lip reading systems designed for lip reading sentences can be explained by the inability to generalize to a wide range of vocabulary whilst using a limited number of classes.
Training ASCII-based lip reading systems to generalise to a wide range of vocabulary remains an ongoing obstacle to tackle. One alternative to having a lip reading system designed for decoding speech that covers a given vocabulary range is to recognise lip movements and map them to possible words because there are distinct number of visemes that can be uttered by someone speaking. However, because of the one-to-many mapping relationship that exists between visemes and phonemes, one would still need to determine which combination of words have been uttered.

III. METHODOLOGY
Given a silent video of a talking face, the objective here is to predict the sentences being spoken by extracting their lip movements. In this Section, an overall architecture is proposed for decoding visual speech illustrated in Figure 1.
The entire process consists of different stages, starting off with a Data Preprocessing stage where the region of interest is extracted from the videos using facial landmark detection to provide the input to the Visual Frontend. The components of the overall architecture include: a spatial-temporal visual frontend that inputs a sequence of images of loosely cropped lip regions, and outputs one feature vector per frame; a sequence processing module known as the viseme classifier that inputs the sequence of per-frame feature vectors and outputs a sequence of visemes, and finally a module that matches visemes to words and predicts the uttered sentence using perplexity analysis. The performance of the system is evaluated by comparing the sentences predicted by the lip reading system to the ground truth of the spoken sentences and measuring the edit distance. In the following Sections, details of the systems components are discussed.

A. ARCHITECTURE
The overall system used for decoding speech consists of two separate neural network architectures used to perform two different tasks. The first architecture is used for the task of viseme classification and consists of a spatial-temporal visual frontend in tandem with an attention-based transformer and the predicted visemes provide the input of the next architecture. The second architecture, also an attention-based transformer, is used to predict the spoken words given the uttered visemes using a calculated metric called perplexity. As illustrated in Figure 2, each of these modules are briefly described along with the overall framework for the lip reading system. Both the viseme classifier and the word detector consist of common blocks including fully connected layers, self-attention layers and feed-forward layers and the breakdown of these three blocks is given in Figure 3.
The attention-transformer structure used in [39] has been changed to fit visemes, and this will be discussed in III-E. Unlike [39], there is no embedding layer, and the Decoder has been altered with the final softmax layer trained on visemes instead of ASCII characters.

B. DATA
The dataset used in this research is the BBC LRS2 dataset [6]. It consists of approximately 46,000 videos covering over 2 million word instances and a vocabulary range of over 40,000 words. The video with the longest duration has a length of 180 frames with every video have frame rate of 25 frames per second. The dataset contains sentences of up to 100 ASCII characters from BBC videos, with a range of facial poses from frontal to profile. The dataset is extremely VOLUME 8, 2020   Table 2 gives a breakdown of the different sections of the BBC LRS2 data with statistics of how many sentences there are, the number of word instances, the vocabulary range and the ratio of profile to frontal videos in that particular section of the corpus.

C. DATA PRE-PROCESSING
All the videos are pre-processed according to the stages given in Figure 4. Videos consist of images with red, green and blue pixel values and resolution 160 pixels by 160 pixels; with a frame rate of 25 frames/second. Videos are first sampled into image frames, then once the videos are sampled, facial landmarks need to be located as the speaking person's lips are   the region of interest and feature input to the visual frontend. The Single Shot MultiBox Detector (SSD) [45], a CNN-based detector, is used for detecting face appearances within the individual frames and to recognise facial landmarks according to the iBug [46] landmark convention of 68 landmarks, and it can be used on faces pointing at different angles. Landmarks are applied according to the stages shown in Figure 5 with the face detected shown on the left, the face being tracked in the middle and where facial landmarks are detected on the right.
The video frames are then converted to greyscale, scaled, and then centrally cropped around the boundary of the facial landmarks resulting in reduced image dimensions of 112 × 112 × T dimensions (where T corresponds to the number of image frames). Data augmentation in the form of horizontal flipping, removal of random frames [37], [38], and random shifts of up to ±5 pixels in the spatial dimension and ±2 frames in the temporal dimension respectively, respectively, are also applied. At the end, pixels are normalized with respect to the overall mean and variance of every pixel in each frame.
Pre-processing is needed in order to ensure that the appropriate region of interest (ROI) can be extracted as the input to the neural network with resolution 112 × 112 pixels that contains the lips. The ROI must also undergo greyscale conversion and z-score normalization. The facial landmark detection described earlier has already been performed on every single video contained within the BBC LRS2 corpus. Some of the pre-processing steps described in Figure 4 may not be necessary for this corpus, as the 112 × 112 set of pixels can be extracted through central cropping of the original image frames with 160 × 160 pixels. The entire pre-processing process would however be a necessity for a lip reading system that can be generalized to other real-time applications.

D. VISUAL FRONTEND
The spatial-temporal visual front-end is based on [38]. The network applies a spatial-temporal (3D) convolution on the input image sequence, with a filter width of five frames, followed by a 2D ResNet that gradually decreases the spatial dimensions with depth. For an input sequence of T × H × W frames, the output is a T × H 32 × W 32 ×512 tensor (i.e., the temporal resolution is preserved) and it is then average-pooled over the spatial dimensions, yielding a 512-dimensional feature vector for every input video frame. Details of the architecture for the Visual Frontend are given in Table 3. The trained network used in [8] has been applied in this work.

E. VISEME CLASSIFIER
Lip reading datasets consist of labels in the form of subtitles. These subtitles are strings of words that need to be converted to sequences of visemes to provide labels for the viseme classifier. The conversion is performed in two stages: first, they are mapped to phonemes using the Carnegie Mellon Pronouncing Dictionary [40], and then the phonemes are mapped to visemes according to Lee and Yook's approach [29]. Table 4 shows the mapping. The attention transformer which predicts the spoken visemes from a person speaking  in a silent video uses 17 classes in total; these include the 13 visemes, a space character, start of sentence (SoS), end of sentence (EoS) and a character for padding. All the defined classes are listed in Table 5. All videos are padded to 180 characters.
The Transformer [39] model has an encoder-decoder structure with multi-head attention layers used as building blocks. The encoder used is a stack of self-attention layers, where the input tensor serves as the attention queries, keys and values at the same time. The decoder here consists of 3 fully connected layer blocks structured as shown in Figure 6; and each fully connected layer blocks consists of a dense layer, batch normalisation, rectilinear unit function and a dropout layer of probability 0.1. The dense layer within the middle fully connected layers consists of 2048 nodes while the dense layers within the first and last fully connected layer blocks only contain 1024 nodes. The decoder produces character probabilities which are directly matched to the ground truth labels and trained with a cross-entropy loss. The encoder follows the base model of [39] with 6 layers, model size 512, 8 attention heads and dropout with probability 0.1.
However, it should be noted that the decoder utilised in this work follows a completely different structure from that of [8] for the following reasons: 1) There are no embeddings; 2) The predicted labels from the previous timestep are not fed into the decoder as it is assumed that visemes do not have the conditional probability relationship that ASCII characters have. This means that no teacher forcing is used whereby the ground truth of the previous decoding step has to be supplied as the input to the decoder.; and 3) It is only the decoder and dense layer that differ, so the trained weights from [8] have been used and applied to both the visual frontend and encoder, where only the decoder layers and dense layers are trained. Because the encoder has an identical topology to that used by [8], the trained weights from their model have been applied to here and it is only the decoder and the final softmax layer in Figure 6 that are to be trained. During the training phase, the Adam optimiser [43] is used with default parameters and initial learning rate 10 −3 , reducing it on plateau down to 10 −4 and all operations are implemented in TensorFlow and trained on a single GeForce GTX 1080 Ti GPU with 11GB memory.

F. WORD DETECTOR
The outputted visemes from the viseme classifier need to be further converted to meaningful sentences or strings of words. Every word in a sentence contains a set of visemes and therefore can be mapped to a cluster of visemes, such that a cluster of visemes is a set of visemes which make up a word. Once visemes have been classified, the viseme-to-word conversion process needs to be performed. Because a cluster of visemes can map to several different words, the combination of the words that were uttered by the speaker still needs to be deciphered. The solution to the problem is to select the most likely combination of words. The general procedure for converting visemes to words with different stages is given in Figure 7. The first stage of the Word Detection is the World Lookup stage. Every single cluster of visemes needs to be mapped to a set of words containing those visemes according to the mapping given by the Carnegie Mellon Pronouncing (CMU) Dictionary. However, if there are clusters where no match is found, a cluster in the dictionary that most closely resembles it is used instead and the words mapping to that cluster are used. The resemblance is determined using Levenshtein distance [21] and the cluster in the CMU dictionary with the smallest value is chosen.
Once the word lookup stage is performed, the next stage of Word Detection is the Perplexity Calculations. The different possible choices of words that map to the visemes are combined, and perplexity iterations are performed to determine which combination of words is most likely to correspond to the uttered sentence, given the visemes recognised. Naturally, the sentence that is most grammatically correct will have the highest likelihood [44] and perplexity is one metric that can be used to compare sentences to determine which is most grammatically sound. The rationale behind perplexity is discussed later with an even more detailed description about how perplexity analysis is used to convert viseme to words. In this paper the following rules are used when predicting sentences and they are based on determining which combinations of words have the greatest likelihood according to probabilistic information theory: 1) If a viseme sequence has only 1 cluster matching to one word, that one word is selected as the output. 2) If a viseme sequence has only 1 cluster matching to several words, that word with largest expectation is selected as the output. 3) If a viseme sequence has more than 1 cluster, the words matching to the first two clusters are combined in every possible combination for the first iteration. a) The combinations with the lowest 50 perplexity scores are kept. b) These combinations are in turn combined with the words matching to the next viseme cluster. c) The combinations with the lowest 50 perplexity scores are kept and the iterations continue for the remaining clusters of the sequence until the end of the sequence is reached. The selection of the lowest 50 perplexity scores at each iteration is based on an implementation of a local beam search with width 50. In practice, it would be computationally expensive to do an exhaustive search so a beam search has been implemented to reduce the computational overhead, and the beam width is an arbitrary figure chosen as a compromise between accuracy and computational efficiency.
Eqs. 1 to 4 below describe the probabilistic relationship between the observed visemes and the words spoken; where V is the spoken sequence of viseme clusters, v i corresponds to every ith cluster, W C represents any given combination of words and w i corresponds to every ith word within the string of words. The string of wordsW that is to be selected will be the combination that has the maximum likelihood given the identity of the viseme clusters for every combination C that falls within the set of combinations C * . The sequence of visemes clusters given in Eq. 1 maps to any possible combination of words as given in Eq. 2, and the solution to predicting the sentence spoken is the combination of words given the recognised visemes which has the greatest probability as expressed in Eqs. 3 and 4.
If the identity of observed visemes is known, the probability of the viseme sequence in Eq. 1 is equal to 1, resulting in the expression in Eq. 5. The choice of words predicted according to Eq. 4 gets reduced to the expression given in Eq. 6.
w 1 ,w 2 , . . . ,w N = arg max C C * [P(w 1 , w 2 , . . . , w N )] C (6) Eqs. 7 to 10 below describe the relationship between the perplexity PP, entropy H and probability P(w 1 , w 2 , . . . , w N ) of a particular sequence of N words (w 1 , w 2 , . . . , w N ). The word detector consists of a trained attention-based transformer for calculating PP expressed as the exponentiation of H in Eq. 7. The per-word entropyĤ is related to the probability P(w 1 , w 2 , . . . , w N ) of words (w 1 , w 2 , . . . , w N ) belonging to a vocabulary set W , and is calculated as a summation over all possible sequences of words. If the source is ergodic, the expression forĤ in Eq. 8 gets reduced to that in Eq. 9. The value of P(w 1 , w 2 , . . . , w N ) resulting in the choice of words selected as the output for Eq. 6 also results in the minimisation of entropy in Eq. 9, further resulting in the minimisation of perplexity given in Eq. 10.
A language model, i.e., a probability distribution over sequences of words, can be measured on the basis of the entropy of its output from the field of information theory [42]. Perplexity is a measure of the quality of a language model, because a good language model will generate sequences of words with a larger probability of occurrence resulting in a smaller perplexity.
The Transformer model used for the word detector is the pre-trained Generative Pre-Training (GPT) Transformer [41] -a multi-layer decoder and a variant of the transformer used in [39]. It consists of repeated blocks of multi-headed selfattention followed by position-wise feedforward layers. The architecture is typically used for sentence prediction; however, the architecture itself here is not used for direct classification, rather its purpose is for perplexity calculations that are required for word selection where visemes are converted to words. Visemes from the previous step are sequentially matched to words and the most probable sentence is chosen according to that with the minimum perplexity score. The perplexity score is calculated by taking the exponentiation of the cross-entropy loss when the GPT is evaluated on a sentence and like in [27], a beam width of 50 has been used.

G. SYSTEMS PERFORMANCE MEASURES
The measures that have been used to evaluate the lip reading sentence system are edit distance-based metrics and are computed by calculating the normalized edit distance between the ground truth and a predicted sentence. Metrics reported in this paper include Viseme Error Rate (VER), Character Error Rate (CER), Word Error Rates (WER) and Sentence Accuracy Rate (SAR).
Error rate metrics used for evaluating accuracy are given by calculating the overall edit distance. In determining misclassifications, one has to compare the decoded speech to the actual speech. The equation for calculating Error Rate (ER) is given in Eq. 11 with N being the total number of characters in the ground truth, S being the number of characters substituted for wrong classifications, I being the number of characters inserted for those not picked up and D being the number of deletions being made for decoded characters that should not be present. CER, WER and VER are all calculated this way with the expressions given in Eqs. 12, 13 and 14 where C, W and V correspond to characters, words and visemes.
SAR is a binary metric as expressed in Eq. 15, where the value is 1 if the predicted sentence P P is equal to the ground truth P T , otherwise it would take the value of 0: SAR = 1, P P = P T 0, P P = P T (15) H. ILLUMINATION To test the proposed lip reading system's robustness to changes in lighting, the overall architecture, once trained, has been evaluated on videos from the testing set under levels of illumination. Illumination has been applied by varying the pixel brightness. It is after the video sampling stage of the pre-processing described in III-C that illumination is applied to the image frames. The overall process is described in Figure 8. Image frames of videos from the dataset consist of red, blue and green pixel components with numerical values ranging from minimum intensity 0 to maximum intensity 255. Pixel normalisation is the first stage of the procedure and this involves minimum-maximum normalisation of all pixel values where pixel values are mapped from the range [0,255] to [0,1]. Once this is done, a gamma correction is applied where pixel values are corrected according to Eq. 16, where I is a matrix of pixels, γ is scalar value and O is the resulting matrix of pixels after the gamma correction has been applied: Values of γ that are less than 1.0 will cause images to darken whereas values of γ that are greater than 1.0 cause images to brighten. Figure 9 gives examples of images with the standard image (γ = 1.0) on the left, the darkened image in the middle (γ = 0.5) and the brightened image on the right (γ = 1.5). The gamma corrections applied in this paper have utilised γ values ranging from 0.5 to 1.5.

IV. EXPERIMENTS AND RESULTS
For training and evaluation of the viseme classifier, the BBC LRS2 dataset described in III-B has been used with 45839 sentences for training and 1243 sentences for testing. All components of the model are evaluated on the LRS2 test set. The metrics reported include VER, CER, WER, SAR and the total overall training time.
The viseme classifier was trained for a total of 2000 epochs and it was at the point that the validation loss started to become saturated, and when no further convergence was recorded that the model was evaluated. Plots for the loss and VER for both training and validation are given in Figures 10 and 11.   The results are summarized in Table 6. As shown in the Table, the overall WER of 35.4% is a reduction of almost 15% compared to the 50% achieved in a previous state-of-the-art model trained and evaluated on the same dataset; and thus, improvement on the overall word accuracy to 64.6%. The accuracy by visemes was also very high VOLUME 8, 2020  with a VER of only 4.6%. The confusion matrices by both visemes and ASCII characters are given in Figures 12 and 13, respectively. Table 7 gives the performance metrics for how the proposed lip reading system and Afouras et al's model [8] performed when videos in the validation set were subjected to different levels of illumination, applied to in accordance with III-H. It can be seen that the proposed lip reading system is generally robust to varying levels of illumination, like that of Afouras et al. [8] and this is expected given that videos in the BBC LRS2 corpus were recorded in varying lighting conditions.
In order to attain a good overall accuracy for classification of words, both the viseme classification performance and the viseme-to-word conversion performance need to be good. The VER is very low and any misclassifications that have occurred during the validation phase appeared to be influenced by the class imbalance of visemes present in the training data. When visemes are misclassified, they are most likely to be decoded as one of ''AH'', ''K'' or ''T'' because such visemes appear most frequently in training data and obscure classes such as ''AA'' and ''CH'' are the most likely to be misclassified. Table 8 gives examples of sentences from the BBC LRS2 dataset along with the decoded visemes, the word combinations that were outputted at each iteration of the perplexity calculations, and the viseme clusters corresponding to each predicted word. Table 9 gives the full details of how those sentences were decoded by listing their corresponding visemes, the predicted visemes, the decoded sentences and their corresponding metric performance results.
A stratified sampling strategy was used to select the most frequently appearing 154 words in the BBC LRS2 training set that begin with each letter of the alphabet. For the selected 154 words, a comparison of the accuracy in terms of ratio of how many times a word was correctly decoded to how many times it appeared in the testing phase has been presented in Figures 14 and 15. Figure 14 shows the word accuracy for Afouras et al.'s model and Figure 15 shows the accuracy for this lip reading system. A better word precision is noticeable in Figure 15.
It should be noted that, whilst the VER was low, the WER was still high although it has been significantly improved compared to other existing works. To further reduce the error rate, the viseme-to-word conversion would need to be optimised. Many misclassifications have been caused by the VOLUME 8, 2020 presence of local optima during the implementation of the local beam search, whereby at each iteration of the viseme sequence during the perplexity calculation stage, the words that make up the ground truth are not included within the top 50 results. A large beam with would invariably result in a greater conversion rate, but at the expense of using more computational overhead and an exhaustive search would not even be viable. Further work needs to be done to ensure that the global optimum combinatorial solution is selected more frequently during the Perplexity Calculation stage to further improve on word accuracy.

V. CONCLUSION
A neural network-based lip reading system has been developed to predict sentences covering a wide range of vocabulary in silent videos from people speaking. The system is lexicon-free, uses only visual cues represented by visemes of a limited number of distinct lip movements, and is robust to different levels of lighting. Verified on the BBC LRS2 data set, the system has demonstrated a significant improvement on classification accuracy of words compared to the state-ofthe-art works.
Future research includes investigating a more suitable neural network architecture in order to enable the system to have a good generalisation capability with a higher ratio of the number of training samples to the number of test samples.
In addition, an efficient conversion of visemes to words is crucial when using visemes as classification scheme for lip reading sentences. As shown in the experiments, although the classification accuracy of visemes achieved by the proposed system was very high (over 95%), the classification accuracy of words was significantly dropped after the conversion (65.5%). As such, it is important to explore any other possible approaches for the conversion. For perplexity analysis-based conversion, different global optimisation methods need to be considered while also limiting the computational overhead required.