Lip Reading in Cantonese

Lip reading aims at recognizing texts from a talking face without audio information. Due to the rapid development of deep learning techniques, researchers have made giant breakthroughs for both word-level and sentence-level English lip reading in recent years. Unlike English, it is difficult for Chinese to distinguish the lexical meanings, because Chinese is a tonal language. In addition, most of the existing Chinese lip reading datasets are designed for Mandarin, there are few for Cantonese. In this paper, we propose a word-level Cantonese lip reading dataset called CLRW which contains 800-word classes with 400,000 samples. For better practical applications, we do not limit gender, age, postures, light conditions, and speech speed to make CLRW closer to the real scene distribution. At first, we give a detailed description of the data collection process. Next, a novel two-branch network is proposed by us, named TBGL, which consists of a global branch and a local branch. The global branch models the whole lip and the local branch divides the feature into three parts to focus on subtle local lip motion. We jointly train these two branches and achieve comparable performance on LRW, CAS-VSR-W1K, and CLRW, respectively. Finally, we benchmark our dataset and perform a comprehensively analyze of the results, which demonstrate that CLRW is full of challenge, and it will bring a positive impact on further Cantonese lip reading tasks.

impaired people, analysis of silent movies, biometric authen-23 tication in video authentication systems [1], and so on. English, Chinese is a tonal language and the tone is used 29 to distinguish the lexical and grammatical meanings [2]. 30 In addition, there are too many homophones in Chinese, 31 The associate editor coordinating the review of this manuscript and approving it for publication was Fahmi Khalifa . and the lip shapes between homophones are similar, which 32 significantly increase the ambiguity of lip reading. As the 33 most spoken language in the world, there are very few jobs in 34 Chinese lip reading. Yang et al. [3] present the only publicly 35 available large-scale benchmark for word-level Chinese Man-36 darin lip reading, named CAS-VSR-W1K, which contains 37 1000 classes with 718,018 samples from more than 2,000 38 individual speakers on CCTV programs. 39 Meanwhile, we should not only focus on Chinese Man-40 darin lip reading, but also Chinese dialects are of great sig-41 nificance. As the only Chinese dialect that has been publicly 42 studied at home and abroad with a complete writing system, 43 Cantonese can be expressed entirely in Chinese characters. 44 It is widely used in Guangdong province, Guangxi province, 45 Hong Kong, Macao, and overseas Chinese community, with 46 a population of 120 million all over the world, which makes 47 the research of Cantonese lip reading be of great significance 48 [4]. As a branch of Chinese, Cantonese is quite different 49 from Chinese Mandarin in terms of grammar, phonetics, 50   In the next step, we propose a two-branch model, 103 named TBGL, which contains a global branch and a 104 local branch. In this paper, we introduce a bidirectional 105 knowledge distillation loss for jointly training the two 106 branches. Finally, we evaluate our methods on LRW, 107 CAS-VSR-W1K, and CLRW respectively, and demon-108 strate the effectiveness of our methods.

110
In this section, we will firstly introduce the pronunciation 111 rules of Cantonese and then give an overview of current 112 mainstream lip reading datasets and some popular lip reading 113 methods.

115
Cantonese is widely used in Guangdong province, Guangxi 116 province, Hong Kong, Macao and overseas Chinese com-117 munity, with a population of 120 million all over the world, 118 which make the research of Cantonese lip reading be of great 119 significance.

120
Pinyin is a tool used to assist the pronunciation of Chinese 121 characters including initials, final and tones. Initials are called 122 consonants and are used before finals. Different initials and 123 finals are combined to form different syllables. The main 124 feature of consonants is that the airflow in the mouth will 125 be hindered in various ways during pronunciation, so the 126 degree of facial muscle force and the shape of the mouth 127 are different shape when different initials are pronounced. 128 Therefore, we can obtain the content of speech based on 129 visual information such as mouth shape changes, facial mus-130 cle changes, and jaw movements.

131
Chinese is a monomorphic language, and words do not 132 have strict morphological changes. The syllables of Chinese 133 are composed of initials, finals and tones according to the 134 rules of pinyin. At the same time, Cantonese has a relatively 135 VOLUME 10, 2022 Pinyin are shown in Tab.1, which are as followed:   As a data-driven process, lip reading systems are inevitably 177 influenced by the available data, which contain rich vocabu-178 lary, multiple perspectives and complex backgrounds to more 179 closely approximate the real scene distribution. Tab.2 lists 180 some of the major lip reading datasets that have contributed 181 significantly to lip reading.

182
Early lip reading datasets are designed for classifying iso-183 lated speech segments in the form of digits and letters. The 184 AVICAR [8] was recorded in a moving car and proposed in 185 2004. There were 100 speakers and they were all asked to 186 speak 10 digits in four different views. The AV Letters [9] was 187 released in 2002 and consisted of 5male and 5female, each of 188 whom was asked to utter isolated letters from A to Z three 189 times. These two corpora have made significant progress in 190 the development of lip reading.

191
However, the main reason for the early research focusing 192 on digit and letter-level datasets is not the ease of relevant 193 feature extraction, but the simplicity of such data collection. 194 With the rapid development of deep learning technology, 195 researchers have made a breakthrough in the field of face 196 detection. We can extract all the faces in the video within 197 a few seconds, which greatly improves the efficiency of lip 198 recognition data collection. Researchers start to pay attention 199 to constructing large scale datasets, and some well-known 200 large scale lip reading datasets are summarized below.

201
The Oulu VS1 [10] dataset was released in 2009 with 202 10 phrases spoken by 17males and 3 females. Each utterance 203 is repeated 9 times. It has been widely used in previous works. 204 However, the average number of samples in each class is 205 merely 81.7, which is not enough to cover the various condi-206 tions in a realistic scene. The Oulu VS2 [11] is an extension 207 of the Oulu VS1. It also contains 10 phrases, but the number 208 of speakers is increased to 52. The major highlight of Oulu 209 VS2 is that the Oulu VS2 contains 5 different viewpoints: the 210 frontal, profile, 30 • , 45 • , and 60 • , which make Oulu VS2 a 211 challenging dataset.   254 have been used in lip reading and we can find more details 255 in Zhou's work [18]. With the rapid development of deep 256 learning and the release of some well-known large-scale lip-257 reading databases, researchers start to pay more attention to 258 applying deep learning methods to lip-reading tasks. 259 Noda et al. [19] were the first to use CNNs to extract 260 features for lip reading and they have achieved a better per-261 formance than traditional methods. However, it should be 262 noted that the use of 2D CNNs for feature extraction when 263 dealing with sequential inputs is limited even if dynamic 264 frames were to be used as opposed to static features. To solve 265 this problem, Stayfylakis et al. [20] combined 2D CNNs and 266 3D CNNs by changing the first 2D convolution layer of the 267 ResNet-18 to a 3D Convolution layer to extract robust fea-268 tures with a Bi-LSTM based back-end network to explore the 269 temporal information. The combined use of 3D-ResNet-18 270 and Bi-LSTM are preferred to be used as the backbone 271 network due to the considerable performance of the model. 272 Martinez et al.

281
Some impressive lip reading methods start to design spe-282 cific modules to address some shortcomings of the existing 283 networks for efficient lip reading. For example, Sheng et al. 284 [5] used 38 lip-reading related points as the center of the patch 285 sequence to model the lip contour deformation for capturing 286 the motion of the mouth contour. They use an embedding 287 layer to model the semantic information which contains the 288 local motion information and coordinates information, but 289 they also lose spatial information of parts of the lip inevitably. 290 Hao et al. [23] were the first to use the Temporal Shift Module 291 in the residual branch of each residual block, which can 292 effectively extract the temporal information between adja-293 cent frames without reducing the spatial feature extraction. 294 Xiao et al. [24] proposed a Deformation Flow Network that 295 generates deformation flow to capture the motion information 296 of faces. To provide complementary cues for lip reading, 297 they use a bidirectional knowledge distillation loss to help 298 the two branches learn from each other. In paper, Li et al. 299 [25] proposed a new dual-stream lip reading model called 300 Lip Slow-Fast (LSF) based on the Slow-Fast Net. To obtain 301 subtle lip motion features, two streams with different channel 302 capacities are used to extract dynamic features and static 303 features, respectively.

305
Early deep learning models relied on massive amounts of 306 data for training to prevent serious overfitting problems when 307 VOLUME 10, 2022   one speaker facing the camera, and remove the other in-valid 345 scenes. We use the global histogram of the image to judge 346 the switching between a single speaker and other scenes in 347 the video and obtain a rough single speaker video clip. The 348 global histogram calculates the difference between adjacent 349 frames according to Formula 1 by counting the number of all 350 pixels in the frame at each gray level.
where H i (j) represents the value of the histogram of level j 353 in the i frame, and M is the total number of levels of the 354 histogram. We mark the locations where the rate of change 355 is greater than 0.5 as the shot boundary to obtain a rough 356 video clip of a single speaker. Finally, we manually clip the 357 video to generate valid video samples, which should contain 358 a complete sentence and a single speaker.

360
We download videos from Bilibili, YouTube, Guangzhou 361 Radio and Television, TVB, and many other websites. The 362 video and audio stream are inevitably out of sync in the 363 process of repeated encoding. 364 We first manually filter out the video samples which audios 365 and videos are obviously out of sync. But for small out-366 of-sync videos, we introduce the SyncNet Model [26] to 367 solve this problem, which is shown in Fig. 3. We take the 5 368 frames of the video sequence and the 0.2s MFCC features as 369 input, then we use 3D VGG-M and 2D VGG-M to extract 370 visual features and audio features respectively. Within the 371 range of ±15 frames, the model search for an offset that 372 minimizes the L2 distance of the fc6 layer features of the 373 two streams. We calculate the offset of each video sample and 374 average the distances among these video samples as a basis 375 for synchronization. If the offset is greater than ±7 frames, 376 we will discard these video samples There are few Cantonese programs with precise subti-379 tles, we thus use an audio stream to generate annotations. 380 In this paper, iFLYTEK commercial-grade Cantonese speech 381 transcription service is used to obtain the text content, word 382 segmentation results and timestamps of the valid video 383 samples. 384 We transcribed the audio-video-synchronized samples to 385 obtain audio information. However, the phonetic transcrip-386 tion process is not word-for-word, and may have pauses 387    The sample lengths are distributed between 0.01s and 2s 432 with an average of 0.25s. This is because many words and 433 modal particles are inherently short in pronunciation. More-434 over, different speakers have different speech rates so that 435 the lengths of different samples in the same word classes are 436 usually different. The ratio of samples in the training set, test 437 set, and validation set is 8:1:1. For the validity of the dataset, 438 duplicate data and speakers are not allowed in each subset. 439 The sample length can be seen in Fig. 5.

441
In this section, we will first describe the overall pipeline 442 of lip reading models. Then we will illustrate the design 443 motivation of our two branch network. Finally, we introduce 444 a bidirectional knowledge distillation loss for jointly training 445 the two branches.

447
An overview of the pipeline is shown in Fig.6. The pipeline 448 consists of a front-end network and a back-end network. The 449 front-end network is designed to extract visual spatiotemporal 450 features representing lip reading dynamics. The back-end 451 decodes the feature sequences and predicts the probability of 452 each word class.

453
For our front-end network, we first send the lip image 454 sequences as inputs into a 3D CNN layer to perform an initial 455 spatial-temporal alignment in the sequence. Then we compact 456 the features in the spatial domain with a spatial max pooling. 457 In the next step, we employ a ResNet-18 module to extract 458 discriminative features. All these features obtained from the 459 ResNet-18 module would be fed into a global average pooling 460 layer to further compress. To fully exploit the global spatial 461 information and the subtle local lip motion, we propose a two 462 branch back-end network, which consists of a global branch 463 and a local branch.

464
Global branch: we take the whole feature as the input, 465 aiming at modelling the global motion information of the 466 lip. The global branch back-end contains a MS-TCN for 467 increasing the receptive field to mix up the short term and 468 long term information during the feature encoding.

469
Local branch: we want to enhance the robustness of the 470 system and optimize the fitting degree of models by learning 471 VOLUME 10, 2022 FIGURE 6. The overview of the proposed model. Given an input image sequence, we first sent them into the front-end to extract visual spatiotemporal features. Then, we use a two branch based back-end, the global branch of TBGL models the whole motion of the lip and the local branch divides the feature into three parts to focus on subtle local lip motion. Finally, we employ a bidirectional knowledge distillation loss to provide additional supervision for jointly training the two branches of TBGL.  where loss_ce L stands the final loss of the local branch, N is 492 the number of lip parts, N=3. D is the number of total classes, 493 p n d is the result on the d th frame from the n th part local branch, 494 y d is the label target.

B. THE BIDIRECTIONAL KNOWLEDGE DISTILLATION LOSS 496
Fusion strategies of two branch networks have been widely 497 used in the field of video analysis tasks for many years. In this 498 section, we employ a bidirectional knowledge distillation loss 499 to provide additional supervision for jointly training the two 500 branches of TBGL. The outputs of the fully connected layers 501 of the global branch are denoted as zg. We set the mean of the 502 outputs of the fully connected layers of the three parts of the 503 local branch as z l . We then obtain the predicted probability 504 distribution over all classes, q g and q l as: where T is a parameter known as temperature, we set T 507 to 20 in this work. The knowledge distillation loss can be 508  defined as: where q t and q s denote the soft probability distributions is put at the center to make the data provide more context and 541 improve the performance of lip reading. The sample length 542 of LRW has already been fixed at 29 frames, and the target 543 word is in the center.

545
Our experiment is based on PyTorch and the model is trained 546 on a single NVIDIA 3090 GPU, with 24GB memory. We use 547 the Adam Optimizer with default hyper-parameters, the initial 548 learning rate η = 3e-4, the weight delay is 1e-4 and the batch 549 size is set to 32. As for the weight of bidirectional knowledge 550 distillation loss, we set it to 100 at first, and we reduce it by 551 half every time when the validation loss stagnates. We train 552 for 80 epochs using a cosine scheduler, and the learning rate 553 η t at epoch t is calculated as follows:   TBGL models the whole motion of the lip and the local 638 branch divides the feature into three parts to focus on subtle 639 local lip motion. We jointly train these two branches and 640 achieve comparable performance on these three challenging 641 datasets. Finally, we benchmark our dataset and divide the 642 dataset by sample lengths.