An Efficient Two-Stream Network for Isolated Sign Language Recognition Using Accumulative Video Motion

Sign language is the primary communication medium for persons with hearing impairments. This language depends mainly on hand articulations accompanied by nonmanual gestures. Recently, there has been a growing interest in sign language recognition. In this paper, we propose a trainable deep learning network for isolated sign language recognition, which can effectively capture the spatiotemporal information using a small number of signs’ frames. We propose a hierarchical sign learning module that comprises three networks: dynamic motion network (DMN), accumulative motion network (AMN), and sign recognition network (SRN). Additionally, we propose a technique to extract key postures for handling the variations in the sign samples performed by different signers. The DMN stream uses these key postures to learn the spatiotemporal information pertaining to the signs. We also propose a novel technique to represent the statical and dynamic information of sign gestures into a single frame. This approach preserves the spatial and temporal information of the sign by fusing the sign’s key postures in the forward and backward directions to generate an accumulative video motion frame. This frame was used as an input to the AMN stream, and the extracted features were fused with the DMN features to be fed into the SRN for the learning and classification of signs. The proposed approach is efficient for isolated sign language recognition, especially for recognizing static signs. We evaluated this approach on the KArSL-190 and KArSL-502 Arabic sign language datasets, and the obtained results on KArSL-190 outperformed other techniques by 15% in the signer-independent mode. Additionally, the proposed approach outperformed the state-of-the-art techniques on the Argentinian sign language dataset LSA64. The code is available at https://github.com/Hamzah-Luqman/SLR_AMN.

These signs represent a majority of the sign words used in 86 the sign language vocabulary [13]. Hence, a video stream is 87 required to represent signs in which the motion component is of this stage can be isolated words or sentences depend-94 ing on the input provided. Isolated sign recognition systems 95 accept a sign and output an equivalent word in a spoken lan-96 guage. Continuous sign language recognition systems iden-97 tify a sequence of signs performed continuously and output 98 a set of words in the form of sentences. These sentences 99 have the structure and grammar of the source sign language, 100 which are usually different from the structure and grammar of 101 natural languages. Thus, a machine translation system is used 102 to translate these sentences into the target natural language. 103 Several approaches have been proposed recently for sign 104 language recognition. However, there are still some limita-105 tions that need to be addressed. Firstly, most of the sign recog-106 nition systems consider all signs' frames for sign learning and 107 classification. This can result in degrading the recognition 108 accuracy due to the variations between the signs performed 109 by different signers. Therefore, there is a need for an approach 110 to extract the main postures of the sign gesture and ignore less 111 important postures. Secondly, most of the temporal learning 112 techniques for dynamic sign gesture recognition could not 113 learn the non-manual gestures efficiently. Thirdly, few tech-114 niques have been proposed for ArSL recognition compared 115 with other sign languages. This can be attributed mainly to the 116 lack of a benchmarked dataset. We aim in this work to address 117 these limitations. The main contributions of this research are 118 as follows: 119 • We propose a trainable deep learning network for sign 120 language recognition that can effectively capture the 121 spatiotemporal information with few frames of the signs 122 • We design a hierarchical sign learning model, which 123 learns the spatial and temporal information of the sign 124 gesture in three networks: dynamic motion network 125 (DMN), accumulative motion network (AMN), and sign 126 recognition network (SRN).

127
• We propose a technique to extract the dominant and 128 important sign postures. This approach helps tackle the 129 variations of the sign samples.

130
• We propose an accumulative video motion (AVM) tech-131 nique to encode the sign motions in the video stream into 132 a single image.

133
• We evaluated the proposed approach on the KArSL and 134 LSA64 datasets and found that the proposed method 135 outperformed other methods.

137
The rest of the paper is organized as follows. Section II 138 reviews the related work dealing with sign language recog-139 nition. Section III presents the architecture of the proposed 140 system, and Section IV describes the experimental work and 141 the obtained results. Section V highlights the contributions of 142 this research and concludes the paper. 144 Several techniques have been proposed in the last two 145 decades for automatically recognizing various sign lan-146 guages. Based on the acquisition devices, these tech-147 niques can be classified into two types: sensor-based and 148 93786 VOLUME 10,2022 vision-based techniques [14]. Most [20], [21], [22], [23]. Ritchings et al. [22] proposed a sign 159 language recognition system using two bend sensors with 160 push-button switches for motion tracking. This system was 161 evaluated on 65 signs and an accuracy of 93% was reported.  Dempster-Shafer Theory of Evidence was used by 165 Mohandes and Deriche [23] to integrate the data obtained 166 from a hand-tracking system and a glove sensor. A dataset 167 consisting of 100 signs was collected using these sensors, 168 and these signs were used to evaluate the proposed system.  on extracting statistical and geometric features from sign ges-199 tures and feeding them into a classifier. Nai et al. [25] pro-200 posed a system to recognize ASL digits on depth images.

201
A set of statistical features were extracted and classified using 202 the random forest classifier, and an accuracy of 89.6% was 203 reported. Depth images were also used by Ameida et al. [26] 204 for Brazilian sign language recognition. A set of seven struc-205 tural features were extracted from these images and fed 206 into support vector machines (SVM) for classification to 207 obtain an accuracy above 80% with 34 signs. Joshi et al. [27] 208 applied a multilevel histogram of gradient (HOG) for recog-209 nizing Indian sign language letters in complex backgrounds. 210 An accuracy of 92% has been reported on a dataset that con-211 sists of 26 signs. Nevertheless, this accuracy is low given the 212 dataset size and the hand segmentation.

213
Hidden Markov model (HMM) was used by Zaki and Sha-214 heen [28] for recognizing 50 signs of ASL. The principal 215 component analysis (PCA) was used for features reduction 216 and an accuracy of 89.1% was reported. A PCA with linear 217 discriminant analysis was also used by Pan et al. [29]. The 218 extracted features were classified using SVM, and the accu-219 racies of 94% and 99.8% were reported using the 26 signs 220 of ASL and CSL, respectively. Nguyen   The key postures and AVM frame are fed into a 315 novel two-stream network for sign language recognition. The 316 key postures are fed into the DMN to learn the spatiotem-317 poral information in the sign gesture. The AVM frame is 318 used as an input to the AMN that learns the motion in the 319 AVM image. The extracted features from the two streams 320 are concatenated and fed into the SRN for learning the fused 321 features and performing the classification. In this section, 322 we start by describing the key frames and AVM extraction 323 techniques. Then, we discuss the proposed two-stream sign 324 learning architecture and the fusion technique. Based on body motion, sign gestures can be classified into 327 two types: static and dynamic. Static signs are motionless 328 gestures, and they depend mainly on the shape, orientation, 329 and articulation of the hands and fingers to convey meanings. 330 By contrast, dynamic signs employ body movements during 331 signing. Dynamic gestures represent a majority of signs used 332 in sign languages, whereas static gestures are used mainly for 333 letters, digits, and a few sign words.

334
Dynamic gestures are more challenging to recognize than 335 static gestures. The recognition of static gestures depends 336 only on spatial information, whereas the recognition of 337 dynamic gestures requires spatial and temporal information. 338 An additional challenge for recognizing such signs is the ges-339 ture variations among the different signers of the sign. These 340 variations are obvious with signs that consist of more than one 341 posture. Another challenge with the recognition of dynamic 342 gestures is the large number of generated frames, especially 343 when sign gestures are recorded at high frame rates. Some of 344 these frames are often redundant, which increases the recog-345 nition time of the systems that process sign video frames 346 for recognizing sign gestures. To address these problems, 347 we extracted the key frames from each sign and used these 348 frames as the input to the recognition system.

349
A key posture technique is used to extract the prominent 350 sign postures in the sign video stream by extracting the corre-351 sponding frames in the sign's video stream. Inspired by [43], 352 we extracted the key frames by employing the hand trajecto-353 ries captured by tracking the hand joint points, which were 354 returned by Kinect as part of the skeleton data. The points 355 for the hand's joints can have some outliers that can signifi-356 cantly impact the extraction of key postures. To address this 357 problem, we preprocessed these joint points by smoothing 358 the hand locations using the median filter to remove the out-359 liers. For occluded hands or lost joints, Kinect V2 is efficient 360 in joint estimation while providing skeletal tracking that is 361 more robust to occlusions [48]. However, if this estimation 362 is noisy or inaccurate, our median filter will smooth it in 363 the preprocessing stage. Then, we extracted the key frames 364 by connecting the hand locations during signing to form a 365 polygon, as shown in Figure 2.   we applied a polygon approximation algorithm. This algo-369 rithm measures the importance of each vertex by taking the 370 product of its edge's lengths and the angle between the edges 371 of this vertex. As shown in Figure 2, the importance of the 372 vertex V is computed as follows: where L AV and L VB are the lengths from the vertex V to 375 the vertices A and B, respectively, whereas θ is the angle 376 between the vertex V and the two adjacent segments. The 377 process is applied to all polygonal vertices, and the least 378 important vertex is removed. This reduction algorithm is 379 iteratively repeated to recompute the importance of the 380 remaining vertices until N vertices remain, as shown in 381 Figure 3; this figure shows the raw hand trajectory and the 382 trajectory obtained after applying the algorithm. This algo-383 rithm was applied to all the color videos to extract N key 384 postures. Sign language recognition is a time-series problem that 430 depends on two sources of information for each sign gesture: 431 spatial and temporal. The spatial information represents the 432 sign using fingers, hands, and body shapes and rotations. 433 The temporal information represents the motion used by all 434 the dynamic signs. Motion is a primary component in sign 435 language, and it involves changing the position and/or shape 436 of the hands during gesturing.

437
To learn and extract the spatial and temporal information 438 from the key frame of the sign gesture, a combination of 439 CNN and LSTM is applied. Figure 5 shows the architec-440 ture of the proposed network. CNN has been extensively 441 employed for several pattern recognition problems, and its 442 efficiency in extracting the spatial features is well established. 443 We fine-tuned four pre-trained models (viz., VGG16 [50], 444 Xception [51], ResNet152V2 [52], and MobileNet [53]) for 445 extracting the spatial information from each key frame. These 446 three models have been trained on ImageNet for large-scale 447 image classification with 14,197,122 images and 21,841 sub-448 categories. Although these models have been trained on the 449 same dataset, the specifications and structure of the models 450 made them fit well for different pattern recognition problems 451 in the literature.

452
As shown in Figure 5, the extracted features using the 453 pre-trained models are fed into a stacked LSTM. The LSTM 454 consists of two LSTM layers with 2048 neurons each. The 455 output of these layers is fed into a fully connected layer with 456 1024 neurons followed by a rectified linear (ReLU) activation 457 function. This function handles the nonlinearity by zeroing 458 the negative values. This function is computationally pow-459 erful and helps to reduce the possibility of gradient vanish-460 ing [54]. To reduce the overfitting, a dropout layer of 60% is 461 used after the activation function. For classification, a Soft-462 max layer is added as the last layer in the DMN stream to 463    As discussed in Section III-B, we represented the motion 471 of the sign in a single image using the AVM approach. 472 This image encodes spatial and temporal information. It also 473 helps to recognize the static sign gestures that do not involve 474 motion. These signs are a challenge for DMN because 475 the variations between some static signs are at the level 476 of finger shapes, which cannot be captured easily by the 477 DMN networks.   Figure 6 (a) shows samples 533 from KArSL dataset. 534 We used two sets of the KArSL dataset: KArSL-190 and 535 KArSL-502. KArSL-190 is the pilot version of the KArSL 536 dataset, and it consists of 190 signs that comprise 30 digit 537 signs, 39 letter signs, and 121 word signs. We used this set 538 to evaluate the proposed techniques and compared our work 539 with other studies that used this set. We also evaluated our 540 approach on more signs using KArSL-502, which included 541 all the signs (502 signs) of the KArSL dataset. The results 542 reported for KArSL-502 can also be used to benchmark the 543 KArSL dataset because it is the first study to use the whole 544 KArSL dataset.

545
LSA64 is an Argentinian sign language dataset that con-546 tains 3200 videos of 64 signs performed by ten signers. Each 547 sign is repeated five times by each signer. The dataset was 548 collected using an RGB color camera. The signers who per-549 formed the dataset signs wore colored gloves to ease the 550 detection and tracking of their hands. However, we used 551 the signs without performing any segmentation. Figure 6 (b) 552 shows samples from LSA64 dataset.

554
Several experiments have been conducted with different con-555 figurations to evaluate the efficiency of the proposed sign lan-556 guage recognition systems. Experiments were conducted in 557 two modes: signer dependent and signer independent. In the 558 signer-dependent mode, we tested the model on samples of 559 the signers who were involved in the training of the model. 560 By contrast, in the signer-independent mode, we tested the 561 system on the signs performed by the signers who were 562 not present for the model training. For the signer-dependent 563 mode, four sets of experiments were performed on the KArSL 564 dataset-three sets corresponded to each of the three signers 565 in the KArSL dataset, and one set contained the signs of all 566 the signers. The signer-independent experiments were con-567 ducted using three sets corresponding to each signer tested 568 for the dataset. For example, in the set used for Signer 01 in 569 the signer-independent mode, two signers (Signer 02 and 570 Signer 03) were used for training, and one signer (Signer 01) 571 was used for testing.

572
In these experiments, we started by evaluating each com-573 ponent of the proposed system independently. We evalu-574 ated the DMN stream using different pre-trained networks 575 on 18 key postures selected empirically. The CNN compo-576 nent of this network was fine-tuned using three pre-trained 577 models for sign recognition, namely, VGG16, Xception, 578 ResNet152 and MobileNet. The feature vectors resulting 579 from these networks were fed into the stacked LSTM, as dis-580 cussed in Section III-C1. Then, we evaluated the AMN 581 stream using three configurations: forward (FWD-AMN), 582 backward (BWD-AMN), and bidirectional (BWD-AMN). 583 This stream accepts the AVM image as an input and employs a 584 pre-trained MobileNet network for features extraction, as dis-585 cussed in Section III-C2. Finally, we evaluated the SRN 586 network that accepts the dynamic and accumulative motion 587  Table 3 shows the obtained results for the proposed mod-  Table 2 as FWD-SRN, BWD-SRN, and 619 Bi-SRN, respectively. Table 3 also shows that the obtained 620 accuracies using the SRN network outperformed the DMN 621 stream for the KArSL-190 and KArSL-502 datasets. By con-622 trast, there was no noticeable improvement over the AMN 623 stream except for Signer 01 of KArSL-502. However, the 624 obtained results with SRN were high in the signer-dependent 625 mode.

626
Although the results obtained for the proposed networks 627 with the signer-dependent mode can be considered satisfac-628 tory, the more challenging type of sign language recognition 629 is with the signer-independent mode. This type of recognition 630 is related to the real-time systems that are tested on sign-631 ers who are different from the signers involved in system 632 training. To this end, we used two signers from the KArSL 633 dataset for training and a third signer for testing. We fol-634 lowed the same experimental settings used for the signer-635 dependent experiments. Comparing Tables 3 and 4 shows 636 that the signer-independent recognition was more challeng-637 ing than the signer-dependent recognition. It is clear from 638 Table 4     To evaluate the performance of the proposed networks 650 on each sign category, we show in Table 5  rates were obtained with sign words that can be attributed 667 to the variation between sign words and the use of motion 668 with these signs. It is also noticeable in the confusion matrix 669 that the AMN stream can recognize the static signs more 670 efficiently than the DMN stream due to its ability in cap-671 turing the spatial features encoded by the AVM technique. 672 The fusion of the DMN and AMN streams through the SRN 673 stream improved the accuracies of all sign types for all sign-674 ers except Signer 01 of KArSL-190. Furthermore, the SRN 675 stream outperformed DMN with all sign types of KArSL-502. 676 To better investigate the misclassifications, we used a pie 677 chart (Figure 7) of the misclassification signs of KArSL-502 678 for each network stream organized by the sign chapter (the 679 KArSL dataset contains signs from 11 chapters of the ArSL 680 dictionary). The signs involved in this analysis are those 681 that could not be recognized by the network streams in the 682 signer-independent mode for the three signers. As shown in 683 the figure, most of the signs that could not be recognized by 684 all the network streams belong to the characteristics chapter. 685   techniques for ArSL recognition. Three types of features 702 were extracted from the skeleton's joint points provided 703 by the Kinect sensor and fed into the HMM: (i) the joint 704 points of the signers' hands, (ii) the hand shape represented 705 using HOG, and (iii) a combination of joint points and the 706 shapes of the signers' hands. Additionally, they formed a 707 single image from all the frames of the signs and used a 708 CNN model with VGG-19 for classification. Table 6 com-709 pares the results of these techniques with our results using 710 KArSL-190. As shown in the table, the obtained results of the 711 proposed AMN and SRN streams in the signer-dependent and 712 signer-independent modes outperformed other techniques. 713 In addition, the improvements in accuracy over the Sidig and 714 Mahmoud [57] results with Bi-SRN were approximately 11% 715 and 15% in the signer-dependent and signer-independent 716 modes, respectively. These results confirm the efficiency of 717 our proposed networks for sign recognition.

718
The LSA64 dataset, which is an Argentinian dataset con-719 sisting of 64 signs performed by ten signers, was also 720 used to evaluate the generalization of our approach to other 721 sign languages. We evaluated the proposed approach in 722 the signer-dependent and signer-independent modes. For the 723 signer-dependent mode, we split the data randomly into the 724 train (80%) and test (20%) sets; we repeated each experiment 725 VOLUME 10, 2022      Table 7. 746 Clearly, our approach outperformed other approaches in 747 the signer-dependent and signer-independent experiments. 748 The highest accuracy in the signer-independent mode was 749 obtained using Bi-AMN. In this experiment, the lowest accu-750 racies were obtained with Signer 02, Signer 03, and Signer 08 751 (see Table 8). These signers were nonexpert signers, and they 752 introduced certain movements that were not part of the sign 753 language, such as head motions and returning hands to their 754 resting positions before signing. These observations align 755 with the challenges reported for the LSA64 dataset in [64].

757
In the last decade, sign language recognition has gained 758 popularity and attracted the interest of researchers world-759 wide. Several approaches that differ in the sign's acquisition 760 method, recognition technique, target language, and number 761 of recognized signs have been proposed for isolated sign 762