Attention-Based Sign Language Recognition Network Utilizing Keyframe Sampling and Skeletal Features

Sign language recognition(SLR) is a multidisciplinary research topic in pattern recognition and computer vision. Due to large amount of data from the continuous frames of sign language videos, selecting representative data to eliminate irrelevant information has always been a challenging problem in data preprocessing of sign language samples. In recent years, skeletal data emerged as a new type of data but received insufficient attention. Meanwhile, due to the increasing diversity of sign language features, making full use of them has also been an important research topic. In this paper, we improve keyframe-centered clips (KCC) sampling to get a new kind of sampling method called optimized keyframe-centered clips (OptimKCC) sampling to select key actions from sign language videos. Besides, we design a new kind of skeletal feature called Multi-Plane Vector Relation (MPVR) to describe the video samples. Finally, combined with the attention mechanism, we also use Attention-Based networks to distribute weights to the temporal features and the spatial features extracted from skeletal data. We implement comparison experiments on our own and the public sign language dataset under the Signer-Independent and the Signer-Dependent circumstances to show the advantages of our methods.


I. INTRODUCTION
Sign language is an effective way for hearing impaired people to convey their ideas to others. Sign language recognition(SLR) provides good communication media between deaf-mute and ordinary people, which has important application value [1]. The research of SLR is mainly divided into isolated SLR and continuous SLR. The former aims at recognizing word by word, the other focuses on translating sentences from a sequence of actions. We will discuss the methodology of isolated SLR in this paper.
The traditional equipment for acquiring sign language data can be divided into data gloves and visual image systems [2]. The former use gesture motion sensors to get sequential trajectory data [3], [4]. This method can have a high identification rate but bring so much labor and financial consumption. In contrast, a visual image system uses cameras to collect The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .
information [5]. However, the identification rate is low, and real-time performance is poor, especially it is not able to collect large sign language datasets. After Microsoft launched Kinect-2.0, we can get the RGB, depth, and skeletal data simultaneously from a frame sampled from the video [6]. Using Kinect-2.0, we recorded a large vocabulary Chinese Sign Language(CSL) dataset with 200 sign language words. Each word contains 100 samples by 10 signers who repeated the same sign language word 10 times. The CSL dataset will be used for the experiments in our paper.
Sign language videos are composed of continuous frames sampled at a specific sampling rate by cameras. Due to the high sampling rate, the number of frames in a sign language video is large, which causes much data storage memory and redundant information between adjacent frames. Therefore, we need to select keyframes from the whole sign language video as the descriptor of it. In this way, the redundant information is eliminated, and data storage memory is greatly reduced, making the feature extraction more convenient without negatively affecting recognition performance. In this paper, we design a method called optimized keyframe-centered clips(Optim KCC) sampling and received better results compared with the state-of-art method in [7].
In recent years, skeletal coordinate data emerged as a new type of data in SLR [8], [9]. It can eliminate the influence from the background and the illumination of the signing environment. and describe the three-dimensional spatial trajectories of finger joints. Besides, the storage memory of skeletal data is much less than that of RGB images and depth images. However, to our knowledge, the number of literatures researching on extracting skeletal features is limited. Thus, making full use of skeletal data, we design a kind of feature called Multi-Plane Vector Relation(MPVR) in this paper to get better descriptions of sign language videos.
With the development of the attention mechanism, researchers proposed Attention-Based networks for SLR, which distribute corresponding weights to different keyframes' features to achieve more efficient feature extraction. However, current research mainly focuses on the temporal attention mechanism. In this paper, we also consider the spatial attention mechanism. In the 3D skeletal data provided by Kinect-2.0, the XY plane represents the screen of the camera facing the signers, the YZ plane represents the ground, and the XZ plane represents the sidewall orthogonal to the previous two subplanes. The skeletal joint trajectories' projection onto the three subplanes has different importance in feature representation. Therefore, we design the spatial Attention-Based BLSTM referencing the Attention-Based network proposed in [10] to weight different subplanes' features.
The contribution of this paper can be summarized as follows: • Based on the keyframe-centered clips(KCC) sampling proposed in [7], we improve it for better data preprocessing and feature descriptions.
• In this paper, a new kind of skeletal feature called Multi-Plane Vector Relation(MPVR) is proposed. We project each skeletal joint's 3D coordinate data to 3 subplanes to get 3 2D vectors. And then, we explore the vector relation in different subplanes, which is the principal component of the MPVR feature.
• Based on the Attention-Based network proposed in [10], we design a spatial Attention-Based BLSTM to distribute weights to corresponding subplanes' features in MPVR.
• According to the ideas proposed above, we implement the comparison experiments under the Signer-Independent and the Signer-Dependent circumstances distinguished by whether there are signers who appear in both training sets and test sets. The recognition accuracy under two cases can validate the adaptiveness of our networks to the practical Signer-Independent situation.

II. RELATED WORKS
In this section, we will review the work related to our research in this paper.

A. KEY FRAME SAMPLING
Huang et al. [7] proposed keyframe-centered clips(KCC) sampling, which aims at selecting a certain number of frames to describe the whole sign language videos. He got better recognition performance compared with other sampling methods. The keyframe extraction algorithm in this paper is based on [7].

B. FEATURE EXTRACTION
In the field of SLR, the initial research on feature extraction focused on extracting features such as HOG, LBP, optical flow, or SIFT [11] from RGB images and depth images using traditional image-processing algorithms [12]- [14]. With the development of deep learning, CNN [15]- [17] and RNN [18]- [21] can directly extract the temporal features or the spatial features from image data, which gradually made themselves become the mainstream research methods in SLR.
After Microsoft launched Kinect-2.0, skeletal data gradually received attention. In recent years, a few literatures began to research on extracting skeletal features and have made some progress. Kumar et al. [22] proposed joint distance and angular coded Color topographical descriptor(JTDT) and got 84.12% accuracy on the Indian sign language dataset. Rastgoo et al. [23] proposed the multi-view hand skeleton, which obtained skeletal coordinate information from multiple perspectives and achieved 99.6% accuracy on his own laboratory's dataset. The above works only stay in the use of rectangular coordinate data. In consideration of this, MPVR is designed by us, which uses polar coordinate data to describe vector relation in different subplanes.

C. NETWORK LEARNING
The types of SLR networks are closely related to the development of computer vision and pattern recognition. HMM is one of the classical models [24], [25]. HMM can model continuous frames in the time domain and extract temporal features. Based on traditional HMM, Zhang et al. [26] and Guo et al. [27] proposed adaptive HMM. Pu et al. [28] applied HMM to trajectory modeling. In addition, SVM [12], [18], [29], CRF [30], [31], and some of their variants have also been used in SLR.
With the development of deep learning, SLR gradually relies on neural networks [19]. Zamora-Mora and Chacn-Rivas [32] used CNN for real-time hand detection as the tool of SLR. Al-Hammadi et al. [33] used 3DCNN to extract temporal and spatial features simultaneously and got better recognition performance. Besides, RNN has also been widely welcomed. For example, Xiao et al. [34] used LSTM to realize multimedia fusion for Chinese SLR. Li et al. [35] VOLUME 8, 2020 proposed an encoder-decoder model using LSTM to model different features of hand shapes.
In recent years, feature fusion gradually receives attention. To make full use of different types of features, Su and Zhu [36] combined the CNN and LSTM to form the fusion network, H. Zhou and W. Zhou also designed the spatial-temporal Multi-Cue network [37] for fully exploring the features from different cues.
Due to the advantages of the attention mechanism, researchers also began to transfer it to SLR. For example, Huang et al. [10] used Attention-Based 3DCNN to distribute different weights to different frames in video sequences. According to this idea, we not only use a temporal Attention-Based BLSTM to weight keyframes' representation but also add a spatial Attention-Based BLSTM to weight the subplanes' features. Then we fuse them to obtain the fusion network.

III. OUR METHOD
A. KEY FRAME SAMPLING 1) SAMPLING METHOD Before feature extraction, we need to select keyframes from sign language videos. Since each sign language video exists in the format of continuous frames, keyframe sampling is downsampling all the frames of the video to select some representative frames as the descriptor of the whole video. Currently, the standard method of keyframe sampling is uniform sampling, which means that if we select N keyframes from the sign language video with L frames, the index of the ith keyframe K i is: This method does not consider the importance of different frames. Therefore, we refer to KCC sampling in [7] and propose OptimKCC sampling to extract the key actions.

2) KCC SAMPLING
Firstly, we take the first frame as the referenced frame, and we search the keyframe from the subsequent n(hyper parameter) frames. We denote D i (1 ≤ i ≤ n) as the Euclidean distance between the pixels of the ith frame and the referenced frame. Long distance means low similarity.
Secondly, we sort the sequence . . , s n } = {1, 2, . . . , n}. Then, we classify n frames into two categories by threshold segmentation. One is similar to the referenced frame; the other is dissimilar to it. We assume the first k frames corresponding with D s i (1 ≤ i ≤ k) as the similar frames. Then, we design the criterion function as: (2) m 1 and m 2 represent the means of the first k and the subsequent (n − k) similarity values. σ 1 and σ 2 represent the standard deviation of the first k and the subsequent (n − k) similarity values.
According to the principle of optimal classification, the result should make the largest mean square error(MSE) between classes and the least MSE within every class, which means the optimal solution k * should satisfy: After finding k * according to (3), we select the frame which appears earliest in the video from the (n−k * ) dissimilar frames as the keyframe. And we set it as the next referenced frame to find the next keyframe in the same way, until the number of remaining frames is less than n.

3) OPTIMIZATION
According to KCC, for each sample, we gradually change n to select keyframes with the fixed number of N . Take the video sample with L frames as an example, we use X = (x 1 , x 2 , . . . , x L ) to denote the frame sequence, and Y = (y 1 , y 2 , . . . , y N ) to denote the keyframe sequence selected from X using KCC sampling. We assume that: Because L > N , we refer to DTW distance as the measurement of similarity between X and Y . Firstly, we construct a matrix M ∈ R L×N , in which D(x i , y j ) = ||x i − y j || 2 represents the Euclidean distance between the pixels of x i and y j . Long distance means low similarity.
We use a path P in matrix M which starts from the coordinate (1, 1) and ends at (L, N ) to match the sequence X and Y . For each point (i, j), the next point along the path can only be one of the following points: Each point along the path can be regarded as the matched point between the two sequences. The summation of all the elements along the path, which is shown in eq (7), is defined as the accumulative distance between X and Y : Our objective is to find a path P * generating the least accumulative distance, which is defined as the DTW distance: The DTW distance can be calculated by DTW algorithm, in which γ (1, 1) = M 11 and γ (L, N ) is the final result: We attempt to optimize the result of KCC sampling by using the conception of DTW distance. We set Y as the initial sequence and gradually approach the optimal result using the greedy algorithm. The flowchart of the algorithm is shown as follow: for s i−1 ≤ j ≤ s i+1 :(search one by one) We use Y * to denote the final keyframe sequence got from the above algorithm, which will be used for data processing and feature extraction.
Because that the sequence Y * is based on Y , it preserves the characteristic that it considers the different weights of different frames. Besides, Y * shows more similarity between X compared with Y . So, we can conclude that Y * can better capture the visual tempo of the video and fully describe the sign language video.

B. MPVR(MULTI PLANE VECTOR RELATION) FEATURE 1) SKELETAL DATA
Kinect-2.0 can capture the 3D coordinate data of 25 skeletal joints. We take the spine joint, which keeps still during almost the whole process of sign language demonstration, as the new coordinate origin to normalize the 3D skeletal coordinate data to eliminate the influence from the heights and the body shapes of signers. In the new 3D coordinate space, the lines connecting the spine joint with other joints can be viewed as 3D vectors.
We select 10 joints closely related to sign language demonstration: thumb, wrist, elbow, index fingertip, and palm center on the left and right sides to get 10 3D vectors. The extraction of MPVR is based on these 3D vectors.

2) MPVR FEATURE EXTRACTION
Multi-Plane(MP): The meaning of multi-plane is that we project the 3D vector (x, y, z) onto the three orthogonal 2D planes, which are the screen of the camera, the ground, and the sidewall, to obtain three 2D vectors (as shown in Fig. 1).
With the same operation on each joint's coordinate, we can get 10 vectors in each plane.
Vector Relation(VR): Take the XY plane as an example, we use V i (1 ≤ i ≤ 10) to represent the 10 vectors in this subplane. For 2 vectors V i (x i , y i ) and V j (x j , y j ), we can use transformation formula to get the polar coordinate (P i , i ) and (P j , j ) (0 ≤ i , j < 2π) from the rectangular coordinate. We use ij to represent the counterclockwise rotation angle from V i to V j (As shown in Fig. 2). According to the definition of the counterclockwise rotation angle, we can get that: MPVR feature extraction: For one of the three subplanes, we use the vector M ∈ R 10 to represent P i (1 ≤ i ≤ 10). Meanwhile, we use the matrix M ∈ R 10×10 to represent the argument of each vector and the counterclockwise rotation angle between every two vectors: In this way, we get the matrix M = concat(M , M ) ∈ R 10×11 as the vector relation feature of one subplane. Assume that the vector relation features of the three subplanes are M xy , M xz , and M yz . We stack them to form the 3D tensor stack(M xy , M xz , M yz ) ∈ R 3×10×11 as the MPVR of one keyframe. Assume that N keyframes are selected, we stack their MPVRs to form the tensor with size N × 3 × 10 × 11 as the MPVR of the whole sign language video.
The main advantages of MPVR lie in: • Scale invariance: Due to the normalization of the skeletal data, the length of the skeletal joints' vectors can be robust to the diversity of the signers' heights and body shapes.
• Equivalent reconstructability: According to the given feature matrices M xy , M xz , and M yz , we can reconstruct the original spatial distribution of skeletal joints. VOLUME 8, 2020 • Rotation invariance: In the process of data acquisition, due to the shake of the camera, the spatial coordinate will change suddenly, resulting in discontinuity and instability of the rectangular coordinate data. However, the length of the skeletal vectors and the counterclockwise rotation angles between them do not change with translation and rotation of the plane facing the signers. Therefore, the features can eliminate the error caused by the camera shaking.
• Multidirectional: We take the skeletal trajectories' projection onto the three orthogonal subplanes into consideration, which fully explores the trajectory features during the sign language demonstration.

C. NETWORK 1) ATTENTION-BASED BLSTM
After obtaining the features from the skeletal data, we need to feed them into networks for training. Currently, BLSTM is widely used for extracting features from the sequential data [34], [36]. However, this network is not sensitive to the fact that different keyframes have different importance. Besides, the corresponding weights of the three orthogonal subplanes in describing the sign language video have also not been considered. To solve the problem, we adopt the Attention-Based BLSTM proposed in [10] to weight the features of the keyframes and the subplanes in MPVR. The general structure of the Attention-Based network is shown in Fig. 3. As shown in Fig. 3, F = (f 1 , f 2 , . . . , f N ) T ∈ R N ×L represents feature sequence. N means the number of the feature vectors, and L means the length of each feature vector. We set the number of hidden units in the BLSTM to be 128 and feed In this way, we get the attention signal from the BLSTM: Hidden layer H outputs the hidden representation H: where H = A(FC). FC means fully-connected layer and A means activation function. Then we calculate the weight vector (w 1 , w 2 , . . . , w N ) = W ∈ R N : Finally, we weight different feature vectors by the weight vector W to get the final feature vector: We use B to denote the whole Attention-Based BLSTM, thus we get:  As for B T , we set the size of f to be 8 × 3 × 10 × 11. Thus, we get the temporal feature vector: Similarly, as for B S , we set the size of f to be 3×8×10×11 and get the spatial feature vector: (18) To make full use of the temporal and spatial characteristics, we concatenate them to form the fusion feature: Finally, we feed F into the fully-connected layer and the softmax layer to get the probability distribution vector, where C is the number of the classes.:

E. LOSS FUNCTION
We use Cross Entropy as the loss function. For a probability distribution vector p = (p 1 , p 2 , . . . , p C ), if the ground truth label is i(1 ≤ i ≤ C), the loss function is:

IV. EXPERIMENT A. IMPLEMENT DETAILS 1) DATASET
Our experiments were implemented on the DEVISIGN sign language dataset released by the the Chinese Academy of Sciences and Chinese Sign Language(CSL) dataset recorded by us.
DEVISIGN dataset includes 500 sign language words and used Kinect-1.0 to capture RGB, depth, and skeletal data. The 500 words cover signs with fundamental postures to complex postures variations. The data covers 8 different signers. The vocabularies are recorded twice for 4 signers (2 males and 2 females) and once for the other 4 signers (2 males and 2 females).
CSL dataset contains 200 sign language words, which are collected by Kinect-2.0. All the 200 words are from the Chinese Sign Language Textbooks. Each word in CSL dataset contains 100 video samples obtained by 10 signers who repeated the same sign language word 10 times. CSL dataset can provide more detailed skeletal information than DEVISIGN dataset because of the superiority of Kinect-2.0 over Kinect-1.0.

2) CONTENT OF THE EXPERIMENTS
We did self-comparison experiments on CSL to validate the effect of keyframe sampling and the attention mechanism. After that, we realized different methods proposed in other literature on DEVISIGN and CSL dataset to validate the advantages of our methods.
Besides, we did experiments under two cases: Signer-Independent and Signer-Dependent. The former means that the signers in the training set are completely different from those in the test set. The latter means that there are some signers appear in both datasets.
Obviously, experiments under the Signer-Independent circumstance is more challenging but has more practical application value. By comparing the recognition results under the two cases, the networks' robustness to the Signer-Independent circumstance can be observed. Now, many works researching SLR tend to include the two cases in the experiments [14], [18].
The experiments were conducted on GPU 1080Ti with the stochastic gradient descent(SGD) optimizer and the CrossEn-tropyLoss criteria. We set batch size = 8, learning rate = 0.01, learning decay = 0.99, momentum = 0.9. 80% of the samples were used for training, 5% for validation, and the remaining 15% for testing.   Table 1.  With the number of keyframes N increasing from 2 to 16, the recognition accuracy also gradually increases. However, after N equals 8, the rising speed drops sharply, and the VOLUME 8, 2020   accuracy reaches saturation. So, we can conclude that when N = 8, the keyframes can fully describe sign language videos. Considering that the number of frames of most sign language videos varies from 80 to 120, keyframe sampling can significantly reduce the data storage memory without bringing significantly negative influence on recognition performance. Subsequent comparison experiments are based on the results when N = 8.

3) COMPARATIVE EXPERIMENTS WITH OTHER METHODS
To validate the advantages of the Fused Attention-Based BLSTM with MPVR, we did experiments on DEVISIGN and CSL dataset with our methods and some state-of-art methods researching extracting skeletal features for SLR. The experimental results are shown in Tables 2 and 3.

C. ANALYSIS OF EXPERIMENTAL RESULTS
From the experimental results, we can observe that: • Compared with KCC sampling in [7], OptimKCC sampling could slightly improve the recognition accuracy.  OptimKCC sampling preserves the characteristic that it considers the different weights of different frames. Besides, its results show more similarity between the original samples compared with the results of KCC sampling. The experiments indeed confirm that optimKCC sampling could better capture the representation of the sign language videos.
• Compared with original coordinate data, the new feature MPVR can significantly enhance recognition accuracy. Besides, the networks with the attention mechanism have better performance, and the Spatial Attention-Based BLSTM performs even better than the Temporal Attention-Based BLSTM, from which we can conclude that considering weight distribution of different subplanes' features can better describe the sign language videos.
• As expected, the recognition accuracy under the Signer-Dependent circumstance is higher than that under the Signer-Independent circumstance. Nevertheless, we can find that there is not much gap between two cases, which indicates that our networks can avoid the overfitting phenomenon on the training set under the Signer-Independent case, and shows that our networks can be robust to such circumstance and demonstrate practical application value. VOLUME 8, 2020  • As shown in Fig. 10-18, when combined with the attention mechanism, the performance of the networks is significantly improved in several aspects, including higher accuracy, faster convergence speed, and lower loss function with slighter fluctuation amplitude. The Fused Temporal-Spatial Attention-Based BLSTM has the highest recognition accuracy, the fastest convergence speed, and the lowest loss function, which means the optimal performance.
• As shown in Table 2. and Table 3., our methods performed better than other state-of-art and classical methods using skeletal features for SLR. It shows that MPVR and the attention mechanism can fully explore the temporal and the spatial features of sign language videos and consider the importance of different kinds of features and different components of a feature, which means the better ability to capture the representation of the whole sign language videos.

V. CONCLUSION
In this paper, we proposed a kind of Attention-Based network utilizing the OptimKCC sampling and the MPVR skeletal feature to improve the accuracy of SLR. First of all, we designed OptimKCC sampling based on [7] to get the keyframes from sign language videos. Secondly, we projected the skeletal joints' coordinate data to 3 orthogonal subplanes to get several 2D vectors and extracted vector relation from different subplanes as the MPVR skeletal feature.
Afterward, based on the attention mechanism, we adopted a temporal Attention-Based BLSTM and a spatial Attention-Based BLSTM for distributing weights to the features of different keyframes and different subplanes in MPVR. ZHONGFU YE received the B.Eng. and M.S. degrees in electronic and information engineering from the Hefei University of Technology, Hefei, China, in 1982 and 1986, respectively, and the Ph.D. degree from the University of Science and Technology of China, Hefei, in 1995. He is currently a Professor with the University of Science and Technology of China. His current research interests include statistical and array signal processing, speech processing, sign language recognition, and hand pose estimation.