Multi-Information Spatial–Temporal LSTM Fusion Continuous Sign Language Neural Machine Translation

There are two basic problems in sign language recognition (SLR): (a) isolated word SLR and (b) continuous SLR. Most of the existing continuous SLR methods are extensions of the isolated word SLR methods. These methods use the isolated word SLR results as the basic module and obtain the sentence recognition results through sentence segmentation and word alignment. However, sentence segmentation and word alignment are often not accurate, resulting in a low sentence recognition accuracy. At the same time, continuous SLR usually requires strict sample labels, leading to the difficult task of manual labeling and limited training data availability. To address these challenges, this paper proposes a bidirectional spatial–temporal LSTM fusion attention network (Bi-ST-LSTM-A) for continuous SLR. This approach avoids problems such as sentence segmentation, word alignment, and tedious manual labeling. Our contributions are summarized as follows: (1) we proposed a sign language video feature representation method using a convolutional neural network (CNN) and spatial–temporal LSTM (ST-LSTM) information fusion technology; and (2) we constructed a uniform neural machine translation framework that can be used for complex continuous SLR and gesture recognition of nonspecific people in nonspecific environments. Experiments were carried out on some large continuous sign language datasets. The sign language recognition accuracy reached 81.22% on the 500 CSL dataset, 76.12% on the RWTH-PHOENIX-Weather dataset and 75.32% on the RWTH-PHOENIX-Weather-2014T dataset, thereby illustrating the effectiveness of the proposed framework.


I. INTRODUCTION
The goal of video-based SLR is to convert a video sequence into a sign language text representation [1]- [4]. SLR, and particularly continuous SLR [1], [4], is a relatively new field of human-computer interaction (HCI). Although many researchers have explored this area [8], [9], there are still many challenges and problems.
A key challenge of SLR is the design of visual descriptors to capture SL semantics, such as facial expressions and the shape, direction, and position of hands [1], [3]. As a result, previous studies relied heavily on RGB-D data in the posture/gesture models [6], [7]. Moreover, most of the existing sign language video sequences are recorded using normal The associate editor coordinating the review of this manuscript and approving it for publication was Kok-Lim Alvin Yau . cameras that lack depth sensors, thus limiting the practical application of the existing SLR method.
Another challenge of continuous SLR involves time segmentation [8], [9] because sign language actions are diverse and difficult to detect. Accurate sign language video segmentation is a difficult task, and inaccurate segmentation during preprocessing can lead to errors in the subsequent steps. In addition, labeling each isolated sign language word in video is time-consuming, and extending isolated word SLR to continuous SLR is difficult and involves many technical challenges. Most existing SLR studies [6]- [8] are isolated SLR methods; that is, they focus on the identification of a word or phrase. Some methods [8], [9] have explored the extension of the isolated word SLR methods to continuous SLR, which involves the reconstruction of sentence structures.
Most existing methods of continuous SLR divide the problem of sentence recognition into three stages: video segmentation, isolated word/phrase recognition, and sentence synthesis. For example, DTW-HMM [8] proposed a coarse segmentation step based on a threshold matrix, followed by a dynamic time wrapping (DTW) algorithm and a bigram model. Dan et al. [9] integrated a new HMM-based language model. Recently, transition action modeling [10] has attracted much attention because this approach can be used for video segmentation. However, despite the popularity of this approach, the sign language video segmentation task is still challenging, and transition actions can be subtle and vague. Ultimately, inaccurate segmentation can lead to significant performance loss in the subsequent steps [11].
Video description generation [12]- [15] involves the generation of a short sentence that describes the scene/object/motion of a given video sequence. A popular approach is based on the video-to-text method [13], which connects two layers of LSTM over a CNN. Hierarchical attention networks can be incorporated into LSTM [14], which is characterized by the automatic selection of the most relevant video frames for the task. The LSTM algorithm also has some extensions, such as bidirectional LSTM (Bi-LSTM), layered LSTM [15], and layered attention GRU [16].
In view of some problems in continuous SLR, and inspired by sequence-to-sequence methods [12]- [15], we propose an RGB video continuous SLR method that identifies sign language through data fusion and attention-based machine translation. Our contributions in this paper are two-fold.
(1) To achieve the SLR task based on the RGB video data, and inspired by the significant results achieved by deep learning technology in object detection, we proposed a sign language detection and representation framework in RGB video. The method includes a faster Region-CNN (R-CNN) module, a compression tracking module, and a newly proposed CNN-LSTM data fusion model. In previous sign language video feature fusion methods [15], the spatial-temporal information was not fully considered. To fuse the spatial-temporal information of sign language in RGB video sequences, we use spatial-temporal LSTM (ST-LSTM) to fuse hand information and body information in the LSTM units. (2) Inspired by the LSTM video description methods [12], [14] and video-to-text approaches [13], we combine structural information and attention mechanisms to propose the use of recurrent attention networks to bypass time segmentation, which is an extension of LSTM. The scheme involves encoding the entire video and then outputting the complete sentence verbatim. However, the attention network only partially optimizes the probability of generating the next word in the case of a given input video and the previous word, ignoring the relationship between the video and the sentence. Therefore, it may experience robustness issues. To compensate for this consideration, we introduce an attention-based bidirectional LSTM encode-decode model to clearly establish the correlation between the video frame sequences and the text sentences.
The remainder of the paper is organized as follows. Section 2 presents a description of the recognition method, Section 3 describes our experiments, and Section 4 draws some conclusions.

II. RELATED WORK
Video sign language recognition systems are often composed of a feature extraction module and a sequence signal model. The feature extraction module is used to represent gesture sequences. Moreover, the sequence signal model maps sequence representations to labels. For better gesture recognition, researchers have designed a variety of handmade features, which generally use image gradients [11], [14], [28] and motion trajectories [11], [18], [25] to represent hand shape and skeleton structure.
In recent years, application of the deep neural network to automatic learning feature representation has become widespread. Wu and Shao [29] used a deep belief network to extract high-level skeletal joint features for gesture recognition. Some researchers used the convolutional neural network [30], [31] and three-dimensional convolutional neural network [19], [20] to collect hand visual cues. For example, Molchanov et al. [17] applied a 3D CNN to extract the spatiotemporal features of the color, depth, and optical flow data from a video stream. Meanwhile, Neverova et al. [19] used color, depth data, and custom-made gesture descriptors to represent sign language features and then established a multiscale deep structure for sign language recognition. The time series model is a powerful tool for learning the corresponding relationship between a sequence representation and tags. HMM is the most widely used time series model in SL recognition [11], [20]. Dynamic time warping (DTW) [24] and support vector machines [32] are also used to measure the similarity between gestures. In recent years, RNNs have been successfully applied to sequence problems such as speech recognition [33] and machine translation [34], [35]. Pigou et al. [26] proposed an end-to-end neural model based on time convolution and bidirectional recursion for sign language recognition. However, due to the weak supervision ability of the recurrent neural network at the sentence level, it is difficult to match the extra-long input sequence with the ordered label frame by frame. Unlike the aforementioned models, we use attention mechanisms to integrate time dynamics before implementing bidirectional recursion.
Compared with existing models [21], [22], [27], [30], the recurrent neural network sequence learning model with end-to-end training presented in this paper has better learning ability and performance with regard to dynamic dependence. First, we do not use noisy frame markers as the training target of the neural network; rather, we use the symbol graphics method of human detection results to train our feature extraction module, which considers more local time dynamics, such as the location of the human face and hands, and other key information. Furthermore, by introducing the spatiotemporal sequence signal model based on the attention mechanism, namely, ST-A-LSTM, we not only integrate hand information and human body structure information organically through data fusion but also effectively solve the problem that the samples and tags do not show one-to-one correspondence in real time. Moreover, in our method, we do not introduce additional monitoring information, such as hand annotation, which requires expert knowledge and tedious annotation.
We can summarize our contributions in this paper as follows: (1) a sign language video feature representation method is proposed using a CNN and ST-LSTM information fusion technology; and (2) a uniform neural machine translation framework is constructed that can be used for complex continuous SLR and gesture recognition of nonspecific people in nonspecific environments.

III. RECOGNITION METHODOLOGY
Our method is described in Fig. 1. The method can be broken down into four steps. (1) The RGB video is inputted. For each frame, we detect face and hand patches in the frame image using a faster R-CNN [13]. The frame information is then divided into two parts: local hand information and spatial-temporal information of sign language. (2) These two types of information are then fed into the basic unit of the ST-LSTM and are fused using the proposed fusion method. (3) To combine the attention mechanism with the bidirectional ST-LSTM encode-decode framework, the fused feature sequence is then translated into text sentences. (4) The loss function is defined using the differences between the text sentence labels and the outputted text sentence. The details are discussed below.

A. FEATURE EXTRACTION USING A CNN
In our research, we used only the RGB image to calculate the sign language feature. In many existing SLR methods [36], [37], both the depth image and RGB image are used. In fact, there are many sign language videos that are only recorded by  RGB cameras, and therefore, research on RGB-based SLR is more important.
Generally, normal-size images are used in SLR to reduce the computational cost of training. Since the hand patch is often small, it is difficult to obtain complete hand information. In contrast to previous research, we used a high-resolution image as the inputted image. Based on this high-resolution image (250 × 250), we used a faster R-CNN [13] and compressive tracking [14] to detect the face and hands, as shown in Fig. 2. The hand patches (reshape size to 50 × 50) are sufficiently large to show subtle details. We use I 1 to denote the hand patch image, and this image is then fed into the CNN to obtain feature x 1 : In this paper, we build a CNN model to extract features of RGB images, as shown in Fig. 3. There are 15 layers in the CNN, namely, an input layer, 3 convolution layers, 3 batch normalization layers, 3 ReLU layers, 2 max pooling layers, 1 fully connected layer, 1 softmax layer and 1 classification output layer. The first convolution layer is computed using 8 different 3 × 3 kernels, and 8 feature maps are obtained; the second convolutional layer is computed with 16 different 3 × 3 kernels, and 16 feature maps are obtained; and the 3rd convolutional layer is computed with 32 different 3 × 3 kernels, and 32 feature maps are obtained. The convolutional layers are used to find the local relationship of the input layer. The corresponding pooling layer is the largest pooling layer in the 2×2 neighborhood domain, and the original size of the feature map is obtained. Similarly, the 3rd convolution layer has 32 different 3 × 3 kernels. The ReLU layer is connected to the fully connected layer. From the fully connected layer of the CNN, the image features x is outputted, which is a 288-dimensional vector, as shown in Fig. 3.
In addition, we used cartoon pictures to simplify the face and hand spatial-temporal information, as shown in Fig. 2. We used three symbols and a thick, solid line with uniform size to draw the cartoon picture. We used ''•'', '' '', and '' '' to denote the face location, the left hand's location, and the right hand's location, respectively. The cartoons remove complex environmental factors that can be used to identify tasks for nonspecific people and nonspecific environments. In cartoon pictures, the head and left and right hands are represented by different symbols. The different positions of the three important components and their logical connections determine the spatial configuration of different gestures. Therefore, cartoon pictures can directly reflect the complete spatial information of gestures. We used I 2 to denote the cartoon picture and fed it into the CNN to get feature x 2 :

B. MULTI-INFORMATION FUSION
To effectively combine hand information with spatial-temporal information, and inspired by Huang et al. [38], we fused two channels of information together for better SLR performance. However, Jie et al. [38] only considered temporal factors of sign language and fused two data streams in a fully connected layer of the CNN. The sign language representation should actually consider spatial factors and temporal factors together. Inspired by Jun et al. [39], the ST-LSTM model was used to represent the spatial-temporal dependencies and relationships among different frames and different body joints. We fused features x 1  is the hidden state of the joint at time t − 1.
In this paper, we use E M , and ζ to denote the maximum epochs, gradient threshold, and initial learn rate, respectively; the learn rate schedule is 'piecewise', and we use η, ξ , δ, n h and b m to denote the learn rate drop period, learn rate drop factor, embedding dimension, number of hidden units, and minibatch size, respectively. To obtain better fusion results, according to the actual samples in the sign language dataset, we initialize the set LSTM parameters to select the adaptive moment estimation (ADAM) method to train the data, and we set E M = 250, = 3, and ζ = 0.001, η = 125, The x 2 -related LSTM unit has two forget gates, f T ,2 t and f S,2 t , to process the time-related information and spatial-related information, respectively. According to LSTM-based time series processing theory [39], the x 2 -related LSTM unit can be calculated as follows: where x t ∈ R D is the input signal of joint j at time t, and i, f , o are the input gate, forget gate, and output gate, respectively. c is the d-dimensional cell state, u is the modulated input, and denotes the element-wise product : R D+d → R 4d , which is an affine transformation.
Similar to x 2 , x 1 can be calculated according to the following: where h is the fusion hidden state. Finally, the fusion output in ST-LSTM is

C. ATTENTION-BASED BIDIRECTIONAL LSTM TRANSLATION SYSTEM
According to the above, we input 2-feature sequence: , and through ST-LSTM VOLUME 8, 2020 feature fusion, we obtain the hidden state sequence h = (h 1 , . . . , h T ), which can be considered an encode sequence. In this paper, since we use bidirectional ST-LSTM as the encoder, we have where q(.) is an encode function and V is an encoded vector that includes all information of h. h → t is the hidden state computed by the sign language sequence from beginning to end at time t, and h ← t is hidden state computed by the sign language sequence from end to beginning at time t.
In the decode stage, let s be the hidden state vector in the decode function and let y be the decode output vector. Based on softmax function calculation, we obtain the optimal decode output: where W is the weight matrix, b is the bias, and D is the dictionary of sign language words. Generally, the LSTM decode calculation using the attention mechanism can be expressed as: We call V t the context vector, and V t can be computed by where h j is the encode vector and α tj is the weight coefficient whose value corresponds to the decoding output. A higher value of α tj corresponds to higher correlation with the current decoding output. To obtain the weight of each encoding vector at time t, we designed an alignment model: f (s t−1 , h j ). Since we aim to calculate the weight distribution of each of the frame encoding features, we used hidden vector s t−1 in the decoding function to compare with the encoding vector h j . We used the model f (s t−1 , h j ) to align the decode output with the encode input and then normalized the weight of each encoding vector by the softmax function. We used a simple perceptual machine layer to finish the alignment calculations: where f d t , i d t , and o d t are the forget gate, input gate, and output gate of the decoding calculation, respectively. φ g is an affine transformation.
Finally, we obtain the sign language sentence translation results: In summary, the input of the Bi-ST-LSTM-A system is the image feature sequence X 1 , · · · , X T , where T is the length of the sign language image sequence and X = (x 1 , x 2 ), x 1 and x 2 are image features corresponding to images I 1 I 2 , respectively, and the system output is the recognition results of a continuous sign language sentence: (y 1 , · · · , y n ), where n is the length of the sentence. In training parts, the loss function can be defined as: We also use the gradient descent method to train the parameters: Finally, the optimization parameters are obtained: θ * = arg max θ p(y 1 , · · · , y n |X 1 , · · · , X T ; θ).

A. DATASET
We tested the performance of our method using 4 sign language datasets: one is a continuous dataset collected by us, and the other 3 datasets are open-source continuous sign language datasets, namely, a Chinese sign language (CSL) dataset [38] and a German sign language dataset, which are both RWTH-PHOENIX-Weather [38] SL datasets, and the RWTH-PHOENIX-Weather-2014T dataset.
In our experiments, only RGB video streams were used. Our CSL dataset was collected in-house and can be used for daily communication. This dataset consisted of 50 continuous common CSL sentences, such as ''What's your name?'', ''Hello, everyone,'' and so on. Each sentence consisted of 3-5 signs, with a total of 150 different isolated signs in the dataset, including ''you,'' ''ID card,'' and ''home''. There were 500 instances of each isolated sign, for a total of 150 = 75, 000 instances. We used 70% of instances for training, 15% of instances for validation, and the remaining 15% for testing. Some CSL examples in our dataset are given in Fig. 6.
The 500 CSL dataset [38] contains 25,000 video instances. The total video clip is 100+ hours long and was recorded by 50 actors. Each video instance was labeled semantically by a professional sign language teacher. We used 70% of the instances for training, 15% for validation, and the remaining 15% for testing.
The RWTH-PHOENIX-Weather dataset contains 7,000 weather forecast sentences from nine sign language speakers. All of the videos were 25 frames per second (FPS), with a resolution of 210 × 260. Overall, 80% of the instances were used for training, 10% for validation, and 10% for testing. The RWTH-PHOENIX-Weather-2014T dataset is an expansion of RWTH-PHOENIX-Weather and contains a total of 8257 videos.
The evolution of sentence recognition is different from isolated words recognition. In sentence recognition, the length of the output sentence may not be consistent with the length of  the annotation sentence, which means that there may be an increase in output, deletion, and replacement errors. To consider various errors and describe the accuracy of sentence recognition, we use the following metrics [43]: where S, I , and D represent the minimum number of replace, insert, and delete operations, respectively, required to convert the hypothetical sentences to true annotations.

B. PERFORMANCE EVALUATION ON OUR DATASET 1) ACCURACY
In our dataset, 50 sentences were established to test the continuous CSL recognition framework proposed in this paper. These sentences were composed of 150 isolated sign words  and included common phrases such as ''Do not forget to bring an umbrella,'' ''This is my business card,'' and ''Is there a room?'' The visual receptive field and time window length are highly important for spatial feature fusion with a large time span. The fusion effects of the proposed method under different combinations are compared on our database. As shown in Tab. 1 and Tab. 2, we obtain some recognition results based on different parameter settings, and from the results, we observe that our proposed method obtains the best performance regardless of the parameter settings.
We also find that the combination of the time series model and sequential neural network (LSTM + HMM model) can effectively improve the accuracy of sign language recognition. The combination of the attention model and sequential neural network (LSTM + attention) can obviously improve the recognition accuracy of continuous sign language. Hence, it is best to consider the temporal and spatial model sequence neural network. One possible explanation for these findings VOLUME 8, 2020   is that the time fusion layer usually focuses on capturing information integration at different times, while visual input involves more spatial attributes of a gesture but is closely related to the dynamic correlation of hand shape. Therefore, a suitable spatiotemporal fusion structure can be applied to the visual input with a close temporal structure.
The proposed technique was tested on a PC with an Intel Core i5 CPU with 8 GB of RAM and the Microsoft Windows 10 operating system. We used two Titan graphics cards for training acceleration. The training process with our dataset is shown in Fig. 6 for 75,000 video samples: 70% of the samples were selected as training samples, there was a total of 600 epochs, and the training time was 13 hours 54 min. The value of the loss function decreased over time.
We compared continuous CSL recognition performance for different data sizes and various frameworks. As shown in Tab. 1 and Tab. 2, the recognition accuracy gradually decreased with increasing quantity of data, making large-scale sign language recognition difficult. For example, as shown in Tab. 1, the LSTM recognition accuracy for 10 samples was 91.13%, but it decreased to 72.22% with 50 samples; this result may have been caused by a decline in the isolated word recognition or increased sentence complexity. Regardless, the LSTM + attention and LSTM + HMM approach outperformed the LSTM method, with Bi-ST-LSTM-A producing the highest recognition accuracy (78.87%) across all 50 sentences. These results indicate that multi-information fusion is beneficial for automated segmentation of sign language sentences, independent of the data sampling environment.
Additional details are shown in Figs. 7 and 8. In Fig. 7, some examples for sign language feature presentation are shown. We can see that our proposed sign language feature representation is effective and that different gestures correspond to different feature representations.
An attention matrix is also given in Fig. 8. We can see that the sign language sentence can be translated well by the attention model. Four attention score matrixes are presented in Fig. 8, and the attention scores highlight the relationship between the source and translated sequences. Each attention score can be calculated by: where the softmax(.) is an activation function, and the attention score at each time step is the dot product of the hidden state h j and the learnable attention weights α tj multiplied by the encoder output Z enc j . The encoder output is obtained by Z = embedding(X, W input ), where the embedding function maps numeric indices to the corresponding vector given by the input weights W input .

2) TIME COST
We also test the time cost of proposed method, with the time cost comparison results given in Tab. 3. From the results, we can see that the LSTM method uses the shortest time for training, and the testing time is 0.22 s, while at the same time, our method uses the longest time for training. In total, the testing time of all LSTM-based methods is almost same. The disadvantage of our method is the longer training time, while it has the advantage of the highest recognition accuracy, and the testing time is almost the same as those of the other methods.

C. PERFORMANCE EVALUATION ON THE OPEN-SOURCE DATASET
We also compared our method with existing approaches on 3 open-source databases. For fair comparison, we set several LSTM-based methods as baselines.

1) BASELINES AND CRITERIA
To fully evaluate our model, we compare the proposed method with several baseline methods. The first baseline is long-short term memory (LSTM). The LSTM-based method [5] directly translates the video into natural language. In [19], the ''fc7'' layer feature was extracted from each frame to feed the feature sequences into the LSTM network at each moment, and the LSTM outputs a corresponding word at each moment. The second baseline is video-to-text (S2VT) [13], which is a stacked two-layer LSTM where the first layer encodes video frames. S2VT uses the CNN output as an input feature, and once all the frames are read, the model generates a sentence verbatim. The third baseline is LSTM combined with an attention mechanism (LSTM-A) [16]. This model takes advantage of the global time structure and focuses on the most relevant time frames. The fourth baseline is LSTM-E [41], a model based on visual semantic embedding. Here, the CNN is used to extract the visual features of the selected video frames for a given video. Video representations are generated by pooling the values of these visual features.  Then, at the same time, the visual semantic embedding model of LSTM is used to generate the video sentences and measure the distance between the video and the sentences.
We tested the sentence recognition accuracy by compensating for the missing alignment during the identification process. The proposed method achieved the highest accuracy. While this alignment is not a true alignment result, we still believe that our proposed scheme is feasible. It is important to note that although all of the LSTM-based models are similar to our model, one of the key differences is that many LSTM-based methods ignore or simplify temporal or spatial information in order to simplify the calculation during the embedding process. We chose to not only retain temporal information but also retain spatial information and optimize the alignment of video sentences. In addition, we compared our model with the traditional continuous SLR algorithms, such as CRF [45] and DTW-HMM [8]. These models require  [38]. The ADAM method is selected for data training, and for LSTM-based method, we set E M = 300, presegmentation of the video when they are identified, possibly leading to segmentation errors. From the comparison results, we can observe that our method presents improved SLR accuracy due to the avoidance of time segmentation. In Tab. 2, we compare recognition performances based on different parameter settings, and it is observed that regardless of the parameter values, our method always obtains the best test results.
Tab. 6 shows the comparison results of continuous SLR on the RWTH-PHOENIX-Weather dataset. Some CNN-based VOLUME 8, 2020  [38]. The ADAM method is selected for data training, and we set E M = 250, = 5, ζ = 0.005, η = 125, ξ = 0.05, δ = 256, n h = 200, and b m = 128. The other methods select optimal parameters for training.  methods always obtain good recognition results, such as deep hand and recurrent CNN. Both deep hand [21] and the recurrent CNN [44] are extensions of the CNN: the former combines the CNN with EM algorithms, while the latter combines the RNN and the CNN. These methods not only exploit the feature learning ability of the CNN but also utilize the time-series modeling ability of the iterative EM and RNN. Our approach uses a similar idea but additionally employs an attention model to strengthen the key content of translation. The approach also uses high-resolution image detection to obtain subtle hand local information and improve the recognition accuracy and robustness of identification. Then, our approach uses the ST-LSTM attention network to generate sign language sentences. The comparison results show that the proposed Bi-ST-LSTM-A network is superior to the other state-of-the-art methods.

4) EVALUATION ON THE RWTH-PHOENIX-WEATHER-2014T DATASET
We also use the RWTH-Phoenix-Weather-2014T [4] continuous sign language dataset, which is an extended database of RWTH-Phoenix-Weather-2014, to evaluate the performance of the proposed method. This dataset provides spoken language translations and gloss-level annotations for German sign language videos of weather broadcasts. The dataset is built by 9 different signers and contains a total of 8257 videos. The dataset includes 2887 different isolated sign language words. The videos' size is 210 × 260. We also divide the dataset into three sets for training, validation and testing, and there is no overlap with the previous version of the dataset in any subset. As shown in Tab. 7, the Bi-ST-LSTM-A achieves 75.32% accuracy on the TABLE 7. Performance comparison of different methods on the RWTH-PHOENIX-Weather dataset.(For the LSTM-based method, we select the ADAM method to train the data and set the LSTM parameters as E M = 300, = 3, ζ = 0.001, η = 100, ξ = 0.01, δ = 200, n h = 100, and b m = 50, while other methods select optimal parameters for training).
test set. Comparison to some state-of-the-art methods, such as the CNN-LSTM-HMM method [49], Re-Sign [27] and CNN-Temp-CNN [46], shows that our method has the highest recognition accuracy.

V. CONCLUSION
In this paper, a continuous SLR framework based on an ST-LSTM fusion attention network is proposed. We call it Bi-ST-LSTM-A, and it bypasses the sequence segmentation steps. The SL video features are produced by a dual-stream CNN model: one stream analyzes global motion information, while the other focuses on local gesture representation. The ST-LSTM is used for spatial-temporal information fusion, and then,an attention-based Bi-LSTM framework is introduced to measure the correlation between the video and the sentence. Finally, the transformation between the video and the sentence is established by the Bi-ST-LSTM-A network, and the sentence recognition is realized through encoding and decoding operations. XIN CHANG was born in 1996. She is currently pursuing the degree with Xi'an Technological University. Her research interests include motion recognition and video information processing.
XUE ZHANG was born in 1996. She is currently pursuing the degree with Xi'an Technological University. Her research interests include motion recognition and video information processing.
XING LIU was born in 1975. He received the master's degree from Northwestern Polytechnical University in 2008 and the Ph.D. degree in aeronautical and astronautical manufacturing engineering from Northwestern Polytechnical University in 2014. In 2015, he joined the School of Electronic Information Engineering, Xi'an Technological University, where he is currently a Professor. His research interests include ammunition design, guidance, and control.