Dynamic Hand Gesture Recognition Using Multi-Branch Attention Based Graph and General Deep Learning Model

The dynamic hand skeleton data have become increasingly attractive to widely studied for the recognition of hand gestures that contain 3D coordinates of hand joints. Many researchers have been working to develop skeleton-based hand gesture recognition systems using various discriminative spatial-temporal attention features by calculating the dependencies between the joints. However, these methods may face difficulties in achieving high performance and generalizability due to their inefficient features. To overcome these challenges, we proposed a Multi-branch attention-based graph and a general deep-learning model to recognize hand gestures by extracting all possible types of skeleton-based features. We used two graph-based neural network channels in our multi-branch architectures and one general neural network channel. In graph-based neural network channels, one channel first uses the spatial attention module and then the temporal attention module to produce the spatial-temporal features. In contrast, we produced temporal-spatial features in the second channel using the reverse sequence of the first branch. The last branch extracts general deep learning-based features using a general deep neural network module. The final feature vector was constructed by concatenating the spatial-temporal, temporal-spatial, and general features and feeding them into the fully connected layer. We included position embedding and mask operation for both spatial and temporal attention modules to track the node’s sequence and reduce the system’s computational complexity. Our model achieved 94.12%, 92.00%, and 97.01% accuracy after evaluation with MSRA, DHG, and SHREC’17 benchmark datasets, respectively. The high-performance accuracy and low computational cost proved that the proposed method outperformed the existing state-of-the-art methods.


I. INTRODUCTION
Research on hand gesture recognition has been increasing daily since many real-life applications like humancomputer interaction, nonverbal communication, controlling a wheelchair, abnormal behaviour monitoring, and sign language recognition [1], [2], [3], [4], [5], [6]. Previous work on hand gesture recognition has been divided into two categories based on the data collection procedure: visionbased and sensor-based systems. Since the sensor-based The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . system is difficult because of the carrying sensor, researchers focus on the vision-based system because it only uses a camera, and that is easy to carry. Based on the input data modality, vision-based research work can be divided into two categories: image-based research, which uses full image pixels, and skeleton-based research, which uses only joint information. RGB or RGB-D images are common input for an image-based method for extracting the recognition features. In comparison, skeleton-based methods predict hand gestures based on 2D or 3D coordinates of hand joints. The skeleton sequence is not affected by the limitations of the RGB video and does not consist of color information. Moreover, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Johansson et al. have proved that key joints of the gesture carry highly potential information about human motion [7]. Furthermore, each skeleton joint represents a point of the human body in three dimensions coordinates. Among the significant reasons this dataset is valuable to researchers is that it contains higher-level semantic information with a small amount of memory and adapts easily to dynamic systems [8], [9], [10]. Currently, many low-cost depth cameras are available in the market, like Intel RealSense, and Oak-D, which are easy to use for collecting skeleton gesture information and made great progress in gesture recognition research [11], [12]. Based on the skeletonbased dataset, many researchers proposed conventional methods for designing a powerful feature descriptors model for recognizing hand gestures [13], [14], [15]. The main problems of the conventional approach are less performance accuracy and limitations of capabilities for generalization. Researchers applied deep learning techniques to overcome the challenges and improve performance accuracy by directly converting joint coordinates into tensors that feed neural networks [16], [17], [18]. They first produced the feature with the neural network, which is learned by the deep learning network for training. Some other researchers transformed their input skeleton into a meaningful format like a graph, a point sequence or a pseudo image using graph topologies or traversal rules. Then this data format is directly fed into the deep learning method such as CNN, GCN, RNN, or LSTM to extract effective features for improving the network architecture performance [8], [19], [20]. Moreover, until now, there is some uncertainty about whether the hand-crafted extracted features and rules are the optimal choice of joint global dependencies for the model. However, learning global agencies transformer has produced success in the natural language processing (NLP) field, which mainly includes the self-attention mechanism [21], [22], [23]. They reported that better parallelizability and global dependency could be learned with minimum computational complexity among the element. In addition, the attention-based model does not require information about the intrinsic relationship among the joints. Another suitability of the attention model is that there is a limited number of joints in the hand gesture dataset. With minimum computational cost, it is possible to discover useful patterns from the hand skeleton dataset. The main drawback of these models nevertheless considers the spatial and temporal structure of the sequential hand skeleton dataset. Recently, many researchers applied a graph-based spatial-temporal attention model to recognize skeleton-based hand gestures [9], [24], [25], [26], [27], [28]. Although they achieved good performance, the main drawback is the lack of flexibility and sub-optimal performance because of the fixed graph structure, which may be difficult to capture variance and dynamics across different actions. To overcome the challenges, more recently, researchers have worked to develop a dynamic hand-skeleton dynamic graph-based spatial-temporal model to recognize hand gestures [24], [29], [30].
Although they overcome the optimality problem with their model, their performance accuracy is not satisfactory. Moreover, their model may be faced difficulties in achieving satisfactory performance or the same performance all the time because of the inefficiency of the extracted feature. In addition, they only extracted spatial features and then temporal features, and there is no explanation about the vice versa features or if the combination of the other general deep learning features. To overcome the challenges, we are inspired to extract all possible kinds of features from the hand skeleton dataset with the dynamic graph-based attention model, including spatial, temporal attention and general deep learning information. To do this study, we proposed Multi-Branch Attention Based Graph and General Deep Learning model for hand gesture recognition using a skeleton dataset to overcome the mentioned challenges. We developed the architecture by following the dynamic graph-based attentionbased mechanism, including spatial, temporal and deep learning information. To convert from original nonsequential skeleton information to sequential information, we used a general neural network, considered the initial feature. Then we employed three branches to extract all possible features with spatial-temporal, temporal-spatial and general deep neural network branches. We considered the spatialtemporal, temporal-spatial branch as a graph-based deep neural network branch that used a position embedding technique to generate the unique markers for every point before each attention block, which helped the attention model feed data sequentially. We utilized a masking operation in each attention block to reduce the computational complexity because two individual pieces of information would decrease the efficiency of the system. The main purpose of the third layer is to recover the missing feature value and solve inefficient signal propagation in the fully connected layer. As we fused the three kinds of features, which combined all possible kinds of features of the hand skeleton, it became an efficient and quicker process compared to the existing system. The significant contribution of this study is as follows: • We proposed a Multi-Branch Attention Based Graph and General Deep Learning Model to recognize dynamic skeleton-based hand gestures.
• We used several principles in designing spatialtemporal, temporal-spatial and general deep neural network models. The first branch produces spatialtemporal features based on spatial attention through a temporal attention block, and the second produces temporal-spatial features based on temporal attention through a spatial attention block. The third branch carries the general deep learning network features, and finally, we fused three features vector to generate the effective final feature vector.
• Finally, we conducted a comprehensive validation of our system with the three-dynamic skeleton-based hand gesture dataset and achieved superior performance over the state-of-art method considerably within minimum time. The models and code of the proposed model were uploaded into GitHub to make it public, which is available at https://www.github.com/musaru/Graphand-General-DNN. This paper we organized as follows, relevant literature review provided in Section II. Section III is described the benchmark dataset of hand skeletons used to develop this work. The proposed multi-branch spatial-temporal attention model is described in Section IV. Section V described the experimental results and different evaluation scenarios. Section VI concludes the paper, including some future work.

II. RELATED WORK
Hand joint skeleton information-based hand gesture recognition has recently been widely used in the computer vision domain but is still considered a challenging task. The traditional approach, like machine learning and traditional feature extraction method, mainly focuses on developing effective feature descriptors [15], [31], [32], [33], [34]. Ohn-Bar et al. l. proposed a set of feature generators from a skeleton dataset by including a histogram of oriented gradients (HOG) algorithm with the descriptor and employed linear SVM after converting the feature to a 2D array using HOG again [15]. Many other feature extractors have also been proposed by researchers, like the covariance matrix for skeleton joint location [34], joint location, joint angles, 3D geometric relationships between [35], and intraclass variance [36]. Hand geometric configuration for capturing hand shape variation was proposed by Smedt et al., which is used to extract spatial-temporal motion features of hand parts from the whole Euclidean space [37]. They achieved 82.50% and 80.11% accuracy for the 14 and 28 gestures of the DHG dataset after applying SVM machine learning on the Riemannian-based trajectory features. Smedt et al. extracted features based on the fisher vectors and skeleton-based geometric technique, then applied SVM to the concatenated features, and achieved 83.00% and 80.00% accuracy for DHG dataset 14 gesture and 28 gesture sequentially [13]. They extracted three features, namely the shape of connected joints (SoCJ), histogram of hand directions (HoHD), and histogram of wrist rotations (HoWR) and combined them to make the final feature vector. Smedt et al. also applied the fisher vector and shaped the connected feature for the SHREC'17 dataset with the SVM classifier and achieved 88.24% and 81.90% for 14 gestures and 28 gestures sequentially [14]. The advantage of the work is that they demonstrated the superiority of 3D skeleton information over depth-based approaches, but the drawback is they did not consider the amplitude of the gesture and temporal pyramid representation may lose some information. Chen et al. proposed a motion feature extractor by combining articulated movement of the finger and motion feature from global hand movement for extracting bone angle and applying RNN for classification. They evaluated their model with the DHG dataset and achieved 84.68% and 80.32% accuracy for the 14 and 28 classes, respectively [16]. Also, some researchers employed deep neural networks like CNN on the hand joints skeleton data for recognizing hand gestures and significantly improved [14], [16], [18], [23], [27], [32]. Many researchers used other networks with CNN, like an RNN-based approach that transforms the skeleton data into sequential data using traversal rules and feeds into LSTM for training and prediction [9], [17], [18], [38], [39]. Lin C et al. developed a fusion model by combining skeleton LSTM and Res-C3D network for recognizing abnormal hand gestures [39]. Lai [26]. They extracted features using a variations auto-encoder from finger and global motion and then fed them into 3 RNNs. Ma et al. proposed a modified memory-augmented neural al network, namely gesture recognition using an enhanced network (GREN) and LSTM architecture, to recognize hand gestures as a short learning algorithm that aims to improve the system's efficiency [23]. They achieved 82.29% and 82.03% for the DHGD dataset and 79.17% for the MSRA dataset. Handwriting-inspired features (Hif3d) are proposed by Boulahia et al. for a 3D skeleton-based gesture classification and achieved 90.48% for 14 gestures and 80.48% accuracy for the 28 gestures of the DHG dataset [28]. Recently, researchers have focused on utilizing self-attention mechanisms to increase the efficiency and performance accuracy of the vision-based hand gesture recognition task by reducing the long-range distance [41], [42]. Vaswani et al. first applied a self-attention network to establish a semantic relationship among words [21].
Query, Key and Values, which multiply the Query with Key in the first stage, divide by the key's dimension and finally apply the SoftMax function to produce the weight vector [22], [30]. After that, it is also employed for detection, semantic segmentation, and relational modelling research work [43], [44], [45]. Currently day, many researchers combined spatial-temporal attention with various architectures like CNN [39], [46], [47], [48], [49], RNN and softattention instead of hidden RNN [50], and memory attention networks (MANS) [31]. Song et al. applied a spatial-temporal attention mechanism through RNN and LSTM, where they used individual joints as the main information [51]. Hou et al. employed spatial-temporal attention by combining VOLUME 11, 2023 with residual connection and temporal convolutional neural network (STA-Res-TCN) to recognize skeleton-based human gestures [27]. They extracted features from the different levels of attention mechanism and applied CNN for individual time steps and finally achieved 89.20% and 85.00% for 14 and 28 gestures of the DHG dataset, respectively. They also evaluated the model SHREC17 dataset and achieved 93.60% and 90.70% accuracy for 14 and 28 gestures sequentially. Recently, a graph convolutional neural network (GCNN) was used by many researchers for gesture recognition [8], [9], [29], [32]. Also, the existing system produced a good performance in some cases but still faced some generalisation problems and sometimes difficulties in achieving high performance for more datasets. To overcome the challenges, we employed here a Multi-Branch Attention Based Graph and General Deep Learning model to recognize hand gestures. We first employed a deep neural network and then employed a spatial-temporal and temporal-spatial branch to produce node and edge features for spatial and temporal domains. To increase the system's generalization, we extracted general deep learning features and concatenated the three extracted features to produce the final feature vector. To reduce the computational cost, we used a spatial-temporal mask and achieved 94.12% accuracy for the MSRA dataset, then 92.00% and 88.78% accuracy for the DHG dataset. In the same way, they achieved 97.01% and 92.78% accuracy for the 14 and 28 gestures sequentially for the SHREC'17 dataset. Our study is more efficient in general, as it does not require hand-crafted transformation rules, and it produced high performance compared to the existing method by a significant margin.

III. DATASET DESCRIPTION
We studied nine open sources of skeleton-based datasets to evaluate the proposed model, namely: MSRA [52], DHG [13], SHREC17 [14], Florence 3-D action [53], UTKinetic [54], UCF-Kinetic [55], NTU [56], NYU [57], ICVL [58], NVGesture [59] which are considered as the benchmark dataset. Among them, Florence 3-D action, UTKinetic, UCF-Kinetic, and NTU datasets are human action datasets. NYU datasets are collected only for binary data, and NVGesture and ICVL contain only RGB and Depth information. Since our objective of the proposed model is to recognize skeleton-based hand gestures, we selected the most recently used skeleton-based hand gesture datasets namely: MSRA, DHG and SHREC17, which have almost similar characteristics in terms of the hand skeleton key points and the number of samples. The details of the uses skeleton dataset are given below [5].
3D Skelton data sequence can be defined as a vector by following Equation (1).
Here, S represent the skeleton data sequence, P j represents a multivariate time sequence, T represent the transpose of a matrix and each component of the sequence we can be written as P j = P j (t) t∈N which contained three univariate sequence components like the following Equation (2).
Here, x, y, and z coordinates are represented by X, Y, and Z for j-th joint, respectively. In addition, P j (t) represents the position of the j-the skeletal joint. Every joint contains a precise or distinct articulation of the hand of the physical world. From each t time frame, 21 joints for the MSRA dataset and 22 joints for the DHG and SHREC'17 dataset are collected in 3D space by Intel Creative Interactive Camera with their position , where N=21 and 22.

A. MSRA DATASET
One of the hand joint skeleton-based gesture datasets is the MSRA, which is the most challenging publicly available sequence data [52]. This dataset was recorded from 9 participants based on 17 right-hand gestures using Intel Creative Interactive Camera. Each gesture is manually chosen by following the American sign language gesture focusing on the figure articulation's span as much as possible. The dataset contains 490 to 500 frames for each gesture, and for 17 gestures, it is composed of 76500 frames. There are 21 joints as 3D world space or skeleton information in each frame and also collected 2d depth images as well. Among the 21 joints, each finger consists of four joints and one in the palm for the MSRA dataset. The name of the 21 joints is shown in Figure 1. This dataset is considered challenging because of the viewpoint variation.

B. DHG DATASET
DHG is a publicly available dynamic and one of the challenging datasets for hand gestures, which contains a sequence of 14 right-hand gestures with various finger configurations [13]. For each gesture, the dataset was collected using Intel Real sense SDK and five times from 20 participants in two ways of finger configuration. By following the procedure, they collected a total of 2800 video sequences, and the length of each video contains 20 to 70 frames. Each frame is considered a 3D world space and a full hand skeleton formed with 22 skeleton joints. Figure 2 shows the name and position of the 22-hand skeleton. Some gestures consist of hand movements called coarse gestures, and some other gestures are composed of the shape of a hand, called fine gestures. Among the 14 gestures, nine coarse and five fine-grained gestures are reported. Also, the DHG dataset contains depth image skeleton information, but in our experiment, we used only the skeleton information for gesture recognition. Table 1 shows the name and types of the gestures.

C. SHREC'17 DATASET
Another challenging skeleton-based hand gesture dataset name is the SHREC'17 dataset [14]. This dataset is the same as the DHG dataset, and the Intel Reals Sense SDK was also used and collected from 27 participants. Data were collected 1 to 10 times from each participant in a 2-finger configuration, with a total of 2800 video sequences.
Depending on the number of fingers, labels from this dataset are categorized as 14 labels or 28 labels. In addition, among the gestures, some gesture consists of only one finger, and some gesture is composed of a whole hand. For each gesture, a 2D and 3D hand representation was also collected with the depth image for each scene and each time step. Although this dataset contains 2D depth images and 3D hand skeleton information, we used only 3D hand skeleton information in this study. The 22 hands skeleton points name of this dataset is shown in Figure 2 and the name shown in Table 1.

IV. PROPOSED METHODOLOGY
The main goal of the demonstrated MSRA, DHG and SHREC'17 dataset was (1) full hand skeleton and depth information-based dynamic hand gesture recognition, (2) evaluating the efficiency of the hand gesture recognizer based on the number of the finger in the gesture [29]. However, the main objective of our study is not the same as theirs because our study is to achieve high performance in hand gesture recognition with minimum time and cost using only 3D hand skeleton information comparing the still image and video-related hand gesture recognizer. Another objective of our study is to extract all possible features, including a small deep neural network as a skip connection. The purpose of the NN2 is to resolve the missing value problem and improve the performance and efficiency of the model by combining general features with others. Our proposed study is demonstrated in Figure 3(a). We designed a Multi-Branch Attention Based Graph and General Deep Learning Model to recognize dynamic skeleton-based hand gestures. We used two graph-based neural network channels in our multi-branch architectures and one general neural network channel. In graph-based neural network channels, one channel first uses the spatial attention module and then the temporal attention module. On the other hand, the second channel of the graph-based neural network section first used the temporal attention module and then the spatial attention module. The graph-based network branches can be defined with the spatial-temporal and temporal-spatial branches. All three branches took input from the output of an NN1, shown in Figure 3(b). Firstly, NN1 takes the skeleton data points as input for each node and projects the input hand joints 3D coordinate into an initial feature node F 1 that is 128 dimensions. All three branches took F 1 as VOLUME 11, 2023 input, where the first and second branches embedded the output of NN1 with the corresponding spatial and temporal position to track the sequence correctly. The first branch produces the spatial features with 256 dimensions as a node feature and is projected into a 128-dimension using the neural network NN1, then embedded with temporal position and fed into the temporal attention model, which produces the spatial-temporal feature with 256 dimensions. After that, we projected the 256-dimensional node feature into 128 using NN1 and denoted by F ST . Figure 4(a) shows the spatialtemporal F ST feature extraction mechanism. In the second branch, we follow the reverse sequence of the first branch, where we first fed the initial feature F 1 into the temporal attention model and then fed the temporal feature into the spatial attention model and produced the temporal-spatial feature vector after projecting in the NN1 which is denoted by F TS . Figure 4(b) shows the temporal-spatial F TS feature extraction mechanism. The 3rd branch also took the F 1 as an input, and after applying the general deep neural network NN2, which is shown in Figure 3(c), it produced a general feature F G . After that, we concatenated the spatialtemporal, temporal-spatial and general features according to Equation (3) and produced the final feature vector of the proposed architecture F Final . Lastly, we fed the average pool feature vector of the concatenated node features into the fully connected layer for classification.
A. GRAPH-BASED DEEP NEURAL NETWORK BRANCH We considered two among the three branches as the graphbased deep neural network branch because we used the attention-based mechanism for computing the representation of every joint node of the hand skeleton as a graph node by following its neighbours. The self-attention approach helps us learn an adaptive and dynamic local summary of the neighbour node to improve the prediction, then change to multi-head attention by repeating itself. Extracting a spatial-temporal [29] and temporal-spatial domain feature is the primary purpose of these two branches to build a long sequence for learning the most important part of the hand skeleton. To modify the unified graph, we need to extract spatial and temporal domain features which are dynamically optimized by the different actions. Both graphbased branches took input from the output of NN1 and produced the spatial-temporal and temporal-spatial features after encoding with the spatial and temporal attention model. In both branches, we employed position embedding and masking operations for each attention at spatial and temporal domains to improve performance accuracy and efficiency.

1) SKELETON-BASED GRAPH INITIALIZATION
The structure of the hand skeleton data naturally looks like a graph when we consider it a graph. A hand gesture video sequence containing T frames to represent the hand skeleton and the total N number of 3D hand skeleton joints can be recorded from each of the frames. We assumed a fully connected graph is constructed from the sequence of hand skeleton joints of a frame which is considered as G = (V , E). The main concept of the feature extraction procedure from the 3D coordinate is that each node connects with other nodes and itself, where we considered three kinds of edges: spatial, temporal, and self-connected edge [27]. We explained the mathematical concept for a set of edges E as follows: • The connection of two different nodes at the same time step is known as a spatial edge which is defined by • The connection of two different nodes at different time steps, known as the temporal edge, is defined by • The same node is connected with itself, known as a self-connected edge which is defined by Here, the frame sequence is represented by t and k; the joint skeleton sequence is represented by i and j, respectively.

2) POSITION EMBEDDING
The recurrent network like GRUs and LSTM sequentially process the input, whereas our architecture is one kind of transformer that will not process the skeleton joint sequentially. We used positional embedding here to maintain the sequence of the joint information since there is no builtin notion of the sequence in the transfer. Each skeleton joint of the hand gesture is composed of a tensor for feeding the deep neural networks. For each node, there are no pre-defined structures or orders for showing the node's identity, and it's impossible to identify the corresponding node's hand gesture name. We need to provide unique markers or identifiers for every node to identify the gesture name of the corresponding node. We propose a spatial, temporal position encoding technique to generate the gesture information according to joint information. We use the sine and cosine functions by following [30], [31], [63], and [64] with various frequencies to encode the position number for each node as the encoding functions: P E (p, 2i) = sin p/1000 2i/C in P E (p, 2i + 1) = cos p/1000 2i/C in (4) Here, P E (p, 2i) represents the sin function position encoding for the even index, P E (p, 2i + 1) represent the cos function position encoding for the odd index, the position encoding vector dimension is represented by i, and p denotes the position of each element. According to [63] and [64], the input hand skeleton contains space and time information,  and one of the important strategies of the position embedded important strategies is to unify the spatial and temporal information and encode them sequentially. The spatial position embedding comprises the N vectors, where each individual vector consists of a hand joint. We applied spatial position encoding by joining all joints in a single frame by encoding sequentially. On the other hand, temporal position embedding is composed of individual vectors, and each vector represents the corresponding node's hand skeleton graph. We encoded them by encoding the same joints in different frames. Lastly, we added the position information with the output of the NN1 network, which is considered an initial feature of a specific node and fed into the proposed architecture after being embedded with the associated position vector. We added the feature vector with the embedding position, which is shown in the following Equation (5) and (6): Here, spatial-temporal and temporal-spatial feature is represented byf ST (t,i) , andf TS(t,i) for a specific node v (t,i) , respectively. In the equation, f (t,i) represents the initial feature, A T represent the output of temporal attention, A S represent the output of spatial attention. The i-th hand joint of the t frame is represented by P S (t,i) and P T (t,i) where the embedding dimension is the same as the input f (t,i) dimension.

3) SPATIAL-TEMPORAL ATTENTION MODULE
The proposed approach consists of spatial-temporal, temporal-spatial attention and general deep neural network branches. Attention-based branches comprise the twoattention model with spatial embedding and two attention models with temporal embedding. In the first branch, the spatial attention block took the input from the output node of NN1 and updated them with the encoding spatial information with the spatial attention block; then, it is fed into temporal attention for updating with the temporal attention block and produced the spatial-temporal feature. In the same way, the second branch produced the temporal-spatial feature by the VOLUME 11, 2023 reverse procedure of the first branch. In all cases, we applied a multi-head attention mechanism [21], [29], [30], which is visualised in Figure 5. Consider f (t,i) is the initial feature of a node v (t,i) of a hand skeleton, which is used as the input value of an attention layer. There are multi-heads in the attention mechanism, and let m-th attention head first apply the fullyconnected layer for mapping query, key and value vectors with the f (t,i) input features. The mapping procedure was performed using the following formulas: Here query, key and value nodes are represented by Q m (t,i) , K m (t,i) , and V m (t,i) respectively. The weight metrics of the fully connected layer for the m-th spatial or temporal attention model are denoted by W m Q , W m K and W m V for query, key and value respectively. The spatial, temporal, and self-connected edge weights are calculated in two stages. In the first stage, simultaneously calculates the dot-product between the query and the key vectors [21], [29], [30]. Using a SoftMax activation function normalize the output of the dot product in the second stage. The following formulas in the Equation (8) is execute the above two steps: where d represents the dimension of the key vectors and scaled dot products between v (t, i) and v (t, j) nodes are represented by u m (t,i)→(t,j) ; inner product operation is represented by < · >, and attention operation is represented by α m (t,i)→(t,j) , which is extracted effective information from v (t, i) to v (t, j) node. In this stage, we can determine whether the attention will be considered spatial or temporal using masking operations by assigning a value of edges. We block the temporal domain information passing by assigning 0 weights for all temporal edges to consider spatial attention and vice versa. Consequently, a weighted skeleton graph is produced by the spatial attention block by considering a hand joint for the same time frame, and the attention head calculates from the node V (t,i) using Equation (9): Here, α m (t,i)→(t,j) , andf m (t,i) represents the attention operation and the output of the attention. The attention operation α m (t,i)→(t,j) is worked as either spatial or temporal attention for the V (t,i) node based on the masking operation. Moreover, the main idea of spatial attention is to calculate the relationship between two nodes and information passing among the nodes within the same time steps. In addition, according to the learned edge weights, their aggregates and the received information. Equation (9) repeated itself. M times for producing the multi-head attention of spatial or temporal domain are considered multiple feature vectors. Finally, all the attention head outputs concatenate according to Equation (10) and make a single feature vector asf (t,i) which is considered the feature vector for the node V (t,i) and we considered spatial attention feature A S : Here, spatial or temporal attention features for single-head and multi-head are represented byf i (t,i) andf (t,i) respectively, and M is the total number of heads in multi-head attention, which is 8 in our study. In the first branch, the spatial attention A S model learns the weighted skeleton graphs and produces node features by encoding multiple types of structural information. The spatial attention feature is considered as the input feature for the temporal attention A T and employed the described multi-head attention procedure in the temporal domain and produced the spatial-temporal feature information. In the same way, in the second branch, the temporal attention A T models learn weighted skeleton graphs and produce node features by encoding multiple types of structural information and then feeding it into the spatial attention A S and employed the described multi-head attention procedure in the temporal position embedding domain.

4) SPATIAL-TEMPORAL MASK OPERATION
In the proposed architecture, we employed the attention block's spatial and temporal masking operation to cut down the computational cost. In spatial attention, the block mask operator assigns 1 for the spatial position and 0 for others. In the same way, the temporal attention block mask operator contains 1 for temporal value and 0 for other positions. After performing the mask operation, it reduces the data block's size and cuts down the system's computational cost. The concept of attention block is first to calculate three fully connected layers for query, key and values vectors. Then among the query vector and key vector, it calculated the dot product and was divided by the dimension of the key vector. Before the SoftMax activation function, we employed mask operation for both spatial and temporal domains to block unnecessary domains' edges by assigning 0. In Figure 6, we illustrated our masking operation [29], [30]. In the previous section, we discussed the attention where we computed a query matrix Q and K key matrix. Each row of the Q matrix contained the query vector for each node, and each row of the K contained the key for each node. Then we computed the edge weight W matrix using scaled dot products by applying the following Equation (11): Here, W , T , and represent the weight matrix, transpose of the key matrix and matrix multiplication between the query and transpose of the key matrix. The edge weights W can contain a spatial or temporal edge depending on the setting value in each element of the masking matrix. Here, in the first stage, we proposed spatial mask operations to set the value in W that contains the temporal edge to η, where the value of the η is near zero and keeps other values unchanged. After applying the spatial mask, we got the output W S that contained the spatial edge, and W T contained the temporal edge. The following Equation (12) calculate the spatial edge and temporal edge: Here,W S , ⊙, φ, and × represent the spatial attention edges, element-wise dot product, Softmax function and multiplication operation sequentially. In addition, W , M S and η represent weights matrix, spatial mask, and a number close to negative infinity. The mechanism of the mask operation with the weight matrix is if the edges are self-connected or spatial, then it's 1; otherwise, 0. At this work, we assign −9 × [10] 5 for the η. The SoftMax activation normalizes the weights based on the spatial edges because the value of the eta is near zero. Consequently, all temporal edges are set to 0 at W S . Here, M S represent the spatial mask containing one if edge represents self-connected or spatial edge otherwise 0. The edge weight calculation formula in the spatial domain of Equation (8) is successfully implemented by Equations (11) and (12). However, masking output W S a matrix can be applicable for computing the node feature described in Equation (12) based on the matrix multiplication with the value vectors matrix. In the same way, we employed temporal mask operation according to Equation (13) where we used M T instead M S for computing the weight matrix W T in the temporal domain.
Here,W T , ⊙, φ, and × represent the temporal attention edges, element-wise dot product, Softmax function and multiplication operation sequentially. In addition, W , M T , and η represent weights matrix, temporal mask, and a number close to negative infinity. According to the previous discussion, this matrix contains if it is temporal or selfconnected edge; otherwise, 0. The main goal of mask operation is to increase the efficiency of the system by reducing computational complexity.

B. GENERAL DEEP NEURAL NETWORK BRANCH
In our study, the general deep neural network branch is used as an alternative path to reach the output of NN2 to concatenate with spatial-temporal and temporal-spatial features. In the NN1, we first employed a fully connected layer along with the relu function, then normalised with layer normalization and dropout layer were used to reduce the overfitting and produced the initial feature F 1 . In the NN2 taken output of NN1 as input with three dimensions where a fully connected layer produced 256 dimensions after applying layer normalization, then we employed an average pooling layer to produce an average vector, and finally, a padding layer was used for maintaining the output dimension general feature vector from the NN2. This branch effectively solves the missing data problems and converges problems for exploding gradient and vanishing gradient, which face difficulties in the other branch [62], [63].

V. EXPERIMENTS
We evaluated a comprehensive validation of our system with the three-dynamic skeleton-based hand gesture dataset here.
Our proposed system has three channels; two are graphbased neural network channels, and one is a general neural network channel. In graph-based neural network channels, one channel first used the spatial attention module and then the temporal attention module. On the other hand, the second channel of the graph-based neural network section first used the temporal attention module and then the spatial attention module. Finally, we fused them, and after average pooling, we applied a fully connected layer as the final layer for classification.

A. EXPERIMENTAL CONFIGURATION OF TRAINING AND TESTING
We implemented our architecture in the PyTorch platform in the study's NVIDIA 8GB GPU machines. We randomly selected eight frames for each video as the input. First, we subtract every input frame sequence by the first frame palm position based on the previous work; then, we employed some data augmentation techniques by following previous work like shifting, scaling, time interpolation and adding noise. In the compiling section, we used Adam optimizer as an optimizer method with the.001 learning rate for training the model, where batch size was set to 32 and dropout rate was set to 0.1 and 0.2 [64].

B. EXPERIMENTAL SETUP AND IMPLEMENTATION PROTOCOLS
We selected the most recently used three skeleton-based hand gesture famous datasets, MSRA [52], DHG [13], and SHREC17 [14] dataset, to evaluate the proposed model. DHG and SHREC contain 2800 video sequences for 14 and 28 gestures, and 3D coordinates of 22 joints are extracted from each frame. MSRA dataset is collected for 17 gestures and 500/600 frames for each of the gestures in 76500 frames, VOLUME 11, 2023  where 21 joints are extracted from each frame. There were 9, 20 and 27 subjects for MSRA, DHG, and SHREC datasets, respectively. We used all three datasets to evaluate our model with a cross-validation procedure: leave-oneout cross-validation (LOOCV). According to the procedure, we sequentially selected n-1 subject information for training for each experiment and the remaining subject for testing. There are nine subjects in the MSRA dataset; keeping one subject dataset for testing, we trained the model with the remaining nine subject datasets. There are 20 subjects in the DHG dataset; we took one subject for testing and the remaining 19 for training. In the same way, among 27 subject datasets for the SHREC'17 dataset, we considered 26 subject datasets for training, and the remaining one was considered as a testing dataset. The overall accuracy of all gestures is reported here.

C. EXPERIMENTAL RESULT
The performance accuracy of the proposed model with three benchmark datasets is demonstrated in this section. SectionV-C1 demonstrated the performance for the MSRA dataset; then Section V-C2 showed the performance for the DHG and the SHREC'17 datasets.

1) EVALUATION WITH MSRA DATASET
In the first stage, we evaluated our proposed system with the MSRA dataset, where eight subject datasets were used for the training and the remaining subjects dataset for the evaluation. Table 2 shows the performance accuracy of the MSRA dataset, where we reported nine individual subject performance accuracy and average accuracy among nine subjects as well. We got maximum accuracy of 100% for subject 1, subject five and subject eight, and minimum accuracy got 82.35% accuracy at subject nine and a 94.12% average accuracy for the nine subjects.

2) EVALUATION WITH DHG DATASET AND SHREC'17 DATASET
Secondly, to evaluate the proposed model with another dataset, namely DHG-14/28, we trained the model using 19 subject datasets and tested it using the remaining subject for each experiment. Accordingly, we repeated it 20 times and used different subjects for both DHG-14 and DHG-28.
In the same way, for the SHREC'17-14/28 dataset, we trained  Table 5 and Table 6 for comparison.

D. COMPARISON WITH STATE-OF-THE-ART METHOD
We compared our evaluation performance with the stateof-art model for all datasets to prove the superiority of the proposed system. Since we are using graph-based and general neural network modules to extract features and fuse them before feeding them to the classification module, we are getting good accuracy over the existing state-of-theart model. In the Section V-D1, V-D2, and V-D3 showed the comparison for MSRA, DHG and SHREC17 datasets, respectively.

1) COMPARISON OF MSRA DATASET
Our model produced good performance accuracy for the MSRA dataset by comparing the stat-of-the-art model shown in Table 4. The state-of-the-art model proposed by Ma et al. employed an enhanced neural al network, GREN and LSTM architecture to recognize hand gestures using a skeleton dataset based on the augmented neural network with one short learning memory [23]. The main goal of their idea is to improve performance accuracy, minimize prediction error, and remove unnecessary hyperparameter updating. Their model aims to design a network that can effectively combine and share the feature between dissimilar classes and experiment with their model in different ways. Based on the skeleton information, they employed an LSTM network that achieved 72.92% accuracy and achieved 79.17% accuracy with the green network. On the other side, our proposed study achieved 94.12% accuracy, which is more than 10.00% of the existing method.

2) COMPARISON OF DHG DATASET
In Table 5, the proposed study is compared with the various state-of-the-art method for the DHG dataset for both 14 and 28 gestures. It demonstrated that the proposed study outperforms most state-of-the-art techniques and achieves comparable performance accuracy with DG-STA [29] and STA-GCN [8]. Although some existing methods used depth and skeleton both information, such as joint angles and HOG2 (JAHOG) [15] approaches, ASJT [37], SoCJ + HoHD + HoWR [13], NIUKF-LSTM [25], CNN+RNN [39], our study only relies on the only skeleton. Our method generated an average accuracy of 20 subjects at 92.00% for the 14-gesture, which is higher than the advanced algorithm.
In the case of 28 gestures, it achieved 88.78% average accuracy for 20 subjects, which is also higher than the existing performance accuracy. JASHOG [15] [29]. Unlike existing work, our proposed architecture focuses on multiple branches for producing multiple feature vectors generated by the parallel architecture, which also preserves the dynamic hand gesture properties. Moreover, replacing some branches of the proposed architecture can easily be compatible with the existing state-of-the-art system like DG-STA [29]. Moreover, our study's main focus is to fully explore prior and future work composition. The table's contents have demonstrated that our proposed method's performance is higher than the existing method in this factor.

3) COMPARISON OF SHREC'17 DATASET
The comparison Table 6 demonstrated that our model outperforms most of the state-of-the-art methods for the SHREC'17 dataset for both 14 and 28 gesture cases and comparable performance with DG-STA [29] and STA-GCN [8]. As shown in Table 6, our study achieved 97.01% for 14 and 92.78% accuracy for the 28 gestures, which is average for 27 subjects and outperformed all existing methods for both experiment settings. Specifically, our method improved the accuracy of 14 gestures by 3.40% and 2.78% for the 28 gestures once we compared them with the existing best-performance DG-STA [29] methods and more than 5.40% with more recent work by STA-GCN [8]. Although some existing methods used depth and skeleton, both information among them, a histogram-based method based on depth sequence (HON4D) [31], shape analysis of motion trajectories on Riemannian manifold (SMTRM) [32] for hand gesture classification, SoCJ + HoHD + HoWR [14], while our study only relies on an only skeleton. In the case of the SHREC17 dataset, MFA-Net produced 91.31% and 86.55% accuracy for 14 and 28 gestures sequentially [ 23], [27] . Res-TCN, STA-Res-TCN [27], STA-GCN [8], and DG-STA [29] are applied attention-based architecture for recognizing hand gestures based on skeleton information. Among the attentionbased model, STA-Res-TCN achieved 93.60% and 90.70% accuracy for 14 and 28 gestures [27] whereas DG-STA [29] approach to improve accuracy and reduce the computational cost of hand gesture recognition and achieved 94.40% and 90.00% accuracy for sequentially the 14 and 28 gestures. Our proposed method mainly focuses on parallelly producing multiple features from multiple branches of the parallel architecture, which preserves the properties of dynamic hand gestures. In addition, the proposed study can be compatible with the existing attention-based method discarding some branches and modules [27], [28], [29]. Moreover, our study's primary focus is to fully explore prior and future work composition. The table's contents have demonstrated that our proposed method's performance is higher than the existing method in this factor.

VI. CONCLUSION
We employed an attention-based Multi-Branch Attention Based Graph and General Deep Learning approach for recognizing hand gestures based on the study's 3D hand skeleton data points. Our method provided a multi-branch graph-based deep neural network and general deep neural network model with masking operation for learning spatial and temporal domain information and produced a potential feature vector for classification. We employed two branches of graph-based neural networks where the first branch took input from the output of the neural network NN1, and after encoding with spatial and temporal attention, it produced the spatial-temporal. In the same way, the second branch produced a temporal-spatial feature by following the reverse sequence of the first branch, which is concatenated with the output of the general deep neural network branch and applied to the average pooling layer. Finally, a fully connected layer is applied to learn node and edge weight for classification.
Since we are using graph-based and general neural network modules to extract features and fuse them before feeding them to the classification module, our proposed model is getting good accuracy over the existing state-of-the-art model for all three datasets. In the table, we demonstrated the experimental result for three datasets and the effectiveness of our proposed architecture. In the future, we plan to collect 3D hand skeleton information from ourselves from more gestures to develop a sign language-based communication system.  Japan, in 1999Japan, in , 2004, and 2019, respectively. He has coauthored more than 300 published papers for widely cited journals and conferences. His research interests include pattern recognition, image processing, computer vision, machine learning, human-computer interaction, non-touch interfaces, human gesture recognition, automatic control, Parkinson's disease diagnosis, ADHD diagnosis, user authentication, machine intelligence, as well as handwriting analysis, recognition, and synthesis. He is a member of ACM, IEICE, IPSJ, KISS, and KIPS. He has served as the program chair and a program committee member for numerous international conferences. He serves as an Editor for IEEE journals and SENSORS (MDPI) and a reviewer for several major IEEE and SCI journals.