American Sign Language Words Recognition using Spatio-Temporal Prosodic and Angle Features: A sequential learning approach

Most of the available American Sign Language (ASL) words share similar characteristics. These characteristics are usually during sign trajectory which yields similarity issues and hinders ubiquitous application. However, recognition of similar ASL words confused translation algorithms, which lead to misclassification. In this paper, based on fast fisher vector (FFV) and bi-directional Long-Short Term memory (Bi-LSTM) method, a large database of dynamic sign words recognition algorithm called bidirectional long-short term memory-fast fisher vector (FFV-Bi-LSTM) is designed. This algorithm is designed to train 3D hand skeletal information of motion and orientation angle features learned from the leap motion controller (LMC). Each bulk features in the 3D video frame is concatenated together and represented as an high-dimensional vector using FFV encoding. Evaluation results demonstrate that the FFV-Bi-LSTM algorithm is suitable for accurately recognizing dynamic ASL words on basis of prosodic and angle cues. Furthermore, comparison results demonstrate that FFV-Bi-LSTM can provide better recognition accuracy of 98% and 91.002% for randomly selected ASL dictionary and 10 pairs of similar ASL words, in leave-one-subject-out cross-validation on the constructed dataset. The performance of our FFV-Bi-LSTM is further evaluated on ASL data set, leap motion dynamic hand gestures data set (LMDHG), and Semaphoric hand gestures contained in the Shape Retrieval Contest (SHREC) dataset. We improve the accuracy of the ASL data set, LMDHG, and SHREC data sets by 2%, 2%, and 3.19% respectively.


I. INTRODUCTION
T HE incredible attention in human-computer interaction (HCI) makes human hands the most natural and efficient medium to express intentions for daily interaction activities [1]. It leads to the development of numerous HCI systems such as sign language recognition, robotics, medical diagnostics, among others. Deaf are generally dependent on sign language to participate in the real world. World Federation of the Deaf put figures around three hundred active natural sign lan-guages across the globe [2]. American Sign Language (ASL) is one of the famous sign languages with unwritten grammar characterized by hand motions, and sometimes facial/body signs [3]. This language involves constructing very complex grammatical structures, using dynamic word gestures. The dynamic word gestures are most crucial constructing blocks during ASL sentence development and facilitating expressive communication. ASL comprises over ten thousand dynamic word gestures with approximately 65% and 35% represented by sign words and finger-spelled words respectively [4]. Sign words remain the common means for the deaf to express themselves. Therefore, these words are indispensable for daily deaf communication. It is imperative to mention that majority of the available ASL words comprised of similar gestures. Thus, the similarity usually confuses sensing devices and hinders the application of most sensors leading to misclassification. To solve this, Fang et al. [5], proposed DeepASL using leap motion controller (LMC) sensor from backhand view with bi-directional long short term memory (Bi-LSTM). Therefore, Deep Bi-LSTM architectures should have more potential for the dynamic sign language recognition (SLR) [6], [7].
In Avola et al. [6], a similar recent approach where LMC with stack Deep Bi-LSTM network is used as a prediction model on temporal feature descriptors, which represent coordinates of internal hand joints angles and the palm displacement. However, stacking large number of Deep Bi-LSTM units resulted to unsatisfactory recognition accuracy. Motivated by [5], [6], we present 3D Spatio-temporal skeletal hand joint features according to the prosodic model and orientation angle to address misclassification of highly correlated ASL words. These words are difficult to be recognized by learning internal hand joint angles and the palm displacement only, thus, the similar ASL words can be treated as composed by many small orientation variations and prosodic cues. The major difference between the Deep Bi-LSTM in [6] and ours, is that, we trained the Deep Bi-LSTM from encoded fast fisher vector (FFV) information to improve the Deep Bi-LSTM learning and reduce large abstraction. Our contributions are supported by several sign language models [8]- [11]. We make the following contributions: (i) We introduced orientation angle Q n and prosodic µ features to discriminate similarity between ASL words from 3D skeletal hand characteristics. (ii) Developed robust fast fisher vector (FFV) for feature selection and encoding in Deep Bi-LSTM, which requires no large abstraction. (iii) Hyper-parameters tuning of FFV-Bi-LSTM sequential learning algorithm is conducted using a validation datadriven approach. (iv) We classified complex gestures using FFV-Bi-LSTM that are critical to recognize by conventional Deep Bi-LSTM algorithms. (v) Our method conforms with the existing results in numerous examples, even with a limited number of data set, static and dynamic hand gestures.
The remainder of this article is as follows: Section II introduces related works, Section III provides problem analysis, mathematical hand gesture models, spatio-temporal feature extraction, data correction and normalization, FFV encoding, and FFV-Bi-LSTM). The recognition phase is proposed in Section IV-A2. Section IV provide details of experimental analysis and evaluation. Discussion is proposed in Section V. Finally, conclusions are drawn in Section VI.

II. RELATED WORK
From the existing works, we can further subgroup available SLR systems into four groups as shown in Table 1. The first group addressed SLR sensing using a contact-based system, which is further sub-divided into two classes namely; wearable systems [12]- [16], which are very unnatural and prone to misclassification and radio frequency system (RF) [17]- [19] more natural and address intrusion, however, these systems are restricted to high internet access and interference. The emergence of digital cameras and camera stereo gave birth to the vision-based SLR, forming the second group [20], [21], [21]- [30] are natural, however, the camera systems suffer complex segmentation. Sensors such as optical sensors, flex sensors, accelerometers, etc. [16], [31]- [34] require no segmentation and good accuracy. However, they are very expensive, invasive, unnatural, and needs calibration set, as shown in Table 1. Therefore, recent papers track dynamic sign words using active imaging devices such as LMC [1], [5], [35], MS Kinect [36] and Orbbec Astra which are portable, requires no complex segmentation, no calibration, inexpensive, mobile, and provides 3D information. This formed an active image sensor-based group four. The summary of some of the available recognition methods are illustrated in Table 2.

III. MATERIALS AND METHODS
In this section, our approach for addressing the misclassification problem consists of the following process: Problem analysis, mathematical hand gesture models, spatial and temporal feature extraction, data correction and normalization, FV encoding, and lastly FF-Bi-LSTM algorithm. This procedure is illustrated in Fig. 1.

A. PROBLEM ANALYSIS
To solve misclassification, authors in [6] utilizes skeletal joints sequence of hand displacements and internal angles as their feature vector. However, these features are insufficient to recognize most ASL words, especially similar ASL words in Figs. (2)-(3). It is found that the differences among these ASL words happen more at hand orientation as shown in Figs. 2(a), (c), (f) and 3(a) and (d). However, small motion at wrist generate large variation angles (∆ φ ). To analyze hand orientation, there is need to investigate prosodic model as described in [10]. The Prosodic model is built from Inherent and prosodic cues to form a lexeme at the root node. Inherent cues comprised of handshape, location and orientation. Prosodic cues are motion (movement cues) features. This is the reason why motion features are known as prosodic features, as shown in Fig. 5. Thus, prosodic cues are mathematically represented to mimic hand joint motion.

Sensor-based SLR methods
Jitcharoenport [32] fLT + LDA + k-NN sensor-based capturing calibration + cumbersome low accuracy Chu et al. [33] Residual PairNets + MAP Accelerometers + gyroscopes cumbersome Stretchable e-skin [34] Backhand-view based capturing motion camera pervasive + trial and error mutual information + LDA multiple sensors unnatural Active imaging device SLR methods Kumar et al. [36] Kinect skeleton coordinate + HMM Real-time position invariant system hard learning + Limited FoV Liu and Huai [1] LMC + HMM-PSO Dynamic hand gestures hard learning Aurelijus et al. [35] LMC + HMC microservice via internet recognition limited representation ability DeepASl [5] Backhand Where t's/po, tj/tk, and t's/j's denotes all fingertips to palm, fingertip to fingertip, and fingertip to fingertip to joint ratios, respectively. Then, the prosodic features µ of finger joints motion M f (n) per each frame f can be coined as υ f (n), where n denotes number of sequence per each frame. Thus, Similarly, the chosen mathematical representation for hand orientation angle about motion axis Y R was a Right-hand rule, which can be obtained using cross-product as follows Thus, angle between Z R and X L , is denoted as a. Wrist flexion and extension angle is denoted as φ. Similarly, hand internal angles b [6] can be obtained according to finger joint angles as shown in Fig. 4. Finally, hand orientation angles can be put together as angular feature vector Q, defined as Therefore, extracted features according to formulations in Eqs.

C. SPATIO-TEMPORAL FEATURE EXTRACTION
Spatio-temporal features are basically defined by given frame length F of sequence matrix Each matrix M t ∈ L consists of skeletal measurements at time-step t.
Thus, spatial information is obtained by setting a threshold value among successive video frames, as given in Eq. (9). This value is assumed from hand motion velocity (which is ≥ 45%). Moreover, temporal features are the hand coordinates of all finger joints, tips of hand, palm center, and wrist center, which generates approximated 3D coordinates of 22 poses. The pose is distinguished by velocity, that is ≥ 45% of maximum velocity (peak velocity), as illustrated in frames

D. DATA CORRECTION AND NORMALIZATION
The output obtained from setup illustrated in Fig. 10 contains noise, which is handled by Savitzky-Golay smoothing filter. The smoothed information B h,s,f = {b k p,q,r,f , · · · , b k p,q,r,n } is utilized using local weighted linear regression (WLR) algorithm to handle missing values and nonlinearity [57]. Thus, weight function is added into linear regression as follow where w denotes prediction time, w f denotes data progressing time and λ denotes wavelength parameter. Then, parameter update is given in Eq. (8) and results of corrected video information is illustrated in Fig. 6.

E. FAST FISHER VECTOR ENCODING (FFV)
FFV transform the features by their deflection from a generative model (Gaussian Mixture Model (GMM)) using sparse matrix representation (sparse filtering [58]). GMM is utilized as probability density function with mixture weight (w), mean vector (µ), parameters θ, and covariance matrix (diag(cov)) of the Gaussian respectively; k denotes the number of Gaussian distributions in the mixture model, which is learned together with the features vector as follows: θ = {w k , µ k , K k=1 = diag(cov 1 k, · · · , cov t k) : k = 1 · · · K}. To apply FFV (κ) to our features, let λ = {λ t : t = 1 · · · T } be the set of T local information in Eq. (8), thus, generative procedures λ of the whole feature vectors are formulated as follows Also, FFV matrix can be obtained as follows: Similarly, κ is finally obtained from fused partial derivatives through GMM parameters Where H θ , 1/v, ∇ θ log(·) denote generative model parameters, normalized values, and log-likelihood gradient. The θ are discover from training features via expectation maximization (EM) strategy. Gradients are computed according to mean vector µ f and standard deviation (s k ) of the fth Gaussian in Eq. (12). A Combination of FVs and deep neural networks was already considered [59]. But FFV (GMM with diagonal covariances) has not been considered in Deep Bi-LSTM for SLR [4], [5], [51], [60]- [62]. Features encoded by FFV are concatenated numerically using three-stacked Bi-LSTM layers as shown in Fig. 8. Basically, each Bi-LSTM layer evaluate FFV encoding, dimension reduction, spatial stacking, and L2 normalization throughout Gaussians and λ as follows

A. EXPERIMENT
We evaluates the FFV-Bi-LSTM recognition algorithm using spatial-temporal prosodic and angle features in three cases.
The first, second and third case adopt skeletal video sequence recognition from our proposed dataset, ASL dataset in [6], and public data sets [6], [63], [64] with FFV-Bi-LSTM. The proposed set up is illustrated in Fig. 10, where a Leap motion controller (LMC) is employed at the signer's chest to capture 3D skeletal hand joints information from backhand view. This is to enable the natural mobility of the signer. The testing environment is provided in Fig. 12 and the set up values is given in Table 3.

1) Data sets
In our new datasets, we employed and trained 10 voluntarily hearing ability people to perform 57 randomly selected VOLUME 4, 2016 Each signer perform all 57 ASL words, ten (10) times. We have collected 10 pairs of similar ASL words out of 57 ASL words in the dictionary. The selected words belong to frequently daily used first 100 ASL words. Some example of our datasets are given in Fig. 5. The data set is partitioned into training and testing; using different types of signers (signer-independence). The selected features have undergone various tests to ensure effectiveness. We further evaluate our method on Semaphoric hand gestures contained in the Shape Retrieval Contest (SHREC) [64], ASL Data set [6], and Leap

2) Recognition Phase
Our algorithm call a function InitialTransformWeights namevalue pair. Sparse filtering algorithm is implemented in MAT-LAB using "sparsefilt" function from yael package. The algorithm handle sparse filtering objective function minimum [65]- [67]. We selected average number of GMM components and few number of iteration for effective video features encoding as provided in Algorithm 1. FFV encoding generates synthetic local information of a particular frame, which do not handle possible time correlation between two different encoded frames of the sequences. To fully exploit this information, three Bi-LSTM units are chosen, each unit accommodate seven layers connected with dropout layers of 20% (0.2) deactivation and validated with careful selection of parameters of Table 4. The total output of this layer is added up and normalized by the softmax layer as shown in Fig. 8.
The output O f f from Eq. (15) is considered as probability for a given number of ASL word L. For a given O E t which have Lth sequence from class E L , then the predicted ASL word G is obtained from normalized O f f at softmax. ASL word classification is achieved by computing high probability score p from Eq. (14). The final layer is obtained from the following formulations: We summarize the steps of sequential gesture recognition in details in the following Algorithm 2.

B. RESULTS
We reported performance results of FFV-Bi-LSTM algorithm. Overall comparison results between FFV-Bi-LSTM and Avola et al. [6] method are shown in Table 5. Average recognition of FFV-Bi-LSTM are illustrated in Table  9 for 10-pairs highly correlated ASL words and randomly selected 57 ASL words. The computational performance of FFV-Bi-LSTM in the proposed data set is depicted in Table  10. To show the effectiveness of the FFV optimization, we extend tests on spatio-temporal features without and with the FFV optimizations mentioned in subsection III-F, detailed as Tables 11 and 12 for "Bi-LSTM no FFV optimization" and "FFV-Bi-LSTM". It is, therefore, demonstrates that our adopted algorithm is feasible for ubiquitous applications. We compare the performance accuracy of FFV-Bi-LSTM with some existing state-of-the-art methods, the average recognition accuracy for each is plotted in Figs. 11 and 13 and the  Tables 8, 6, 5, 7.

V. DISCUSSION
Deep Bi-LSTM with 3 units has hard learning because of high abstraction, which lead to low accuracy. However, Deep FFV-Bi-LSTM has flexible computing which lead to an increase of 5% accuracy. Thus, Deep FFV-Bi-LSTM outperforms the conventional Deep Bi-LSTM in [6]. The superior model is number three with four feature vectors, which is chosen for further analysis. Performance evaluation of model 3 using Deep Bi-LSTM and FFV-Bi-LSTM is demonstrated on Tables 11-12. It is proven that each word takes an amount of 2 seconds to be trained. However, the generalization of model takes approximately 1 second to test each word per sequence. Therefore, the standard deviation of 7.091 is achieved from the mean. This means that each score deviates from the mean by 0.0738 points on average. The accuracy of the algorithm and proposed data set is further evaluated using leave-one-subject-out cross-validation. Perclass accuracy is obtained to be 91.002%, with less than 9.0% error which demonstrates that our algorithm has a high probability to recognize ASL words of similar characteristics, as detailed in Table 10. Table 9 depict the recognition performance of leave-one-subject-out cross-validation of the 57 randomly selected ASL words. Therefore, the chosen mathematical model has proven to be a good choice for our idea. It is also shown that the adopted algorithm has a relatively bad generalization to recognize positive results of "Happy", "Cheap", and "Jump". Research findings show that these similar ASL words have similar spatial information and minimum orientation angle variations. One of the major VOLUME 4, 2016  Figure 11: Confusion matrix of skeletal ASL datasets [6] using adopted method   Figure 12: Confusion matrix of Correlated ASL words using adopted method

VI. CONCLUSION
In this work, we adopted an approach to recognize highly correlated American sign language words. We optimize the accuracy of recorded 3D video skeletal hand joints information, using a WLR algorithm and filter. The final information is encoded using FFV for fine-grained recognition which depends on a few discriminative features. The Features are found potential and interesting for Deep Bi-LSTM recognition. The second contribution in this article includes the design of a new large 3D dynamic hand skeletal ASL data set. We also systematically compare the radius of convergence of  Figure 13: Confusion Matrix of the entire dataset our method with the method of [6]. FFV-Bi-LSTM algorithm fail to learn the small changes of hand motion trajectory of some similar ASL words, which reflect biases, which is responsible for misclassification. Since several features are influencing the recognition of similar ASL words, it is suggested that similar ASL words should be dealt with as a multi-feature problem in future research.