A Survey on Intelligent Gesture Recognition Techniques

Gesture recognition is an ideal means of interaction because it allows users not to have to make contact with any surface, which is a safe and hygienic means, especially in the pandemic situation that is occurring worldwide. However, gesture recognition is not a new discipline and it has been researched for many years but this type of interaction has not succeeded in replacing the keyboard and mouse. It is very useful to know about the advances that are being made with artificial intelligence in gesture recognition to be able to perform a more robust and reliable gesture recognition with a low response time. As it is, deep learning is being integrated into various areas to increase improvement in performance and one such area is artificial intelligence. In this way, there is the possibility that in the future the recognition of gestures will be a viable option as a means of daily interaction for the user and the main objective of this paper is to contribute to that process. For this reason, this study has analyzed 571 papers related to gesture recognition and artificial intelligence. This analysis has extracted relevant information related to scientific production, such as the most productive authors and journals or the most pertinent articles on the subject. Furthermore, we have developed our own model, which shows the relationship between the types of gesture recognition and the artificial intelligence techniques that have been applied for this task.


I. INTRODUCTION
Gesture recognition is a highly focused area of research where the user can interact with computerized systems, such as controlling a computer [1], a home automation [2] or an automobile [3]. This field can be classified depending on the characteristics or objectives that imply such recognition. Depending on the part of the body that is being recognized, you can differentiate between hand [4], body [5] and face [6]. However, if the recognition involves a movement then it will be dynamic whereas if it is a pose the gesture will be static [7]. Finally, depending on the method used to recognize the gesture, this recognition can be classified into visionbased [8] or sensor-based [9].
The associate editor coordinating the review of this manuscript and approving it for publication was Shovan Barma . In recent years, a multitude of devices have been created for gesture recognition such as Microsoft Kinect [10], Leap Motion [11], Intel Real Sense [12], Myo [12], Camboard Pico [13] and so on. For instance, Tolentino et al. [14] developed a system to detect emergency situations through gesture recognition. Microsoft Kinect was integrated in order to obtain the location of the different joints in the user's body and mathematical information such as the distance or the angle between the distinct joints. Thus, it is possible to identify 14 gestures that will be the input of the decision tree in order to make a decision whether the user is in one of the emergency situation categories: medical, life-threatening condition, or disaster. In [15] hand gestures are recognized (instead of body gestures in the previous work) to identify American sign language. In this approach, the combination of Leap Motion and Hidden Markov Models (HMM) in a Virtual VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Reality environment is applied to classify 24 gestures. The Leap Motion device is necessary to extract the 3D data from 11 joints that this sensor is able to recognize and the aim of the HMM is to classify the gestures by means of the feature vector sequence that contains the distance and angle between the different joints. However, gesture recognition is not a strange or even novel technique for users due to the day-to-day use of mobiles. Mobile phones are used to check mail, surf the internet, interact with social networks, take photos and even play games. These mobile devices have become a fundamental part of users' lives because it is capable of executing all the tasks that can be carried out with a computer but with very small dimensions so they can be with the users 24 hours a day regardless of their locations or circumstances. The most relevant thing is that these tools are handled with gesture recognition. From tapping the device to opening an application to dragging their finger to reduce the size of an image or change the screen, all these movements are gestures that people make daily several times a day. Furthermore, these devices are not only used by adults, but the immersion of these tools in the daily routine has also enabled children to use them at an early age and to perform the gestures that involve the use of these devices in a natural and intuitive way [16].
Therefore, the use of gestures to replace traditional forms of interaction is not a surreal idea but has not been implemented well except in mobile devices mainly due to the fact that the precision and response time are not optimal to make the user feel confident and comfortable with this type of interaction. Hence, fields of research such as Artificial Intelligence (AI) have been used to improve these parameters and make gesture recognition the predominant means of interaction.
AI consists of the creation of systems or machines that have the objective of performing tasks in the same way that a human would. Its fields known as Machine Learning (ML) and Deep Learning (DL) are applied to numerous scientific fields such as Medicine [17], Natural Language Processing [18], Information Retrieval [19], Cybersecurity [20], Computer Vision [21], Internet-of-Things (IoT) [22], and so forth. Moreover, they have been applied to gesture recognition to achieve better results since the necessary performance for its general use has not been achieved through robust models that have been created in vision-based gesture recognition. In [23], a new Convolutional Neural Network (CNN) is designed to recognize 3D hand gestures through the hand joints' location without using the depth distance. The novelty of this neural network lies in the parallel processing and the use of residual connections for each signal. This approach leads to 91.28% accuracy in the dataset named DHG which contains a collection of dynamic gestures. deepGesture [24] is a deep learning framework that applies convolutional and gate recurrent unit (GRU), feeding the recurrent neural network (RNN) with data from a gyroscope and an accelerometer sensor. This methodology consists of four parts: the input layer that obtains the data from the sensors, the convolutional layers which extract the features, the GRU layers that process the sequential information, and the fully connected layer which calculates the result. The reason for using this method is because it is faster and only needs a small amount of data compared to other methods such as long-short term memory (LSTM). In [25], a 3D ResNet has been created to extract spatiotemporal features from the dataset to combine it with a memory module in order to make a one-shot learning gesture recognition. Furthermore, another contribution of this work is the creation of a dataset with 3045 videos about hand gestures since it is sometimes difficult to find a dataset that fits the needs of the problem under study. In [26], a model to recognize hand gestures with forearm electromyography (EMG) is described. The EMG signals are produced by the device Myo armband as input for the system and the algorithms k-nearest neighbor and dynamic time warping (DTW) is involved in the classification process. The goal of the k-nearest neighbor algorithm is to estimate the conditional probabilities in the feature matrix while the DTW is in charge of calculating the distance function with the Manhattan distance. The particularity of this work resides in the fact that the authors affirm that this model is capable of learning any hand gesture through training.
The fact that this form of interaction is more robust and reliable will be a necessity in the times that pass since now cities are becoming Smart Cities, which opens a world of possibilities and facilities to the user with the immense volume of data to handle. Among these possibilities is that users can interact with the sensors on the same street, in a smart building [27] or in a smart classroom [28]. In fact, Ma et al. [29] have developed an algorithm to recognize the gestures of police officers when they control traffic using a spatiotemporal convolution neural network (ST-CNN). This proposal has been created in a virtual geographic interactive environment that could later be applied to a real environment in a smart city.
Finally, the motivation for this study is preceded by the fact that we wanted to develop a method based on natural interaction in order to control a computer system, since it would be a more natural and faster means of interaction for the user, with the additional advantage of which could also be used by people with special needs [30]. Among all the natural interaction modes that exist (gesture recognition, speech recognition, brain-computer interface, and so on), preference was shown for gesture recognition because it is intuitive, it does not interrupt the user's activity [31] and it does not require any additional devices since the command is performed only through the movement of a part of the body [32]. Moreover, the participants of this study [33] systematically preferred natural interaction input over mouse input and performed better regarding some parameters such as speed and accuracy. It is known that AI techniques got satisfactory results in the area of gesture recognition [34]. For this reason, there was a need to conduct a study on AI techniques applied to this type of interaction.
The contributions of the present study can be summarized as the following: • The analysis of relevant papers from 2000 to 2020 to extract the most significant indicators.
• Numerous data related to scientific production such as the number of authors, journals, and institutions, among others, have been analyzed.
• The elaboration of our own model that expresses the relationship between ML and DL techniques and types of gesture recognition.
• The authors are not aware of the existence of a paper with the same characteristics or that analyzes the same parameters that address this issue. The remainder of the paper is organized as follows: Section II describes the fundamental concepts of gesture recognition and AI and the most recent works related to these topics. Section III presents the methodology of Gesture Recognition. Section IV shows the results obtained through carrying out an evaluation and analyzing these results. Section V summarizes the conclusions and discusses future work.

II. BACKGROUND
In this section, the main topics of this study will be described, which are gesture recognition and AI.

A. GESTURE RECOGNITION
Gestures are a necessary means of communication between people and are defined in the category of non-verbal communication. This fact has led to this type of communication being transferred to interaction with computers, also because it is a more natural and faster form of communication. However, gestures can be performed with different parts of the body and these are mainly categorized into facial gestures, hand gestures, and body gestures.

1) FACIAL EXPRESSIONS
Facial expressions are characterized because they are capable of expressing a person's emotions during the communication process. The facial expression recognition process consists of three stages: preprocessing, feature extraction, and classification [35].
• Preprocessing: In this process, different transformations are carried out on the frames to improve the quality of the image. The most used techniques are cropping and scaling on the image of the face, normalization and histogram equalization to modify the illumination and the Viola-Jones algorithm is used to determine the size and location within the image.
• Feature extraction: The objective of this method is to identify the most relevant characteristics/descriptors that allow defining a face. Among the techniques to be highlighted are: -Supervised Descent Method which extracts the main positions of the face and estimates the distance between the different elements of it.
-Line Edge Map It is used to extract useful information from the image edge representation feature and is widely applied in face recognition because it detects lines as features from a face edge map. -Principal Component Analysis it is applied to extract fundamentally global and low dimensional features.
• Classification: The most demanded classification techniques for facial expressions, as can be seen later in this work, are Support Vector Machine and Convolutional Neural Networks. The first step is the identification of the central focus of this study (NUI -Natural User Interaction), in order to show the information relating to the scientific output (continent analysis) and the analysis of the co-occurrence of keywords in this field of research (content analysis). The second step is the selection of the database. Bearing in mind that the results of the analysis may vary dep In this study, facial expressions were detected in 64% of the analyzed papers. 89% of the papers that recognize facial expressions apply DL techniques, of which CNN accounts for 85.1%, LSTM 4.1%, RNN 3.4% and other techniques 7.4%. However, 42% of the papers use ML techniques (the percentages of DL and ML techniques are greater than 100% because in some investigations one type of technique was not used exclusively and it is applicable to the rest of the recognition types), of which SVM accounts for 49.6%, K-Means and K-Nearest Neighbors 14.4%, Artificial Neural Networks 11.2% and other techniques 24.8%.

2) HAND GESTURES
The hands are the part of the body that is most frequently related to the recognition of gestures since we use them, for example, to greet another person. In the recognition of hand gestures, there are two general perspectives: recognition based on vision and recognition based on sensors [36].
• Recognition based on vision: This type of recognition is the most used because it is the least invasive for the user since it is based exclusively on the use of one or more cameras. In terms of vision, basically, two types of recognition are usually distinguished: 3D model-based and appearance-based. 3D modelbased approaches work with depth in order to obtain the 3D position of different points of the hand and thus track those points to detect gestures in relation to the relative position of other points that are taken as reference. On the other hand, appearance-based approaches make use of various properties of the hand to recognize gestures. These properties can be the color of the hand to detect movement through markers or some geometric characteristics such as perimeter, convexity, or elongation.
• Recognition based on sensors: This version makes use of sensors that are usually integrated into a glove and detect the position of the hand at all times. These sensors can be of different types: mechanical, magnetic, and VOLUME 10, 2022 ultrasonic, among others. The most used technique to recognize hand gestures based on sensors is HMM.
In this study, it has been determined that 18% of the analyzed papers are related to hand gesture recognition. 95% of the papers that recognize facial expressions apply DL techniques, of which CNN accounts for 75%, RNN is 12.5%, LSTM is 7.5% and other techniques are 5%. However, 44% of the papers use ML techniques, of which SVM corresponds to 45.9%, PCA has 13.5% and Artificial Neural Networks has 10.8% as well as Decision Trees, being others techniques 29.8%.

3) BODY GESTURES
Facial gestures and hand gestures are usually the most used, but there is another type of gesture that should also be highlighted: body gestures. Body gestures refer to movements made with large muscle groups and are called gross motor skills. Unlike hand gestures, which are more precise and short-range movements, body gestures tend to be more indeterminate and long-distance gestures.
In the process of recognizing these gestures, it is important to take into account the type of model chosen to make the abstraction of the body. In general, two types of models are used: part-based models and kinematic models [37].
Body pose recognition involves three steps: background subtraction, body pose detection, and tracking. Body pose detection and body pose tracking are the most interesting methods because of the topic of this study. In the body pose detection, the posture of the human being is analyzed and the orientation of the different parts of the body is identified. Currently, ML and DL techniques are mainly applied in this task, SVMs are used to classify body positions from features extracted from a [38] dataset, CNNs are based on 2D features to detect the body [39] while the LSTM are included especially when you want to perform detection on videos since this type of network takes into account the time variable [40]. On the other hand, in body pose tracking, a series of parameters such as the position or the shape is estimated in each frame. Some techniques such as Expectation Maximization and Optical flow [41] are used to detect the human body in a frame, from the pixels located in different parts of the body or its contour. Also, for body pose tracking there are tools like MediaPipe [42] that make this process easier to implement.
In this study, it has been determined that 13% of the papers analyzed are related to body gesture recognition, while the remaining 5% correspond to other gestures.

4) APPLICATIONS
The main applications associated with gesture recognition are shown in Figure 1. The identified applications were classified as combined, those that are present in most of the main types of recognition (facial, hands, or other body parts), and specific, exclusive to each type of recognition. According to the analysis carried out on the universe of documents analyzed, facial recognition is the technique that presents the most applications together with hand recognition. Being the recognition of sign language the most complex application identified since it is a visual language that draws on all the main types of recognition through fingerspellings, facial and body gestures in different languages, such as Arabic [43], American [44], Chinese [45], Korean [46] and Indian [39]. Due to the difficulty of sign language recognition, device-based studies have been conducted to improve accuracy [47], [48], [49]. However, to achieve the speech recognition [50], [51] is needed for both language and speaker identification. Finally, recognition of Indian dance performance needs both body pose recognition and hand gesture recognition [52], [53].
Within the specific applications, the identification and estimation applications stand out, which are based on the three surveys studied but applied individually. The identify action uses facial recognition and is used to identify expressions [54], micro-expressions [55], emotions [56] and negative expression [57]. Hand recognition is used to identify weapons [58], [59] and people [60]. Finally, recognition 87138 VOLUME 10, 2022 focused on the rest of body parts can be identified skeletal posture [61], neuronal illness or social actions [62], as social touch [63]. The estimation of age and/or gender can be done through facial recognition [64], [65] or through hand gestures [66], [67]. Apps estimating ethnicity [68], beauty [69] and body constitution [70] were also detected exclusively through facial recognition.
Finally, applications that exclusively use a single type of recognition were found. Facial recognition is used to identify capabilities such as attendance [71] or engagement [72] or to detect alterations such as palsy grading [73] or plastic surgery [74]. In the recognition of hand gestures, the applications of the skill of handwriting (scripts, characters, and signatures) stand out above the rest [75], [76], [77]. Man-machine interaction applications and their control use hand gesture recognition. For example, interactions for video games and virtual worlds [78] and control actions in the use of electronic devices [79] and specifically prosthetic [80].

B. ARTIFICIAL INTELLIGENCE
AI is defined as a discipline that seeks to imitate human behavior, such as being able to make decisions or differentiate between different objects. The most generic types of AI that are usually distinguished are weak AI and Artificial General Intelligence (AGI) [81]. The weak AI refers to the resolution of specific problems, such as the example of when a machine beat the chess expert Kasparov [82], but this algorithm is only dedicated to finding the solution of this task and could not solve others chores. However, the AGI would be more similar to how we act and this type of intelligence would be more adaptable and flexible, with which it could face situations independent of those for which it was created.
Some of the disciplines using AI are in education, where robots have been created to enhance student learning [83]; in agriculture in order to warn about adverse environmental situations that could ruin the harvest and improve agricultural yields [84] or in health with the use of virtual assistants that can make a diagnosis based on the symptoms that the patient has and thus serve a greater number of patients [85].

1) MACHINE LEARNING
ML is a branch of AI that allows machines to learn through statistical calculations. This discipline can be found today in everyday applications for users such as Netflix, Siri, or Gmail automated responses. The types of ML are supervised learning, unsupervised learning, and reinforcement learning.
• Supervised learning: in this type of learning, previous information is provided that helps in the predictions made by the system. An example would be if you want to classify animals, the input images of the algorithm will be labeled. In this way, the image can be associated in advance with the name of the animal.
• Unsupervised learning: in this case, you will not have prior information to help you associate the images and you will have to find patterns to help you organize them in some way. An example would be to provide the algorithm with a series of flower images and have it group them into categories. The algorithm will identify the characteristics and will group in the same category the elements that have similar characteristics, this technique is known as clustering.
• Reinforcement learning: This model learns through trial and error, where correct actions will be reinforced to successfully complete the task. An example of this type of learning is autonomous car navigation systems.

2) DEEP LEARNING
Deep Learning has the objective of developing algorithms that are assimilated into the functioning of the brain. This discipline is so called because it uses deep neural networks that are made up of many layers and that have provided better results than other techniques, although it also consumes a lot of resources. These deep neural networks consist of a system of multiple layers each. These layers act as a filter, going from the most general elements to the most specific, and can iterate as many times as necessary on the data to analyze this data and draw conclusions. Therefore, these neural networks can identify patterns and classify different types of information.
The types of neural networks that are usually used in DL are: • Convolutional Neural Network (CNN): These neural networks were designed for the processing of structured matrices such as images. This means that they can classify images based on patterns that appear in them, such as lines or circles. These neural networks use several convolutional layers that can be stacked on top of each other and thus recognize more complex shapes.
• Recurrent Neural Networks (RNN): This architecture uses sequential data or time series data. The difference between this neural network and the others is that it has an ''artificial memory'' generated by the fact that the neurons that make up its architecture receive the input from the previous layer which, together with its own output from the previous time instant, generates the output of this neuron. This operation allows you to make future predictions from historical data.
• Generative Adversarial Network (GAN): The philosophy of this technique is to use two artificial neural networks and oppose each other in order to create new content or synthetic data. One of the networks generates and the other works as a discriminator. The discriminating network has been trained to recognize real content and censors the other network's content in order to generate content that looks real.

3) TRANSFER LEARNING
TL is a technique that has provided numerous advantages in general but also to gesture recognition. This statement is based on the advantages offered by this method, where the main ones are the fact that it is possible to perform a task with much less data than with the traditional procedure, it is faster in the execution of the training because it is not necessary to VOLUME 10, 2022 train the complete model but only a part of it and its learning rate is usually higher because it has been previously trained in a similar task [86], [87], [88]. In this study, the occurrences of TL have been determined in the papers that comprise this work, where this technique has been used mostly in DL, especially in CNN, with a percentage of 89% of all papers that have applied TL and the RNN reaches 4%. Some of the most relevant studies in the field of TL applied to gesture recognition would be this framework [89] which is capable of identifying a series of gestures through a siamese architecture based on recurrent convolutional networks. In this architecture, the spatial characteristics are first learned from CNN and later the temporal characteristics are learned with a Recurrent neural network. The results of the experiments support the efficacy of this method, obtaining 89.5% precision. Another relevant work in this regard is WiADG [90] which is a gesture recognition system that works independently of any device since it uses the WiFi signal to perform this recognition. This system has the particularity that it can identify gestures in a dynamic environment with the use of a supervised adversarial domain adaptation, through the IoT devices found in it. These devices are the source of the channel state information that is necessary to analyze and detect the different gestures made by the user. The results show that this system achieves an accuracy of 98% in the gesture recognition process.

4) APPLICATIONS
In this section, the most recent studies on ML and DL will be described. In relation to ML, there is this study [91], in which a framework has been created that detects if a person has COVID-19 by answering a series of questions. In this way, hospitals are not so saturated and a larger number of the population can be treated. The ML part of this framework is formed by the gradient-boosting model, which has been trained with a total of 51,831 records of patients with and without COVID-19. This work [92] has the objective of predicting the concentration of polluted air in smart cities. In carrying out this task, the Multi-Layer Perceptron and Random Forest algorithms have been used, which have analyzed the air data from some smart cities. The results of the experiments have determined that Random Forest has obtained a higher precision than Multi-Layer Perceptron. A topic that is booming at the moment is Quantum Computing and this field has also been involved in AI with the name of Quantum Machine Learning, for example in [93] the authors propose a methodology to evaluate the advantages offered by AI. quantum computing in learning tasks. In this case, a set of experiments are carried out with classical models and with quantum models to determine in which situations Quantum Machine Learning can be an advantage over traditional models.
On the other hand, as for DL in [94] the goal is to do IoT image identification. To achieve this goal, a series of images are extracted from the IoT sensors to introduce them as input to the Principal Component Analysis (PCA) technique, which will be in charge of extracting features. These features will be trained on CNN in order to recognize images in the context of IoT. In women, breast cancer is the second most common type of cancer and for this reason in this study [95] a methodology based on DL has been developed to improve the diagnosis of this type of cancer. This diagnosis is made through a CNN, which extracted characteristics from the input data of the multiparametric magnetic resonance imaging and then a support vector machine (SVM) has been trained with the characteristics extracted from the CNN to detect if the tumor is malignant or benign. In [96] a DL algorithm has been integrated to recognize human emotions through electroencephalogram (EEG) signals. The process has been divided into several stages, where the EEG signal has first been broken down into rhythms. Next, higher-order statistics have been used to analyze the signal in a higher dimensional space. However, this causes repeated information to be generated, which requires the use of a dimensionality reduction technique. Once the redundant information has been eliminated, the recurrent neural network called long-short term memory will be fed with the normalized attributes. This neural network will be in charge of discriminating the signals and thus classifying the emotions.

C. GESTURE RECOGNITION AND ARTIFICIAL INTELLIGENCE
In this section, unlike the previous ones, works that have combined gesture recognition with AI will be presented. In this case, the aspects that have been considered most outstanding will be described, which are the most significant approaches that have been created, the original and innovative architectures that have been created recently, and the cases that have stood out for their reliability.

1) RELEVANT APPROACHES
In terms of approaches, 24% corresponds to hybrid models, which are widely used because the objective is to make the most of the advantages of the methods involved and in this way achieve greater performance and accuracy, and they are also more flexible and robust than the non-hybrid approaches [97]. The most developed hybrid approach has been the combination of SVM and CNN, which accounts for 48% of the total of these hybrid models. An example of these methodologies is HandSense [98], which consists of a system that has the objective of recognizing dynamic gestures with dependent hands in various 3D CNNs that are dedicated to extracting spatio-temporal characteristics so that SVM can recognize the different gestures through characteristics. In [99] another work is described where dynamic gestures have been recognized by developing a 3D separable convolutional neural network so that this compact model can be used in augmented reality glasses. However, the separation made during the 3D convolution process has caused problems with the gradient that the authors have solved including the skip connection method and the layer-wise learning rate. This approach has proven to be more effective than other 3D CNN models with similar characteristics trained for the classification of dynamic hand gestures.
However, CNN is the most widely used popular technique, which has a percentage of 56% and has been applied mainly to recognize facial expressions, as in this study [100] where images are pre-processed before inserting them as input on CNN for learning and detecting facial expressions. Although in the process there was an issue that was the absence of enough data for the learning of CNN the authors solved it through data augmentation. RNN is another of the most used techniques, even though its percentage is lower than CNN, with 11% as in the following work [101] whose purpose is the recognition of certain actions by applying RNN. The recognition process is based on an RNN that receives a series of time-of-flight measurements from which it obtains a prediction by analyzing the complete temporal sequence. This system is capable of recognizing with high precision the following actions: walking forward, walking reverse, sitting down, standing up, and waving a hand.

2) NOVEL PROPOSALS
Despite these being the most popular approaches, we must also highlight new proposals such as pyramidal 3DCNN [102] or capsule network [103]. The pyramid 3DCNN architecture is composed of a pyramid input, pyramid fusion, multimodality fusion, and two 3DCNNs. The pyramid input has the function of segmenting each of the videos it receives as input because they have different lengths and uses uniform sampling with temporal jitter to obtain the pyramid input. Next, this input feeds the pyramid fusion phase in order to join the features of the pyramid input, and finally, because the networks had trained the data with RGB and depth independently, the multi-modality fusion has to merge both modalities in order to enable the gesture recognition. The capsule network is made up of 5 convolutional layers, the primary capsule layers that are made up of 3872 capsules of 8 dimensions, and the layer named GestureCaps that contains 5 capsules of 16 dimensions. Each capsule is a set of neurons and has an activation probability and a matrix pose. The operation of this architecture is that the 5 convolutional layers are used to reduce the dimension and then the capsule layers have a squashing function to obtain the output of the capsule that is multidimensional. In this way, the output feeds the layer GestureCaps whose output will affect the decision of the output layer applied by softmax to obtain the corresponding label in the gesture classification.

3) RELIABLE CASES
Albeit, it is significant to learn about the most novel approaches but they are not useful if they are not reliable. That is the reason why a few of the most dependable cases are going to be described. In this work, [104] facial expressions and body gestures are recognized using a model that combines CNN, long short-term memory, and principal component analysis (PCA) to extract spatio-temporal features. This proposal has reached an accuracy of 99.57% on a face and body dataset (FABO) [105]. In this study, [106] a hybrid model has been created which combines the features extracted from the scale-invariant feature transform (SIFT) with the CNN features to recognize facial expressions. This combination has achieved a precision of 94.82% in the Cohn-Kanade (CK +) database [107]. In [108] a total of 27 gestures have been recorded with a linear optical sensor to detect hand gestures. Three types of gesture representation have been recorded: raw, simple features, and high-level features; which have been the input data of the RNN to classify hand gestures. Of these three types of representation, raw data has been the one that has obtained the highest hit rate and the average accuracy has been 96.89%. This job [109] is intended to make static gesture recognition applied to the sign language. An improved convolutional neural network (ICNN) has been developed and dropout and L2 regularization has been applied to avoid overfitting. The efficiency of this method for the recognition of sign language is reflected in the classification accuracy obtained of 99.96%.

III. MATERIALS AND METHODS
The identification of the key elements of the research field of gesture recognition and AI through machine learning and deep learning tools follows the bibliometric analysis technique that shows relevant information about authors, institutions, documents and keywords [110], [111]. This bibliometric study is a five-step process: 1) Definition of the research field.
2) Selection of the database.
3) Setting of research criteria. 4) Coding of the retrieved material.

5) Examination of the information.
In this way, the process becomes clearer and could be reproducible (see Figure 2).

VOLUME 10, 2022
The first step is the identification of the central focus of this study (NUI -Natural User Interaction), in order to show the information relating to the scientific output (continent analysis) and the analysis of the co-occurrence of keywords in this field of research (content analysis). The second step is the selection of the database. Bearing in mind that the results of the analysis may vary depending on the database selected, and in line with [112], this study uses the two most widely used bibliometric data sources, namely Web of Science (WoS, produced by Clarivate Analytics) and Scopus (created by Elsevier). Although the Google search engine could offer additional coverage to WoS and Scopus, it has certain problems. First, it lists a large number of non-academic sources, including grey literature that is not peer-reviewed [113]. Second, the search algorithm is not reproducible, as the results are displayed on the basis of previous searches and interactions [114]. Third, it is difficult to use for large-scale analysis [115]. Therefore, the limitations above have deterred us from including it in our analysis.
Once the databases have been selected, the next step is the adjustment of the research criteria. At this stage, the research criteria are set with Boolean operators to obtain an accurate search and to facilitate the capture of large database. The parameters used to retrieve the search were: TITLE-ABS-KEY (''gestur* recognition'' OR ''hand* recognition'' OR ''bod* recognition'' OR ''fac* recognition'' OR ''hand* gestur*'' OR ''bod* ges-tur*'' OR ''fac* gestur*'' OR ''gestur* interact*'' OR ''gestur* based interact*'' OR ''hand* interact'' OR ''bod* interact*'' OR ''fac* interact*'' OR ''gestur* detect*'' OR ''hand* de-tect*'' OR ''bod* detect*'' OR ''fac* detect*'' OR ''gestur* model*'' OR ''gestur* classif*'') AND (''machine learning'' OR ''support vector machine'' OR ''decision tree'' OR ''k-nearest neighbor'' OR ''naive bayes'' OR ''random forest'' OR ''hidden markov model'' OR ''dynamic time warping'' OR ''bayesian network'' OR ''k-means'' OR ''artificial neural network'' OR ''deep learning'' OR ''convolutional neural network'' OR ''recurrent neural network'' OR ''long shortterm memory network'' OR ''deep belief Network'') from the title, abstract and keywords. The search was limited to the period 2000-2020. The first paper on this topic is the article by Ng C.W. and Ranganath S. entitled ''Gesture recognition via pose classification'', published in 2000. This paper describes how a gesture recognition system can be trained by estimating hand poses [116]. This strategy, which includes hand poses in the training of the system, reduces training and recognition times by offering the possibility of applying the system in real time. However, these authors were not only the pioneers in combining both fields of research in a single article, but later in 2002, they published ''Hand pose training in real time''. In the same year, they also published ''Real-time gesture recognition system and application'', in which they used hand poses to create a vision-based system which allowed objects in an interface to be controlled by gestures with a 91.9% hit rate [117].  The search in both databases (Scopus + Web of science) was carried out at the end of September 2020. In terms of inclusion and exclusion criteria, only articles, books and book chapters were considered, including open access documents [118]. Table 1 summarises and quantifies in global form the content variables that will be analyzed in more depth later on. The number of documents found exclusively in WoS was 213 and 221 in Scopus, although there were 137 in common. Therefore, the final sample consisted of 571 documents (Figure 3).
The fourth step is the coding of the retrieved material, which was downloaded in csv format and coded using Excel (version 2013) and VOSviewer (version 1.6.9). The data were pre-processed for further analysis. At first, duplicate documents in both databases were removed. Second, the abstract and title of each document were checked to ensure that they met the search criteria. Third, documents with missing information were corrected.
Finally, the last step is the examination of the information. This phase is carried out using two bibliometric analysis techniques: performance analysis and scientific mapping [110]. Firstly, following previous studies [119], [120], the performance analysis is based on productivity, taking into account the number of publications as the main indicator. In addition, the number of citations and the h-index are used to enrich the performance analysis of authors, journals, institutions, and countries. Their main objective is to provide an up-to-date overview of the research field by identifying the works that constitute its intellectual base [121]. Secondly, scientific mapping aims to reveal the structure and dynamics of scientific fields [122]. It is a spatial representation of how disciplines, fields, authors, or works relate to each other [121]. This methodological approach suits the purposes of this study. Therefore, in order to examine various interesting aspects of the research field, we conducted a scientific mapping based on co-authorship and co-writing analyses. On the one hand, co-authorship analysis allows us to identify the social network of a research field through the links between its most relevant authors and the subgroups that emerge from collaborations at the level of institutions and countries [123]. This technique captures stronger social links than other measures of relatedness, making it ideal for examining social networks [122]. On the other hand, the analysis of keywords in aggregate or in concurrence makes it possible to establish the intellectual structure of a scientific field by dividing the words into different groups [122]. Specifically, this method quantitatively relates and ranks the conceptual content of publications according to the content of publications based on the occurrence of similar word pairs [124]. In other words, keyword co-occurrence helps to identify a research domain through the specific connections made between its keywords [125], [126]. In fact, keyword co-occurrence analysis is particularly appropriate for this purpose compared to other bibliometric methods, such as co-citation analysis and bibliometric linkage, as the latter are not optimal for mapping research fronts due to, among other reasons, the fact that citations take time to accumulate and it is, therefore, more difficult to directly connect the most recent publications across clusters of knowledge bases [122]. In addition, the keywords of an article reflect its main content and the frequency of occurrence and co-occurrence represent the most important topics addressed by papers in a research area and how they link to each other [121]. The sum of all joint co-occurrences between keywords allows for the establishment of a network map of the co-occurrence matrix, which allows the recognition of several thematic groups or clusters. Finally, maps are produced for different time periods in order to map changes in the conceptual space of the field [127] and future avenues of research.
Despite the emergence of this field of research since the beginning of the current millennium, its significance in scientific production can be traced back to 2015, as it is shown in Figures 4 and 5. In this paper, more than 22 review articles in the scientific production are listed in Table 2. It should be noted that none of them conducted a bibliometric study.

IV. RESULTS AND DISCUSSION
This section describes the results obtained from the methodology carried out in the previous section, where different important aspects are analyzed such as the authors who have written the most publications and the institutions they belong to or which country has the highest scientific production in relation to the subject studied. This information is intended to represent the current state of AI techniques used in gesture recognition and to have an idea of how this research is going to continue.

A. DATA ANALYSIS
This analysis is going to start with the main productive indicators related to the documents published per year (see Table 3), such as the number of papers published per year (A), number of citations per year (C), the average number of citations per paper (C/A), number of authors per year (AU), number of authors that published at least one paper in a specific year (AUA), number of journals that published at least one paper in a specific year (JA) and number of countries that published at least one paper in a specific year (COA). Nevertheless, the rest of the tables contain these parameters: Number of total papers (A), number of citations for all papers (C), average citation per paper (C/A), year of first published paper (1st A), year of last published paper (Last A), population (million inhabitants) (P) and number of papers per one million inhabitants (AP).
Regarding the number of articles, it can be seen in Figure 4 that since 2014 there has been an interest in this area of knowledge on the part of the scientific community, highlighting the massive increase in documents in the recent years wherein 2018 less than 50 publications were written and only 180 papers were published in 2020. Furthermore, the analysis of the number of citations revealed that 2014 is the year with the highest number of citations (180). In addition, Figure 5 shows that the number of authors who have published articles on the subject (AUA) has increased exponentially since 2017 with 200 authors involved in 2020, which shows the growing interest and an increasing number of collaborations between authors trying to fill the research gap in this field. Moreover, this broad field of research has been accompanied by steady growth in the number of journals and countries that publish relevant papers.
In the distribution of the documents included in this bibliometric analysis; out of 93.52% of papers, 3.85% are reviews and the rest is made up of papers published in special editions of presentations, books, and book chapters ( Figure 6).
Regarding journals, Table 4 shows additional bibliometric indicators, such as citations, the average number of citations per article, the year of the first publication, the year of the last publication, and the h index. The most relevant journal in terms of number of articles is IEEE Access with VOLUME 10, 2022  As for authors, Table 5 displays the most productive authors in the area. These authors come mainly from South Korea and the institution called Dongguk University. It is worth mentioning that the affiliation indicated in Table 5 belongs to the one indicated at the time of publication of the last document. All the authors in this table have published the   same number of articles with a total of 4 articles, with Zhan, Shu being the most cited author and the one with the highest number of citations per article between 2016 and 2020. He was affiliated with the Hefei University of Technology. Figure 7 shows the international networks of academic institutions, which have in common more than 5 scientific studies among researchers, with two groups identified. The main group (red) is the most productive. It includes three universities of Chinese origin, highlighting the most productive academic institution ( Table 6) Chinese Academy of Science, along with Xidian University and the University of Chinese Academy of Science. The other group includes Tsinghua University (Peoples R China), Shenzhen University (Peoples R China), Carnegie Mellon University (USA), and Nanyang Technological University (Singapore).
The analysis of the institutions is shown in Table 6, which shows the ten most productive institutions in this area analyzed from 2000 to 2020. These institutions are all located in China, a fact that is not strange due to the fact that most of the authors who publish in this field as well as the most relevant scientific production come from China. According to the number of articles, the most productive is the Chinese Academy of Science, which has had 20 articles since 2016. The second most productive institution is the University of VOLUME 10, 2022   Chinese Academy of Science with 10 articles since 2017. The third academic institution is Tsinghua University which also has 10 articles. According to the number of citations per article (C/A), the first place is occupied by the University of Chinese Academy of Science with a ratio of 24.80 citations per article. In this regard, Tsinghua University ranks second As can be deduced from the data shown in the table, the institution that has been a pioneer in this field of research has been Sun Yat Sen University. Table 7 presents the countries that have published a greater number of articles on gesture recognition using techniques from the field of AI, which includes information related to citations, articles, the index h, and the year of the first and last publication. It should be noted that an article may represent more than one country, as the countries are established by the affiliated institutions of the researchers involved. The most influential countries in terms of the number of articles are China and the United States with 252 and 74 documents respectively, followed by India (56) and South Korea (47), as can be seen in Figure 8. In addition, taking into account the number of articles per million inhabitants (PA), Australia ranks first with 1.04. Considering the number of citations (C), the documents from China (6,325) stand out, although the highest relation of citations per article is obtained by the United States, with 45.12 citations per document. Considering the h-index, China leads the ranking with an index of 26. Table 8 shows the ten most relevant papers associated with the field of research analyzed in this work. These papers have been ranked according to the number of citations and the year of publication to avoid the deficiency that may be caused by the exclusive fact of considering the citations of a paper. The most relevant data from each of these papers will be presented below: 1) In [128], an architecture based on 3D convolutional neural networks for human activity recognition is presented. This architecture extracts the spatial and temporal characteristics of the videos it analyzes and also has the particularity that it generates several information channels from different videos to finally combine all that information and obtain the features.
2) The novelty of this work [129] is to train a deep network with a spatial pyramid pooling layer, which shows good  results in detection and classification tasks, as well as a greater speed. This approach is faster than R-CNN and makes it suitable for real-world applications. 3) In [130], the authors have developed a Deep Learningbased framework called DeepConvLSTM that combines convolutional layers and recurring LSTMs, in order to recognize human activity from data extracted by wearable sensors. Among the main characteristics of this union are that it allows us to extract the characteristics automatically and to model the temporal dependencies of the activation of their neurons. 4) In [131], a hybrid model formed by the convolutional neural network (CNN) and Support Vector Machine (SVM) classifiers for handwritten digit recognition is shown. This model has been created by replacing the last layer of the CNN with an SVM classifier in order to recognize unknown patterns. This architecture has presented promising results mainly due to 3 reasons: the automatic extraction of salient features, this model has the advantages of CNN and SVM techniques, and the reduction of complexity in the decision process compared to other models. 5) In [132], three new architectures for CNN of sEMG-based gesture classification are presented. Furthermore, it describes a novel Transfer Learning (TL) paradigm that improves the performance of these convolutional networks. This scheme is based on the learning of two CNNs at the same time. The network that the authors refer to as the ''source'' shares information with the second network through elementwise summation with the advantage that reduces the issues of connecting both networks. Moreover, it implements a method that fosters residual learning, therefore the second network only has to learn a limited number of weights. Two datasets have been tested where it has been shown that this TL design has improved in each of the proposed CNN, where one of the networks has achieved an accuracy of 98.31%. 6) In [100], a facial expression recognition system has been created using traditional Deep Learning techniques such as convolutional neural networks with the aim of making it a simple method. Despite its simplicity, this methodology has managed to obtain an accuracy rate of 96.76%. 7) In [133], a review focused on Deep Learning is presented since this area of knowledge has been in high demand due to the promising results that it has had in tasks such as object recognition and face recognition.   8) In [134], a description of Machine Learning models as replacement of the traditional physical models in Earth sciences is given since more and more data is being extracted from this field of knowledge and data science is becoming necessary to deal with and analyze this huge amount of data volume. 9) In [21]  as depth and RGB images. The role of the HMM is to take into account time dependencies in the learning process.
In Tables 9 and 10 you can see the twenty most representative keywords in different time intervals. The common factor in these different periods is characterized by face recognition, which is among the highest positions in each of the defined intervals. This trend is due to the fact that face recognition is a common problem in Computer Vision that continues to persist and also the identification of the face and its respective verification have become very popular for access control and security issues, as well as aspects of biometrics [136]. However, it is necessary to make a special mention of the Convolutional Neural Network and Deep Learning keywords that are also in the first positions along with Face recognition since Deep Learning has made important contributions to Computer Vision in recent years. Finally, in relation to gesture recognition, which is one of the most representative keywords for this study, it should be noted that it has begun to arouse VOLUME 10, 2022 FIGURE 11. Relationship between the DL techniques and the gesture recognition. 87150 VOLUME 10, 2022 more interest from 2016 when it is among the 10 most cited keywords in the papers of that time. This statement is confirmed in Figure 9 where it can be seen that the percentage of frequency of said keyword is higher between the period 2016-2020.

B. GESTURE RECOGNITION MODEL
The models in Figures 11 and 12 represent the relationship between AI techniques and the various types of gesture recognition. This model has been created by reviewing each paper and extracting the type of gesture recognition and AI techniques that were used to identify the gestures. As a result, the graph has been obtained, where it can be observed how the nodes in the center represent the types of gestures while the peripheral nodes display the different AI techniques that have been identified. Figure 11 shows Deep Learning techniques while Figure 12 displays Machine Learning techniques. This relationship between these two concepts has been determined by the edges that form the connection between the central and peripheral nodes, as well as the size of the nodes and the color of these edges and nodes. On the one hand, it is necessary to comment that there are three different sizes: small, medium and large. Large-sized nodes are those with a number of occurrences greater than 80, medium-sized ones are in the  range, and small ones are in the (0-30] range. On the other hand, there are 3 different colors: green, blue and red. Regarding the nodes, these were painted according to size, the nodes with large size were painted green, with medium size blue and the small size nodes red. Regarding the edges, the thickness of the edges is the same for all and the occurrences have been reflected by their color. The green edges mean that they have a number of occurrences greater than 30, the blue edges are in the range of  and the red edges are in the range (0-10]. However, these edges not only inform about the number of occurrences but also show which AI techniques have been used for each type of recognition since if there are no occurrences then the edge is not painted in the figure.
From the point of view of success cases, this model verifies the techniques that have been most successful since there have been more published papers that have used this AI technique in a type of recognition. Some of the techniques have been included in studies of each of the types of gesture recognition such as Convolutional Neural Networks, Recurrent Neural Networks and Support Vector Machine, while others have only been used in a couple of them. Furthermore, Convolutional Neural Networks, Recurrent Neural Networks, and Support Vector Machine have been featured the most in scientific publications. This fact could be interpreted as that these methods applied to gesture recognition have obtained good results and researchers have chosen to use them to a greater extent. On the other hand, the techniques that have not been so successful have been: Hidden Markov Models, Deep Boltzmann Machine and Regressions.

V. CONCLUSION
Gesture recognition techniques have aroused the interest of the scientific community in recent years, and have been used in various fields such as education, vehicles, virtual reality, sign language translation and so forth. However, the application of AI techniques to this field of research has been a determining factor in its progress. This fact has been the origin to conduct this study where the most relevant indicators regarding research on the subject of AI techniques applied to gesture recognition that span the last twenty years, from 2000 to 2020 have been collected.
From this study, it has been found that there has been a substantial increase in the number of articles produced and citations referring to the subject of this bibliometric analysis as of 2016, while previously it went unnoticed. A noteworthy fact is that the academic institutions that have shown a greater interest in the subject and have contributed the most to research on gesture recognition and AI come mainly from China. This fact is probably prompted because Deep Learning methods are being applied to numerous areas of knowledge including Computer Vision. This combination is producing great advances in the field of Computer Vision, which is closely related to gesture recognition because in the classification of this type of recognition there is vision-based gesture recognition. Therefore, it is not a surprise that there has been a recent interest in this subject in recent years to obtain promising results in this field with the use of AI techniques.
In this bibliometric analysis, we have worked with approximately 571 articles extracted from the WoS and Scopus scientific databases. The research criteria are the combination of different types of gesture recognition and ML and DL techniques, in order to select the papers that have specifically used an AI technique to include gesture recognition in their development. Based on this information, the most relevant aspects related to scientific production on the subject of this study have been described, where in the first place the process that has been followed to prepare this bibliometric study has been explained in section III and the results obtained from the said process have been shared below in the IV section. In section IV, the indicators that are relevant for this study have been shown: the evolution of the articles published during the years that this study comprises, the journals with the greatest number of articles, the authors that have been cited the most, the countries where they have produced the largest number of articles and with the highest h-index. Regarding the indicators indicated previously, the results show that the journals with the highest number of published papers are IEEE Access, Neurocomputing and Multimedia Tools and Applications, although the journal with the most citations is IEEE Transactions on Pattern Analysis and Machine Intelligence. The countries that have been the most productive are China and United States, with the institutions from China being the ones that have produced a greater number of articles on the topic of interest. Nevertheless, the majority of authors who have published a greater number of articles are from South Korea. Furthermore, a brief description of each of the ten articles with the most citations, as well as the keywords that are most meaningful when searching on the terms concerning this work have been described.
From our models, it can be concluded that facial gestures have been the most demanded in the researches, followed by hand gestures. Regarding DL, CNN and Recurrent networks are the most used, CNN being the most preferred technique to use both in facial expressions and in hand gestures and it should be noted that this method is one of the few that has been applied in all types of gestures presented. In the model referring to ML, the SVM technique has stood out, where it has obtained the highest number of occurrences in the works that involve facial expressions. Moreover, the combination of CNN and SVM has been highly acclaimed by researchers.
Despite the research that is being carried out, the interest in including AI algorithms in gesture recognition is recent and there are still quite a few limitations to overcome. However, the advances are promising, as can be seen in the description of the ten most cited articles that have been included in this analysis.