A General Pipeline for Online Gesture Recognition in Human–Robot Interaction

Recent advances in robotics have allowed the introduction of robots assisting and working together with human subjects. To promote their use and diffusion, intuitive and user-friendly interaction means should be adopted. In particular, gestures have become an established way to interact with robots since they allow to command them in an intuitive manner. In this article, we focus on the problem of gesture recognition in human–robot interaction (HRI). While this problem has been largely studied in the literature, it poses specific constraints when applied to HRI. We propose a framework consisting in a pipeline devised to take into account these specific constraints. We implement the proposed pipeline considering, as an example, an evaluation use case. To this end, we consider standard machine learning algorithms for the classification stage and evaluate their performance considering different performance metrics for a thorough assessment.

learn commands that are specific for the robot. In this regard, gestures represent a valid candidate for intuitive communication with robots.
Gesture-based interaction with machines or robots has been proposed by a large body of the literature, in both everyday and industrial applications [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. The problem of gesture-based interaction can be split in two specific problems: recognizing gestures and mapping gestures into commands to the interaction system. The focus of our article is on the first specific problem applied to the domain of robotics, that is, gesture recognition in the context of HRI. While gesture recognition has long been an active area of research in computer vision and machine learning [11], the use of gestures for interaction with robots poses specific constraints. These constraints define the prerequisites for successful gesture-based HRI. To improve the use of gestures for HRI it is, then, beneficial that approaches to gesture recognition are compliant by design with such constraints. The second specific problem refers to designing how gestures can be translated into commands to the robot, possibly in an intuitive manner [2]. While these two specific problems are independent, they jointly address the general problem of gesture-based interaction with robots. Building upon these lines, in this article, we propose a pipeline for online gesture recognition in HRI. The approach has been designed to be general and can be applied to any set of gestures for interacting with any robot. To the best of our knowledge, this is the first attempt in this direction, since existing approaches have been designed to address specific use case scenarios. Conversely, we design the proposed pipeline starting from the understanding of how gesture-based interaction, in general, can be applied to HRI; no additional constraints or requirements derived by specific use cases are included to design the pipeline. As a result, the proposed pipeline is application agnostic.
To achieve this, in Section III-A we first discuss the specific constraints of gesture recognition for HRI, which have to be taken into account to use gesture-based interaction in daily life. Then, having these constraints in mind, in Section III-B, we propose an algorithmic pipeline to implement gesture recognition in HRI. To show and discuss the application of the proposed pipeline, we consider the experimental scenario presented in [2], [12], and [13]. In particular, we assess the capability of the pipeline to generalize across multiple subjects having different level of acquaintance with the use of gestures. Indeed, prospectively, the use of gestures for interacting with robots should be designed in such an intuitive way that it is available This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to any user, and does not require specific prior experience with gestures.
The rest of this article is organized as follows. Section II reports the state-of-the-art on gesture recognition in HRI. In Section III, we discuss how HRI constrains the use of gesture-based interaction and, building upon such discussion, we present the proposed pipeline. Then, Section IV focuses on the classification stage, which is the core of the proposed approach. In Section V, the experimental setting considered to validate the proposed architecture is presented. Then, in Section VI, we report the results of implementing the proposed pipeline to the considered use case. Finally, Section VII concludes this article.

II. GESTURE RECOGNITION FOR HRI
A large part of existing approaches to gesture recognition in the domain of HRI relies on the use of vision systems [14], [15]. In [5], a stereo camera is mounted on the head of a personal service robot for elderly assistance. Whole-body gestures, such as walking or raising a hand, are then recognized from estimated 3-D human body components. Burke and Lasenby [6] proposed the use of a Kinect sensor to detect pantomimic hand gestures that control an unmanned aerial vehicle (UAV). Xu et al. [8] focused on the problem of background subtraction when using RGB-D cameras for hand gesture detection in home-like dynamic environment. Kim et al. [7] proposed a vision-based gesture recognition to address the problem of HRI at a long distance, approximately 5 m from the camera. Sigalas et al. [9] have considered to classify the arm trajectories, seen as sequences of motor primitives. To this end, RGB video sequences are used, with the subject standing in front of the robot in a constrained setup. The same requirement, which is the subject standing in front of the robot, is set by Cicirelli et al. [10], despite the use of multiple Kinect cameras to monitor the surrounding environment. Chandarana et al. [16] used the infrared-based leap motion controller to detect hand gestures for teleoperating UAVs. UAV trajectories are built combining gestures that define portions of flight paths. Generally speaking, vision-based approaches require proper lighting conditions and camera angles, and the user has to be in the field of view of the camera. Hence, these characteristics pose a limitation to the use of such approaches in real-world HRI applications.
To overcome the limitations linked to the use of vision systems and given the recent advances in pervasive computing, wearable devices have been used to detect and recognize gestures [17]. Indeed, unobtrusive, wireless, and inexpensive body worn sensors, such as accelerometers and gyroscopes, possibly integrated in inertial measurement units (IMUs), are available on everyday mobile and ubiquitous systems, such as smartwatches, wrist bands, and mobile phones. They provide information about body movements and, hence, can be used to track user activity [18], [19]. In addition, they have been used for gesture recognition for human-machine interaction in several works [3], [20], [21]. For what concerns gesture-based interaction with robots, Neto et al. [4] proposed the use of five IMUs and an ultra-wideband positioning system to capture the human upper body shape and the relative position between the human and the robot. Villani et al. [2], [12] used inertial data recorded with a smartwatch to control both wheeled and aerial robots. Gestures are used to provide high-level commands, such as take off, land, or stop, whereas robot velocity is determined by mapping user's wrist movements. A similar setting was considered by Carfì et al. [22]. However, their framework was not designed for HRI. Indeed, gestures were executed in steady conditions, which means that the user was in a fixed predetermined pose between consecutive gestures. This condition limits the possibility to use this interaction means while the user is performing everyday activities. In addition, intersubject robustness of the classification approach was assessed to a limited extent since it was tested with gestures performed by the same subjects involved in the training phase.
Furthermore, wearable sensors based on surface electromyography (sEMG) are being used for gesture recognition [23], [24], [25]. Quite often, IMU and EMG data are combined together to improve gesture recognition. Indeed, while inertial data provide information about hand position, EMG sensors allow to fully understand complex finger or hand gestures. This is the case, for example, of the work by Jiang et al. [26], where sEMG and IMU sensing fusion allows to recognize several air and surface gestures with two distinct force levels. Georgi et al. [27] proposed the simultaneous usage of IMU and EMG sensors for gesture-based interfaces and Hidden Markov Models are used as classifiers to discriminate between the defined gesture classes.
From an algorithmic point of view, most approaches to gesture recognition resort to machine learning techniques to deal with high-dimensional, multimodal streams of data that are characterized by a large variability. Different machine learning algorithms applied to gesture recognition have been compared in [24], [28], and [29]. Comparisons by Trigueiros et al. [28] and Wahid et al. [24] focused on hand gestures, as most of the approaches proposed in the literature (e.g., [20], [21], [23], and [27]), whereas realistic daily life activities are considered by Sagha et al. [29].
The abovementioned approaches consider the use of gestures for HRI in specific case studies and cannot be easily scaled to other applications. Our aim in this article is to address the problem of gesture-based HRI from a general perspective, identifying the specific constraints of this application domain and proposing a pipeline that can be applied to any case study, with some fine tuning.

III. PROPOSED ARCHITECTURE
We hereby describe the proposed pipeline for online gesture recognition in HRI. It addresses several constraints specific for gesture-based interaction with robots. A diagram representing the proposed architecture is depicted in Fig. 1.

A. Specific Constraints of Gesture Recognition in HRI
As introduced in Section I, when dealing with gesture recognition in HRI, it is important that using gestures does not limit the user interacting with the robot, insofar gestures are not perceived as a slowdown for the interaction. As a result, the use of gestures in the context of HRI poses the following specific constraints for the problem of gesture recognition. 1) Gesture-based interaction should rely on a lean infrastructure that requires poor or no installation and does not limit the user's freedom to move around and/or with the robot. 2) Gesture recognition should be performed online and not introduce any perceivable delay between the execution of a gesture and its effect, meant as command to the robot. In other words, the algorithmic pipeline in charge of gesture recognition should be fast enough, requiring limited computational burden. Moreover, the architecture should guarantee immediate communication of a detected gesture to the robot. 3) It should be possible to detect gestures online and in dynamic conditions, while users are performing other ordinary activities and movements. 4) The system should be easily adapted to different users, thus requiring robust generalization capabilities. 5) While seeking to improve gesture recognition performance, it is particularly important to minimize the rate of false positives since they would initiate an unintended communication towards the robot. It is noteworthy that some of these constraints are general for gesture-based interaction and apply also to other domains, as discussed, for example, in [14] and [30]. Nevertheless, as regards HRI, they define the prerequisites for successful gesture-based HRI. These constraints are summarized in the left column of Table I, whereas the right column describes how the proposed system tackles them. More details are given in the following sections.

B. Overview of the Proposed Pipeline
To track the movements of the user, we consider inertial data recorded by a wrist-worn device, and thus focus on forearm gestures. This allows to comply with constraint C1 since wearable devices recording inertial data are not cumbersome to wear and are easily available on the market. Inertial data are then analyzed, by considering sliding windows of fixed length, with one sample shift. Recorded data are continuously processed and the occurrence of a gesture is continuously verified, thus complying with C3. In particular, to search for gestures, for each sliding window a set of statistical features are computed for each of the measured signals. The extracted features represent the input for the classification stage. In this article, we start considering different classical classification algorithms, such as K-nearest neighbors (KNN), support vector machines (SVMs), random forests (RF), and neural networks (NNs). Among them, we then select the algorithm that best complies with the specific constraints for gesture-based HRI, in a validation use case. To select the most appropriate algorithm for the considered application domain, different performance metrics are considered, tailoring the severity of false positives in HRI, so as to comply with C5. Moreover, starting from the results achieved in our case study, the determination of the classification stage is driven by considerations related to the need for intersubject robustness and computational burden, thus complying with C2 and C4. The right column of Table I summarizes how the proposed pipeline complies with the specific constraints for gesture-based HRI.
With reference to Fig. 1, all the considered algorithms share the same general structure. They require previous training and validation with respect to the specific gestures of the considered use case. As output, each algorithm provides a score referred to the probability that one of the considered gestures occurred in the current sliding window. A threshold is set on such scores; the threshold can be tuned in order to balance between false positives and false negatives. If the confidence of the classifier is below such threshold, then the process moves to the following sliding window. Otherwise, in case the confidence of the classifier is above the threshold for at least one gesture, a decision is not taken immediately, but the system checks for the same condition also in the subsequent sliding windows. If the condition is met, then a decision is taken and the corresponding command is sent to the robot. The rationale behind this behavior is to improve the robustness of the detection phase. Since consecutive sliding windows are shifted by one sample, it is likely that a gesture occupies more than one window, depending on the sampling frequency of the recording device. Hence, it is likely that when a gesture is performed, it is recognized by the classification algorithm in several consecutive windows, whose number depends on the duration of the gesture and the device sampling frequency. On the other side, if a gesture is detected in a window only, it is likely to be a false positives and should be discarded.

IV. CLASSIFICATION STAGE
The classification stage of the proposed architecture consists in a machine learning module that receives as input the statistical features computed on the current sliding window and provides as output a decision whether a gesture is recognized or not. In this section, we overview the general setting of supervised machine learning, and we briefly describe four classic algorithms that have been tested in our experimental evaluation.
Broadly speaking, in a classification task, the goal is to predict the category y ∈ Y, sometimes also named class, label, or output, of a given observed example x ∈ X , which instead represents the input of the system. In a supervised setting, we are given a collection D of m input/output pairs D = {(x i , y i )} m i=1 and we aim to fit a function f : X → Y so that we can then predict the categoryŷ of a given, novel (not previously seen) examplex. Clearly, different machine learning approaches exploit different definitions for function f , as well as different algorithms and techniques to learn such a function. To identify the best approach for the proposed HRI pipeline, we have compared classic machine learning approaches, namely KNN, SVMs, RF, and NNs. However, other different classification algorithms can be considered. In all these cases, we have considered a setup where each input instance x is described by a vector of real numbers, which are the characteristics, or features of that instance.
The parameters chosen to implement each approach are discussed in Section V-C, with reference to the considered datasets.

A. K-Nearest Neighbors
Based on the concept of distance between examples, the KNN classifier is not properly a learning algorithm. In fact, given a test examplex to be classified, the algorithm looks for the K examples in the training set that are the closest ones tox according to a chosen metric (e.g., the Euclidean distance). The prediction is then performed via a majority voting procedure among the classes of the KNN. Although very simple, this algorithm can work well in practice, when the distance computed on the feature vectors is highly discriminative of the target class. The parameter K defines the size of the neighborhood to consider for classification.

B. Support Vector Machines
SVMs are another classic method for supervised classification in machine learning. Considering binary classification, SVMs are trained to learn a discriminative function that best separates positive and negative examples, with the maximum possible margin [31]. Given a collection of m training samples, such a discriminative function is computed as where α i are the learnable model parameters, and the kernel function K(·, ·) aims to measure the similarity between examples. Function K is, in general, a nonlinear function so that the classifier can model nonlinear dependencies between features and class. Decision function f only depends on those training examples whose corresponding coefficient α i is different from zero: these are called support vectors. In general terms, the decision function f defines a hyperplane, which constitutes the decision boundary that used to classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. This idea can be extended to the case of multiclass classification, breaking down the multiclassification problem into multiple binary classification problems. Two approaches can be selected to this end: In the one-versus-one approach, a binary classifier is set per each pair of classes, regardless of the other classes; in the one-versus-rest, a binary classifier is set per each class, to distinguish it from the rest of data.
A commonly used kernel function for SVM classifiers is the radial basis function (RBF). The RBF kernel function for two points x 1 and x 2 computes the similarity or how close they are to each other, as a function of their Euclidean distance. It is specified by means of two parameters: C and γ. While C sets a tradeoff between misclassification of training examples and simplicity of the decision surface, γ defines the width of the radial functions.

C. Random Forests
An RF [32] consists in a collection of individual Decision Trees (DTs) [33], whose predictions are combined typically through a voting process. A DT inductively learns a set of explainable classification rules by imposing conditions of the values of the features describing the examples. When creating a RF, each DT is trained from a distinct set of n examples randomly sampled from the original training set, and by testing only m out of M features at each node in the tree. The process of combining the outcome of individual classifiers into a single prediction is usually named as an ensemble approach, and it is known to typically improve the performance of the overall system, as well as to reduce overfitting. Each individual DT in the RF produces a class prediction and the class with most votes becomes the prediction of the model, as in classic ensemble approaches.

D. Neural Networks
An NN [34] is a nonlinear function transforming a set of input variables in a set of output variables via a set of adjustable parameters. In particular, an NN is a combination of nonlinear basis functions. Each basis function is itself a nonlinear function (called activation function) of a linear combination of the inputs, and the coefficients in such combination are adaptive weights that can be learned to fit the training data. Several layers of adaptive weights can be stacked to form a deep network.

V. EXPERIMENTAL IMPLEMENTATION
In this section, we present the experimental setting that was considered to implement the pipeline shown in Fig. 1. We first introduce the validation use case and describe different sets of features extracted from the raw signals; then, we illustrate the adopted training and evaluation procedure, and finally the performance metrics employed at prediction time.

A. Use Case
To validate the proposed pipeline, we considered the gesturebased HRI approach introduced in [2]. In particular, the scenario consists in recognizing the following N gest = 5 gestures, depicted in Fig. 2: 1) up: sharp movement upward in a plane parallel to the sagittal one; 2) down: sharp movement downward in a plane parallel to the sagittal one; 3) circle: movement in a circular shape in a plane parallel to the frontal one; 4) left: sharp movement to the left, from sagittal plane to frontal one, in a plane parallel to the transverse one; 5) right: sharp movement to the right, from sagittal plane to frontal one, in a plane parallel to the transverse. Examples can be seen in the multimedia attachment to [2]. These gestures are meant to be performed with the right arm, with the subject wearing an IMU on the right wrist. As our input device, we considered a commercial multipurpose smartwatch, namely the Samsung Gear S device. Data used in the analysis presented in this article consist of triaxial inertial measurements recorded by the smartwatch, namely x ∈ R 10 , and include timestamp, angular velocities, and linear accelerations (raw and with automatically compensated gravity). Data are accessed by means of a Tizen interface and are provided on an uneven sampling grid. On average, approximately 25 samples per second are provided.
As regards the overall software architecture, data are recorded via Tizen and sent via Wi-Fi to an external computer, for ease of implementation. The classification stage is implemented in Python, using the scikit-learn library [35] and Tensorflow 2.0.0 [36].
Robotic Operating System (ROS) can be, then, used for implementing the remaining of the pipeline [37]. It is an open-source hardware-independent middleware widely used in robotics and consists in a set of software libraries and tools that allow communication with robots, actuators, sensors, and other devices commonly used in robotic applications. Since it supports Python, it can deal with data acquisition, data processing, gesture recognition, and communication with the robotic platform. To this end, a publish-subscribe pattern can be used that can efficiently handle communication events, such as arriving messages and inform the robot about the detection of a gesture.

B. Feature Extraction
Starting from the data provided by the wrist-worn device, synthetic features are extracted and passed to the classification stage as input data. We considered three sets of features, which were compared as possible different inputs for the classification algorithms. The first two feature sets, denoted in the following as F 1 and F 2 , were defined as follows. Set F 1 includes standard statistics computed in the domain of time, for each inertial quantity: mean value, standard deviation, maximum value, and minimum value. These are customarily used in many gesture recognition applications (for example, see work in [38]). Following [24] and [26], set F 2 includes mean absolute value, count of slope sign changes, count of zero crossings, and waveform length, given as follows: 1) mean absolute value: 1 L L k=1 |x k |; 2) slope sign change: where L is the number of samples in a sliding window. Set F 3 is obtained from the union of F 1 and F 2 and, hence, consists of eight features.
As a result, for each feature set, each input sequence for classification (both for training and test datasets) is represented by 36 features for F 1 and F 2 and 72 features for F 3 , since the abovementioned features were computed for each of the nine inertial quantities measured by the smartwatch.

C. Training and Evaluation
All the considered classification algorithms were trained by considering a labelled dataset of 300 gestures, which included 60 trials per gesture. Such gestures were all performed by the same subject, denoted in the following as S 0 , standing in steady state. The dataset contains inertial samples that refer to the execution of the gestures, only: samples between two consecutive gestures were manually excluded. The composition of the training dataset is detailed in Table II. In order to tune the hyperparameters of the different classifiers, an inner k-fold cross-validation on the training set was conducted. Based on preliminary experiments, in the analyses reported in this article, for KNN we tested values for K = 5, 7, 10, 15. Since the two hyperparameters, C and γ, have to be jointly chosen, we exploited a grid search with an internal cross-validation 1 as customary in this kind of applications. We finally selected C = 0.1 and γ = 10 −6 . As for RF, we considered a total of 100 trees in our model. For NN, we considered three different architectures, summarized in Table III. Each network was trained using the Adam optimizer [39] with batch normalization, and the hyperparameters (learning rate, momentum, and number of training epochs) were set with Bayesian optimization. 2 To test the algorithms under analysis, we considered streams of inertial data containing gestures. In other words, consecutive gestures were separated by a nonconstant number of samples in which the user moved in a natural manner, as described in the following. The goal was to replicate real-life scenarios in which the subject is free to move while interacting with a robot. Occurrences of gestures were manually annotated to serve as ground truth. To deal with online classification, and following [2], input data were provided to the algorithms by considering sliding windows of length L = 35 samples, with one sample shift. As a result, the test set includes N test = 200 gestures performed with the subjects moving in a natural manner between consecutive gestures (e.g., walking, waving, and drinking). These gestures were performed by 12 subjects, namely S 1 , S 2 , . . . , S 12 , different from the one contributing to the training set. They were not told the movements to execute between gestures, but were left free to chose. An equal number of the five gestures under consideration was included. The composition of the test set is detailed in Table II. All the algorithms were implemented in a multiclass configuration. This implies that the detection of a gesture is accepted only if the confidence of the classifier is above a certain threshold that can be tuned in order to balance between false positives and false negatives. To this end, different thresholds have been considered. In the case of RF, performance has been computed considering three different thresholds, namely σ 1 , σ 2 , and σ 3 , on the predicted probabilities for each class, to be used as a confidence level for gesture recognition. Specifically, for each input sequence, N gest -predicted probabilities P i , i = 1, . . . , N gest are given as output, each representing the likelihood for that sequence to contain one of the considered gestures. For each threshold σ j , gesture i is, then, detected if max P i > σ j , with j = 1, 2, 3. In a similar manner, for SVM, a one-versus-one approach for multiclass classification is selected and five thresholds are considered, denoted as θ k , k = 1, . . . , 5.
A score for each sample in relation to each gesture is provided and the threshold θ k is set with respect to such score. The results achieved for RF and SVM with the different thresholds σ j and θ k (j = 1, 2, 3 and k = 1, . . . , 5) are reported in Section VI. 3 A threshold is considered also in the case of KNN: it was set to τ = 1 for any value considered for K, and, hence, is not reported in Section VI. Alternatively, when dealing with a specific case study, the optimal threshold can be set with a cross-validation step.
Finally, as introduced in Section III-B, a gesture was detected only if the same outcome was predicted by a classifier throughout several consecutive sliding windows. Given the average duration of the considered gestures (reported in Table II) and the approximate sampling frequency of the smartwatch of about 25 Hz, we set such number of consecutive sliding windows to 25.

D. Performance Metrics
A set of established metrics has been computed to compare the performance of the considered algorithms [40], [41]. The selection of these metrics was guided by the need to guarantee high recognition performance in general, while taking into account specific HRI constraints as well. We name Recall the percentage of gestures in the test set that have been correctly detected by the classifier, whereas Precision is the percentage of predicted gestures that are correct. 4 These two measures account for different kinds of error: namely, Recall considers false negatives since it takes into account those gestures that are not recognized by the system, whereas Precision considers false positives, which are gestures that the system wrongly recognizes. The latter are much more dangerous in HRI, where a gesture starts a command to the robot, as highlighted by constraint C5 in Section III-A. Hence, in the need for a compromise between performance on Precision and Recall, it is important that Precision is as high as possible. Low Precision means that the risk of providing unintended commands to the robot is high, which is clearly unacceptable. Conversely, a false negative in HRI means that the user has performed a gesture to command the robot, but this command was not received by the robot. Although annoying for the user, this circumstance is less dangerous than the previous one, since it does not have direct consequences in terms of unintended behaviour of the robot.
Precision and Recall are usually combined in the F β score, defined as The parameter β quantifies the importance of Recall over Precision and is typically set to 1. Since, in our context, it is more important that Precision, rather than Recall, is large, we set β = 0.5, to attribute more importance to Precision over Recall [40], [41]. Thus, in our analysis, we consider F 0 .5 score, following constraint C5.
In order to better highlight the difference between the two main error categories (i.e., missed or misclassified gestures) and comply with constraint C5, we introduce two additional metrics. The misclassification gesture rate (MGR) is the percentage of real gestures that are assigned to the wrong category, whereas the undetected gesture rate (UGR) is the percentage of real gestures that are not detected. We hereby remark that Recall + MGR + UGR = 1. Following the line of abovementioned reasoning, it is more important to minimize MGR than UGR, since the former accounts for unintended swaps among commands to the robot, whereas the latter refers to the need, for the user, to repeat a gesture.
Finally, to guarantee real-time gesture recognition (see constraint C2 in Section III-A), algorithms have been compared also in terms of computational burden. To this end, the classification time has been considered. With respect to Fig. 1, this amounts to considering the time required from the selection of a sliding window (dashed rectangles on the left-hand side) to the output of classification algorithms for that window (boxes "Gesture recognized?" in the figure).

VI. ANALYSIS OF THE EXPERIMENTAL DATA
In this section, we analyze and discuss the experimental data collected implementing the proposed pipeline in the use case introduced in Section V-A. In particular, besides comparing the performance of the considered four classical machine learning approaches with different feature sets, we will focus on the generalization capabilities of the classifiers across different subjects, with varying levels of familiarity with the selected gestures and inclusion in training set. The idea behind this analysis is to assess the amount of user's contribution, meant in terms of participation in the training set or gained acquaintance with the gestures, that is needed to achieve satisfactory performance for gesture recognition. To this end, we consider the three different conditions: a) subjects with different familiarity with the considered gestures, whose data were not included in the training set; b) a subject with previous experience on the use of the selected gestures, but whose data were not included in the training set; c) the same subject as in the previous condition, whose data contributed, in small part, to the training set. a) Multiple subjects, expert, and novel, not included in the training set: Table IV reports the performance of the classification stage, for the different algorithms considered, in the case of subjects with and without prior familiarity with the considered set of gestures. With reference to Table II, numbers in Table IV refer to 200 gestures performed by all the subjects in the test set (from S 1 to S 12 ). We remark that these subjects were not represented in the training set and had received different amount of instructions on how to execute the gestures. Specifically, subject S 1 had previous experience with the considered set of gestures, whereas all the others had not, and received little training soon before recording the test set. The table shows that satisfactory results can be achieved with RF (with features F 3 ) for Precision and MGR, which are the most relevant metrics for HRI applications, as discussed in Section V-D. Nevertheless, performance achieved for the other metrics is extremely low. The performance achieved with the other algorithms is quite poor, too.
b) Expert subject, not included in the training set: To improve the performance of gesture recognition, we limited the test set to gestures performed by an expert subject, who had prior experience in using them, namely S 1 . Indeed, although the considered gestures consist in simple and natural movements, they are meant as sharp movements, as described in Section V-A, and require some experience to get familiar with movement speed and initial and final positions. Table V reports the results achieved in this condition. As for the previous analysis, the subject contributing to the training set was not included in the test set. The table shows that a notable increase in performance is achieved. However, classification performance is still quite low for all the considered algorithms. In particular, while very good performance can be achieved in terms of Precision and MGR, Recall is still quite low. Satisfactory Recall is achieved with NNs (Model 1 and features F 1 ), at the expenses of low Precision and F 0 .5 score. There is no algorithm, among those considered, that returns satisfactory results for both Recall and Precision. c) Expert subject, included in the training set: As a further attempt to improve the performance of gesture recognition, we increased the training set adding some gestures by the same subject in the test set. In particular, with reference to Table II, the training set was augmented adding 100 new gestures (20 per type) by subject S 1 . Regarding the test set, we considered 90 gestures performed by S 1 [i.e., the same as condition b)]. Table VI reports the performance achieved in this condition. The best classification performance is achieved with SVM with features F 1 , which proves successful for all the considered metrics. All the thresholds θ k , with k = 1, . . . , 4, return similar results, whereas θ 5 is less performing. Considering F 0 .5 score, which is the weighted combination of Recall and Precision, the  IV  CLASSIFICATION PERFORMANCE FOR GESTURES PERFORMED BY MULTIPLE SUBJECTS, EXPERT, AND NOVEL, NOT INCLUDED IN THE TRAINING SET   TABLE V  CLASSIFICATION PERFORMANCE FOR GESTURES PERFORMED BY AN EXPERT SUBJECT, NOT INCLUDED IN THE TRAINING SET threshold θ 1 returns the highest value. RF provides slightly better performance for Precision and MGR than SVM, but F 0 .5 score for RF is quite low. Good performance, in terms of Recall, UGR, and MGR, is also achieved with NNs with features F 1 , although Precision and F 0 .5 score are quite low. In summary, it is possible to argue that better performance can be achieved when using SVM with threshold θ 1 and features F 1 .
Building upon this result, we tested this classifier considering gestures performed during intense motion conditions or moderate physical activity. To this end, subject S 1 was asked to perform N IM test = 20 gestures during running and leg-skipping sessions and arm circumductions. These movements were intended as spurious noise for the classification algorithms. Nevertheless, SVM, with threshold θ 1 and features F 1 , has returned quite good performance also in this extreme motion condition: R = 100%, P = 57.1%, UGR = 0.0%, MGR = 0.0%, and F 0.5 score = 87.0%. While Recall is very high, the effect of intense motion can be seen in lower Precision, due to an increased number of false positive gestures.

A. Discussion
Tables IV-VI report the classification performance achieved considering a possible implementation of the proposed pipeline. To this end, we considered four standard classification algorithms and implemented them with different parameters. The achieved results show that, to achieve satisfactory classification performance, it is needed that the user receives some sort of training on how to execute the selected gestures. In particular, even if natural and easy movements are considered, the user should practice them, in order to get familiar with range and speed of motion and other specific features. To this end, including, in the classification algorithm, some amount of user specific training is beneficial. It is noteworthy that this is not in contrast with constraint C4 of Table I. Indeed, in Table VI, we considered the case  TABLE VI  CLASSIFICATION PERFORMANCE FOR GESTURES PERFORMED BY AN EXPERT SUBJECT, INCLUDED IN THE TRAINING SET   TABLE VII  AVERAGE COMPUTATIONAL TIME FOR GESTURE CLASSIFICATION ON A SINGLE  TIME WINDOW that the algorithms were trained with data from a subject not in the test set in large part and, in small part, with data from the same subject in the test set. This is customary in devices running, for example, algorithms for voice or handwritten text recognition, which usually require some fine training by the user in charge.
As Table VI shows, if the user is familiar with the selected set of gestures and algorithms are trained also with her/his data, classification performances become highly satisfactory and allow an efficient use of gestures in HRI. This is confirmed also by the classification performance achieved in the case of intense motion condition: although this represents a quite unusual condition in HRI, these results show that gestures can be used without restrictions to any other tasks the user might be carrying out while interacting with a robot.

B. Analysis of Computational Burden
The computational burden of the proposed pipeline was assessed to verify its applicability to real-time gesture recognition (constraint C2 in Table I). To this end, we computed the average classification time required by each of the considered algorithms. The measured times, averaged over 5000 sliding windows, are reported in Table VII. We hereby make two observations. First, according to the decision rule presented in Sections III-B and V-C, a gesture is detected only if it is recognized in 25 consecutive sliding windows. In other words, the recognition of a gesture requires that 25 windows elapsed. Hence, since times in Table VII refer to classification for a single time window, the recognition of a gesture implies a delay that is 25 times the one reported in the table, from the beginning of its execution. Second, for ease of computation, we implemented the classification stage on an external computer 5 running Python. Computation times reported in Table VII refer to this setting. In a real-world operational setting, an hardware-driven software implementation should be considered, possibly relying on the computational capacity of the robot. Table VII shows that all the algorithms are quite fast, with SVM being the fastest among those considered. Even considered the need to process 25 consecutive windows, the delay introduced by the classification operation does not affect the fluency of gesture-based interaction. As a result, the selection of the most suited algorithm for the proposed classification can be guided by classification performances only.

VII. CONCLUSION
In this article, we considered the problem of forearm gesture recognition for HRI. The ultimate goal is that of providing commands to robots by means of intuitive gestures. To this end, we proposed a pipeline for gesture recognition specifically designed for HRI applications. As input data, we consider wrist inertial movements, which can be recorded with any commercial device mounting an IMU. Gesture detection and classification is performed by a classification stage that relies on machine learning algorithms. To this end, we compared the performance of several machine learning algorithms used in classification problems: RF, KNN, SVM, and NN. An extensive evaluation was performed, including also an analysis of computational burden for real-time gesture recognition. Different performance metrics were introduced to provide a thorough assessment of algorithms and to highlight the specific needs of gesture recognition in HRI context. In particular, we highlighted that the cost of false positives and misclassifications (wrongly recognized gestures) is much higher than that of false negatives (gestures not detected at all), thus motivating the need for models with a high precision, even though at the cost of a lower recall.
An evaluation use case was selected to show an implementation of the proposed pipeline. It consisted of five gestures, recorded with a smartwatch. In total, 12 subjects were included in the validation, considering different confidence with the selected gestures. First, we analyzed whether a general, meant as opposite to user tailored, training of classification algorithms could be sufficient to achieve satisfactory recognition performance. Then, to improve recognition performance, we considered the need of training algorithms also with a small amount of data from the subject in charge.
As concluding remarks, it is noteworthy that the proposed pipeline is general. Indeed, it can be used implementing other classification algorithms, or with other parameters. In addition, it can be used to interact with any kind of robots and with other sets of gestures.