Gaze and Environmental Context-Guided Deep Neural Network and Sequential Decision Fusion for Grasp Intention Recognition

Grasp intention recognition plays a crucial role in controlling assistive robots to aid older people and individuals with limited mobility in restoring arm and hand function. Among the various modalities used for intention recognition, the eye-gaze movement has emerged as a promising approach due to its simplicity, intuitiveness, and effectiveness. Existing gaze-based approaches insufficiently integrate gaze data with environmental context and underuse temporal information, leading to inadequate intention recognition performance. The objective of this study is to eliminate the proposed deficiency and establish a gaze-based framework for object detection and its associated intention recognition. A novel gaze-based grasp intention recognition and sequential decision fusion framework (GIRSDF) is proposed. The GIRSDF comprises three main components: gaze attention map generation, the Gaze-YOLO grasp intention recognition model, and sequential decision fusion models (HMM, LSTM, and GRU). To evaluate the performance of GIRSDF, a dataset named Invisible containing data from healthy individuals and hemiplegic patients is established. GIRSDF is validated by trial-based and subject-based experiments on Invisible and outperforms the previous gaze-based grasp intention recognition methods. In terms of running efficiency, the proposed framework can run at a frequency of about 22 Hz, which ensures real-time grasp intention recognition. This study is expected to inspire additional gaze-related grasp intention recognition works.


Gaze and Environmental Context-Guided Deep Neural Network and Sequential Decision Fusion for Grasp Intention Recognition
The gaze map is generated by the aligned gaze points set ¯Gn .

Grasp Intention Related Contents l s
Sliding window size for gaze map generation.l z Number of gaze points in the sliding window l s .l w Sliding window size for sequential fusion.λ j HMM model.λ j = A j , B j , π j , j = 1, 2. λ 1 is the HMM model for IT and λ 2 is the HMM model for IA.A j The transition probability matrix.
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B j (n)
The emission probability matrix given observation o n .B j (n) = p j o n | i n, j = 0 ,. . .,p j o n | i n, j = m j .p j o n | i n, j The emission probability of observing sample o n given the latent state i n, j .

O
The set of possible observations of two HMMs.O = {o 0 , . . ., o n }. o n The observation of HMM, i.e., the input sample of Gaze-YOLO.o n is composed of the scene image and the gaze map concatenated in the channel dimension, which can be expressed as o n = f n ⊕gm n .

I. INTRODUCTION
T HE upper limb assistive robots, such as prostheses [1], supernumerary robotic limbs [2], and exoskeletons [3], can help the elderly and infirm people with upper limb disabilities restore arm and hand functions.However, a barrier to using these assistive technologies is the lack of appropriate human-robot interaction (HRI) that allows people to express their grasp intentions intuitively and naturally.There have been many studies for upper limb intention recognition and prediction, utilizing electromyography (EMG) signals [4], Electroencephalogram (EEG) signals [5], etc.Although these biosignals can be effectively used for intention recognition, they can not be applied to all populations, such as stroke patients [6].
Eye-tracking is an emerging technology for users' gaze point estimation [7].Even in severe hemiplegia and other motor disorders, the human oculomotor system typically remains intact [8], [9], making eye-tracking accessible to users with disabilities.It has been shown that gaze is related to intention and that a person's gaze can express intention and anticipate actions [10], [11].The natural and intuitive link between intention and gaze makes it a promising approach to exploiting gaze for grasp intention recognition.In grasp intention recognition, there are two attributes that we need to focus on, one is the intentional target (IT), which is used to indicate the object that the user is interested in and viewing, and the other is the intentional action (IA), which is used to indicate whether the user has a grasp intention on this object [12], [13], [14], [15], [16], [17].With IT and IA, the assistive robot can help the user with the grasping task.
Typically, studies on gaze-based upper limb assistive robots focused on two aspects: 1) gaze point estimation and trajectory planning and 2) intention recognition.The first category of studies explored gaze points to determine target position and plan robots' movement trajectories [9], [14], [15], [17], [18], [19], [20].Faisal et al. proposed a 3D gaze calibration method utilizing continuous robotic arm trajectories [18].Furthermore, Faisal et al. implemented a grasp assist task by leveraging the estimated 3D gaze points [14], [18].Wang et al. proposed a method that combines a depth camera and an eye-tracker to estimate 3D gaze points [15].However, the estimation of 3D gaze in real environments has certain limitations, such as being sensitive to the user's head motion or requiring additional optical devices to track head motion.Chen et al. developed a lightweight multi-model network for appearance-based eye gaze tracking.Their method fused eye and head features to improve gaze estimation accuracy and achieved a 27 × speedup [20].Yang et al. introduced a set-membership filter based on eye-movement modality, which effectively improves the gaze signal quality [21].Most of these studies primarily utilized machine learning techniques to estimate gaze points based on eye features or incorporated additional depth cameras to provide depth information.However, these methods do not resolve the problem of recognizing the user's grasp intention, including IT and IA.
The second category of studies mainly focused on intention recognition.Li et al. constructed a naive Bayesian model for intention inference based on the correlation between objects and intentions [22].Koochaki et al. used a density-based spatial clustering of applications with noise (DBSCAN) to extract gaze features and infer intention [23].Such studies are only suitable for activities of daily living (ADL) intention inference, not grasp intention recognition.Another part of the studies focused on the foundation of intention recognition-grasp intention recognition.In [15] and [16], fixations with dwell times longer than two seconds were utilized to determine the grasp intention.In [24], a network based on the egocentric view termed VIDEO-Net was introduced to recognize the IA but has not determined IT.In [12], a weakly-supervised network was used to recognize IT, and then an extra long-short term memory (LSTM) was used to identify IA.The Earth Mover's Distance (GazeEMD) was exploited to evaluate the similarity between gaze points and target saliency to determine the IT by Shi et al. [25].In [26], a gaze point motion model TAGMM was used to process the gaze data, and then multiple features were proposed for identifying IA and IT.
While these studies have produced positive outcomes, they still have some issues.The first issue is that gaze and environmental context (scene image) interact minimally and do not efficiently integrate.Typically, these algorithms require target detection techniques to detect objects in the scene, followed by the construction of a set of features for intention recognition based on the object coordinates and the gaze points.As a result, intention recognition is influenced by object detection performance and gaze data quality.For example, gaze points that fall outside the bounding box due to noise but are close to the boundaries may still indicate that the user's IT is the object.However, a method relying on the bounding box would misclassify this object as not being IT.The ideal approach is to fuse gaze data and scene images to extract effective features to identify intention.Convolutional neural networks have proven to be a powerful technique for extracting features in image tasks.However, due to dimensional inconsistencies, discrete gaze point coordinates are challenging to be fed into a 2D convolutional network.Moreover, the gaze point can only provide information about a single pixel point, which does not reflect the actual human vision characteristics.Human eyesight is an area rather than a single point.Consequently, there is a dearth of an effective approach to extract gaze and scene features and perform intention recognition.
Another issue is the underuse of temporal information, which may result in a lack of accuracy in gaze-based grasp intention recognition.As previously reported, existing gaze-based grasp intention recognition methods had an accuracy of less than 76% [12], [25].Human upper limb intentions are continuous in a grasping task-the user's gaze usually stays on the object until the grasping action is completed.Therefore, a sequence model could be established to fuse temporal information to increase grasp intention recognition performance, which has been proven feasible in human behavior prediction [27], gesture recognition [28], and intention recognition [29].Neural networks, such as LSTM and gated recurrent units (GRU), are effective approaches for fusing temporal information.While these models improve the accuracy of human intention recognition, they are data-intensive and require training.Consider the grasp intention is classified into several classes, each of which can be described in probabilistic terms.Bayesian models provide an alternative method to fuse temporal information and optimize sequential decisions in scenarios involving probabilities.Additionally, the probabilistic model is interpretable.
From the previous analysis, the challenges in achieving gaze-based grasp intention recognition are as follows: 1) How to develop a framework for integrating gaze data and environmental context (scene images) to simultaneously detect IT and recognize the corresponding IA. 2) How to fuse sequential decisions to improve the accuracy of intention recognition.
Our objective is to establish a gaze-based framework for object detection and its associated grasp intention recognition based on multimodal information (gaze data and environmental context).To achieve this objective, we designed a gaze-based grasp intention recognition and sequential decision fusion framework (GIRSDF).This framework is composed of a gaze attention map generation method, a Gaze-YOLO network, and sequential decision models.The main contributions of the present paper include the following: 1) In terms of grasp intention recognition, a novel endto-end deep neural network Gaze-YOLO is designed.This network employs gaze data and scene images as the inputs for scene objects detection and corresponding intention detection.2) In terms of Gaze-YOLO inputs, a gaze attention map generation method based on human visual properties is proposed to align the representation of gaze data with the scene image.3) In terms of sequential decision optimization, models that (HMM) do not require training and models (LSTM and GRU) that need training are constructed to fuse sequential decisions and improve intention recognition accuracy.4) A dataset named Invisible is established.The dataset containing data from seven healthy individuals and two hemiplegic patients.The proposed framework's performance is evaluated on the dataset.The rest of the paper is organized as follows: Section II describes GIRSDF.Section III introduces the experimental results of the proposed framework.Section IV presents the discussion.Section V concludes the paper.

II. THE METHODOLOGY
The proposed GIRSDF is shown in Fig. 1, including a gaze attention map generation approach, a Gaze-YOLO intention recognition network, and a sequential decision fusion model.The following discussion will introduce each component of the grasp intention recognition framework.

A. Eye-Tracker Output
The eye-tracker (Pupil-Invisible: Pupil Labs, Berlin, Germany), which comprises eye cameras and a scene camera, delivers the user's two-dimensional (2D) gaze point coordinates on the scene image.The 2D gaze point coordinates are indicated as where n is the scene image index, and k is the gaze point index in one scene image.The scene camera is sampled at 30 Hz and the eye camera is sampled at approximately 200 Hz.Thus, a scene image contains multiple gaze points.A snapshot of the outputs of the eye-tracker are displayed in Fig. S1 in the Supplementary Document.

B. Gaze Attention Map Generation
During the use of the eye tracker, the subject's head movement will cause the gaze points move across in different scene frames.The gaze points should be aligned to the same scene image to eliminate the effects of the subject's head movement.Consider a video clip F = {f n , n = 1, . . ., N } and f n associated gaze points G n = {g n (k), k = 1, . . ., K n }, where f n is scene image frame.Then we adopt Diaz's approach [12] to align the gaze points.
The utilization of gaze points encounters two difficulties.The first is that the gaze point coordinates cannot be fed into the image processing 2D convolutional neural network.Another difficulty is that the gaze point can only provide information about a single point, which is inconsistent with visual properties.As demonstrated in Fig. 2, when a person stares at a target, the eyesight is focused on the region of interest rather than a single location.The region's center is the most concerned and interested area determined by the brain, and the attention progressively attenuates to the surroundings [30].To solve these two problems, we propose a method for generating gaze attention maps from gaze points.
We utilize Gaussian functions for generating gaze attention maps to model human visual attention's decay process from Fig. 1.GIRSDF framework.This framework includes a gaze attention map generation approach, a Gaze-YOLO intention recognition network, and sequential models.Gaze-YOLO's input is the gaze attention map and the corresponding scene image; Gaze-YOLO completes the object detection in the scene and recognizes the intention for each object.The grasp intention results are converted into probabilities of different types of IT and IA, respectively.Then the sequential models fuses the probabilities of the current sample and previous samples to estimate the ultimate intention decision.
the gaze point coordinates to the surrounding environment.The gaze attention map is generated as follows: where img n,k represents the gaze attention map of g n (k) and img n,k (x, y) is the pixel value in the xth row and yth column of this grayscale image.σ 2 is the Gaussian function's variance, which represents the attention decay rate and w 1 is the gain factor.L(•) denotes the Euclidean distance.Assume an aligned gaze point set Ḡbu f = { Ḡn+1−l s , . . ., Ḡn } contains l z gaze points, where l s is the size of the sliding window.As a result, the gaze attention map of Ḡbu f is synthesized by multiple gaze points: where gm n is the gaze attention map of Ḡbu f .It corresponds to the reference scene image f n .Fig. 2 depicts the procedure for generating the gaze attention map.The gaze attention maps mimic human vision and furnish a more detailed visual attention distribution and information than a single gaze point.
C. Gaze-YOLO 1) Network Architecture: In this subsection, we designed an intention recognition network Gaze-YOLO inspired by YOLO [31], which was a high-efficiency network, but was only applicable to object detection.A spatial pyramid pooling (SPP) module is integrated in Gaze-YOLO to achieve The gaze attention map gm n and the corresponding scene image f n are concatenated in the channel dimension as the input of Gaze-YOLO.Gaze-YOLO predicts object boxes on three different scales.The input image is divided into grids on each scale.Then three prediction boxes are generated for each grid.For object detection, each box is responsible for detecting an object, which is composed of bbox coordinates (i.e., t x , t y , t w , t h ), objectness scores p obj (i.e., whether the box contains an object), and object class scores p o,0 ,. . ., p o,m 1 −1 (i.e., possibilities that the object belongs to different classes).Besides the object detection properties, in the prediction box of each grid, four dimensions are added and denoted as p vo , p nv , p go , and p ng , respectively, as shown in Fig. 3. p vo determines the probability that the object is the IT, while p nv determines the probability that the object is not the IT.The probability of grasp intention on the object is determined by p go , while the probability of no grasp intention on the object is determined by p ng .vo and nv mean viewing object and not viewing object, respectively.go and ng mean grasping object and not grasping object, respectively.It is worth noting that if the subject has grasp intention for the object, the object must be IT.
2) Loss Function: The proposed Gaze-YOLO model generalizes the loss function of YOLO by introducing the losses of grasp intention recognition.The entire loss function is shown in Eq.( 4) where L coord is the localization loss, L obj is the object loss, and L cls is the class loss.α represents the gain factor of the losses.The object detection losses are inherited from YOLO.
Except the object detection losses, we design intention losses, including the IT loss L IT and the IA loss L IA .Both of them use the binary cross-entropy loss, which can be expressed as S 2 denotes the number of grid and B denotes the number of prediction boxes generated by each grid.I obj i j indicates whether the jth prediction box of the ith grid contains an object, and its value is 1 if it does and 0 if it does not.pi (•) represents the intention label.p i (•) represents the intention prediction.Multiple prediction boxes are processed by Non-Maximum Suppression (NMS) to obtain the Gaze-YOLO output.In our work, there are a total of m 1 classes of objects numbered 0 ∼ m 1 − 1 (m 1 = 10).The prediction box with the highest score in each class is selected and output in the NMS process.The VO score vector P vo = p vo,0 , . . ., p vo,m 1 −1 and the NV score vector P nv = p nv,0 , . . ., p nv,m 1 −1 of all m 1 − 1 objects are utilized to determine the IT (i,e., whether or not the subject is looking at this object).When an object is missing from a scene image, the p vo corresponding to the missing object is set to 0 and the p nv is set to 1.The GO score vector P go = p go,0 , . . ., p go,m 1 −1 and the NG score vector P ng = p ng,0 , . . ., p ng,m 1 −1 are utilized to determine IA (i.e., whether or not the subject wants to grasp this object).As shown in the lower-left part of Fig. 3, "VO" denotes IT and "NV" denotes non-IT."GO" means the subject wants to grasp this object and "NG" means the subject has no grasp intention for this object.The user's IT which is recognized by Gaze-YOLO, denoted as i n,1 ∈ Q 1 = {0, . . ., m 1 }, where 0 ∼ m 1 − 1 indicate different IT types, and m 1 indicates "no target".i n,1 is computed according to The user's IA which is recognized by Gaze-YOLO, denoted as , where number 0 indicates the intention of grasping, and number m 2 = 1 indicates the intention of no grasping.i n,2 is computed according to

D. Sequential Decision Fusion of Intention Recognition
Considering that the use of short-term information to recognize human intentions is not robust.For instance, people's blinks cause a sudden change in the gaze points; or head movement may cause the camera to capture error images, which will cause errors in intention recognition.An intuitive approach is to fuse temporal information to improve the accuracy of intention recognition, Two HMMs λ 1 = (A 1 , B 1 , π 1 ) and λ 2 = (A 2 , B 2 , π 2 ) are constructed to describe the transition relationship of IT and IA, respectively.The HMM λ = (A, B, π ) consists of a state transition probability matrix A, an emission probability matrix B, and an initial probability π.The initial state is not taken into account for sequential decision fusion in this work.The subject's IT is regraded as the latent state i n,1 .The subject's IA is regraded as the latent state i n,2 .The set of possible observations for both HMMs are denoted as O = {o 0 , . . ., o n }, where o n is the observation; i.e., it is the input sample of Gaze-YOLO.o n = f n ⊕ gm n , where ⊕ represents the concatenation operation of the channel dimension.Since the two HMMs are in a similar form, only the detail of λ 1 is presented in the following paragraphs, but the readers can take it as a reference for both HMMs.
The emission probability p 1 (o n i n,1 ) is calculated from the output of Gaze-YOLO with Eq. (8).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where p vo,m 1 denoted the VO score of "no target".Thus the emission probability matrix of observing sample o n can be defined as Since the input samples are determined at each instant, the emission probability matrix is determined and calculated based on the VO score.The estimated category of i n,1 may not be robust due to the blinks or the blurred scene image.To make the system tolerant of errors, we introduce a smoothed state s n,1 ∈ Q 1 to substitute i n,1 , by calculating the average probability distribution in the sliding window: In HMM, transferring between two adjacent latent states is characterized by transition probability.The transition probabilities between different states constitute the transition probability matrix, which can be constructed from our life experience.Empirically, we have the following assumptions on the transition probability matrices A 1 and A 2 , whose elements a i j represent the transition probabilities from the previous state i to the next state j.
For the transition probability matrix A 1 of IT, we have the following empirical rules: • The probabilities of IT remaining at the same objects are higher than the probabilities of switching to other objects (a ii > a i j , i ̸ = j).
• The probabilities of IT remaining at the same objects are almost equal, and the probabilities of IT staying at no target are lower than the probabilities of staying at a object (a ii > a m 1 m 1 , i = 0, . . ., m 1 − 1).
• The probabilities of IT switching from no target to objects are almost the same, which are higher than the probability of IT switching between different objects (a m 1 j > a i j , i ̸ = j, i = 0, . . .m 1 − 1, j = 0, . . .m 1 − 1).
• The probabilities of IT switching from other objects to no target are the same, which are higher than the probabilities of IT switching between different objects (a im 1 > a i j , i ̸ = j, i = 0, . . .m 1 − 1, j = 0, . . .m 1 − 1).
• The probabilities of IT switching between different objects are almost the same and are the lowest (a i j , i ̸ = j, i = 0, . . ., m 1 − 1, j = 0, . . ., m 1 − 1).For the transition probability matrix A 2 of IA, we have the following empirical rule: • The probabilities of IA remaining at the same actions are higher than the probabilities of switching to the other action (a ii > a i j , i ̸ = j).The two transition probability matrices are then initialized as shown in Supplementary Document Fig. S3.
We used the modified Viterbi algorithm [27] to implement the sequential decision and estimate the smoothed state s n .Due to the similarity of estimating IT and IA, we will only discuss the sequential decisions of IT.The posterior probability distribution of the last smoothed state s n−1,1 can be calculated with Eq. ( 11): Then, the smoothed state with the max probability is chosen as the latest smoothed state The posterior probability distribution of the current smoothed state is updated by: Finally, the current smoothed state probability is normalized as: The sequential decision fusion process is shown in Fig. 4 and the whole framework is summarized in Algorithm 1.
To verify the effectiveness of sequential decision fusion, two neural networks, LSTM and GRU, were designed as comparisons.The outputs of Gaze-YOLO were converted into the probability of 10 (objects) × 2 (intentions) + 1 (intentionfree) = 21 classes of intentions.These probabilities were used to train LSTM and GRU.Both the LSTM and GRU consisted of an input layer, a hidden layer, and a fully-connected output layer with a softmax activation function.The input feature size and sequence size were set to 21 and l w , respectively.The hidden layer size was set to 128, and the output layer size was set to 21 to output the probability of each type of intention.

III. EXPERIMENTS AND RESULTS
This study aims to discover the underlying mechanism underpinning grasp intention recognition from gaze.Therefore, we collected data from healthy and hemiplegic subjects and conducted trial-based experiments and subject-based experiments.

A. Dataset and Experiment Setup
We conducted visually guided natural grasping and viewing experiments to establish datasets for grasp intention recognition.Seven healthy subjects and two hemiplegic subjects were recruited and instructed to wear a eye-tracker and perform tasks.Their gaze and actions were recorded to build the dataset.Each subject was asked to perform two categories of tasks: grasping and viewing.Gaze-based human-robot interaction often encounters the Midas touch problem [22], [25].This issue pertains to a situation where, when a user attempts to interact with a target using gaze, two possibilities arise: either the user is "just looking at the target," or the user is "intending to interact with the target."To address the Midas touch problem in the gaze interface, subjects were also instructed to perform an intention-free viewing task.This task merely requires the subjects to look at the object without engaging in any intentional interaction.The data gathered during this task enabled GIRSDF to overcome the Midas touch problem.There were ten objects in our experiments.All participants signed an informed consent that was approved by the ethical committee of UHCT (UHCT-IEC-SOP-016-03-01).Details of the experiments and information of datasets are shown in Two different experiments were conducted to verify the effectiveness of GIRSDF.
1) Trial-based experiments: Each subject completed repeated trials of each task.One repetition was selected as the test set, and the rest were used as the training set.To obtain statistically significant results, we utilized a five-fold crossvalidation procedure.
2) Subject-based experiments: one subject's data were used as the test set, and the left subjects' data were utilized as the training set.We conducted experiments with each subject left out by turn to verify the intention recognition framework.
We further investigated the impact of data size and diversity on sequential decision fusion.For data size, training and test sets were divided according to trial-based experiments.Specifically, we utilized the p% ( p = 5, 15, . . ., 100) of data from the training set to train the sequential model.For data diversity, training and test sets were divided according to the subject-based experiments.The data collected from different numbers (1)-( 8) of subjects were used to train the sequential model.In addition, to verify the validity of the gaze-attention map, we conducted both comparison and ablation experiments, the details of which are provided in Supplementary Document Section V.
The sliding window size l s was set to 18 for generating gaze attention maps.Furthermore, w 1 was set to a suitable value so that the maximum value of the gaze attention map pixels was about 255.σ was set to 25 to reduce the pixel value to 0 at a diameter of 90 pixels centered on the gaze point, similar to the clear region of human vision.The sequential fusion sliding window l w was set to 5, taking about 0.5 seconds to initialize.

B. Object Detection Results of Gaze-YOLO
First we evaluated the performance of Gaze-YOLO on object detection and the results were shown in Table I.From the results, we know that the object detection performance of Gaze-YOLO is close to that of YOLO with only a slight degradation (no statistical difference), but the network can perform intention recognition for each object.

C. Statistical Analysis
Statistical tests were performed in groups.Metrics (accuracy, F1 score, and success rate) corresponding to different parameters (factors) are divided into groups, e.g., accuracy values for different numbers of neurons (e.g., 32 and 64).The trial-based experiment group size was 5 (5 folds), while the subject-based experiment group size was 9 (9 subjects).The Shapiro-Wilk Test was initially conducted on each group Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II GRASP INTENTION RECOGNITION RESULTS IN THE TRIAL-BASED AND SUBJECT-BASED EXPERIMENTS. THE BOLDED DATA DENOTES THE OPTIMAL RESULTS, AND THE UNDERLINED DATA DENOTES THE SUBOPTIMAL RESULTS. ASTERISKS INDICATE SIGNIFICANT DIFFERENCES COMPARED WITH GIRSDF (HMM)
of data to test whether followed a normal distribution.For data following normal distributions, the analysis of variance (ANOVA) was applied to detect whether there was an overall significant difference.Suppose an overall significant difference was found; the T-Test was then conducted perform the pairwise comparison parameters versus reference parameters).For data that did not follow a normal distribution, a non-parametric test (Kruskal-Wallis H Test) was performed to check for overall significant differences among groups.Suppose an overall significant difference was found; the Wilcoxon signed-rank Test was subsequently performed for pairwise comparisons.The differences were considered significant if p < 0.05 was achieved.

D. Trial-Based Grasp Intention Recognition
Three metrics including success rate, accuracy, and F1 score are introduced as evaluation metrics.The success rate quantified the proportion of successful trials to the total number of trials.A successful trial means no errors occur from the moment when the correct intention is identified to the end of the trial.The performance of our GIRSDF framework was compared with other approaches.Additionally, three sequential decision fusion strategies were compared.
The grasp intention recognition results of trial-based experiment are shown in Table II and Fig. 5.The proposed GIRSDF outperforms other gaze-based grasp intention recognition methods in trial-based experiments.The best accuracy achieved by GIRSDF is 94.15% (LSTM), and the best success rate of GIRSDF is 89.34% (HMM).The confusion matrix depicting the intention recognition results is displayed in Fig. 6.The utilization of various decision fusion methods eliminated some of the errors and increased the accuracy of most classes.Consequently, the overall success rate of the framework is improved.Here, only a comparison of the two methods is presented.For additional methods and their respective confusion matrices, please refer to Fig. S5 and Fig. S6 in the Supplementary Document.The results indicate that all three sequential decision fusion methods significantly improve the performance of GIRSDF compared with Gaze-YOLO (all p < 0.01), especially the success rate.This improvement can be attributed to sequential decision fusion can effectively correct unexpected intention recognition errors, which may arise from blurred scene images or outliers in the gaze points.Moreover, there is no significant difference between the results of the three sequential decision methods (accuracy: p = 0.93; F1 score: p = 0.93; success rate: p = 0.39).This result suggests that both the training-free HMM and trained LSTM and GRU can optimize intention recognition and achieve comparable performance.Notably, the HMM has a lower computational burden than LSTM and GRU, owing to its simpler design.Although the advantage is not obvious, the proposed GIRSDF has advantages over LSTM and GRU in that HMM is simple in design and of a low computational burden.

E. Subject-Based Grasp Intention Recognition
The subject-based experiment results are provided in Table II.The accuracy and success rate of each subject are presented in Fig. 7.The proposed GIRSDF achieves the best Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.accuracy of 88.12% (HMM) and the best success rate of 79.87% (GRU), demonstrating similar characteristics between grasp intentions and gaze among different subjects, including hemiplegic patients (H8 and H9).The results for hemiplegic patients H8 (acc: 86.77%; success rate: 90.63%) and H9 (acc: 81.72%; success rate: 80.00%) demonstrate the feasibility of using gaze to recognize grasp intentions for people who retain eye-movement control ability.
Although the performance of GIRSDF is degraded compared to trial-based experiments, the proposed GIRSDF still outperforms other gaze-based grasp intention recognition methods.The degradation of grasp intention recognition performance in the cross-subject case is caused by a lack of data from the test subject when tuning the model.

F. Intention Recognition Results in Healthy and Hemiplegic Subjects
In this subsection, the variability in intention recognition on healthy and hemiplegic subjects are analyzed.In the trial-based experiments, the results on healthy subjects are combined with those on the hemiplegic subjects H8 and H9, respectively, for statistical analysis.The results are presented in Table III.
In the trial-based experiments, no significant differences are found between the hemiplegic and healthy subjects for most metrics.Interestingly, the hemiplegic subjects exhibited higher accuracy, F1 score, and success compared with the healthy subjects, indicating more remarkable behavioral similarity between them.During our experiments, we discovered that the hemiplegic patients demonstrated similar behaviors when performing the same task.In the subject-based experiments, the accuracy and F1 scores on H8 and H9 are lower than those on the healthy subjects, which can be explained in two aspects.First, the hemiplegic patients may exhibit different visual behaviors during grasping and viewing tasks compared to the healthy individuals.Second, the calibration accuracy of the eye-tracker varies between subjects, which results in differences in the recorded data even though the subjects' visual behaviors are similar.The closer the calibration accuracy is to that of the healthy subjects, the greater possibility that the accuracy will be high.However, only two hemiplegic patients participated in our experiment.To gain a deeper understanding of the dissimilarity between hemiplegic and healthy subjects, further research with a number of hemiplegic patients is needed.

G. Comparison of Data Size and Data Diversity
The average accuracy and F1 scores with different data sizes and diversities are shown in Fig. 8.When the percentage of training data is increased from 5% to 20%, the accuracy and success rate improve significantly (LSTM accuracy: p < 0.01, success rate: p < 0.01; GRU accuracy: p < 0.01, success rate: p < 0.01) as shown in Fig. 8(a) and (b).This result demonstrates that LSTM and GRU are sensitive to training data size.When the data size exceeds 20%, the performance improvement is not significant (LSTM accuracy: p = 0.08, success rate: p = 0.22; GRU accuracy: p = 0.07, success rate: p = 0.12).There is no overall significant difference (accuracy: p = 0.93; F1 score: p = 0.93; success rate: p = 0.39) between the performance of the three sequential decision fusion methods when the 100% of the training data is utilized.
As shown in Fig 8(c) and (d), the performance of LSTM and GRU is significantly improved when the training data of more than one subjects is utilized (LSTM accuracy: p = 0.03, success rate: p < 0.01; GRU accuracy: p = 0.03, success rate: p = 0.02).The reason for the poor performance when the models are trained on only one subject's data may be the insufficient size of the training data.After the training data with more than two subjects, the performance of LSTM and GRU improves but is not statistically significant (LSTM accuracy: p = 0.75, success rate: p = 0.34; GRU accuracy: p = 0.70, success rate: p = 0.36).This indicates that increasing data diversity (data from different subjects) has no appreciable effect on the results of intention recognition.There is no significant difference (accuracy: p = 0.96; F1 score: p = 0.95; success rate: p = 0.98) between the performance of the three sequential decision fusion methods when the training subjects are all the remaining subjects.
The experimental results demonstrate that LSTM and GRU require training and are data-intensive.When the training data is sufficient, LSTM and GRU perform well.Additionally, LSTM and GRU are not user-specific, and the models trained on various trainging subjects can be applied to the test subjects.HMMs utilize pre-built models that do not require explicit training and are suitable for situations with insufficient training data.A suitable sequential decision fusion method can be selected according to the training data size for good grasp intention recognition performance.

H. Time Consumption
The intention recognition framework's time consumption is quantified and summarized in Table S1 in the Supplementary Document.Gaze-YOLO's training times are approximately 14 mins per epoch, while LSTM and GRU's are approximately 11.7 seconds.HMMs do not require training.After training, the inference times of LSTM, GRU, and HMM are 3.4 ms, 3.39 ms, and 0.02 ms per frame, respectively.The results reveal that HMM has the lowest computational complexity compared with LSTM and GRU.Considering the gaze attention map generation, the proposed GIRSDF runs with a frequency of about 22 Hz, which can satisfy the realtime requirements.

A. Datasets
In the Invisible dataset, ten representative kinds of objects were chosen.Different subjects, including healthy individuals and hemiplegics, participated in the experiment.Notably, our dataset only contains one object for each class.If multiple similar objects exist, it is difficult for the annotator to correctly identify which one among them is to be grasped during the annotation process.Therefore, different kinds of objects are selected to speed up the labeling process.For all samples of each trial, we assigned the identical intention label.In practice, subjects may not gaze at the target object for a short period of time before the start or after the end of the trial (e.g., the user's gaze is not on the target object at first but moves quickly to the target object from elsewhere after locating it and then performing the task).It is possible to reduce the weights of these samples to eliminate the potential effects on the training process.

B. GIRSDF
Considering the possible connection between gaze and grasp intentions, we designed GIRSDF for grasp intention recognition.Trial-based and subject-based experiments on Invisible dataset were organized to demonstrate the generalization performance and effectiveness of the framework.The proposed framework only relies on gaze to recognize grasp intentions and does not require the user to learn specific behaviors, which is promising to be applied to hemiplegics and the elderly.
The gaze maps generated by all the gaze points are combined to create the final gaze map, which effectively reduces the influence of abnormal gaze points.When outliers occur, the pixel values in the generated gaze map are extremely low, among which the maximum is roughly 2. This low value has a smaller effect than the gaze maps generated from the normal gaze points, which makes Gaze-YOLO insensitive and robust to outliers.
GIDSDF can be easily extended to handle multiple objects.By augmenting other object categories in the dataset (the nonintention samples are labeled as NV and NG), Gaze-YOLO can adapt to scenes containing more kinds of objects.The transition probability matrix of HMM presented in this study is constructed using a fixed number of objects.It is possible to develop an adaptive HMM construction approach by combining the number of detected objects in the scene with the rules.The transition probability matrix is built on empirical rules that are interpretable and valid.Compared to LSTM and GRU models, HMMs impose a low computational burden and do not require training.With sufficient training data, LSTM and GRU may achieve better performance.A suitable sequential decision fusion method can be selected for optimal grasp intention recognition performance according to the training data size.
It is notable that there is an apparent variation in the intention recognition accuracy in the trial-based experiments.This phenomenon is because the visual behavior of subjects may vary across replicate trials, which makes the trials with similar behavior highly accurate and the rest less accurate.The results of grasp intention recognition validated the effectiveness of GIRSDFD and demonstrated its applicability for assistive robot control.As reported in [14] and [15], the gaze-based grasp assistive robot will execute the grasping action after detecting successive identical intentions.With this premise, the success rate can reach 100%.The subjectbased experiments further verify the GIRSDF's generalization ability and the existence of subject-to-subject similarity in gaze behavior.Even on hemiplegic subjects, satisfactory results are obtained.The generalization ability minimizes the need for new users' data and offers the possibility of recognizing the grasp intentions of new users, whose data are often difficult to obtain.In addition, there is no significant difference in the eye movements between healthy people and hemiplegic patients [34].Therefore it is possible to apply the trained model to patients.

C. Limitations and Future Works
Although the proposed GIRSDF achieves the optimal grasp intention recognition results and good generalization, there are some limitations.As shown in the confusion matrices of Fig. S5 and S6 in Supplementary Document, most grasp intention recognition errors are the different IAs of the same IT (e.g., grasp cup and view cup are misidentified).This is due to the fact that vision is typically capable of reliably recognizing IT, but variations in the gaze signal can lead to IA recognition errors.Inspired by EEG signals in intention detection [5], we plan to incorporate EEG signals to improve IA identification ability.Second, GIRSDF has not been applied to control the assistive robot.In practical applications, a depth camera or a pose detection network [35] will be utilized to determine the position of the intentional target to accomplish the assistive grasping tasks.

V. CONCLUSION
In this work, a gaze-based generic framework GIRSDF is proposed for grasp intention recognition and performing sequential decision fusion.This framework consists of a gaze attention map generation approach, a Gaze-YOLO grasp intention recognition model, and sequential decision fusion models.A dataset Invisible containing healthy and hemiplegic subjects' Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
data is established to validate the performance of GIRSDF.Trial-based and subject-based experiments demonstrate the framework's effectiveness and generalization ability for grasp intention recognition.The experimental results further revel the similarity of different subjects' gaze behavior and grasp intention.Experiments on data size and data diversity illustrate the sensitivity of LSTM and GRU to data size.HMM employs pre-designed models that do not require training.The proposed framework can run at a frequency of about 22 Hz, which can satisfy the need for real-time intention recognition.Future work includes fusing EEG signals to improve intention recognition performance and applying GIRSDF to control the assistive robot for validation and evaluation.

Fig. 2 .
Fig. 2.Gaze attention map generating process.The upper part represents the human visual attributes, and a Gaussian function is used to model this process; the lower part represents the gaze attention map generation process based on the Gaussian function.The gaze attention maps are represented as grayscale images (the Apriltags pasted on the table is not relevant to this paper).

Fig. 3 .
Fig. 3.The left part represents the components in a Gaze-YOLO prediction box.Multiple prediction boxes in the left part are processed by NMS, and then the Gaze-YOLO output in the right part is obtained.

Fig. 4 .
Fig. 4. The sequential decision fusion process.The observed samples are first utilized to compute the smoothed state and then output the optimized state at the latest moment.

1
with the max probability as the latest smooth state (12); 14:

Fig. 5 .
Fig. 5.The accuracy and success rate in the trial-based experiments with the Invisible dataset.Error bars represent mean ± one standard deviation in five repetitions.Asterisks indicate significant differences compared with Gaze-YOLO.Seven healthy subjects (S1-S7) and two hemiplegic subjects (H8-H9) participated in the experiments.

Fig. 6 .
Fig. 6.The confusion matrices for Gaze-YOLO and GIRSDF (HMM) in the trial-based experiments.G and V are the abbreviations of grasp and view.ITF is the abbreviations of intention-free.

Fig. 7 .
Fig. 7.The accuracy and success rate in the subject-based experiments.Error bars for the overall results represent mean ± one standard deviations in different subjects.Asterisks indicate significant differences compared with Gaze-YOLO.

Fig. 8 .
Fig. 8. Accuracy and success rates for different data size and diversity.The shaded area indicates the one standard deviation.

To evaluate the performance of GIRSDF, a dataset named Invisible containing data from healthy individuals and hemiplegic patients is established. GIRSDF is validated by trial-based and subject-based experiments on Invisible and outperforms the previous gaze-based grasp intention recognition methods. In terms of running efficiency, the proposed framework can run at a frequency of about 22 Hz, which ensures real-time grasp intention recognition. This study is expected to inspire additional gaze-related grasp intention recognition works. Index Terms-Grasp intention recognition, gaze, envi- ronmental context, object detection, hidden Markov model, deep neural networks.
The gaze map is generated by the gaze point g n (k).gm n nThe nth scene image frame.gn(k)kthunalignedgaze point on nth frame.gn(k)=[gn,x (k), g n,y (k)].G nSet of unaligned gaze points on nth frame.G n = {g n (k) = [g n,x (k), g n,y (k)], k = 1, ..., K }. ḡn (k) kth aligned gaze point on nth frame.ḡn(k)= [ ḡn,x (k), ḡn,y (k)].ḠnSet of aligned gaze points on nth frame.Ḡn = {ḡ n (k) = [ ḡn,x (k), ḡn,y (k)], k = 1, . . ., K }.
K , IT transition probability: a i j,1 , IA transition probability: a i j,2 ; 2: Initialize: Sequential fusion sliding window size l w and gaze map generation sliding window size l s ; 3: Output: Optimized smoothed state of the last time s P vo , P nv , P go , P ng = Gaze-YOLO(f n , gm n );

TABLE III RESULTS
OF GRASP INTENTION RECOGNITION ON THE HEMIPLEGIC PATIENTS AND HEALTHY SUBJECTS.ASTERISKS INDICATE SIGNIFICANT DIFFERENCES (p < 0.05) FROM THE HEALTHY SUBJECTS.NO STATISTICAL ANALYSIS RESULTS ARE AVAILABLE FOR THE SUBJECT-BASED EXPERIMENTS