Gaze-Based Shared Autonomy Framework With Real-Time Action Primitive Recognition for Robot Manipulators

Robots capable of robust, real-time recognition of human intent during manipulation tasks could be used to enhance human-robot collaboration for activities of daily living. Eye gaze-based control interfaces offer a non-invasive way to infer intent and reduce the cognitive burden on operators of complex robots. Eye gaze is traditionally used for “gaze triggering” (GT) in which staring at an object, or sequence of objects, triggers pre-programmed robotic movements. We propose an alternative approach: a neural network-based “action prediction” (AP) mode that extracts gaze-related features to recognize, and often predict, an operator’s intended action primitives. We integrated the AP mode into a shared autonomy framework capable of 3D gaze reconstruction, real-time intent inference, object localization, obstacle avoidance, and dynamic trajectory planning. Using this framework, we conducted a user study to directly compare the performance of the GT and AP modes using traditional subjective performance metrics, such as Likert scales, as well as novel objective performance metrics, such as the delay of recognition. Statistical analyses suggested that the AP mode resulted in more seamless robotic movement than the state-of-the-art GT mode, and that participants generally preferred the AP mode.


I. INTRODUCTION
A CTIVITIES of daily living (ADLs) can be challenging for individuals with upper limb impairment.Assistive robotic arms can significantly increase one's functional independence by easing the performance of ADLs [1].However, the direct control of robotic arms with numerous degrees-offreedom (DOFs) via low dimensional input devices, such as a joysticks, imposes a high cognitive burden on operators.Operators must frequently switch between several modes for commanding gripper position, orientation, and open/close, and do so using an unintuitive 3D Cartesian space perspective.To make the control process more intuitive and seamless, we pursued a "shared autonomy" approach in which operator inputs and semi-autonomous control are integrated in order to achieve shared goals [2].
In prior studies, eye gaze was simply used as a "cursor" to select a target object from several candidate objects [10], [11], [12], [13], [14], [15].These conventional methods did not attempt to infer or predict intent and required operators to stare at a target object for fixed duration in order to trigger a pre-programmed robotic trajectory.We refer to such a control approach as the "gaze trigger" (GT) method.In addition, prior studies mainly focused on pick-and-place tasks [10], [12], [13], [15].In our prior work [16], we introduced a recurrent neural network model to recognize participants' intent that incorporated novel three-dimensional gaze-related features.In this study, we leveraged the intent inference model trained in our prior work, integrated the model into a shared autonomy system with a real robot, and evaluated the gaze-based shared autonomy framework.
The objective of this work is to enhance gaze-based shared autonomy systems by introducing a neural network-based "action prediction" (AP) algorithm that leverages spatiotemporal gaze-related features.We propose a number of objective and subjective performance metrics to evaluate and compare the performance of two control modes: the state-of-the-art GT mode and our proposed AP mode.There are two main contributions of this study: (1) First, we developed and implemented an "action prediction" control mode for a gaze-based shared autonomy framework that can be used to perform everyday tasks comprised of a sequence of actions.The system features capabilities for intent inference, object localization, obstacle avoidance, and dynamic trajectory planning.(2) Second, we demonstrated that the AP control model results in more seamless robotic movements than the state-of-the-art GT mode, and that participants often preferred the proposed AP control mode over the GT mode.
This article is organized as follows.Section II outlines related work concerning gaze-based action recognition and gaze-based shared autonomy.Section III introduces our proposed gaze-based shared autonomy framework and action prediction control mode, and Section IV describes the experimental evaluation of the framework.Section V presents a comparison of the performance of the state-of-the-art gaze trigger control mode and the proposed action prediction control mode.Section VI concludes with a summary of contributions.

A. Gaze-Based Action Recognition
Numerous computer vision-based studies have leveraged egocentric videos taken by head-mounted cameras or eye trackers to recognize actions during everyday tasks [17], [18], [19], [20].These studies first subtracted the foreground and then detected human hands and activity-relevant objects.Features related to hands, objects, gaze, and their relative spatial relations were then used as inputs for action recognition using approaches such as HMMs, neural networks, and support vector machines (SVMs).Actions could not be successfully recognized until key visual features related to hand motions and object states (e.g.whether the lid is on a cup) were available to the classification algorithm.In this work, we aim to predict an operator's intended actions using gaze-related features.Thus, computer vision-based action recognition algorithms that rely on the visual consequences of actions cannot be directly applied for intent inference prior to the initiation of actions.
Li and Zhang proposed a gaze-based intention communication framework for human-robot interaction that was designed for eventual use with an assistive robot [21].A simulated kitchen image was displayed to subjects who were instructed to express their intent by looking at task-relevant objects in the image.Subjects were required to press a physical button before and after they expressed their intention using visual attention in order to identify the sequence of gazed objects to be used for SVM classification of intent.While the system enabled recognition of intended tasks, such as "prepare a cup of coffee," a number of steps were required of the operator, thereby reducing the intuitive nature of control and seamlessness of the shared autonomy system.
Fuchs and Belardinelli recorded gaze signals as operators used a gaming controller to control the 3D position of a virtual robot end-effector in order to perform a pick-and-place task [22].The gaze point was fed into a Gaussian Hidden Markov Model to classify a verb ("pick" or "place") and target (cylinders to be grasped or locations for setting down grasped cylinders).Although a recognition accuracy of approximately 80% was achieved, the eye gaze signal was interpreted as a gaze point rather than a 3D gaze vector, and the action recognition was not tested with a real robot or for tasks other than pick-and-place.
In a gaze-based intent inference study conducted by Huang et al., a "customer" selected one ingredient at a time for the preparation of a sandwich by a "server" [23].Using gazebased features, an SVM-based method correctly predicted the selected ingredient approximately 1.8 sec before a verbal request was given.While intent inference was successfully implemented for a target object (ingredient), the study did not incorporate the prediction of any verbs, as it was assumed that each ingredient was to be added to the sandwich.
In a prior study, we interpreted human intent as a triplet of a verb, target object, and hand object [16].In that study, we recruited subjects to perform several everyday activities, such as preparing a powdered drink, and trained a recurrent neural network (RNN) to simultaneously recognize verbs and target objects using gaze-based features.As detailed in Section III-D, we leverage our prior RNN-based action recognition algorithm in this work.

B. Shared Autonomy Systems for Gaze-Based Robot Control
While the works cited in Section II-A addressed the challenge of action recognition using eye gaze, the recognition algorithms were not implemented in shared autonomy systems with real robots.In this Section II-B, we provide an overview of works that implemented the gaze-based control of real robots.
Previous studies on gaze-based shared autonomy have focused on pick-and-place tasks [13], [15].Gaze was used to select a target object for pick-up or a target position for setting down a grasped object.A robotic action would be triggered once gaze fixation on a target exceeded a preset time threshold (e.g. 2 sec in [13]).Zeng, et al. used a hybrid gaze-brain machine interface in order to trigger robotic actions [10], [11], [12].Gaze was used to select target objects while an EEG brain-machine interface triggered an action using "motor imagery" data.In the aforementioned studies, gaze was used to identify objects for pre-programmed movements.
Shafti et al. expanded the repertoire of gaze-triggered actions by adding pouring to pick and place capabilities [14].A finite state machine was used to select the next action based on the identities and affordances [24] of the grasped and gazed objects.For instance, when the grasped object is a mug and the gazed object is a bowl, the next action to be triggered is "pour." Aronson and Admoni proposed an intent inference method for gaze-based shared autonomy systems [25].A Partially Observable Markov Decision Process model used joystick and eye tracker signals in order to update probability distributions for candidate target objects.Huang and Mutlu demonstrated a gaze-based intent inference method for humanrobot interaction [26].A "customer" ordered a drink by verbally requesting one ingredient at a time while a robotic "server" picked up the corresponding ingredient and placed it into a blender.The robotic system monitored the customer's gaze, predicted the intended ingredient using SVM-based classification, and acted proactively.With the intent inference algorithm and the proactive control method, the system could respond to a customer's request and complete the task 2.5 sec earlier on average.In this work, we aim to predict a verb in addition to a target object, and to develop a larger repertoire of verbs and robotic actions.

III. GAZE-BASED SHARED AUTONOMY FRAMEWORK
Our proposed gaze-based shared autonomy framework consists of three threads: 1) 3D reconstruction, 2) intent inference, and 3) robotic manipulation (Figure 1).The 3D reconstruction thread tracks the 3D gaze vector as well as the location and orientation of task-relevant objects.The intent inference thread extracts input features from 3D gaze-object spatiotemporal data and feeds the features into a recurrent neural network (RNN) in order to perform real-time recognition of the intended action primitive.The robotic manipulation thread executes the intended action primitive while also implementing collision avoidance.This section presents how each thread was designed and integrated into a system.

A. Intent Representation
Before we introduce the control logic and three parallel threads in the gaze-based shared autonomy framework, we first define operator intent.Leveraging our prior work [16], we represent operator intent as an action primitive triplet comprised of a verb, target object (TO), and hand object (HO).
We selected four verb classes that are ubiquitous and can be observed in most instrumental activities of daily living (iADLs): Reach, Set down, Move, and Manipulate.The verb classes Reach, Set down, and Move correspond to gross 3D movement of the robotic gripper.In contrast, the verb class Manipulate includes a list of manipulate-type verbs that are related to object-specific affordances [24] and require more dexterous robotic motion.For instance, the verb "stir" is closely associated with the object spoon, and the verb "pour" is closely associated with the object mug.
The target object (TO) refers to the object or support surface that will be directly affected by verbs.The hand object (HO) refers to the object grasped by the robotic gripper.For instance, in the action primitive "move the spoon to the mug", the verb, HO, and TO are "move," "spoon", and "mug," respectively.The action primitive candidates used in this work are detailed in Section IV-A.

B. Control Logic
The flow chart in Figure 2 shows the integration of three parallel threads that comprise the shared autonomy framework: 3D reconstruction, intent inference, robotic manipulation.
The 3D reconstruction thread tracks eye gaze as well as the location and orientation of task-relevant objects (Figure 2(a)) so that the robot always has access to the real-time gaze vector and object point clouds in 3D space.
The intent inference thread extracts input features from 3D gaze-object spatiotemporal data and feeds the features into a recurrent neural network (RNN) in order to perform real-time recognition of the intended action primitive (Figure 2(b)).
The robotic manipulation thread executes the action primitive with the highest recognition probability.Considering that the strict implementation of the real-time action primitive recognition output can lead to unsmooth robot behavior, the system also implements a locking mechanism to enable the smooth completion of a given action primitive (Figure 2(c)).
The three threads operate in a real-time manner at 50 Hz.The threads collectively update gaze vector and object poses, provide an updated action primitive recognition result, and send a velocity command to the robot every 20 ms.A detailed explanation of each thread can be found in the following subsections.

C. 3D Reconstruction of Gaze Vector and Objects
A motion capture system and eyetracker described in Section IV were used to reconstruct the participant's 3D gaze vector.The eye tracker provided a set of 2D pixel coordinates, which represent the perspective projection of the participant's gaze point onto the image plane of the eye tracker's egocentric scene camera.We utilized a traditional chessboard calibration procedure [27] and the MATLAB Camera Calibration Toolbox [28] to achieve the intrinsic and extrinsic parameters of the egocentric scene camera.Through intrinsic parameters, we reduced the lens distortion effect and achieved a distortion-free image plane.Through extrinsic parameters and retro-reflective markers attached on the scene camera, we determined the pose of the scene camera frame with respect to the 3D global reference frame.We constructed the 3D gaze vector by connecting the origin of the egocentric scene camera frame with the gaze point's perspective projection in the image plane that is now expressed in the global frame.In addition to the gaze vector, each task-relevant object's pose and points cloud were tracked in real-time via a set of retro-reflective markers attached to the objects' surface.The marker set can be considered as a subset of the object's point cloud.Object pose was calculated through the use of Iterative Closest Points (ICP) algorithm [29].

D. Intent Inference
We developed a long short-term memory recurrent neural network (LSTM RNN) model to recognize participants' intended action primitives.The training dataset was drawn from experiments in which 10 participants performed three everyday tasks [16].The tasks included making a powdered drink, making instant coffee, and preparing a cleaning sponge.In this study, the model from [16] was directly deployed in the gaze-based shared autonomy scheme.The trained population-level model was used for all participants (i.e.there was no customization or hand-tuning of parameters for each participant).
The intent inference thread first extracted gaze-related attributes (Figure 2(b)).The identities of the task-relevant objects were then de-identified via a generic sorting and indexing method to improve cross-task generalizability, so that the intent inference model trained on one task could be deployed in another task.The attributes of the indexed objects were concatenated and sent to the RNN for the recognition of the intended verb and target object.Lastly, the generic, indexed objects were converted back to objects having specific identities for the implementation of real-time robotic control.
1) Gaze-Related Attributes: For each of the task-relevant objects and support surfaces, four types of attributes were extracted: gaze object (g), hand object (h), gaze object angle (θ ), and gaze object angular speed ( θ ).As binary variables, the gaze object attribute g and hand object attribute h represent whether an object is intersected by the 3D gaze vector and held by the robotic gripper, respectively.Gaze object angle θ is defined as the angle between the gaze vector and the eyeobject vector [16].The eye-object vector emanates from the origin of the gaze vector, but ends at an object's center of mass (Fig 3).Finally, gaze object angular speed θ is the time derivative of the gaze object angle.As in [16], gaze object angle is defined as the angle between the gaze vector and the eye-object vector (ending at the object's geometric center).
2) Converting Objects With Specific Identities to Generic, Indexed Objects: Given that specific gaze objects vary across tasks, the intent inference thread de-identified the task-relevant objects via a generic sorting and indexing method to improve cross-task generalizability such that the RNN would not depend on object identity or the specific task.For instance, without the generic sorting and indexing method, a model trained on tasks with a mug, spoon, and pitcher could not be deployed for similar tasks with different objects (e.g.cup, fork, and jug).
In our prior work [16], to evaluate the model's cross-task generalizability, the RNN trained on Task 2 (make instant coffee) was tested on Task 1 (make a powdered drink), and vice versa.RNNs trained on Task 1 and Task 2 were additionally tested on Task 3 (prepare a cleaning sponge).We observed a modest level of generalizability of the action primitive classifier across tasks.However, cross-task generalizability was not explicitly tested in this work with a robot in the loop.To recognize the intended action primitive for time t, gaze-related attributes t obj are extracted from a sliding temporal window W r ecog of size w r ecog for RNN input feature preparation.That is, t obj = {g τ obj , h τ obj , θ τ obj , θτ obj |τ ∈ W r ecog }.The variable τ represents any time step within W r ecog .The subscript "obj" is a placeholder for a specific object identity.The symbols g, h, θ, θ represent the attributes gaze object, hand object, gaze object angle and angular speed, respectively.
Figure 4 shows how objects within the time window W r ecog are de-identified and indexed for use by the RNN. Figure 4(a) illustrates a gaze object sequence (temporal sequence of objects that are visually regarded [30]) and gaze-related attributes drawn from the time window W r ecog .
Figure 4(b) shows how the task-relevant objects and support surface were sorted, de-identified, and indexed in descending order according to their frequency of occurrence in the gaze object sequence.The attributes of specific objects were converted to attributes of generic, indexed objects, i.e., where M and N were the total number of task-relevant objects and support surfaces, respectively.Lastly, the attributes of indexed objects were concatenated as t and sent to the RNN as input features to recognize the intended action primitive for time t.
3) Action Primitive Recognition Model: The intent inference thread leveraged two parallel LSTM RNNs to recognize intended verb and target object.To reduce overfitting, we selected a dropout rate of 0.3 and designed the LSTM RNN architecture to be comprised of one LSTM layer, three dense layers, and one softmax layer.The LSTM layer contained 64 neurons and each of the dense layers contained 30 neurons.For each time step t, the LSTM RNNs provided the probability distribution for each of the four verb classes and among M task-relevant objects and N support surfaces, respectively.The symbols f ver b and f T O represent the RNN models that recognize verb and target object, respectively.P r each (t), P move (t), P setdown (t), P mani p (t) (1) P obj 1 (t), . . ., P obj M (t), P ss1 (t), . . ., P ss N (t) Given the output from f ver b and f T O , the generic, indexed objects had their specific identities restored in order to implement the real-time robotic control.All verb and target object classes were combined and formed into n cand action primitive candidates.The probability of each candidate P(a c ), c ∈ 1, . . ., n cand was smoothed through the use of a moving average filter W f,r ecog of length w f,r ecog .The real-time action primitive recognition output a r ecog was set as the action primitive candidate with the highest average probability within W f,r ecog .

E. Robotic Manipulation
In this section, we detail our methods for the planning and implementation of recognized action primitives on real robots, as indicated in Figure 2(c).
1) Logic Flow of Robotic Manipulation Thread: Despite the use of a moving average filter on the RNN output, a strict implementation of the real-time action primitive recognition output can still result in unsmooth robot behavior.Thus, we implemented a "locking" mechanism to enable the smooth completion of a given action primitive.
In order for an action primitive to be locked, two criteria must be simultaneously satisfied: (i) the Euclidean distance between the end-effector and the target location associated with a r ecog must be less than a user-defined distance threshold d, and (ii) the average probability of the recognized action primitive P(a r ecog ) must exceed a user-defined probability threshold p lock .A moving average window W lock with a fixed window size of w lock is used to calculate the average probability.
When no action primitive is locked, the robotic manipulation thread continuously sends Cartesian velocity commands to the robot to move the end-effector toward the target pose corresponding to a r ecog , as seen in the right branch of Figure 2(c).The velocity command is calculated based on the artificial potential field (APF) so that the end-effector can approach the target and avoid collision with obstacles at the same time.Detailed explanation of the APF algorithm can be found in section III-E.2.The robotic manipulation thread then checks the locking criteria and locks the action primitive a r ecog when the criteria is satisfied.
When an action primitive is locked, the robot ignores the real-time recognition result a r ecog and prioritizes movement of the end-effector toward the target pose corresponding to a lock , as seen in the left branch of Figure 2(c).After the end-effector arrives at the target pose, depending on the verb of a lock , it opens or closes the gripper, or plans and executes a human-like trajectory for the manipulate-type verb ("pour" or "stir") patterned after a human demonstration, as seen in Table I.T P(a lock ) represents the target pose corresponding to a lock .For instance, T P("r each mug ′′ ) is the end-effector's pose with which the mug can be successfully grasped.The action primitive a lock can be unlocked in two different ways: (1) a lock is completed; (2) its average probability during the locking window P(a lock ) falls below a user-defined probability threshold p unlock .Once a locked action primitive is unlocked, the robotic manipulation thread steps out of a lock and refocus on the implementation of a r ecog .
2) Collision Avoidance via Artificial Potential Field: For collision avoidance, we adopted a path planning framework based on the artificial potential field (APF) algorithm.Although MoveIt! [31] and Open Motion Planning Library (OMPL) [32] provide numerous outstanding offline planners such as RRT Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I ROBOT OPERATIONS AFTER a recog WAS LOCKED
(rapidly exploring random tree) and PRM (probabilistic roadmap), few of them are capable of quickly providing collision-free paths in cluttered environments in real-time.
The APF algorithm has the advantage of convenient calculation, simple implementation, and outstanding real-time performance [33].Attractive potential fields around goal locations would attract robot end-effectors while repulsive potential fields around obstacles would push end-effectors away.
Considering the irregular geometries of robot arms and end-effectors, grasped objects, and obstacles, we cannot simply represent each body as a particle, as is typically done for mobile robots.Leveraging the work of Khatib [33], we selected a set of "points subjected to potentials" ("PSPs") for each body.For the end-effector, we assigned a PSP to the tip of each digit in order to protect the non-backdrivable gripper from collision damage.Attractive and repulsive potential fields were generated based on the 3D positions of the PSPs.
One limitation of the original APF algorithm [33] is that interactions between attractive and repulsive potential fields may make some goals non-reachable.An end-effector could get trapped in a local minimum near the goal, but never reach the goal.This problem is known as "goals non-reachable with obstacle nearby" (GNRON).We adopted a modified repulsive potential function proposed by Zhu, et al. in 2006 that directly addresses the GNRON problem and enables the end-effector to reach its goal while also avoiding nearby obstacles [34].As can be seen in Figure 1, the robot might collide with mug after grasping spoon.We tested APF algorithm in simulation.
3) Planning and Execution of Trajectories for the Verb Class "Manipulate": For manipulate-type verbs, such as "pour" and "stir," we defined smooth trajectories that mimicked human demonstrations drawn from [16].First, a time series of end-effector poses are designed in Cartesian space such that the spatiotemporal relation between the target object and hand object remains the same as observed in human demonstrations.Following the use of inverse kinematics to convert the target poses from Cartesian space to joint angles, we used an iterative parabolic time parameterization method provided by MoveIt! to plan the joint velocities for execution on the robot [31].For the verb "set down," an operator's intended target position for setting down a grasped object on a support surface can be hidden within unfocused gaze signals, blinks, saccades, and involuntary eye movements.Unlike the verbs "reach" and "move," whose associated target positions can be determined by the target objects' locations and affordances, the intended target position on a support surface for the verb "set down" needs to be extracted from noisy eye gaze signals.
We adopted Li, et al.'s "fuzzy interpretation" method to filter out noise in eye gaze signals and extract valid points of visual attention [35].Consider the point of intersection between the 3D gaze vector and support surface as a raw, unfiltered gaze point x i .The variables x i and x i represent the i th raw gaze point and the gaze point after being processed by the filter, respectively, at the time step i.We calculate the distance between the gaze point x i and the geometric center of the cluster of "influential" gaze points in a moving filter window W f,gaze of length w f,gaze .Per [35], if the distance is less than a user-defined threshold d r , then the "influence coefficient" e i is set equal to 1 and the gaze point x i is added to the cluster of influential gaze points in W f,gaze .Otherwise, the "influence coefficient" e i is set equal to zero, and the gaze point x i is discarded.
Influential gaze points are used to calculate the average gaze point x i within the moving filter window at the time step i (eq.5).The moving filter window includes all time steps between the time steps i − w f,gaze and i − 1.
When at least 80% of the gaze points in W f,gaze are influential, we consider x i as the participant's target position for setting down a grasped object.

F. Control Modes
In a conventional gaze-based shared autonomy system, a robotic action is not triggered until gaze fixation on a target object exceeds a user-defined duration threshold [13], [14], [15].We refer to this conventional control mode as the "gaze trigger (GT)" mode and use it as a benchmark for comparison with our proposed "action prediction (AP)" mode.Our intent inference model was integrated into the control scheme of the AP mode only.
We implemented the GT and AP modes under the same algorithmic framework having three parallel threads, as described in Section III-B.However, there were three key differences in the practical implementation of the GT and AP modes due to the inclusion of the intent inference model in the AP mode.First, the prediction thread of the GT mode does not recognize intent using the RNN-based method of the AP mode (as described in Section III-D).
Second, when no action primitive is locked, the robotic manipulation thread of the GT mode does not send any velocity command to the robot while the AP mode does (solid block in the right branch of Figure 2c).
Third, the locking and unlocking criteria are different.For the AP mode, the locking and unlocking criteria rely on the average probability of a r ecog and the Euclidean distance between the end-effector and the target pose, as described in Section III-E.1.For the GT mode, since the RNN-based method was not leveraged, the locking and unlocking criteria depend solely on gaze fixation.
The same locking window size W lock was used for both the AP and GT modes.For the GT mode, an action primitive is locked if gaze fixation on a target object exceeds 70% of the W lock duration.An action primitive is unlocked if gaze fixation on a different target object exceeds 70% of the W lock duration.

IV. EXPERIMENTAL EVALUATION A. Experimental Protocol
We hypothesized that the action prediction (AP) mode would result in more seamless robotic movements than the state-of-the-art gaze trigger (GT) mode, and that participants would prefer the AP mode.In order to test these hypotheses, we conducted a study approved by the UCLA Institutional Review Board.All 16 participants (13 male, 3 female; aged 18-35 years) gave written informed consent in conformity with the Declaration of Helsinki.Three out of the 16 participants reported prior experience in interacting with robots.
We used a retro-reflective marker-based motion capture system (T-Series, Vicon, Culver City, CA, USA) with a sampling rate of 100 Hz and an eye tracker (ETL-500, ISCAN, Inc., Woburn, MA, USA) having a sampling rate of 60 Hz to reconstruct the 3D gaze vector and to identify and locate taskrelevant objects.
As shown in Figure 1, we used a 7 degree-of-freedom (DOF) robot arm (JACO 2 7DOF spherical, Kinova Robotics, Quebec, Canada) with a three-fingered end-effector (Kinova Robotics).For simplicity, we controlled the grip aperture of the end-effector only, effectively reducing the end-effector to a 1 DOF gripper.The experiment was conducted with a single computer having an Intel 9700K processor running at 3.6 GHz and an NVIDIA GeForce RTX 2070 GPU to accelerate the RNN calculations.With the assistance of the GPU, the RNN was able to provide an updated recognition result within 5 ms.
We selected everyday objects and actions common in activities of daily living for the assessment of the GT and AP modes within our gaze-based shared autonomy framework.We used three objects (mug A, mug B, spoon) from the benchmark Yale-CMU-Berkeley (YCB) Object set [36] and defined one support surface (table ).
This experiment involved 14 action primitive candidates in total, i.e., reach mug A, reach mug B, reach spoon, move spoon to mug A, move spoon to mug B, move mug A to mug B, move mug B to mug A, stir within mug A, stir within mug B, pour from mug A to mug B, pour from mug B to mug A, set down mug A, set down mug B, and set down spoon.Each participant was instructed to perform 10 action primitives in sequence as  II.These action primitives involved 8 distinct action primitive candidates, two of which were repeated.Thus, 6 action primitives out of the 14 candidates were not performed in the experiment.For brevity, we did not instruct participants to use the 6 action primitives in which mug B was the primary object of interest (e.g.stir within mug B).However, these 6 action primitives were part of the action primitive library and were potential candidates for recognition as intended action primitives.
Unlike most studies that focus solely on pick-and-place actions, we included actions that involve the verbs "move," "pour," and "stir."We indexed the sequentially performed action primitives as a j , where j ∈ {1, • • • , 10}.
For a consistent comparison of the GT and AP modes across subjects and trials, the sequence of 10 actions was prescribed through verbal instructions and objects were placed at preset locations before each new trial.However, there is nothing about the system implementation described in Section III-B that would prevent participants from improvising and changing the sequence of actions, or that relies upon prescribed locations for the task-relevant objects.
Each experimental session consisted of two blocks of trials, with each block consisting of one type of control mode and three consecutive repetitions/trials of that same control mode.To account for the possibility that the order of the blocks could bias results, half of the participants (selected at random) experienced the GT mode first while the remaining half experienced the AP mode first.
Each participant was instructed on how to control the robot for each mode with a script: "You can let the robot know your intent by looking at the target object."Each participant was allowed to familiarize themselves with each control mode for up to two practice trials.Between each block, the participant was informed that the control mode would be switched.However, each control mode was referred to only as "Control Mode #1" or "Control Mode #2."As will be described in the following Section IV-B.3, participants were instructed to complete a brief questionnaire after each trial and were interviewed upon completion of the entire experimental session.

B. Performance Metrics
Here, we describe the objective and subjective performance metrics that were used to compare the performance of our proposed AP mode with the conventional GT mode.
1) Preliminaries: Before we define metrics for the seamlessness of the shared autonomy system, we introduce several key temporal variables.First, consider an action primitive a j , which is one of the instructed, sequentially performed action primitives where j ∈ {1, • • • , 10}.We defined t end (a j ) as the time at which a j ends.
Taking j = 3 as an example, we illustrate the recognition process of a 3 (reach mug) in Figure 5(b).The curves in Figure 5 ranged from t end (a 2 ) − 0.5sec to t end (a 2 ) + 2sec, during which the robot completed a 2 ("set down mug") and started to execute a 3 ("reach spoon").The curves in Figure 5, drawn from one representative trial, represent the two action primitives candidates with the highest probability.For clarity, other candidates with lower probability values are not shown.We defined t r ecog as the time at which a 3 was first identified as a r ecog according to eq. 3.
Importantly, our gaze-based shared autonomy framework allows for recognition of a j prior to t end (a j−1 ) at which time the prior action primitive ends.However, any recognition of a j earlier than a predefined time window W end that immediately precedes t end (a j−1 ) was treated as a possible misclassification and was ignored.The value of W end determined the earliest time at which an action primitive might be predicted.
From an implementation perspective, it could be premature to take t r ecog as the moment when the robot has correctly identified an intended action primitive, especially if the identity of a r ecog changes from one time step to the next.Rapid changes in the identity of a r ecog could occur due to noisy inputs to the RNN and despite the moving average filter applied to the RNN outputs.Thus, we conservatively define t stable as the time of "stable" recognition.
Occurring after t r ecog for the AP mode, t stable was the first time at which the following conditions were simultaneously satisfied: (i) at t stable , a j is identified as a r ecog according to eq. 3, and (ii) more than 70% of the time steps within a user-defined time window W stable (gray shaded area in Figure 5b) after t stable was recognized as a j .Note that for the GT mode, t stable was the same as t r ecog since both times corresponded to the instant at which a j was locked.
2) Objective Measures: We used the following objective measures to evaluate the seamlessness of the shared autonomy system: delay of recognition, delay of stable recognition, and recognition accuracy.The delay of recognition is defined as the time difference between t r ecog and t end (a j−1 ).A negative value for the delay of recognition indicates that the recognition of an action primitive has occurred prior to the completion of a preceding action primitive.In this case, the RNN has successfully predicted an action primitive.Prediction of action primitives can enhance the seamlessness of the shared autonomy system.
The delay of stable recognition is defined as the duration between t stable and t end (a j−1 ).As with the delay of recognition, it is possible for the delay of stable recognition value to be negative.When considering positive values of the delay of recognition and delay of stable recognition, as the delay magnitudes decrease, the seamlessness of the shared autonomy system increases.
Recognition accuracy is defined as the proportion of time steps from t end (a j−1 ) to t end (a j ) that are correctly identified as a j .Ground truth for each action primitive was known since all participants followed instructions to perform 10 specific action primitives in a given sequence.For the GT mode, recognition is deemed correct when a j matches a lock as determined by the locking mechanism described in section III-F.For the AP mode, recognition is deemed correct when a j matches a r ecog as determined by eq. 3, which relies upon the action primitive recognition RNN.
3) Subjective Measures: After each trial, we adopted verbatim the questionnaire reported in [2].Using a Likert scale ranging from 1-7, we asked participants to respond to the following statements, where 1 and 7 corresponded to "strongly disagree" and "strongly agree," respectively: 1) "I felt in control." 2) "The robot did what I wanted."3) "I was able to accomplish the task quickly."At the end of the experimental session, we asked participants two open-ended questions: 1) "Which control mode do you prefer and why?" 2) "Do you have any general comments for either the first or the second control mode?"

C. Specification of User-Defined Parameters
The implementation of the gaze-based shared autonomy framework and the performance assessment involve a number of user-defined variables.The following values were determined from preliminary studies in order to balance speed with robustness of performance.
In Section III-E.4,to extract an operator's intended target position for setting down a grasped object, we set the moving filter window w f,gaze as 0.5 sec and the distance threshold d r as 5 cm.In Section III-D.3, to filter out noise in the real-time action primitive recognition RNN outputs, we set the moving average filter window w f,r ecog as 0.5 sec.For the "locking" mechanism, the probability thresholds p lock and p unlock were set as 0.7 and 0.3, respectively.Considering the 2 sec and 1.5 sec windows used for gaze triggering in [13] and [14], respectively, we set our "locking" window w lock as 1.5 sec to enable a fair comparison of the GT and AP modes.In Section V-A, to calculate objective measures of performance, we set the windows w end and w stable as 1 sec and 0.5 sec, respectively.

A. Objective Measures of Performance
For each action primitive and control mode, we report the population averages for the delay of recognition, delay of stable recognition, and recognition accuracy (Table III).Since all 16 participants operated the robot using both control modes, we conducted a paired t-test with a significance level of α = 0.05.
In general, the AP mode outperformed the benchmark GT mode for all three objective measures of performance.First, we will address the delay of recognition.Out of the nine action primitives that yielded statistically significant results, five action primitives had a mean delay of recognition that was negative, indicating that prediction of action primitives had occurred.Prediction occurred for action primitives involving verbs "set down" and "move."For these two verbs, eye gaze moves toward the target object of the subsequent action primitive well in advance, which may enable the predictive capabilities of the AP mode.For the action primitive "move mug A to mug B," the prediction of the action primitive occurred as much as 0.86 sec prior to the completion of the prior action primitive.
All delay of recognition values for the GT mode were positive, indicating that prediction of action primitives was not possible with the GT mode.Furthermore, all positive delay of recognition values were smaller for the AP mode than those of the GT mode.Notably, the AP mode outperformed the GT mode by 1 sec or greater for eight out of 10 action primitives.For the two manipulate-type action primitives that were the exceptions ("stir" and "pour"), the target objects were already gazed at during the preceding action primitives involving the verb "move."As a result, the delay of recognition was less than 1 sec for both control modes.
By design, the delay of stable recognition performance metric is more strict and conservative than the delay of recognition performance metric.Thus, the delay of stable recognition values were either equal to or slightly worse than those for the delay of recognition for all action primitives and control modes.Predictive abilities were degraded by less than 0.2 sec, as in the "move spoon to mug A" case.The AP mode outperformed the GT mode by the largest margin (1.96 sec) for the second "reach mug A" action primitive in the instructed sequence (Table III).
Multiple factors could potentially affect the performance metrics (e.g.delay of stable recognition) of an action primitive, including but not restricted to its identity.As can be seen in Table III, although action primitive 1 and 7, and action primitive 2 and 10 are identical, their corresponding delay of stable recognition differ.Factors such as identity of the preceding action primitive and/or location of the target object, can affect performance metrics.For instance, before action primitive 7 was initiated, participants tended to keep their visual attention on objects related to the preceding action primitive (i.e.spoon and table), possibly due to concerns about the robot's capabilities.In contrast, for action primitive 1, participants immediately moved their gaze toward mug A without hesitation once the experimenter confirmed that the participant could begin the next trial.Thus, the delay of stable recognition of action primitive 7 was larger than that for action primitive 1.
Regarding recognition accuracy, the mean recognition accuracy was statistically significantly higher for the AP mode than the GT mode for all 10 action primitives.For the AP mode, the average recognition accuracy exceeded 95% for five out of the 10 action primitives, and exceeded 85% for all 10 action primitives.The lowest recognition accuracy value was 71.4% for the GT mode as compared with 86.3% for the AP mode.
Of the three objective metrics of performance, the delay of stable recognition appeared to be the most reliable metric of seamlessness of the shared autonomy system.We highlight the delay of stable recognition results for all 10 action primitives in Figure 6.The mean delay of stable recognition was lower for the AP mode than the GT mode for all action primitives except for "pour," which had a p-value of 0.11 (Table III).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Predictive capabilities of the AP mode were observed for the five action primitives involving "set down" or "move" (Figure 6).The use of the RNN classifier for action primitive recognition resulted in earlier recognition and execution of users' intended action primitives.By enhancing the responsiveness of the robot to gaze behaviors, without sacrificing recognition accuracy, the AP mode resulted in a more seamless gaze-based shared autonomy system than the GT mode.
The AP control mode uses an RNN model that leverages gaze object angle and gaze object angular speed, which are not considered by the GT mode at all.As reported in our prior study [16], the use of gaze object angle and gaze object angular speed as input features to the RNN can decrease the observational latency for recognizing action primitives.These two additional input features may encode the tendency of the gaze vector to approach an object once the eyes start to move, thereby providing intent-relevant information even before the gaze vector intersects with the target object.

B. Subjective Measures of Performance
1) Post-Trial Survey: Figure 7 summarizes average participant responses to the post-trial surveys described in Section IV-B for each of the control modes.A paired t-test (α = 0.05) was conducted in order to compare participants' views for the benchmark GT and proposed AP control modes.For all three survey statements, there was a statistically significant difference between the Likert scale responses ( p < 0.01).In each case, the AP control mode outperformed the GT mode.For the statement "I felt in control," the mean (standard deviation) Likert scale response was 6.2 (0.5) for the AP mode and 5.8 (0.8) for the GT mode.For the statement "The robot did what I wanted," the mean (standard deviation) Likert scale response was 6.5 (0.5) for the AP mode and 5.9 (0.8) for the GT mode.The largest difference in mean values was observed for the statement "I was able to accomplish the task quickly."In this case, the mean (standard deviation) Likert scale response was 6.1 (0.6) for the AP mode and 5.1 (1.0) for the GT mode.

2) Post-Experiment
Interview: As described in Section IV-A, half of the participants experienced the GT mode first and half of the participants experienced the AP mode.For clarity, we refer to the GT and AP modes using brackets instead of as "first" and "second" control modes, which each participant referenced in their interview responses.
In the post-experiment interview, 14 out of 16 participants expressed a preference for the AP mode (see the Supplemental Video for 1st and 3rd person perspectives of a representative trial).Representative comments about the AP mode are listed below: • "I don't have to look at one object or position for a long time like a few seconds, and the movement [of the AP mode] is smoother." • "It just seems there's smooth tracking.The [AP] process is pretty fast and seamless.It is pretty obvious that the [GT] control mode is slower to respond than the [AP]." • "When I was using [AP], it was more responsive, and the action is pretty smooth.There's no pause in the middle." • "The [AP] control mode was more fluent.In the [GT] control mode, it didn't feel like when I looked at it, the robot is following, so I felt less control."Two participants preferred the GT mode over the AP mode: • "Although the [GT] control mode was slower, there was not any confusion.You looked at the cup, and after a couple of seconds, it picked up the cup.It took some time, but it would do it, so you don't have to worry about correcting.The other one [AP] feels a bit twitchy like it's very responsive." • "For the [AP] control mode, I felt like the control was more on the robot side instead of the human side.I enjoyed it initially because I felt like I can rely on the robot to accomplish each of these tasks very accurately in a laboratory environment.Still, if I use the system in real life, there will be unpredictable variables, and I will appreciate it if the robot can pause and wait for my confirmation through eyes like what the [GT] control mode did."According to the Likert scale survey and the postexperiment survey, most participants reported that the AP mode was more seamless than the GT mode.For the GT mode, participants perceived pauses between the robot's execution of their intended action primitives.For the AP mode, participants reported that the robot was more responsive to their eye movements.

C. Limitations and Future Work
Despite a general satisfaction with the AP mode, two participants expressed skepticism about the ability of the gaze-based shared autonomy system to accurately recognize their intent in a more visually cluttered environment.Such concerns might be assuaged by enhancing the transparency of the system and conveying to participants what the robot has inferred and plans to do next.Alonso and Puente highlighted the critical nature of transparency for shared autonomy systems in a review paper [37].They described how transparency could improve system performance, reduce human errors, and build human trust in human-robot systems.Our results might be improved further through the use of screen-based confirmation or audio confirmation to increase the system's transparency.
In addition, participants' gaze behaviors can be affected by environmental stimuli or mental states [38].The proposed shared autonomy scheme does not consider distracted or idle states, during which gaze does not correspond with user intent.This problem can be addressed by incorporating additional modalities of user inputs.For instance, EMG [5], EEG [7] signals, or simply a joystick can be leveraged to complement the current action primitive recognition algorithm by differentiating task-driven gaze behaviors from distracted or idle states.Also, the gaze-based shared autonomy framework might not work well for individuals with certain impairments.For instance, the gaze vectors of individuals who exhibit involuntary body and/or eye movements may be too noisy for accurate action primitive recognition.4Additional filtration of gaze signals may be necessary.The gaze-based shared autonomy framework could be tested on individuals with Fig. 6.The delay of stable recognition is shown for the GT and AP modes.Each boxplot indicates the 25th, 50th (green), and 75th percentiles.The whiskers extend to the most extreme data points that are not considered outliers (red "+"), which have values that exceed 1.5 times the interquartile range from the top or bottom of the box.A negative value indicates that an action primitive has been predicted before the end of the preceding action primitive.The AP mode (blue) outperformed the GT mode (black) for all action primitives (p < α = 0.05) except for "pour."Asterisks indicate p < α = 0.05.Fig. 7. Likert scale survey results are shown, where 1 and 7 indicate "strongly disagree" and "strongly agree," respectively.Each boxplot indicates the 25th, 50th (green), and 75th percentiles.The whiskers extend to the most extreme data points that are not considered outliers (red "+"), which have values that exceed 1.5 times the interquartile range from the top or bottom of the box.The AP mode (blue) outperformed the GT mode (black) for all three statements (p < 0.01).
different types of impairments and the control scheme could be improved accordingly.
Furthermore, eyetracking technology can be expensive and inaccessible to the general public.Even the most lightweight eyetrackers on the market can be uncomfortable to wear for long periods.Recently, consumer sensors, such as RGB-D cameras, have been adopted for gaze estimation in real-time.These remote sensors, typically positioned at a distance from the subject, are capable of estimating gaze directions with an acceptable accuracy by locating head position and iris center and handling low-quality eye images [39].As gaze estimation techniques advance to yield improved estimation accuracy, remote cameras might be able to substitute for head-mounted eyetrackers, thereby enhancing user comfort and reducing overall cost of the system.
Our gaze-based shared autonomy framework could also be improved by developing a library with a greater variety of action primitive candidates, as would be needed for the diverse set of activities of daily living required by individuals with upper limb impairment.Currently, the action primitive recognition thread labels each time step as one of the four verbs (reach, move, set down, manipulate).Although these four verb classes are generic enough to serve as building blocks for complex actions, important verbs such as "feed" [40] are not included.Modifications to the current action primitive recognition framework would be required, as a participant's mouth could not be identified as a "target object" during self-feeding.Instead, action primitives related to feeding and drinking might require the triggering of pre-planned trajectories once the utensil or cup were ready to be brought to one's mouth.Alternatively, information from a 3rd person camera perspective could be used to supplement that of the 1st person view of the user [40], [41] to improve the accuracy and safety of feeding and drinking trajectories.

VI. CONCLUSION
We developed a novel gaze-based shared autonomy framework to assist with activities of daily living.Utilizing a pre-trained recurrent neural network [16], the system can recognize, and often predict, an operator's intended action primitives using 3D gaze-related features.The system can localize objects in real-time, dynamically plan collision-free trajectories to reach, move, manipulate, and set down everyday objects.Through both objective and subjective metrics of performance, we demonstrated that our AP control mode, which leverages a gaze-based action primitive recognition model, can outperform the conventional gaze-triggered control mode.Borne out by statistical analyses and participant Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
surveys and interviews, the AP mode enabled a more seamless gaze-based shared autonomy system than the GT mode.The system can serve as a foundation for further enhancements to system transparency through augmented reality and system adaptability through the expansion of the verb library used for action primitive recognition.

Fig. 1 .
Fig. 1.The gaze-based shared autonomy framework consists of three threads: 3D reconstruction, intent inference, and robotic manipulation.Gaze-related features are used to recognize action primitives to enable seamless robotic movements during the assistance of activities of daily living.

Fig. 2 .
Fig. 2. Three parallel threads comprise the shared autonomy framework.(a) The 3D reconstruction thread tracks eye gaze and object poses.(b) The intent inference thread recognizes the intended action primitives in real-time.(c) The robotic manipulation thread implements the execution of intended action primitives.

Fig. 3 .
Fig.3.As in[16], gaze object angle is defined as the angle between the gaze vector and the eye-object vector (ending at the object's geometric center).

Fig. 4 .
Fig. 4. The intent inference thread converted objects with specific identities (e.g.spoon and mug shown in (a)) to generic, indexed objects (e.g.obj 1 and 2 shown in (b)) according to their frequency of occurrence in the gaze object sequence.Attributes of indexed objects, for every time step τ within W recog (from t − w recog to t − 1), were then concatenated as input features and sent to the RNN.

4 )
Planning the Target Position for the Verb Class "Set Down":

Fig. 5 .
Fig. 5.Key variables and performance metrics described in Section IV-B are defined for the (a) GT mode and (b) AP mode.Since the GT mode does not utilize an intent inference model, the delay of recognition equals the delay of stable recognition.For the AP mode, the delay of stable recognition depends on the intent inference model and user-defined time windows such as W end and W stable .Prediction of action primitives is only possible with the AP mode.

TABLE III PAIRED
T-TESTS WERE CONDUCTED FOR THE GT AND AP MODES FOR THREE OBJECTIVE METRICS OF PERFORMANCE.POPULATION MEANS ARE REPORTED, WITH THE BEST RESULT FOR EACH ACTION PRIMITIVE SHADED IN GRAY AND THE BEST OVERALL RESULT FOR EACH PERFORMANCE METRIC INDICATED IN BOLD