3D Gaze Estimation for Head-Mounted Eye Tracking System with Auto-Calibration Method

,


I. INTRODUCTION
Gaze estimation is the process of predicting where someone is looking, either as gaze directions or as points of regard (PoR) in space.Vision is one of the primary resources of collecting surrounding information; our eyes will turn toward a person or an object we are looking.According to the analysis of researchers in the United States, the degree to which we rely on each sense in the performance of everyday activities is approximately: 1% for taste, 1.5% for touch, 3.5% for smell, 11% for hearing, and 83% for sight [1].Therefore, estimating users' gaze vectors or gaze points will be of great help to understand human activities.Nowadays, gaze estimation systems have been applied in many fields, such as human-machine interactions [2], assisted driving [3], and surgery assistance [4].
Gaze estimation systems can be generally classified into remote devices and head-mounted devices (HMD) [5].The remote device is a screen-based interaction system that works at a distance from the subject [6].This device contains at least The associate editor coordinating the review of this manuscript and approving it for publication was Yongping Pan .one user-facing camera to capture images of the subject's face so that the PoR can be estimated according to the extracted features from face images.Based on the remote device's working principle, it needs the participant's head to be in the field of the camera all the time, which limits the participant's head-body mobility.In contrast, HMD is designed as a head equipment that can acquire clear eye images and allows users to move their heads freely.HMD usually consist of a scene camera and two eye cameras.The scene camera is used to obtain scene images that the user sees, and eye cameras are used to record eye movement while the user is looking at a scene.In this way, HMD estimate humans' PoR in the scene camera's coordinate on the basis of eye images.Recently, lightweight HMD have become popular eye tracking systems for gaze researchers due to their flexibility and mobility.Such devices extends a user's gaze estimation field from desktop or computer screen into other scenes that greatly enriches the collection of human gaze data.
After several studies have attempted to predict PoR on a scene image plane or a screen, interest in estimating humans' gaze location in 3D coordinate has been increasing.3D gaze estimation not only predicts PoR in the field of view with VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/depth information, but also proves the connection between a scene's saliency and human-related motion [7].To our knowledge, one of the earliest systems that is capable of estimating 3D PoR requires a fixed head-to-camera setting [8].Kwon et al. introduce a novel binocular technique, in which gaze direction is computed by using glints on the corneal and then depth is inferred by interpupillary distance.Though 3D gaze estimation has been widely studied in remote gaze estimation systems [9], research on HMD remains limited.For both gaze estimation systems, the most common approach is to estimate the 3D gaze point as the midpoint of the shortest segment between both eyes' visual vectors [10].However, deviations in the eye gaze vector's calculation are likely to cause considerably variance in the PoR's depth direction.
To address this problem, Lee et al. use multi-layered perception to obtain depth gaze position [11].However, the method employs dual Purkinje images as the input, which are hard to be detected in practice.
Another challenge for 3D gaze estimation is that eye tracking systems need to be calibrated for each user before an estimation.During calibration, the subject needs to stare at the specific reference marker and sometimes, such active personal calibration interrupts user-scene interactions.Although the calibration procedure has become much simpler and the number of calibration markers has been reduced down to one in some works [12]; it still requires the user to participate in the calibration task actively.
Several approaches have been proposed to avoid the person-specific calibration process.Alnajar et al. claim that the gaze patterns of several viewers provide important cues for the auto-calibration in [13].By making use of the topology of pre-recorded gaze pattern, a transformation is computed to map the initial gaze points to match the gaze pattern.Sugano treats mouse clicks on a computer screen as gaze points to train the mapping function between the eye features and PoR [14].Similar to [14], the algorithm in [15] detects the user's hand and fingertip which indicate the user's point of interest.This method can easily collect calibration samples in different environments quickly, and the proposed method achieves comparable accuracy to standard marker-based calibration.Moreover, Lander and his co-workers combine the pupil center position and the scene reflection on the corneal surface to predict the actual PoR in the scene [16].All these approaches attempt to avoid using calibration markers.Nonetheless, they all rely on observations of a specific person and environment, which limits their applicability.
Visual salience has also been considered as the cue for predicting PoRs, as Parkhurst et al. claim that there is a connection between visual saliency and gaze positions [17].Biologically, a human is more likely to focus on a part in the scene with high saliency, i.e., a region with peculiar characters compared with surrounding features.After Koch and Ullman came up with the concept of saliency [18], different kinds of saliency algorithms have been proposed [19], [20].In contrast to the methods mentioned above, visual saliency can be applied to a variety of scenes and does not require any restrictions on the user, which shows its high flexibility and wide adaptability.
In this paper, we present a head-mounted eye tracking system to predict users' PoR in 3D environment (as shown in Fig. 1).This system consists of a simple setup and can achieve good estimation accuracy.We also propose a novel auto-calibrating 3D gaze estimation method named 3DGAC.This method combines with the saliency-based algorithm to calculate possible 3D calibration targets in the scene instead of asking the subject to fixate on a certain point, thereby achieving a fully automatic calibration procedure without any restrictions and human intervention enforced on the subject.For the system's architecture, we replace the regular RGB camera with an RGBD camera, which provides accurate 3D data of the scene.During the usage, participants can change their location and head pose with no constraint.In the calibration step, we integrate saliency maps with gaze vectors to search for extrinsic parameters between the scene camera and two eye cameras.In the gaze estimation section, we firstly rotate gaze vectors from both eye camera coordinate systems into the scene camera coordinate system.Then, we use the point cloud generated by the RGBD camera to find the optimal PoR in a 3D environment.The main contributions of this paper are threefold: 1) We combine RGBD images with the saliency-based method into auto-calibration to achieve 3D gaze points estimation for our head-mounted eye tracking system.To the best of our knowledge, this is the first work to do that; 2) We propose a novel method to determine the rotation matrix that transfers gaze vectors into the scene camera coordinate system; 3) We perform experiments indoors and outdoors.The performance of PoR estimation is encouraging, and we also achieve the remarkably improvement in the accuracy of PoR's depth estimation.
The rest of this paper is organized as follows.Section II presents details of 3DGAC.Section III shows the experiment result and discussion.Section IV concludes this study.

II. METHODOLOGY A. ARCHITECTURE FOR HMD
A head-mounted eye tracking system consists of two parts.An Intel RealSense D435 RGBD camera is used as the scene camera to provide scene images with the depth measurement range of over 10m.Two IR cameras are leveraged as eye cameras to capture clear IR eye streams [21], [22].Besides, there is an IR light illuminating eye regions.All the capture devices are connected to a laptop, and they are set to be triggered at the same time so that we will capture two IR eye images, one RGB scene image, and one depth image simultaneously.
As the scene camera, Intel RealSense D435 RGBD camera contains two modules.The RGB module captures RGB scene images; the depth module has two IR cameras and an IR projector for obtaining depth images of the scene.With the software module in librealsense libraries, the depth image can be easily aligned to the RGB image.Thus, all the scene image pixels' 3D positions can be obtained in the scene camera coordinates.

B. CALIBRATION PROCEDURE
The framework of 3DGAC is shown in Fig. 2.During calibration, the participant can scan the surrounding environment randomly.
For RGB scene images, we apply the algorithm in [23] to generate saliency maps, which represent the distinctive features of scene images.We have also considered using the RGB-D saliency detection algorithm to implement our self-calibration algorithm.However, those methods generally calculate the saliency map based on the trained neural network [24], [25].In contrast, traditional saliency algorithms have better universality and can be used on low-cost equipment.Therefore, we finally use the traditional RGB saliency detection algorithm to calculate the saliency maps of scene images.Fig. 3 presents examples of scene images and their corresponding saliency maps.After the generation of each saliency map, we pick out pixels with higher saliency value than the pre-setting threshold.Then a clustering method based on [26] is employed to remove noise and improve  the reliability of the saliency map.For the multiple salient regions in the scene, we preserve all the pixels of salient parts as shown in Fig. 4. Consequently, the target data set can be collected {t i } j for all the 3D coordinates of the space points that correspond to our chosen pixels.{t i } stands for the selected 3D targets of the saliency map, and j is the image's index.The details of the data acquisition process are outlined in Fig. 5.
There are generally two types of methods establish the associations between eye features and targets in the scene: the regression-based and model-based approaches.Unlike regression-based approaches that directly create a mapping relationship between eye features and gaze points, model-based approaches first use extracted eye features to build the eyeball model.Once the model is built, the initial gaze vectors would be obtained, and they will be used to determine the real gaze vectors in the scene camera coordinates.In this paper, we adopt the 3D eyeball reconstruction method as in [27].When modeling the camera imaging sensor as a pinhole model, the pupil contours on the eye images can be back-projected into 3D space as circles: where m donates a pupil position and l is a normal passing through m.After the back-projection, a 3D eye model can be built on the basis of multiple circles.
We assume the 3D eye model as a sphere and every space circle would be tangent to the sphere at the circle's center.In this case, l is treated as a normal which is passing through the eye model center and parallel with the gaze vector.Once we have collected enough eye images, all the pupil contours on those eye images are extracted and back-projected into 3D space.The eyeball center c is determined as the point closest to each normal, which can be obtained by minimizing the function as follows: where j is the index of eye image.Once the eyeball center c is recovered, the 3D eyethe 3D gaze vector n can be calculated as the vector origins from c and points to m in the eye camera coordinate system.Given that each user has the specific gaze habit in a scene, the salience part is not sufficient enough for determining the exact position as the calibration marker does.To combat this problem, we propose a novel method to find relationships among the saliency map and two gaze vectors.
Considering short distances between user's eyes and the scene camera are within a few centimeters whereas user's fixation usually locates at meters away, it is reasonable to assume that the user's eyeball center coincides with the scene camera coordinate system's origin (we perform a simulation to evaluate the angle deviation without this eye-camera distance in Section III-A).Thus, the eyes and scene camera observe the object with the approximately same viewing directions.
For the computation of the extrinsic parameter between the left eye camera coordinate system and the scene camera coordinate system, we acquire the corresponding gaze vector n lj , and the target data set is acquired as {t i } j for j th saliency map.Then we delete targets whose depth is less than 1m (based on the simulation result) to improve the reliability of our proposed calibration method.Each target vector can be represented as n lj − e l where e l stands for the user's left eyeball center location.α l and β l represent the vertical and horizontal angle between the gaze and target vectors, respectively.Then the specific rotation matrix R li can be represented as We obtain the rotation matrix set {R li } through which can be represented as solving the overdetermined linear equation as follows: Similarly, we further acquire a set of rotation matrices {R li } j that corresponds to all the gaze vectors and all selected pixels in all the saliency maps.During the calculation, a two-dimensional cumulative array A(α l , β l ) is created to count the frequency of angles (as shown in Fig. 6).Those angle pairs' frequencies keep growing as more and more saliency maps being involved.It may occur the situation that multiple angle pairs show high frequency during the calculation as Fig. 6(a), Fig. 6(b) and Fig. 6(c).Since there is only one correct extrinsic parameter between the scene camera and the left eye camera, the frequency of one angle pair will be significantly higher than others in the end like Fig. 6(f).After the calculation, the R li is determined based on the most frequent angle pair and leveraged to restore the extrinsic parameters R l between the left eye camera and the scene camera coordinate systems.We determine the rotation matrix R l when its maximum angle set's frequency is two times greater than the second largest frequency.
With the same method, we can also recover the extrinsic parameters between the right eye camera and the scene camera coordinate systems, which is represented by R r .
The two rotated gaze vectors are shown as follows: (7)

C. GAZE ESTIMATION
For the traditional model-based gaze estimation method, the gaze point in 3D is computed as the midpoint of the shortest segment between two rotated gaze vectors.However, due to the short baseline between human eyes, small angle calculation errors in rotated gaze vectors can cause large deviations in the Z direction of the gaze point estimation (as shown in Fig. 7).To refine the raw 3D gaze estimate method, we generate a point cloud the environment (as shown in Fig. 8).Consider a 3D vector N passing through the origin O = (0, 0, 0), the distance of space point p from N can be calculated as For two rotated gaze vectors in space, we define the gaze point by minimizing the equation as follows where p i is the point in the point cloud and d the sum distance from the point to two rotated gaze vectors.

A. SIMULATION RESULTS
We use MATLAB to evaluate the effect of ignoring eye-camera distance on the accuracy of 3D gaze point estimation at 10 different depths from 0.5m to 5m.Based on the architecture of our gaze tracker, the distances from the scene camera to the user's left and right eyeball centers are approximately 70mm and 90mm, respectively.At each depth, 100 points are randomly generated as 3D gaze targets, and those targets cover a field of view which is 60 • ×55 • .For each target, the vector that starts from the eyeball center and points to the 3D target is tread as the real gaze vectors while the assumptive gaze vector starts from the scene camera's origin.We calculate the angle between each pair vectors and the result is shown in Fig. 9.
In Fig. 9, it emerges that the eye-camera distance can cause a considerably angle deviation when the depth is small, but the mean error drops to an acceptable level as the depth increases to 1m.Besides, when the depth exceeds 2m, mean angle errors and standard deviations for both eyes can achieve an encouraging precision.Considering the subject's gaze distance normally beyond 1m in a mobile situation, assuming that the user's eyeball center coincides with the scene camera coordinate system's origin is reasonable.

B. EXPERIMENT RESULTS AND DISCUSSION
Our experimental system is based on a low-cost HMD developed in our laboratory, which applies Intel RealSense D435 RGBD camera as the scene camera and two IR cameras as eye cameras.All the cameras run at approximately 25fps during the experiment procedure with resolutions of 640*480 pixels.Before the calibration and the estimation, three cameras have been calibrated, and as a result, the lens distortion   can be corrected by the MATLAB toolbox.The proposed gaze tracking algorithm in this paper is running on a laptop PC with an Intel i7 2.70GHz, 2.90GHz quad core CPU and an 8.00GB RAM.The computation cost for calculating the gaze vector from an eye image and generating a saliency map from a scene image are about 15ms and 18ms, respectively.To ensure the real-time gaze calculation, our proposed method is executed in a multi-thread manner.One thread runs at 30 fps to calculates gaze vectors based on eye images.In the meanwhile, the other thread runs at 10fps, which generates a saliency map from one of the three consecutive scene images and discarding the other two.
First, we conduct the 3D gaze estimation experiment in the office.10 subjects are invited to validate the proposed method.Fig. 10 demonstrates the setup of our indoor experiment.During the calibration, subjects are encouraged to look at surroundings randomly, and there is no limitation on their head poses or standing position.The entire calibration process for each subject usually takes 20 seconds, as the algorithm takes time to establish eyeball models.In the step of gaze estimation, we apply a board with 5 concentric circle targets to test the angle and the depth accuracy of 3DGAC.Taking the RGBD camera's depth measure range and the  The average result in different depth is shown in Table 1, we use the average depth error (ADE) and the average angular error(AAE) to measure the accuracy.The overall ADE of indoors experiment is 55.9mm and the overall AAE is 3.7 degrees.
Table 2 demonstrates the comparison of the proposed method with several state-of-the-art remote-based gaze estimation methods, four models are included as follows.
a) 3DDEFM: 3DDEFM is a 3D eye-face model to enable 3D eye gaze estimation proposed by Wang and Ji in [28].
b) RGBD-GT: the method designed by Xiong et al. in [29] tracks the iris center and a set of 2D facial landmarks whose 3D locations are provided by the RGBD camera.c) RTGTK: Real time gaze tracking with Kinect is submitted by Wang and Ji in [30].d) RGBD-EMGE: RGBD-EMGE is an eye-model-based gaze estimation method proposed by Jianfeng and Shigang in [31].
According to Table 2, 3DGAC gives a relatively good performance in estimating the gaze vector's direction compared with four remote-based methods that also adopt RGBD sensors.Although the angular error of 3DDEFM is smaller than the proposed method, it is evaluated in a hand-reaching distance.Besides, our system demonstrates advantages with more flexibility: the experiment is conducted in a larger range (from 1m to 4m); the participant can move and change his or  her head pose freely during the entire process, and calibration markers are unnecessary to assist the calibration process.
In Table 3, we compare the proposed method with four head-mounted-based gaze estimation methods.a) 3DGE-GPR: 3D gaze estimation with Gaussian processes regressor method is proposed by Elmadjian et al. in [32].b) 3DGE-3DEP: 3D gaze estimation on a 3D environment map is designed by Takemura et al. in [33].c) GEMFR: GEMFR is a method using multiple feature regression to predict 3D gaze which is submitted by Weier et al. in [34].d) SCGE: SCGE is an self-calibrating gaze estimation method proposed by Sugano and Bulling in [35].
From Table 3, we can figure that the average angular error of our method is 3.7 • , which outperforms the accuracy of 3DGE-GPR, 3DPoRE-3DEP and SCGE (4.9 • , 5.27 • and 6 • respectively).GEMFR reports the accuracy of 1 • .However, their method requires complex hardware setups and time-consuming calibration period.For estimating 3D PoR's depth, the average depth error of 3DGAC from 1m to 4m is 55.9mm, which has achieved significant improvement compared with other four head-mounted-based gaze estimation methods.
From Table 1, it can be observed that our gaze estimation method gives better performance in long distance.This instance may be due to the assumption of ignoring the distance between the eyeball center and the scene camera when we calculate the rotation matrix for each eye.If the estimation distance was relatively short, then the eye-camera distance may affect the computation of rotation matrices.In this way, the gaze system's result will be more accurate when the user gazes at a place more extensive.Fig. 12 demonstrates the setup of our outdoor experiment.The calibration and the gaze estimation period are the same as the first experiment.The PoR estimation errors in degrees for each subject at 4 different distances is shown in Fig. 11(b), and the depth estimation error is shown in Fig. 11(d).average depth error is 68.4mm and the average angular error is 4.0 degrees.
Comparing the results of two experiments, the accuracy of PoR estimation and depth measurement decrease outdoor when the subject stands at the same distance from the board.We believe that the main reason for this result is the impact of sunlight.Given that we utilize IR light sources to capture the dark pupil image and extract the pupil contour, sunlight will affect the illumination of the IR source which causes poor pupil detection.Moreover, the RGBD camera acquires depth data based on the IR projector and two IR cameras; hence, sunlight also bring bad influence on PoR's depth measurement.

IV. CONCLUSION
In this paper, we present a head-mounted eye tracking system to estimate users' 3D PoR.A novel auto-calibrating 3D gaze estimation method named 3DGAC is proposed to simplify the calibration procedure and improve the system's flexibility.For the hardware design, we replace the regular RGB camera of head-mounted eye tracking system with an RGBD camera to obtain the depth information of the scene.During the calibration procedure, the saliency algorithm is introduced to generate saliency maps based on scene images.We align those salient pixels to the depth image and use them as possible 3D calibration targets.Through the aggregation, we obtain the rotation matrix between the eye camera coordinate system and the scene camera coordinate system.The entire calibration procedure is completed automatically without pre-setting markers and any external assistance.To improve the accuracy of depth estimation, environmental point cloud data is applied in the PoR's estimation.Once we have identified the final gaze vector, the PoR is calculated as it is the closest point to the final gaze vector.On the basis of experiment results and comparison with other state-of-the-art approaches, the proposed method achieves a relatively accurate measurement depth with encouraging angular estimation precision.

FIGURE 1 .
FIGURE 1.Our head-mounted eye tracking system to predict user's PoR in 3D environment.

FIGURE 2 .
FIGURE 2.Our proposed framework applies the saliency-based method for the scene images to obtain 3D targets' dataset, combine them with gaze vectors to automatically calibrate the head-mounted gaze tracker and achieve 3D gaze points' estimation.

FIGURE 3 .
FIGURE 3. Examples of RGB scene images and the corresponding saliency maps generated by the saliency algorithm in [23].

FIGURE 4 .
FIGURE 4. Clustering algorithm in[26] is employed to remove noise in the saliency map and improve the saliency map's reliability.The three images from left to right are: scene image, saliency map and selected pixels after clustering.

FIGURE 5 .
FIGURE 5. Process of obtaining calibration targets' dataset.The upper row shows the operation on the single scene image and its corresponding depth image.The lower row demonstrates the operation on a set of scene images and their corresponding depth images.

FIGURE 7 .
FIGURE 7. Angular error of gaze estimation barely impacts the correct positioning on the facing plane, but it can cause considerably variance in the PoR's depth direction.

FIGURE 8 .
FIGURE 8. Point cloud of the environment.The gaze point is defined as the point with the minimal distance to both rotated gaze vectors.

FIGURE 9 .
FIGURE 9. Angle deviation between true gaze vector and the gaze vector in our assumption of both eyes at 10 different depth.

FIGURE 10 .
FIGURE 10.Indoors experiment setup.The distance L between the user who wears our head-mounted gaze tracker and the calibration board can be 1m, 2m, 3m or 4m.The target board has five concentric circle targets.
room size into consideration, four tests are carried out by putting the board at 4 different depths from the subject: 1m, 2m, 3m, and 4m.For each depth, they should fixate at the 5 targets in the same order: top-left, top-right, bottomright, bottom-left, and center, each target lasts for 2 to 3 seconds.Fig. 11(a) illustrates the PoR estimation errors in degrees for each subject at 4 different distances and the depth estimation error is shown in Fig. 11(c).

FIGURE 11 .
FIGURE 11.PoR estimation errors on different points over different depth.(a) The PoR estimation angular errors in degrees indoors.(b) The PoR estimation angular errors in degrees outdoors.(c) The PoR estimation depth errors indoors.(d) The PoR estimation depth errors outdoors.

FIGURE 12 .
FIGURE 12. Outdoors experiment setup.(a)Gaze estimation accuracy test in outdoors while the user is looking at pt5.(b) Images of left and right eye.(c) Gaze estimation result: purple arrow lines are current rotated gaze vectors, red dot are PoR in the past and the Green dot are current PoR.

TABLE 2 .
Comparison with state-of-the-art model-based methods.

TABLE 3 .
Comparison with state-of-the-art head-mounted-based methods.

Table 4
illustrate the average result in different depth, and the overall