Accurate Regression-Based 3D Gaze Estimation Using Multiple Mapping Surfaces

Accurate 3D gaze estimation using a simple setup remains a challenging issue for head-mounted eye tracking. Current regression-based gaze direction estimation methods implicitly assume that all gaze directions intersect at one point called the eyeball pseudo-center. The effect of this implicit assumption on gaze estimation is unknown. In this paper, we find that this assumption is approximate based on a simulation of all intersections of gaze directions, and it is conditional based on a sensitivity analysis of the assumption in gaze estimation. Hence, we propose a gaze direction estimation method with one mapping surface that satisfies conditions of the assumption by configuring one mapping surface and achieving a high-quality calibration of the eyeball pseudo-center. This method only adds two additional calibration points outside the mapping surface. Furthermore, replacing the eyeball pseudo-center with an additional calibrated surface, we propose a gaze direction estimation method with two mapping surfaces that further improves the accuracy of gaze estimation. This method improves accuracy on the state-of-the-art method by 20 percent (from a mean error of 1.84 degrees to 1.48 degrees) on a public dataset with a usage range of 1 meter and by 17 percent (from a mean error of 2.22 degrees to 1.85 degrees) on a public dataset with a usage range of 2 meters.


I. INTRODUCTION
Eye gaze is an essential clue for understanding human intentions. Gaze tracking is mainly used for attention analysis [1], [2], human-computer interaction [3], and human-robot interaction (HRI) [4], [5]. Li et al. [5] recently reported the first attempt to achieve intuitive HRI using only gaze signals, which is helpful for disabled people with upper limb motor impairment, such as amputees and paralyzed patients, to reconstruct their upper limb motion abilities. In mobile applications, such as gaze-based intuitive HRI, head-mounted 3D gaze trackers are preferred to table-mounted gaze trackers [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Pinjia Zhang . In head-mounted eye tracking, 3D gaze estimation faces three significant challenges, namely the accuracy of gaze estimation, calibration complexity (i.e., the number of calibration points and the burden of calibration procedures), and device complexity (such as the number of cameras and infrared LEDs). Many gaze estimation methods [6] have been proposed to address these limitations, which are generally divided into model-based and regression-based methods. Based on geometric eye models, model-based methods [7]- [9] calculate visual axes to represent gaze directions. Their main advantages are rapid one-point calibration and robustness against the relative drift between heads and eye trackers. However, these methods require sophisticated, fully calibrated devices and are limited in accuracy by geometric eye models, which differ from one individual to another.
In contrast, regression-based 3D gaze estimation methods employ simple devices to estimate gaze points. Most existing paradigms [10]- [13] are based on the implicit assumption that all gaze directions intersect at one point called the eyeball center. It is noted that the term in these methods differs from the traditional definition of the eyeball center, which is the intersection of all optical axes. To avoid confusion, we adopt the eyeball pseudo-center to represent the intersection of all gaze directions. Gaze directions are represented as the eyeball pseudo-center and gaze vectors [13], [14] or 2D mapping points on the imaginary mapping plane [10]. The existing regression-based methods have two limitations that affect the accuracy of 3D gaze estimation. First, the eyeball pseudo-center is challenging to be calibrated accurately, because it is inside the eyeball and outside the field of vision of the scene camera. Second, it is still unknown how the implicit assumption of the eyeball pseudo-center affects the accuracy of gaze direction estimation.
To overcome these two limitations, we present a gaze direction estimation model based on mapping surfaces. By simulating the implicit assumption of the eyeball pseudo-center, we find that all gaze directions intersect within a small region rather than at one point, which implies that this assumption is approximate. The sensitivity analysis of the eyeball pseudo-center on gaze estimation reveals the conditions of the assumption, i.e., configuring the mapping surface and reducing the error of the eyeball pseudo-center. Accordingly, we propose a gaze direction estimation method with one mapping surface that exploits and satisfies the assumption. The eyeball pseudo-center is calibrated by two additional calibration points outside the mapping surface. Furthermore, we propose a gaze direction estimation method with two mapping surfaces that adds another mapping surface to replace the eyeball pseudo-center under the same framework. The gaze direction is thus represented as the line connecting the two points on the two mapping surfaces. Experimental results reveal that the second method is more accurate than the first method, but has more calibration points. The main contributions of this paper are summarized as follows: 1) The simulation and sensitivity analysis reveal that the current implicit assumption of the eyeball pseudo-center is approximate and conditional. 2) A high-quality calibration technique for the eyeball pseudo-center is proposed, which requires only two additional calibration points outside the calibration plane. 3) An accurate gaze direction estimation method with two mapping surfaces is presented to overcome the limitation of the implicit assumption.
The remainder of this paper is organized as follows. Section II introduces the related works. Section III formulates the basic model of gaze direction estimation. Section IV analyzes the implicit assumption of the eyeball pseudo-center. The gaze estimation method with two mapping surfaces is described in Section V. Experiments are given in Section VI, followed by the conclusion drawn in Section VII.

II. RELATED WORKS
Regression-based 2D gaze estimation methods regress from eye features (i.e., pupil centers or pupil-glint vectors) [15], [16] to gaze positions on a plane, without assuming a geometric eye model [8], [17]. Regression functions are calibrated by simultaneously capturing eye features and calibration points on a calibration plane. Similar to these methods, appearance-based methods use eye appearance [18], [19] as eye features. They leverage large training data with convolutional neural networks to achieve unconstrained gaze estimation. They do not require calibration, but are less accurate than the regression-based methods and are only suitable for table-mounted eye tracking.
When target points are outside the calibration plane, 2D gaze estimation methods result in an error of gaze points, i.e., the parallel error [20], due to the offset between scene cameras and eyes. To this end, some methods use additional input features related to depth coordinates of gaze points, such as pupil distances [5] or Purkinje images [21]. It is highlighted that the depth error of gaze estimation is generally significant, as these additional features have an indirect and weak correlation with depth coordinates of gaze points. Instead of additional eye features, other methods add scene information. Takemura et al. [22] used the correspondence between triangles containing 2D gaze points in the scene camera image and triangles containing 3D gaze points in the real world to estimate 3D gaze points.

A. GAZE DIRECTION ESTIMATION USING THE EYEBALL PSEUDO-CENTER
Unlike the above methods of direct mapping from features, most gaze direction estimation methods [10]- [12] implicitly assume that all gaze directions intersect at one point, the eyeball pseudo-center. Then gaze directions are represented as the eyeball pseudo-center and gaze vectors. Based on a simplified eye model of a perfect sphere where optical axes coincide with visual axes, back-projection methods [14], [23] estimate the eyeball center and gaze vectors in the coordinate system of the eye camera by back-projecting pupil image ellipses. Transforming the coordinate system into the scene camera introduces six parameters of homogeneous transformation. To solve these parameters, the 3D-to-3D mapping method in [14] forms a nonlinear optimization problem that minimizes the angular disparity between computed gaze vectors in the eye camera and targets in the scene camera. To reduce the complexity of the search space, Elmadjian et al. [24] proposed a two-step iterative optimization algorithm. Each iteration first computes the translation and then the rotation that places gaze vectors into the scene camera space.
Regression-based gaze direction estimation methods regress from eye features to gaze vectors. To solve the mapping parameters and the eyeball pseudo-center, the 2D-to-3D mapping method in [14] formulates a nonlinear optimization that minimizes the angular error between estimated and ground-truth gaze vectors using at least two calibration planes. To address locally optimal solutions caused by nonlinear optimization, Su et al. [13] set sensible initialization and constraints on eye positions to improve accuracy. Our work is inspired by the method proposed by Abbott et al. [10], which regresses from eye features to mapping points on a calibration plane instead of gaze vectors. In this way, the eyeball pseudo-center and mapping parameters can be computed separately to obtain analytical solutions, without nonlinear optimization.

III. MODEL FORMULATION
Gaze directions are strongly correlated with rotations of the eyeball, such that they can be inferred from image features of the eye. Fig. 1 illustrates the regression-based gaze direction estimation. The gaze direction is represented as the line connecting the eyeball pseudo-center O ∈ R 3×1 to the corresponding point P ∈ R 3×1 on the imaginary mapping surface P that has a one-to-one mapping with the eye feature (u, v), as follows: where G ∈ R 3×1 denotes an arbitrary point on the gaze direction, and w is a proportional coefficient. The gaze direction of one eye intersects the gaze direction of the other eye or the reconstructed object surface at the gaze point. Therefore, this problem has transformed to calculate the eye pseudo-center O and the imaginary mapping surface P . Each input eye feature (u, v ) can calculate the corresponding point on the mapping surface P . Note that the eyeball pseudo-center is assumed as the intersection of all gaze directions in the eyeball coordinate system, and it is non-coincident with the eyeball center.

A. IMAGINARY MAPPING SURFACE
The choice of the mapping function depends on extrapolation errors [25], which emerge when gaze points are outside the calibration area. Most regression-based methods employ polynomial mapping functions [3], [6], and it is found in [26], [27] that functions higher than two orders do not significantly reduce extrapolation errors. The typical polynomial mapping function is the second-order binary polynomial function that calculates the target point P(x P , y P , z P ) on the mapping plane at a constant distance of z P from the input eye feature (u, v), as follows: where a 1 , a 2 , . . . , a 6 and b 1 , b 2 , . . . , b 6 are the unknown mapping parameters. However, this mapping function requires users' heads to remain stationary during calibration and restricts all calibration points on a plane parallel to human faces.
To overcome this limitation, we replace Equation (2) with the second-order ternary polynomial function whose input is the same but output is the target point P(x P , y P , z P ) on the mapping surface, as follows: This function is calibrated with the least-square method by simultaneously capturing eye features and calibration points on a calibration plane. The advantage of this function is that users' heads are free during calibration since z P of calibration points can be obtained by scene cameras. Since the basic coordinate system is located in the head-mounted eye tracker and moves with the head, the coordinates of calibration points are on a surface rather than a plane. After calibration, the mapping relationship remains unchanged, even though users look at different distances or move their faces. Therefore, the mapping surface is imaginary and changes consistently as the human face moves.

B. THE EYEBALL PSEUDO-CENTER
The meanings of the eyeball pseudo-center are two-fold. First, it refers to the intersection of all gaze directions in the eye coordinate system, as opposed to the eyeball center, which is the intersection of all optical axes. Second, it refers to an approximate assumption instead of an actual point. Existing papers [10]- [13] make this assumption implicitly and indicate this point as the eyeball center. To clarify the difference, we use the eyeball pseudo-center rather than the eyeball center. Since the eyeball pseudo-center is inside the eye, the position is difficult to measure. More importantly, it is still unknown how this assumption affects the accuracy of gaze direction estimation.

IV. ERROR ANALYSIS AND CALIBRATION OF THE EYEBALL PSEUDO-CENTER
To address the limitations of the assumption of the eyeball pseudo-center, we theoretically analyze the assumption and calibrate the eyeball pseudo-center.

A. INTERSECTIONS OF GAZE DIRECTIONS
According to [28], the assumption of the eyeball pseudocenter does not hold, i.e., all gaze directions do not intersect at one point. Thus, we simulate all intersections of gaze directions based on the geometric eye model [28], which consists of two spheres, namely the cornea and the retina. The optical axis is the line connecting the eyeball center E to the cornea center through the cornea vertex V (see Fig. 2). The gaze direction refers to the visual axis, which is the line connecting the fovea to the gaze point by way of the nodal points N, N . Model-based gaze estimation methods [7], [17] always assume that N, N are coincident with the cornea center. In contrast, we abandon this assumption such that the results are more reliable and meaningful. When two visual axes are non-coplanar, the intersection refers to the midpoint of the common perpendicular of these two visual axes. Thus, we simulate all intersections of coplanar visual axes and all midpoints of common perpendiculars of non-coplanar visual axes. First, we set the parameters and initial state of the eye model. Due to individual differences, we set typical values for the parameters according to accommodated schematic eyes of [17], [28]: VN = 6.6 mm, VE = 13.1 mm. Then EN = VE − VN = 6.5 mm. The angle between the optical axis and the visual axis is always denoted as κ, which is individual-dependent and constant for each individual, as it is determined by the fovea location. Considering the mean value of κ, we set κ = 5 • . In the coordinate system, the origin is the eyeball center E, and the Z -axis points to the horizontally forward direction. The X -axis points to the horizontally right eye, and the Y -axis points to the vertically downward direction (see Fig. 2). The initial state is that the optical axis is parallel to the Z -axis direction, and the visual axis is in the X-Z plane (see the red points in Fig. 3 (a)). Thus, the initial position of N is N 0 = [0, 0, 6.5] T , and the initial vector of the visual axis is v 0 = R y (κ)z = [sin(κ), 0, cos(κ)] T . Different visual axes are generated by eye rotations. Through rotating both a pan angle α (left-right rotation) and a tilt angle β (up-down rotation) from the initial eyeball state, the nodal point is N(α, β) = R y (α)R x (β)N 0 , and the vector of the visual axis is v(α, β) = R y (α)R x (β)v 0 , where R y (α) and R x (β) are rotation matrices around y-axis and x-axis respectively. According to [29], the horizontal field of view is 100 • for the temporal side, but 60 • for the nasal side due to the cut-off of noses. Hence, α ∈ −100 • ∼ 60 • . As the vertical visual field is 60 • to the brow side and 75 • to the cheek side, β ∈ −75 • ∼ 60 • . The sampling interval for the two angles of α and β is 2 • . Note that since it is uncommon for the eye to rotate around the optical axis, torsion is ignored.
The simulation of the intersections of visual axes is shown in Fig. 3. For the left-right rotation (about the Y -axis), the intersections of visual axes lie within an approximately triangular area, whose range is 3.19 mm × 1.35 mm (X × Z ) (see Fig. 3 (a)). For the up-down rotation (about the Xaxis), visual axes form a part of a conical surface, such that they intersect at the apex of the cone, i.e., [−0.57, 0, 0] T . Besides the above two particular rotations, most eye rotations combine both pan and tilt. The area range of all intersections is 3.03 mm × 6.32 mm × 1.34 mm (X × Y × Z ) (see Fig. 3 (b)). The distribution of all intersections is [−0.52 ± 0.22, 0.01 ± 0.25, −0.15 ± 0.31] mm (mean ± standard deviation). Therefore, all visual axes intersect in a small region. The assumption of the eyeball pseudo-center is approximate. Although the above simulation is based on the left eye model with typical parameters and the practical rotation range of the visual axis, the achieved region of intersections is similar by considering individual differences and the right eye. VOLUME 8, 2020

B. SENSITIVITY ANALYSIS OF THE EYEBALL PSEUDO-CENTER IN GAZE ESTIMATION
Although all visual axes intersect in a small region, it is still insufficient to accept the assumption of the eyeball pseudocenter, unless this assumption has little effect on gaze estimation. To this end, we analyze the impact of the eyeball pseudo-center on gaze estimation. Note that the following analysis ignores the extrapolation error of the mapping surface, and we focus on the parallel error.  Fig. 4 shows the sensitivity analysis of the eyeball pseudo-center in gaze estimation. When a user is looking at a ground-truth point G g , there are two kinds of visual axes through the mapping point P, namely the ground-truth visual axis PG g and the estimated visual axis PG e (see Fig. 4 (a)). According to [15], [16], the angular error of gaze points is calculated approximately by where G g G e is the planer error of gaze estimation, and Z G denotes the depth distance of G g . Based on triangular similarity, G g G e can be computed by By substituting Equation (5) into Equation (4), We should first investigate the effect of the ideal eyeball pseudo-center on gaze estimation. The ideal eyeball pseudo-center O is the average position of the intersecting region in Section IV-A. Then, OP is in the interval [−3σ, 3σ ] with a high probability, about 99.7% (assuming a Gaussian distribution), where σ is the standard deviation on the vertical plane in the intersecting region. Hence, OP ≤ 3× √ 0.22 2 + 0.25 2 ≈ 1 mm. Since E is quite small, less than 0.1 • (see Fig. 4 (b)), it is reasonable to use the ideal eyeball pseudo-center to approximately represent the intersection of all real visual axes. Therefore, the ideal estimated visual axis can represent the ground-truth visual axis, as shown in Fig. 5 (a). We should then study the influence of the estimated eyeball pseudo-center on gaze estimation. In the spherical coordinate system, P = (l P , ϕ P , γ P ), and O = (l O , ϕ O , γ O ). As P = ϕ P and ϕ O determine the phase of f , and γ P and γ O determine the amplitude of f . The minimum f is equal to 0 when O, P, and O g are collinear, i.e., ϕ P = ϕ O , and γ P = γ O . The maximum is f = | cos γ O |(tan γ P +tan γ O ) when ϕ P = 180 • + ϕ O , and γ P is set to its maximum. Note that f = 1 when γ O = 90 • . Fig. 5 (b) shows the distribution of f with different γ P and ϕ P . Besides, Fig. 5 (c) shows that mean values of f change slightly when γ O increases. Therefore, f (γ P , ϕ P ; γ O , ϕ O ) can be regarded as the phase item of E. According to Equation (5), OP , since Z O Z P . By substituting Equation (7) into Equation (4), where m(Z P , Z G , γ O ) and f (γ P , ϕ P ; γ O , ϕ O ) approximately represent the amplitude term and phase term of E respectively. m(Z P , Z G , γ O ) is similar to Equation (6), except that the amplitude is magnified by a factor of l O . If l O is larger than 10 mm, E may exceed 1 • . Therefore, the estimated error of the eyeball pseudo-center has a significant effect on gaze estimation. In summary, the assumption of the eyeball pseudo-center is conditional. To be specific, it is acceptable only if • the distance between Z P and Z G decreases, and Z P Z G increases.
• the error of the estimated eyeball pseudo-center, l O , is small.

C. CALIBRATION OF THE EYEBALL PSEUDO-CENTER
To reduce the error of the estimated eyeball pseudo-center, we should build the calibration model of the eyeball pseudocenter. After calibrating the mapping surface, at least two additional calibration points outside the mapping surface are required to calibrate the eyeball pseudo-center O. With n additional calibration points {C i } n i=1 , n ≥ 2, C i / ∈ P , we can use the corresponding eye features to calculate the corresponding mapping points {P i } n i=1 , which have extrapolation errors [25], as mapping points outside the calibration area are extrapolated. According to [15], extrapolation errors are not uniform along the mapping surface but increase towards the edges of the surface. The main reason is that detecting the limbus is more difficult at the edges than in the middle. Hence, it is assumed that the maximum radius of the extrapolation error is P ≈ ε OP , where ε is a small constant, such as ε = tan 0.5 • ≈ 0.0087. Thus, each line C i P i forms an error cone. Increasing the accuracy of the eyeball pseudo-center is equivalent to decreasing the intersection volume of all error cones.
Assuming that there are only two additional calibration points that have the same depths and lengths, OC 1 = OC 2 , error analysis of eyeball pseudo-center calibration with two additional calibration points is illustrated in Fig. 6. There are two auxiliary axes (see Fig. 6 (a)): the Z -axis refers to the horizontal forward direction; the d-axis refers to the line connecting O to the midpoint of P 1 P 2 . The angle between the d-axis and the Z -axis is γ d . The angle between d-axis and OP 1 is γ P . Then OP 1 = Z P /(cos γ P cos γ d ). The two error cones formed by C 1 P 1 and C 2 P 2 intersect to form a diamond-like region (see Fig. 6 (b)). Based on triangular similarity, the region width h O satisfies then Reducing |d O | and |d O | is equivalent to increasing the denominator. As ε is small, ε The error region of O can be decreased by • increasing the distance between Z P and Z C and decreasing Z P Z C (see Fig. 6 (c)).
• decreasing the angle between d-axis and z-axis, i.e., γ d , whose optimal value is 0, which implies that two calibration points are central symmetry about the Z -axis; • setting the angle between two additional calibration points, γ P = 45 • , considering the contradictory relation between |h O |, |d O | with γ P . When the two additional calibration points are not in the above situation, i.e., when their depths or lengths are different, the intersection region of the two error cones increases. In addition, Section VI-A demonstrates that gaze estimation is not significantly different using different numbers of calibration points.
Based on these conclusions and the conclusions in Section IV-B, the optimal configuration of gaze estimation with one mapping surface can be derived, as shown in Fig. 7.

V. GAZE ESTIMATION WITH TWO MAPPING SURFACES
Since the implicit assumption of the eyeball pseudo-center is conditional, an alternative method that replaces the assumption of the eyeball pseudo-center with another mapping surface is proposed to further improve accuracy under the same framework (same devices and similar calibration procedures). Thus, two points on the two mapping surfaces are calculated from an input eye feature, and both the two points are on the same visual axis. The gaze direction equation is changed from Equation (1) as follows: where P 1 and P 2 are corresponding points on the two mapping surfaces, respectively. To employ this method, we analyze the errors outside mapping surfaces and consider configuring two calibration surfaces, as shown in Fig. 8 (a). According to the extrapolation error in Section IV-C, the maximum error radius on the mapping surface is r = εZ / cos γ , where γ is the angle between the gaze direction and the Z -axis. Then, the error range in two mapping surfaces forms an error cone whose vertex is O. When the target is outside the two mapping surfaces, E has a maximum value when the error directions of P 1 , P 2 are opposite (see red lines in Fig. 8 (a)). According to Equation (4), E is calculated by , otherwise. (a) Illustration of the errors outside two mapping surfaces whose depths are Z 1 , Z 2 . G is the gaze point. P 1 , P 2 are ideal mapping points. γ is the angle between the gaze direction and the Z -axis. r 1 and r 2 are the maximum error radius on mapping surfaces. The red lines correspond to the maximum of E . (b) The relationship between E and Z G with different Z 1 and Z 2 when γ = 0. Since the parallel error outside the two surfaces is larger than that between the two surfaces, it is concluded that the fixation target should be located between the two mapping surfaces. Besides, when the target object is outside the two surfaces, the angular error on the near side is larger than that on the far side (see Fig. 8 (b)).

VI. EXPERIMENTAL RESULTS
Since the proposed model with mapping surfaces requires calibration points on at least two planes, the datasets for table-mounted gaze estimation, such as MPIIGaze [30], are unsuitable. In this work, experiments are conducted on two publicly available datasets for head-mounted gaze estimation, namely the Mansouryar dataset [14] and the Elmadjian dataset [24]. The two datasets have small and large usage ranges, respectively. We first validate and evaluate the proposed methods on the Mansouryar dataset [14]. We then compared the proposed methods with state-of-the-art methods using multiple calibration planes on the two datasets. Note that methods are evaluated by the angular error calculated by Equation (4), which is a better indicator for gaze direction estimation than the depth error. The Mansouryar dataset [14] contains monocular pupil features and corresponding ground-truth 3D gaze coordinates recorded from 14 subjects and five planes of small depth range, including 1.0 m, 1.25 m, 1.5 m, 1.75 m, and 2.0 m. Each plane has 25 calibration points and 16 test points. After calibrating each mapping surface with Equation (3), the corresponding point on the mapping surface can be calculated for each pupil feature. The calibration error of the calibration plane is calculated as the average of angular errors between the ground-truth calibration points and the mapping points. The extrapolation error on the calibration plane is calculated as the average of angular errors between the ground-truth test points and the mapping points. Fig. 9 (a) shows that this dataset has small calibration and extrapolation errors. The Elmadjian dataset [24] contains binocular pupil features and 3D gaze coordinates recorded from 11 subjects and five planes of large depth range, including 0.75 m, 1.25 m, 1.75 m, 2.25 m, and 2.75 m. Each plane has 20 calibration points and 12 test points. The size of the plane increases with the depth to ensure that different planes have the same field of view in the scene camera. As each calibration or test point contains 30 continuous frames, we filtered pupil features to reduce noise. Fig. 9 (b) shows that five subjects (number 4,5,6,8,9) have large calibration and extrapolation errors (> 2 • ). Thus, we discarded these data in the following experiments.

A. EVALUATION OF THE METHOD WITH ONE MAPPING SURFACE
After calibrating the mapping surface using a calibration plane, the eyeball pseudo-center is calibrated by two extra calibration points that have the largest distance on another plane. Fig. 10 shows gaze estimation errors for the method using one mapping surface with different calibration combinations, and Table 1 shows average errors and standard deviations. With the same third plane (Z P = 1.5 m) as the calibration plane, Fig. 10 (a) shows that increasing the distance of calibration |Z P −Z C | can reduce angular errors, and Z C seems to be better on the near side (Z C < Z P ) than on the far side (Z C > Z P ). Besides, decreasing |Z G − Z P | yields smaller angular errors, and Z G is better on the far side (Z G > Z P ) than on the near side (Z G < Z P ). These results are consistent with the  theoretical analysis in Section IV-B. When |Z P − Z C | is large, Fig. 10 (b) shows that the average errors of four combinations are approximate, but the standard deviation when (Z P > Z C ) is larger than that when (Z P < Z C ), because the angular error before the mapping surface is larger than that after the mapping surface. Overall, the optimal combination is Z P = 1 m and Z C = 2 m, whose error is 1.64 • ± 0.75 • . To investigate the effect of the number of additional calibration points on gaze estimation, we used all 25 calibration points on a calibration plane instead of only two points. Fig. 11 shows that there is no significant difference between these two situations (paired-t test: tstat = −2.075, p = 0.058). Fig. 12 shows that the angular error is smallest between the two mapping surfaces (Z 1 < Z G < Z 2 ), followed by after VOLUME 8, 2020 FIGURE 12. Gaze estimation errors for the method using two mapping surfaces on the Mansouryar dataset [14] with different combinations of two calibration planes. Note that it is not displayed when the angular error is larger than 5 • . the second mapping surface (Z G > Z 2 ), and then before the first mapping surface (Z G < Z 1 ). These results are consistent with the theoretical analysis in Section V. Overall, the optimal combination is Z 1 = 1 m and Z 2 = 2 m, whose error is 1.48 • ±0.83 • (see Table 2). Compared to the method with one mapping surface, this method has better accuracy and similar precision. We can conclude that the assumption of the eyeball pseudo-center is conditional and affects gaze estimation.

C. COMPARISON WITH OTHER METHODS
We implemented and evaluated the following baseline methods of 3D gaze estimation using multiple calibration planes: • 2D-to-2D Mapping: As a basic baseline method, the method of 2D-to-2D mapping [14] employs multiple calibration planes to calibrate the regression relationship between pupil features and 2D mapping points. To achieve a comparison of the same conditions, we calibrated Equation (3) using the first and fifth calibration planes.
• Fixed Eyeball Center (FEC): Abbott et al. [10] used a fixed eyeball pseudo-center and a mapping surface to estimate the gaze direction. We used the fifth plane for surface calibration, and the fixed eyeball center was set to (0, 0, −0.05m) empirically.
• Nonlinear Optimization: The method recently proposed by Su et al. [13] used the constrained nonlinear optimization to calculate multiple parameters, including the eyeball pseudo-center and the mapping parameters between pupil features and gaze vectors. We used the first and the fifth planes for calibration. The initial 3D eyeball center was computed by the 2D eyeball center and the intrinsic matrix of the scene camera.
The constrained search range of the eyeball center was ±[0.05m, 0.05m, 0.02m]. Due to the limitation of calibration points, cross-validated input-space partition was not used. Gaze estimation results and comparison on two datasets are shown in Table 3 and Table 4. The proposed method with two mapping surfaces achieves the best accuracy, followed by the proposed method with one mapping surface. The method of 2D-to-2D mapping [14] has the largest error since it cannot deal with the parallel error. The method with nonlinear optimization [13] has the smallest standard deviation of the error. Compared to the FEC method [10], the proposed method with one mapping surface adds two additional calibration points, and the effect on reducing errors is significant. Compared to the method with nonlinear optimization [13], the proposed method with two mapping surfaces has the same calibration procedure but better accuracy. Besides, as targets may lie within different depth ranges, we evaluated all methods in different usage ranges, including 0.5 m, 1 m, and 2 m. Fig. 13 shows that all methods have better accuracy in a small range than in a large range, and the method with two mapping surfaces has better accuracy cross usage ranges than other methods. We can conclude that the proposed model has improved accuracy in both small and large usage range.  [24] and the Mansouryar dataset [14] are 2 m and 1m, respectively. Bottom: The usage range of 0.5 m on the Mansouryar dataset [14], which means that only three planes are used for calibration and testing. Fig. 14 shows correlation coefficients between extrapolation errors and gaze estimation methods on the two datasets. Gaze direction estimation methods with multiple calibration planes are relevant, implying their performances are limited by the same underlying factors. For a small usage range, the extrapolation error is the primary error source due to  [14].

TABLE 4.
Comparison with other methods on the Elmadjian dataset [24]. For the valid six subjects, two eyes are evaluated respectively. ''L'' and ''R'' represent the left and right eye. the strong correlation between extrapolation errors and the proposed methods (see Fig. 14 (a)). In contrast, for a large usage range, the parallel error seems to be the primary error source due to the weak correlation between extrapolation errors and the proposed methods (see Fig. 14 (b)).

VII. CONCLUSION
In this paper, we propose a regression-based 3D gaze estimation model with high accuracy using a simple, easily calibrated setup in the head-mounted eye tracker. The key idea is to analyze and leverage the implicit assumption of the eyeball pseudo-center. The simulation of visual axes shows that visual axes intersect in a small region rather than at a point, implying the assumption is approximate. The sensitivity analysis of the eyeball pseudo-center reveals that the assumption is acceptable only if the error of the estimated eyeball pseudo-center is small, and the distance between the fixation target and the mapping surface is small. Hence, we propose a gaze direction estimation method with one mapping surface that satisfies the assumption by a high-quality calibration of the eyeball pseudo-center, which only requires two additional calibration points outside the mapping surface. Furthermore, we propose a gaze direction estimation method with two mapping surfaces that replaces the assumption with another mapping surface. Experimental results indicate that although the assumption of the eyeball pseudo-center is met, the accuracy of gaze estimation is still affected, compared to the gaze estimation method that replaces the assumption. Besides, the proposed method with two mapping surfaces has higher accuracy than state-of-the-art methods. Since VOLUME 8, 2020 extrapolation errors are the basis of error analysis of the proposed model, we should reduce extrapolation errors in further studies. Besides, the robustness of the proposed model should be discussed and improved.