Axial Constraints for Global Matching-Based Optical Flow Estimation

Optical flow estimation is a fundamental task that aims to find the 2-dimensional motion field by identifying correspondences between two input images. For quite a long time, the correlation volume followed by convolutional neural networks (CNN) to directly estimates the optical flow was a predominant pipeline. However, several pioneering methods proposed global matching recently, pointing out the limitation that CNN-based methods are struggling to handle large displacements due to their locality. Global matching is the step that identifies global correspondences at the pixel-level using entire correlation volumes at once with simple operations like softmax. However, when global matching with softmax is combined with commonly used regression loss in optical flow estimation, there will be a vast number of possible correlation volumes that can minimize the regression loss and correctly estimate correspondences. In other words, the training objective induces a one-to-many solution problem resulting in the presence of noisy gradients. In this paper, the necessity for more constraints on the correlation volume to mitigate the aforementioned ill-posed problem is discussed. To acquire such constraints, axial cross-entropy loss (i.e. axial constraints) to restrict the correlation volume to have low variance with designed pseudo ground truth is proposed. Experimental results show that axial constraints are applicable to off-the-shelves global matching-based optical flow estimation frameworks easily and lead to both quantitative and qualitative performance improvement without any architectural changes.

Apart from such efforts, GMFlow [29] and GMFlowNet [30] suggested a new type of method, namely global matching. While previous techniques with CNN used implicit methods to resolve the locality of CNN, global matching tries to explicitly capture the large displacements. To explicitly capture such long-range movements, global matching determines pixel-level dense correspondences by exploiting the full-resolution correlation volume at once. In this step, these methods used computationally cheap operations to utilize full-resolution correlation volume at once.
There are two straightforward ways to determine correspondences with a minor computational overhead. First, the argmax function can be used. By applying the argmax function to the correlation volume which represents the similarity between every pixel pair, the target frame pixel with the highest similarity with the source frame pixel is determined as the correspondence. Although using argmax seems simple and effective, it is not differentiable and cannot handle continuous optical flow because it decides correspondences from discrete pixel grids. Second, softmax function can be used to determine correspondences from correlation. After applying softmax operation to the correlation volume, it can be interpreted as a discrete probability distribution over target pixel grids where each bin indicates a matching probability between the source pixel and the corresponding target pixel. Then correspondence can be determined as the expected value of this probability distribution. Since the expected value is continuous, it can represent the continuous nature of the optical flows. Moreover, it is differentiable since the operations involved in calculating the expected value are differentiable.
Despite the advantages of using softmax operation in global matching, there are drawbacks when it combined with commonly used regression loss for the optical flow estimation [26], [27], [29], [30], [31], [32]. As shown in Figure 1, when using softmax operation for global matching, various correlation volumes can result in the same correspondence which eventually becomes the same estimated flow. Due to such one-to-many relationships between the correspondence and correlation volumes, minimizing regression loss between estimated flow and ground truth flow will also have a vast number of possible solutions. These large number of possible solutions can induce noisy gradient and unstable training [33].
To alleviate such instability in training, the incorporation of additional constraints for the correlation volume can be argued. In this regard, the axial cross-entropy loss (i.e., axial constraints) is proposed to accomplish such constraints. The axial cross-entropy loss can reduce the number of possible solutions by constraining the probability distribution over each axis, in addition to constraining the expected value of the probability distribution. Pseudo ground truth of axial constraints is designed for each axis of the correlation volume to have low variance and they are able to represent a continuous optical flow. The axial constraints with pseudo ground truth are expected to fully exploit the benefits of using softmax in global matching by weakening the underconstrained problem. The effectiveness of the proposed axial cross-entropy loss is elaborated by incorporating it into previous optical flow estimation methods in Section V.
The key contributions of this paper are summarized as follows: FIGURE 1. One-to-many relationship between correspondence and correlation volumes. To determine correspondence in the target frame (red) for the source frame pixel (green) by the global matching, softmax function is applied to the 2D spatial map (green rectangle) from the correlation volumes. By applying softmax function, the 2D spatial map is converted to a discrete probability map (P map ) over the target pixel grid, where P map (u, v ) represents the probability of the source pixel moving to the position (u, v ) in the target frame. The expected value from a previously-made probability distribution is decided as correspondence and should be the coordinates of the red pixel in the figure. However, there are numerous P map , namely correlation volumes that satisfy this condition as shown in (b). This underconstrained formulation can produce noisy gradients, and unstable training, and end up in local minima.
• Pointed out the problem that despite the effectiveness of using softmax operation in the global matching step, it is not well-compatible with regression losses since it formulates underconstrained problems due to the one-tomany relationship between the correspondence and the correlation volume.
• Proposed axial constraints to give a low-variance restriction to the correlation volume to alleviate the underconstrained formulation. Specifically, 2-bin pseudo ground truth for axial cross-entropy loss to achieve low-variance restriction is designed.
• To validate the effectiveness of axial constraints, applied axial cross-entropy loss to the off-the-shelves optical flow estimation model GMFlow which includes the global matching step with softmax operation. Although axial constraints are simple and have high applicability and improve both qualitative and quantitative results.
This paper consists of several sections that present a method for mitigating the ill-posed problem of global matching-based optical flow estimation. First, previous techniques in optical flow estimation and stereo matching that exploit correlation volumes are explained in Section II. Then Section III elaborates the concept of the axial constraints and the detailed methods to construct the pseudo-ground truth of the axial cross-entropy loss. Sequentially, the design of the pseudo-ground truth is further discussed in Section IV, and experimental results with axial constraints are presented in 69990 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Section V. Finally, a summary of the contributions and potential future research directions are provided in Section VI.

II. RELATED WORK A. OPTICAL FLOW ESTIMATION WITH CORRELATION VOLUME FOLLOWED BY CNN
With the advent of deep learning-based methods, optical flow estimation has achieved enormous performance improvements. After it is revealed that the use of correlation volume is helpful for optical flow estimation [20], most optical flow estimation methods [21], [22], [23], [24], [25], [26], [27], [28], [31] adopt the following pipeline: 1) extract features from each frame, 2) compute correlation volume, 3) estimate the optical flow using CNN taking correlation volume as input, and 4) supervise estimated flow with regression loss. Although the performance was improved by using the aforementioned pipeline, they suffer from performance drops for large motions since CNN has difficulty in capturing the global view. To mitigate the large displacement issues, some prior techniques proposed coarse-to-fine approaches [23], [24], [25], [34], [35], [36] and demonstrated their effectiveness. Nevertheless, coarse-to-fine approaches still have limitations that they are not robust to small and fast-moving objects, and difficult to resolve errors that occur in the coarse scale. RAFT [26] is a notable work that introduced an iterative refinement step to overcome these limitations. As the iterative step progresses, they enlarge the search spaces to be able to handle large movements implicitly. In the past few years, many techniques [27], [28], [31], [37], [38] enjoyed the effectiveness of such iterative refinements. Specifically, FlowFormer [38], GMFlowNet [30], and CRAFT [31] exploit the combination of the transformer encoder which extracts features obtaining global information and correlation volume followed by iterative refinements has demonstrated the state-of-the-art performance.

B. GLOBAL MATCHING-BASED OPTICAL FLOW ESTIMATION
Aside from the aforementioned efforts, recently two pioneering methods proposed a global matching step [29], [30] to handle large displacements explicitly and have shown state-of-the-art performance. GMFlow [29] pointed out that iterative refinements require linearly increasing inference time due to their sequential computations. To achieve both high accuracy and efficiency, they introduced a global matching step to estimate optical flow and completely removed the CNN layer which takes correlation volume as input. They show that the global matching step can effectively handle large displacements. GMFlow used the softmax to determine correspondences in global matching and supervised estimated flow from correspondences with commonly used regression loss. Similar to GMFlow, GMFlowNet [30] argues that a direct regression of the optical flow from the correlation volume using CNN cannot explicitly capture large displacements so they introduced a global matching step before the regression with CNN. They estimate the initial flow with global matching using the argmax and iteratively refine the estimation using CNN. GMFlowNet introduced a matching loss based on binary cross-entropy supervision because the initial flow from the argmax is not differentiable, thus they cannot give supervision with regression loss on the initialized flow.
Both methods successfully perform optical flow estimation with global matching but each of them has drawbacks in the global matching step. First, the argmax used in GMFlowNet has no ability to represent the continuous nature of the displacement and is not differentiable. Also, the combination of global matching with the softmax and regression loss used in GMFlow has a vast number of possible solutions which formulate underconstrained problems. To fully exploit the ability of the global matching with softmax which are able to express continuous flow by alleviating the underconstrained problem, adding more constraints on the correlation volume to narrow down the potential solution space can be suggested.

C. CROSS ENTROPY LOSS ON CORRELATION VOLUME
In stereo-matching, there have been several approaches that give cross-entropy losses on correlation volumes [33], [39], [40], [41]. For the cross-entropy supervision, they created various forms of pseudo ground truth such as Gaussian and Laplacian distributions, and some of them even used focal loss [42]. Their fundamental idea to give constraints on correlation volume is similar to axial constraints. However, there are task domain differences between stereo-matching and optical flow estimation since the former is a 1-dimensional matching task and the latter has to predict much denser correspondence in 2-dimensional space. Thus, directly applying the same loss is hard due to different dimensionality and would not lead to the same consequences. Nuanes et al. [33] proposed a stereo depth prediction method that uses mathematically equivalent ground truth with axial constraint though, there are domain differences, and they also conduct algorithmic post-processing on learned correlation volume at the inference time. They locally normalized the neighbor region of the largest correlation value for inference, while axial constraints do not acquire any algorithmic postprocessing.

III. METHODOLOGY
Using softmax in global matching has benefits such as differentiability and the ability to express continuous displacements. On account of these advantages, the proposed axial cross-entropy loss focused on fully exploiting the benefits of global matching with softmax by alleviating the underconstrained problem. To mitigate the underconstrained problem, axial cross-entropy loss add more constraints on correlation volumes to narrow the solution space and eventually, reduce the ambiguity in the training process. Axial cross-entropy loss can achieve such constraint by enforcing low-variance restriction to correlation volumes based on a 2-bin axial pseudo ground truth. This section first clarifies VOLUME 11, 2023 The axial cross-entropy loss is computed by performing summation along each axis to make the correlation volume an axial form ( 1 ⃝) and producing cross-entropy loss between P x (1, 3) and x (1, 3), and P y (1, 3) and the overall global matching process and terminologies in Section III-A, then elaborate on the ways to define soft pseudo ground truth in Section III-B and to apply the axial cross-entropy loss to impose restrictions on the correlation volume in Section III-C and Section III-D. The final supervision to train the optical flow estimation model will be stated in Section III-E.

A. FLOW ESTIMATION WITH GLOBAL MATCHING
Numerous prior techniques [23], [26], [29], [30] extract features F 1 , F 2 ∈ R H ×W ×C from two consecutive source image I 1 and target image I 2 , then formulate a correlation volume C ∈ R H ×W ×H ×W as follows: For a global matching with softmax, the softmax function is applied to C to convert correlation volume into discrete probability distributions over target frame pixels as follows: where P(i, j) is a probability map representing the degree of each target frame pixel matching with the source frame pixel at (i, j). A temperature scaling parameter [43] τ is incorporated to further suppress the small values in correlation volumes. Note that, the variance of the probability distribution becomes lower if τ is smaller than 1.0.
With the probability map computed by Eqn. 2, the expected value over a 2-dimensional discrete target frame pixel grid G ∈ R H ×W ×2 is predicted for the correspondence as follows: where the range of for horizontal and vertical dimensions, respectively. Finally, by calculating the difference between estimated correspondence and source pixel coordinates, predicted optical flow f pred is acquired as follows:

B. PSEUDO GROUND TRUTH
Axial cross-entropy loss requires pseudo ground truth to narrow the number of possible correlation volumes that results in ground truth flow. Note that, the overall process to define pseudo ground truth and illustrated in Figure 2. Before constructing pseudo ground truth, the coordinates of desired correspondences M gt with ground truth flow f gt are calculated as follows: When using the regression loss solely, correlation volumes that make M pred (i, j) to be same with M gt (i, j) can minimize the objective. However, there is an enormous number of correlation volumes that meet this condition, thus leading to the underconstrained problem. The proposed axial constraint addresses this problem by adding a low-variance condition to restrict the range of possible optimal correlation volume. Specifically, pseudo-ground truth for the axial constraints is designed to consist of 2-bin distribution to encourage correlation volume to have low variance and ensure that the expected value over softmax-applied correlation volume is equal to the M gt , which aligns with the original regression objective. Compute P x and P y by Eqn. 15 and Eqn. 16 # P x and P y are axial summation along x-axis and y-axis, respectively. 7: # λ is a loss-balancing hyperparameter 13: end for 14: # Average loss over the batch 15: θ ← θ − α∇ θ L B # Update parameters θ of M by the gradient descent with a learning rate of α. 16: end for 17: end for Designed 2-bin soft pseudo ground truth for x-axis where M x gt (i, j) is x-axis coordinates of M gt (i, j) and ρ x indicates fractional part of M x gt (i, j). Note that, the expected value over x (i, j) is aligned with the regression supervision M x gt (i, j) as follows: Similarly, pseudo ground truth for y-axis T is represented as follows, and the expected value over y (i, j) is also identical with M y gt (i, j).

C. OCCLUDED REGION FILTERING
In the previous section, pseudo ground truth is defined using matching coordinates M gt . However, the area where occlusion occurs does not have matching coordinates in the target frame. Therefore, it is necessary to filter out the occluded regions to give supervision. First, a set of pixels O out moving outside the target frame is collected as follows: To find the parts that are obscured inside the target frame, I 2 is warped using flow f gt to make I 2→1 by utilizing backward warping operation ψ B [44] as follows: where I 2→1 is a pseudo I 1 produced by pixels of I 2 . If I 2→1 (i, j) greatly differs from I 1 (i, j), pixels existing in I 1 (i, j) are expected to be obscured in I 2 . Therefore, a set of pixels O obs of obscured parts can be found and excluded by a threshold parameter T as follows: Finally, the union of these two types of occluded region, O mask is excluded when providing supervision with the axial cross-entropy:

D. AXIAL CROSS-ENTROPY LOSS
For providing axial cross-entropy supervision for nonoccluded regions, summation along each axis of P(i, j) ∈ R H ×W is firstly computed. Specifically, P x (i, j) ∈ R W is the VOLUME 11, 2023 sum of each column of P(i, j) and P y (i, j) ∈ R H is the sum of each row of P(i, j) as follows: where P(i, j)(u, v) denotes the probability of pixel in (i, j) moved to (u, v).
The cross-entropy loss over summation along each axis is defined as follows: By applying axial cross-entropy loss, the correlation volume is enforced to have low-variance distribution with the same expected value as when using regression loss, while narrowing the feasible solution space of correlation volumes,  , j ). However, the 2D ground truth is too strict because it forces the probability map to learn an exact value for each coordinate. Since the optimal value for P(i , j ) is not available, such strict constraints could be rather poor supervision. On the other hand, our axial pseudo ground truth constrains the sum of each axis for 4-bins (a, b, c, d in figure), so we can give constraint on correlation volume with a balanced degree of freedom with an appropriate regularization.
as illustrated in Figure 4. As a result, it is expected to alleviate the underconstrained problem.

E. FINAL SUPERVISION
Prior techniques typically supervise all output flows {f 1 , . . . , f N } based on the L 1 distance with exponentially increase weights γ for later predictions as described follows: (20) In addition to the Eqn. 20, we utilize our axial cross-entropy (Eqn. 19) in the final objective as follows: where λ is a loss-balancing hyperparameter. Note that, the overall procedure to train a model with our axial crossentropy loss is described in Algorithm 1.

IV. DISCUSSION ON PSEUDO GROUND TRUTH WITH PRIOR KNOWLEDGE A. 2D PSEUDO GROUND TRUTH
In the previous section, prior knowledge is incorporated to design the pseudo ground truth as previous techniques do [45], [46]. Specifically, the 2-bin 1D pseudo ground truth for axial constraints employed low-variance prior over each axis of correlation volume to reduce the number of possible solutions that minimize the supervision. One may ask that designing a 4-bin 2D pseudo ground truth that also employs low-variance prior for correlation volume might be better than an axial one because it will have only one possible P(i, j) that minimizes supervision. However, since it is impossible to obtain the exact optimal P(i, j), such strictly assigned 2D pseudo ground truth would be malicious. Using 2D crossentropy loss with 2D pseudo ground truth allows only one optimal solution for P(i, j) exactly matched with the 2D pseudo ground truth as shown in the left of Figure 5. On the other hand, using axial cross-entropy loss has a constraint on the sum along each axis in the correlation volume, though, how to fill 4-bin (a, b, c, d in the Figure 5) is left for the capability of the network. Although more constraints for the correlation volume to reduce the number of possible solutions are necessary, defining the exact value for each cell of 2D distribution is too strict and even prone to produce wrong supervision. In other words, it is important to balance a degree of constraint and the degree of freedom, and axial constraints can achieve it better than 2D constraints. Experimental comparisons between 2D constraints and axial constraints are further presented in Section V-E. Apart from axial and 2D pseudo ground truth, the advanced design of pseudo ground truth which incorporates the prior other than low-variance can be discussed in future work.

V. EXPERIMENTS
This section elaborates on the experimental details and results to validate the effectiveness of axial constraints. Axial crossentropy loss is applied to the off-the-shelves optical flow estimation model GMFlow [29] which incorporates a global matching step using softmax operation. Note that, axial crossentropy loss can be easily applied to the baseline without any architectural changes.

A. DATASETS AND EVALUATION METRIC
Following the standard training steps of previous techniques [23], [26], [27], models are trained on Fly-ingChairs [20] followed by FlyingThings3D [47] and reported cross-domain evaluation performance on MPISintel [48] and KITTI [49] datasets. FlyingChairs and Fly-ingThings3D have 22,232 and 80,604 images for training, respectively. In this section, the process of training using the FlyingChairs dataset is referred to step 1, and the process of training using the FlyingThings3D dataset resumed from the weight learned in step 1 is referred to step 2. For evaluation, average pixel-wise L 2 and L 1 distances between predicted flow and ground truth are adapted for measuring the accuracy of the estimated flow. The L 2 distance is denoted as end-pointerror (EPE) and the L 1 distance is denoted as Abs. For the KITTI dataset, F1-all metric is also utilized for evaluation, which denotes the percentage of outliers.

B. IMPLEMENTATION DETAILS OF AXIAL CONSTRAINTS FOR GMFlow
GMFlow already uses softmax layer to determine global matches. So just added axial cross-entropy loss with λ = 0.1 for training FlyingChairs datasets and λ = 0.01 for training FlyingThings datasets. In the implementation, T = 20 is used to filter out occluded regions and τ = 1.0 for training, and τ = 0.6 for inference are used. Experiments are conducted on the GMFlow model without refinement and used four RTX A6000 GPUs for training. For more detailed training settings like learning rate, the number of training iterations, and data augmentation are followed the same settings as the original model. Specifically, color jittering and spatial transformations are performed as data augmentation. In learning step 1, random crop the patch sized 384×512 from the FlyingChairs dataset are utilized. And 384 × 768 sized patches from the FlyingThings3D dataset are used in learning step 2. For step 1, training using the FlyingChairs dataset is conducted for 100,000 iterations with a batch size of 16 and a learning rate of 0.0004. For step 2, training using the FlyingThings3D dataset is performed for 800,000 iterations with a batch size of 8 and a learning rate of 0.0002. For comparison, GMFlow is reproduced to obtain model weights that learned only the FlyingChairs dataset and officially released weights that learned both FlyingChairs and FlyingThings3D datasets are used. Implementation of axial constraints added GMFlow [29] is built upon the officially released codes of GMFlow and its training strategy.

C. CROSS DOMAIN EVALUATIONS 1) EVALUATION WITH RESPECT TO MAGNITUDE OF DISPLACEMENTS
In order to investigate the performance improvements by applying axial cross-entropy loss to GMFlow, the EPE with VOLUME 11, 2023  respect to the magnitude of the displacements are measured. Specifically, s 0−10 , s 10−40 , and s 40+ refer to regions with a magnitude of displacement that falls into [0, 10], [10,40], and more than 40 pixels. As reported in Table 1, the performances are improved by our axial cross-entropy loss in almost all groups. Especially, the performance improves much greater in step 1, which has a small number of training data. It shows that utilizing axial constraint on the correlation volume becomes more effective with insufficient training data. Additional discussion and experiments about the effectiveness of axial cross-entropy loss with respect to the amount of training data are further provided in Section V-D.

2) VISUALIZATION OF CORRELATION VOLUMES
In Figure 6, P(i, j) of the correlation volumes after each training step 1 and step 2 of GMFlow and GMFlow with axial constraints are visualized. It is observed that P(i, j) with axial constraints has higher values near the actual correspondence (red pixel) and lower variances at the same time as expected. The average variances of P(i, j) for each axis are also calculated in Table 2. Note that, the occluded pixels are excluded when calculating the average variance. Step 1 refers to the process of training using FlyingChairs dataset, and Step 2 refers to the process of training using FlyingThings3D dataset resumed from the weight learned in Step 1. Reported EPE over Sintel train set and improved results are highlighted in bold.

4) GENERALIZATION OF AXIAL CONSTRAINTS
To further evaluate the generalization capability of axial constraint, axial constraint are applied to the other optical flow estimation methods, RAFT [26], GMA [27] and GMFlowNet [30]. As reported in Table 3, with axial constraint, EPE scores on Sintel training dataset are improved in most cases. For experiments, the official implementation of RAFT, GMA, and GMFlowNet without warm-start initialization are used.

D. ABLATION ON THE NUMBER OF TRAINING DATA
To further investigate the effectiveness of axial constraints, an ablation study with respect to the number of training data is conducted. In experiment, randomly sampled 25% and 50% of data from the FlyingChairs dataset are used for training and EPE on Sintel train datasets is evaluated. As reported in Table 4, the rate of performance drop is smaller when using axial cross-entropy loss, and they can even achieve the performance of the original GMFlow that uses the entire data with only half of the data. It makes sense with the motivation of axial cross-entropy loss to overcome underconstrained formulation since the shortage of training data also induces underconstrained problems.

E. COMPARISION BETWEEN 2D LOSS AND AXIAL LOSS
Experimental comparison between axial cross-entropy loss with 2D cross-entropy loss discussed in Section. IV-A is presented in this section. For the experiment, 4-bin 2D pseudo ground truth which has low variance and the same expected value as axial pseudo ground truth is used. Detailed formulation to define 2D pseudo ground truth can be found in the Appendix. GMFlow model is used as a baseline and quantified EPE with respect to the magnitude of the displacements is reported. As shown in Table 5, there is performance degradation, especially at small motions (s 0−10 ) when using 2D cross-entropy loss. This is likely due to the fact that the 2D cross-entropy loss imposes overly strict constraints on the possible correlation volumes. However, there is performance gain on large motion (s 40+ ) in the Sintel final dataset. This presents that the impact of constraints may vary depending on the magnitude of the displacement.  of M gt (i, j) and standard deviation of 1 is used. Note that, Gaussian pseudo ground truth is carefully designed to align the expected value with M gt (i, j), and the sum over pseudo ground truth is 1. As reported in Table 6, performances with 4-bin and 8-bin pseudo ground truths are even lower than the original GMFlow and worse with more the number of bins. It is probable that the optimal P(i, j) would have a peak value near M gt (i, j) (lower variance), thus pseudo ground truth with higher variance (larger bin) force P(i, j) to deviate from the optimal one.

G. DISCUSSION
Experimental results validate the constant benefits of axial constraints regardless of the magnitude of the displacements. However, as presented in Section V-E, the effect of the 2D cross-entropy loss differs significantly depending on the speed of motions. This suggests that the influence of the constraint may vary with respect to the magnitude of the displacements. It can be interpreted that there is room for improvement which can be achieved by having adaptive constraints according to the magnitude of the displacements.

VI. CONCLUSION
In this paper, the underconstrained problem of recent global matching-based optical flow estimation methods utilizing the softmax function to determine the correspondences is intensively discussed. This is mainly because a vast number of possible correlation volumes can produce the correct correspondence, resulting in unstable training due to learning for the one-to-many solution. To address it, an axial constraint for the correlation volume is proposed which is based on 2-bin pseudo ground truth having low variances, to narrow the feasible solution space of correlation volumes producing the correct correspondences. Axial constraints also have high applicability to existing global matching-based optical flow estimators, since any architectural changes are not required to adopt it. Experimental results validate the merits of the axial constraints by applying it to the prior techniques, and it is also shown that axial constraints leads to more performance improvements with data scarcity.

APPENDIX 2D PSUEDO GROUND TRUTH
In experiments in Section V-E, a low-variance 2-dimensional pseudo ground truth is designed to compare against the 2D constraint with axial constraints. To produce 2D pseudo ground truth to have a low variance, the 2D pseudo ground truth is defined as a 4-bin discrete probability distribution and ensures that the expected value is equal to M gt . To define 2D 4-bin pseudo ground truth, first, axial ground truth for x-axis x (i, j) = [φ Then multiply y (i, j) ∈ R H ×1 and x (i, j) ∈ R 1×W to produce final 2D pseudo ground truth 2D (i, j) ∈ R H ×W as follows: With designed ground truth 2D (i, j), cross-entropy loss is defined as follows: where ⊙ denotes the element-wise multiplication operation.