DT-SLAM: Dynamic Thresholding Based Corner Point Extraction in SLAM System

Visual localization estimation is highly depended on the quality of video frames or captured images. Estimation quality may be affected by the poor visibility, low background texture and overexposure. Low quality frames with blurred edges and poor contrast pose tremendous difficulties for corner point detection in SLAM impacting the overall accuracy of estimation. This paper introduces DT-SLAM, a dynamic self-adaptive threshold (DSAT) approach for ORB corner point extraction in FAST to improve SLAM’s localization performance. The proposed method replaces the existing static threshold-based ORB extraction approach, enabling improved performance in complex real-world scenes. In addition, this study introduces a threshold switching mechanism (TSM) to replace the existing SLAM’s frame-level and cell-level thresholds for corner point extraction. The proposed DT-SLAM approach is validated using the TUM RGB-D and EuRoC benchmark datasets for location tracking performances. The results indicate that the proposed DT-SLAM outperforms the current state-of-the-art ORB-SLAM3, especially in challenging environments.


I. INTRODUCTION
The ability for an agent to localize and track its movement through an otherwise unmapped environment is important functionality in application areas such as autonomous robotics, self-driving cars and Unmanned Aerial Vehicles (UAVs). Often, agents will operate in environments that are either not mapped or cannot be mapped practically using conventional means, e.g., home-based environments. Additionally, agents will typically operate on constrained hardware which may lack network connectivity, requiring localization to be computed on the agent itself. Simultaneous Localization and Mapping (SLAM) is a widely used technique that allows the mapping of an unknown environment whilst tracking the position of the agent in that environment. The ubiquity of high-performance processors and high-resolution color cameras had seen the popularization of visual SLAM (vSLAM). vSLAM applies visual odometry (VO) [1], [2] to determine the position and orientation of an observer using the camera images.
The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou .
Since the introduction of Parallel Tracking and Mapping (PTAM) [3] in 2007, many real-time vSLAM implementations have followed a multiple-thread strategy, whereby tracking and mapping are split into two separate tasks. Examples include, corner point-based ORB-SLAM [4], RGBD-SLAM [5], direct method-based LSD-SLAM [6] and SVO [7] which optimizes pose directly on image intensities. Among these, the third incarnation of ORB-SLAM (ORB-SLAM3) [8], which represents the current state-of-the-art, uses Oriented Fast and Rotated Brief (ORB) to detect and describe corner points as the key references for its VO. ORB itself uses the Feature from Accelerated Segment Test (FAST) corner point detector [9] and the BRIEF descriptor [10] to identify reference points. The FAST corner point detector is an image point segmentation operator which identifies pixels that have significantly different intensity values relative to their neighboring pixels. Should a pixel's intensity value surpass that of a predetermined threshold value, then the pixel is considered by FAST to be a corner point. Therefore, the value of the predetermined threshold is an influential determinant of the quantity and quality of the identified corner points. In real-world situations, properties such as image clarity, contrast and sharpness are affected by the dynamics of the environment, camera hardware and terrain. Motion-blur, low-texture scenes and the camera's auto-white balance (AWB) setting can also negatively influence FASTs' ability to accurately track corner points between consecutive frames [11], [12].
To solve this problem, we propose a novel, adaptive dynamic threshold SLAM (DT-SLAM) which utilizes a lightweight processing technique for improving corner point extraction performance in FAST. This work contributes a Laplacian-based threshold computation method for high-performance and robust corner point detection. Additionally, a novel threshold switching approach for ORB is presented, enabling the dynamic recalculation of threshold values.

II. RELATED WORK
The performance of VO is determined by the quantity and quality of reference features. Many feature extraction methods had been applied in this field. Harris et al. [13] proposed a combined corner and edge detector to track extracted features from consecutive frames via an auto-correlation function. Bay et al. [14] proposed Speeded Up Robust Features (SURF), a scale and rotation-invariant method, utilizing a Hessian matrix to identify feature points. Similarly, Lowe [15] proposed a three stage, Scale-Invariant Features Transform (SIFT) based approach, whereby nearestneighbour, clustering and least-squares algorithms are combined to determine feature points. The results of Lowe's approach, however, are not stable and cannot be applied practically in time-critical applications [16].
One approach for improving vSLAM's tracking performance is to improve the feature point extraction method. Pumarola et al. [11] proposed PL-SLAM which simultaneously extracted both lines and ORB points for localization and mapping. However, PL-SLAM which is built upon ORB-SLAM2, performed poorly in scenes with few objects. To improve tracking performance, Li et al. [17] proposed R-ORBSLAM, an evolution of ORB-SLAM2 which considered photometric error prior to feature point extraction. Fan [18] proposed a self-adaptive thresholding approach for feature point extraction based on the local variance of pixel intensities. Ma et al. [19] proposed an image entropy method using K-means clustering for computing the threshold, whereas Yang et al. [20] adjusted the threshold based on image brightness. Despite improving feature tracking across different conditions, these approaches were not practical for real-time applications due to hardware constraints. Ding et al. [21] used a KSW entropy method to calculate the FAST threshold, but was susceptible to scenes that had lots of movement. Crucially, this approach still relied on an initial static threshold which reduced the number of corner point detection [18], [22].
Sun et al. [23] proposed a pre-processing stage to remove moving objects from the tracked scene (MR-SLAM). This approach was prone to tracking failure when previously moving objects stopped moving. Huang et al. [24] presented a cross-modality method to associate local points with existing map's 3D structure components for better mapping efficiency, but the model's performance relied on the accuracy of existing maps. Similarly, Cho et al. [25] introduced a de-blurring and de-hazing technique method to improve the quality of frames in SLAM. Jin [26] proposed a lightweight convolution neural network-based corner point extractor instead of FAST, which improved localization accuracy. Similarly, Huang et al. [27] adopted the SuperPoint method for feature extraction, which required the repeated estimation of camera pose via deep-learning (DL-VO). However, these approaches are not suitable for use in constrained hardware applications.

III. SYSTEM OVERVIEW
In this section we provide a technical overview (see Fig. 1) of our proposed DT-SLAM implementation. DT-SLAM introduces a Dynamic, Self Adaptive Threshold (DSAT) mechanism to FAST. In doing so, it is hoped that feature point detection is more robust in challenging scenes. DSAT consists of a Dynamic Threshold Calculator (DTC) and a Threshold Switching Mechanism (TSM) as shown in Fig. 2a. DSAT accepts an input frame which is then divided into 31 × 32 uniform cells, as show in Fig. 2b [8]. ORB extraction is then performed on each cell using FAST via a 2-step process.
First, the centre-point is identified using a 16-pixel Bresenham circle [28]. This selection process compares the pixel's intensity to it's n-nearest neighbouring pixels. These intensity difference values are then compared to an initial ''hard'' threshold to determine whether the centre-point is a suitable ORB candidate. If no ORB candidate is detected, this process is repeated using an alternative ''soft'' threshold value. ORBs are then decided from a candidate pool through the application of a non-maximal suppression (NMS) algorithm. This method replaces the static threshold value used in ORB-SLAM3 which identifies candidate ORBs through repeated trial-and-errors using RANSAC approach [29].

A. DYNAMIC THRESHOLD CALCULATOR (DTC)
SLAM relies on matching ORB points between two consecutive frames to model the poses of the camera. In DSAT, we introduce edge detection to improve ORB point tracking between frames. Edges are comprised of consecutive points that have significantly brighter intensity relative to their neighbouring pixels. Popular edge detection methods such as Sobel operator [30], Canny [31] and Laplacian of Gaussian (LoG) [32] have been applied in many fields, including medical imaging [33], computer vision for object detection [34], and self-driving vehicles for pedestrian monitoring [35].
Existing works indicate that LoG [32], [36]- [38] yielded better edge detection performance due to its high sensitivity to fine edges whilst introducing a low computational overhead. Figure 3 illustrates the step-by-step operation of DTC upon receiving a gray-scaled frame as input source (I ). A Gaussian smoothing filter G σ (x, y) is applied to reduce the noise interferences (shown in (1) and (2)) [37]. Edges are further refined using a Laplace operator, (1) where * , G σ (x, y), x and y are the convolution operator, resulted filtered frame, and x-and y-axis of image pixel, respectively. The original Laplace operator is defined as divergence ( ) of the gradient of the pixel intensity ( P) in a n−dimensional Euclidean space, as shown in (3) and (4), where P is derived from (x, y)-coordinates as shown in (5).
Here, P is computed by an approximate discrete convolution kernel of size 3 × 3 (see (6)) which generates the edge-I f image containing edges information [38]. Finally, threshold t is computed as the standard deviation of the pixels intensity of edged-I f as shown in (7), where P, W and H are denoted as the mean pixels intensity of P, width and height of the I f (frame or cell), respectively.

B. THRESHOLD SWITCHING MECHANISM (TSM)
In TSM, two levels of threshold are introduced, namely hard threshold (HT) and soft threshold (ST   sequences of indoor videos under different environment conditions e.g., illuminance and varied scene settings, which include both static and moving object. The raw data was captured by Microsoft Kinect RGB-D sensor and the ground truth locations were generated by an external motion capture system. The EuRoC dataset [40] contains 11 stereo sequences recorded from a micro aerial vehicle (MAV). The dataset specifies two office like environments and a large industrial factory. The ground truth is measured by Leica laser tracker and Vicon motion capture system. In this study, only RGB frames from left stereo camera are considered during the performance analysis. To avoid differences caused by testing hardware and RANSAC [29], the benchmarking of ORB-SLAM3 and DT-SLAM are performed three times under the Ubuntu version 16.04 operating system powered by Intel Xeon R Silver 4108 CPU @ 1.80 GHz and 16 GB RAM. Figure 4 shows the comparison of detected corner points by ORB-SLAM and DT-SLAM using two consecutive input frames obtained from the EuRoC V202 sequence. Table 1 shows that 686 and 795 corner points were extracted by ORB-SLAM3 with static threshold value of 20. DT-SLAM identifies the maximum permitted (1000) corner points with the dynamically calculated HT value of 14 and 12 on respective frames. DT-SLAM also demonstrates a 30% improvement in matching points between frames, reducing the trajectory loss rate of the overall evaluation run. Additionally, in situations where ST calculation is required, ORB point extraction increases more than five-fold in both frames as illustrated in Table 2 and Fig. 5.

V. RESULTS AND DISCUSSION
The performance of proposed DT-SLAM is evaluated with the benchmark datasets based on Root Mean Squared Error (RMSE) analysis as shown in (8).
where m is the number of keyframes in SLAM, α i and β i are the generated trajectories of keyframes and their  Table 1 Table 2) of cells (yellow boxes) in two consecutive frames (top and bottom images) by ORB-SLAM3 (blue dots) and DT-SLAM (red dots) with ST (EuRoC V202).
associated ground truth, respectively at i-th keyframe whereas h(·) is the trajectory alignment method based on the scale and rotation consistency between α i and β i [41]. Table 3 shows the RMSE for indoor localization, as tested with TUM RGB-D and EuRoC datasets. The ''Fr1/desk2'' sequence is a video recording of an office desk. Two frames are extracted from the sequence where ''Frame A'' contains multiple objects on the desk and ''Frame B'' contains fewer objects (Fig. 6). The corner points extracted by ORB-SLAM3 are reduced when the camera moves from Frame A (see Fig. 6a) to Frame B (see Fig. 6b). The results indicate that localization performance increases by 44.3% in ''Fr1/desk2'' with DT-SLAM (RMSE = 0.0571) compared to ORB-SLAM3 (RMSE = 0.1026).
The trajectory prediction performance of GMMLoc, SVO and DL-VO were also evaluation using the EuRoC dataset. GMMLoc (mean RMSE = 0.0323) and DL-VO (mean RMSE = 0.0761) outperform ORB-SLAM3 (mean RMSE = 0.1899) and DT-SLAM (mean RMSE = 0.0807). However, GMMLoc and DL-VO require an existing map of the environment in order to estimate trajectory. DT-SLAM  [39] and EuRoC [40] datasets. shows great stability in challenging environments. Multiple trajectories are lost at sequence ''V203'' which are tested with DL-VO and SVO. The same scenario is also observed at sequence ''V103'' with multiple trajectories lost tested with SVO. Majority of the lost trajectories is due to insufficient matching points between consecutive frames, which are caused by lack of informative ORBs being extracted. Improvements are found at sequences of ''MH04'', ''V102'' and ''V103'' tested with DT-SLAM compared to ORB-SLAM3. A significant improvement can also be observed at sequence ''V203'', tested with ORB-SLAM3 (RMSE = 0.2116) and (RMSE = 0.0807) as illustrated in Fig. 7 where the trajectory predicted by ORB-SLAM3 has higher variation than proposed DT-SLAM between the time frames of 1:41:45 and 1:41:50.
Similar improvements are seen with ''Fr2/360-kidnap'' where the mapped trajectory generated by DT-SLAM (RMSE = 0.0382) once again outperforms ORB-SLAM3 (RMSE = 0.0805) (Fig. 8). Three sequences in the TUM dataset contain moving objects, which are ''Fr3/sitting_halfsph'', ''Fr3/walking_xyz'' and ''Fr3/ walking_halfsph''. Table 4 shows that the total number of map points and their respective mean number of matching points are higher in DT-SLAM when compared to ORB-SLAM3. The results indicate that DT-SLAM outperforms ORB-SLAM3, MR-SLAM and (largely) PL-SLAM in the tested sequences. PL-SLAM demonstrates   the best performance (RMSE = 0.0066) for sequence ''Fr3/sitting_halfsph''. A possible explanation for this result could be the numerous contiguous edges present in this sequence, a property that is exploited well by PL-SLAM's line detection-based approach. Overall, DT-SLAM demonstrates the best performance relative to all other methods tested with the TUM dataset. Table 5 shows the mean computation time of localization and mapping for the methods tested in this evaluation. DT-SLAM's computational efficiency is comparable to other SLAM methods. DT-SLAM introduces minimal overhead when compared against ORB-SLAM3, resulting in a 1 FPS difference between the two implementations.
We have identified two limitations present in the current implementation of DT-SLAM. First, DT-SLAM's tracking performance in blurry or low-contrast test sequences does not improve upon the performance of existing methods. Second, to ensure that DT-SLAM is able to achieve real-time operation on constrained hardware, we were unable to implement semantic-recognition image processing algorithms. As a consequence to this, DT-SLAM is unable to outperform these methods in specific testing sequences.

VI. CONCLUSION
This study proposes DT-SLAM, a novel self-adaptive dynamic threshold-based approach for ORB feature point extraction. The main contributions of this system include a dynamic threshold algorithm and a threshold switching mechanism. We have shown that DT-SLAM outperforms the current state-of-the-art SLAM methods using standard benchmarking datasets. DT-SLAM is especially effective in challenging environments and dynamic scenes. Future work will seek to evaluate DT-SLAM in increasingly more challenging environments, including multi-room and multi-floor localization.