A Robust Visual SLAM Method for Additive Manufacturing of Vehicular Parts Under Dynamic Scenes

Additive manufacturing has significant advantages in complex parts of the vehicle manufacturing. As additive manufacturing is a kind of precise production activity, different components of manufacturing instruments need to be located in appropriate positions to ensure accuracy. The visual Simultaneous Localization and Mapping (SLAM) can be considered to be a practical means for this purpose. Considering dynamic characteristics of additive manufacturing scenarios, this paper constructs a deep learning-enhanced robust SLAM approach for production monitoring of additive manufacturing. The proposed method combines the semantic segmentation technique with the motion-consistency detection algorithm together. Firstly, the Transformer-based backbone network is used to segment the images to establish the a prior semantic information of dynamic objects. Next, the feature points of dynamic objects are projected by the motion-consistency detection algorithm. Then, the static feature points are adopted for feature matching and position estimation. In addition, we conducted a couple of experiments to test function of the proposed method. The obtained results show that the proposal can have excellent performance to promote realistic additive manufacturing process. As for numerical results, the proposal can improve image segmentation effect about 10% to 15% in terms of scenarios of visual SLAM-based additive manufacturing.


I. INTRODUCTION
Additive Manufacturing is an emerging processing technology based on the principle of discrete stacking [1], [2], [3], which breaks the traditional reduced material manufacturing and equal material manufacturing production methods [4], [5]. It is a new manufacturing technology that does not require the collaboration of jigs and fixtures and is not processed The associate editor coordinating the review of this manuscript and approving it for publication was Wei Quan. by machine tools and equipment [6], [7], [8]. With the unification of production standards and the maturity of raw material technology, additive manufacturing technology is well boosted by intelligent technology and the intersection of basic disciplines like automotive industry, aerospace, bioengineering and other fields [9], [10]. At the same time, its gradual prevalence can also breed some latent technical breakthroughs in many cross-discipline applications [11], [12]. Therefore, it is believed to have unlimited market potential in terms of smart manufacturing [13], [14], [15]. Additive manufacturing technology firstly uses computeraid design softwares to conduct 3D modeling for mechanical parts. Then, slicing software is utilized to slice the 3D models according to the parameters of the parts. On this basis, the computer algorithms are used to precisely connect each layer to form a layer stack to quickly realize the additive manufacturing of parts [16], [17]. Contemporarily, the application of additive manufacturing technology has bee more and more general in the automotive industry [18], [19],liu2022human. In this context, some brand-name automotive companies currently choose to use additive manufacturing technology in the automotive development stage to achieve the purpose of rapid verification and optimization of components [20], [21]. In the small batch production of automotive parts often involve some complex parts with thin walls and internal abdominal cavities, the traditional forging and casting processes have limitations in processing and cannot meet the production requirements [22], [23]. Due to the pointby-point, line-by-line and domain-by-domain local forming characteristics of additive manufacturing technology, it is possible to achieve highly flexible near-net-shape additive manufacturing in the manufacture of complex parts [24], [25]. Therefore, additive manufacturing technology has significant advantages in the manufacturing of complex parts, and the application prospect is very promising [26], [27].

A. MOTIVATION
In additive manufacturing, how to use vision sensors for accurate positioning and mapping of parts is the key to realize autonomation. With the continuous development of research, robots are equipped with more diverse sensors, including vision, laser, radar, and multi-sensor fusion methods. Robots are able to perceive their environment and have the ability to estimate the state of their systems using the sensors they carry, they can sense their surroundings and make decisions autonomously. These digital technologies require accurate and robust localization with the ability to progressively build and maintain models of the world scenes. In this work, localization refers to the ability to obtain the internal system state of the robot's motion, including position, orientation, and velocity. While mapping refers to the ability to sense the state of the external environment and capture information about the surroundings, including the geometry, appearance, and semantic information of a 2D or 3D scene [28]. These components can perceive internal or external states individually or, like simultaneous localization and mapping (SLAM) [29], so as to facilitate control decision of robot' poses.
The localization and mapping problem has been studied for decades and various sophisticated hand-designed (handdesigned) models and algorithms are being developed, such as odometer estimation, image-based localization, position recognition, SLAM, motion reconstruction (SfM) [29], [30]. Under ideal conditions, these sensors and models are able to estimate the system state accurately regardless of the time, environment constraints. However, in reality, sensor measurement errors, system modeling errors, complex environmental dynamics and unrealistic constraints (conditions) affect the accuracy and reliability of manually designed systems [31]. Although modern vision SLAM systems are quite mature and have satisfactory performance [32], the aforementioned classical SLAM systems are with the assumption that the objects for SLAM are static, and the detection and processing of dynamic objects are very limited.
However, in actual indoor and outdoor scenes, it is impossible to circumvent moving objects [33]. In this case, unexpected changes in the surrounding environment may seriously affect the camera pose estimation, increase the trajectory error or even lead to system failure. Thus, the detection of moving objects and the correct segmentation of dynamic regions become important research aspects of vision SLAM in dynamic scenes. Because of the limitations of model-based solutions and the rapid development of machine learning, especially deep learning, researchers have been prompted to consider data-driven learning methods as an alternative approach to solve this issue. The class of relationships between sensor data input values (e.g., vision, inertial guidance, LiDAR data or other sensors) and target output values (e.g., position, orientation, scene geometry or semantics) as a mapping function [34], [35].

B. CONTRIBUTIONS
While traditional model-based solutions are implemented by manually designing algorithms, learning-based approaches construct this mapping function by learning large amounts of data. The learning-based approach has three advantages. Firstly, the learning approach can automatically discover task-relevant features using a highly expressive deep neural network as a general-purpose approximator. This feature enables trained models to adapt to various scenarios (e.g., featureless scenes, dynamic high-speed scenes, dynamic blur, accurate camera calibration) [36], [37]. Secondly, the learning approach allows learning from past experiences and actively developing new information. By building a general data-driven model, researchers can solve domainspecific problems without having to go through the trouble of specifying the entire knowledge about mathematical and physical rules when building the model. Thirdly, deep neural networks have the ability to be scaled to large-scale problems. Trained on large data sets through back propagation and gradient descent algorithms, a large number of parameters in the DNN framework can be automatically optimized by minimizing the loss function. Thus, harnessing the power of data and computation to solve localization and mapping is potentially achievable.
The multi-sensor fusion scheme needs to add information from different sensor sources, facing multiple difficulties such as data correlation, signal synchronization, and fusion processing, which greatly increases the complexity of the system, and the dense scene flow approach is computationally intensive and a great challenge for real-time computing [38], [39]. The multi-sensor fusion scheme can construct semantic maps to enrich the robot's understanding of the environment and thus obtain advanced perception, but there are problems of misjudgment for movable objects. To address it, this paper proposes a robust SLAM algorithm for dynamic scenes, which uses deep learning to quickly identify dynamic object frames combined with sparse feature optical flow calculation to make further dynamic judgments, the scenario of the proposed method is shown in Figure 1. Edge detection algorithms are used to effectively segment the edges of dynamic objects to ensure that no too many static feature points are mistakenly removed. And a static environment 3D point cloud map without dynamic objects are constructed to truly realize the powerful sensing capability of autonomous robots.
To sum up, the main contributions of this paper can be stated as the following three aspects: • This work aims at the additive manufacturing of automobile parts, and explores to employ deep learningbased vision sensing to enhance the manufacturing process. • This work proposes a robust visual SLAM method for additive manufacturing of vehicular parts under dynamic scenes.
• This work conducts simulation experiments on realworld scenarios to evaluate performance of the proposal, and corresponding discussions are also made for it.

II. METHODOLOGY
In our study, the transformer real-time target detection algorithm is used to quickly obtain the rough rectangular range of potential semantic dynamic objects in the threechannel image of input, the ORB feature points and the optical flow field are extracted and calculated, respectively, which largely reduce the time to calculate the optical flow field of all pixel points. Then by combining the semantic data with the dynamic feature points filtered by the optical flow field calculation, the true motion of the object can be obtained. Then, the canny operator is adopted to detect the edges of the dynamic objects to extract the edge data of the dynamic objects, and to do position estimation of camera by minimizing re-projection error of static feature points other than dynamic objects. Finally, the map is constructed using the key frames with the dynamic objects removed. The overall flow is shown in Figure 2.

A. REAL-TIME TARGET DETECTION BASED ON TRANSFORMER
The Detection Transformer (DEtection TRansformer, DETR) [40], [41] with an ensemble global loss that makes predictions through bilateral match and a classical encoderdecoder architecture, which containing three components: a CNN based backbone to extracte feature representations, a Transformer pretraining model to enhance features, and a simple feedforward network (FFN) for performing the object detection prediction.The detail structure is shown as Figure 3. Starting from an initial image x img ∈ R 3×H 0 ×W 0 (3 color channels, To batch the input images together with sufficient 0 padding to have the same dimension (H 0 ,W 0 ) as the largest image in same batch), a convolutional network then to generate a activation map f ∈ R C×H ×W with lower resolution. First, the high-level activation map of the channel dimension f is reduced from C to dimension d using a 1 × 1 convolution to generate a new feature map, which is written as z 0 ∈ R d×H ×W . Since a sequence is expected for the encoder as input, so the spatial dimension of the feature map z 0 is collapsed to generate a new feature map with dimension d ×HW . For an encoder, it is constituted by a head part, an attention mechanism part and a FFN part. Since the architecture of transformer is alignment-independent (orderindependent), a fixed position encoding [8], [9] is provided and the processed results are added to the input of each attention layer.
For a decoder, it transforms N embeddings with size d by multi-head attention mechanism. The authors in [40] adopted an auto-regressive model to predict one element of the output sequence at once. Because the decoder is also permutationindependent (order-independent), hence thedifferent results will be produced according to N input embeddings. And the input embeddings are learned through the positional encodings, and the results is defined as the queries of object, and then are added to the input of each attention layer. The decoder can process N queries of objects into output feature map. The elements of the output embedding corresponding to queries of object are decoded independently into bounding box coordinates and are assigned the category labels through a FFN, which get the N final predictions. A self attention mechanism is applied on these embeddings, the model utilizes pairwise relationships between all objects to perform global inference on all objects.
The final prediction results is calculated by a three-layer foward propagation network. Also, there is a hidden layer with dimension d and a projection layer before the final results. The normalized center coordinates, height and width of the bounding box with respect to the input image are predicted by the FFN, and the category labels is predicted through a softmax function in the last layer. Thus, a fixed size of N bounding boxes is predicted, and N is typically bigger than the number of targets of interest in the original input. In addition, a category label is appended to indicate that no targets are detected within the slots (e.g., no targets of interest in the image or targets of interest do not fill the N slots). This category is similar to the ''background'' category in standard target detection methods.

B. ORB FEATURE EXTRACTION
In order to carry out the static and dynamic analysis of the object while saving the computational cost and ensuring the real time performance, this study calculates the optical flow field to estimate the motion state of the extracted ORB feature points, which are mainly divided into two parts, FAST corner point extraction and BRIEF descriptor calculation [33], which is given in the followin: 1) Construct the image pyramid, at the same time extract the FAST corner points for each pyamid layer using a uniform extraction strategy based on quadtree [41], the specific calculation process is described as follows: Step 1: Select pixel p in the image and obtain its luminance, assumed to be Ip; Step 2: set the threshold T = I p × 0.2; Step 3: traverse a circle with radius 3 centered on pixel p. The 16 pixel points on the circle with radius 3; Step 4: Let the brightness of each traversal point be I c p. If there are N consecutive points with I cp > I p + T or I cp < I p − T , the point is considered to be a featured point, and N is 12 in this study; Step 5: performs the above operation for each pixel in the image. VOLUME 11, 2023 2) Calculate the rotation angle of the FAST corner point by using the gray scale center of mass method. Define the moments of the image as: where I (p, q) is the gray value of the FAST corner point (p, q), a, b are the order of the moments, and the image center of mass coordinates are: The rotation angle is: 3) Calculate the rotated BRIEF descriptor, choose the window W of S × S, and define: where:I (p) is the grayscale value at p. Randomly selected n pairs of feature points, the Generate an n-dimensional BRIEF description sub-vector:

C. EDGE DETECTION
Directly removing the rectangular area of dynamic objects removes too much static scene, which is not conducive to accurate camera positioning and map construction. In order to extract edge of dynamic objects more accurately, this study uses canny operator to detect the edges of the filtered dynamic objects. Canny is a second-order differential operator [42], which extracts the edges of the image by the zero point of the second-order derivative at the edge of the given image. The strong edges and weak edges are detected separately, and the real weak edges can be detected. The detail steps are described as follows: 1) Eliminate image noise. Firstly, the image is smoothed by using Gaussian function. Define f (p, q) as the input image, O(x, y) as the output image, and g(p, q) as the Gaussian function, where the Gaussian function is defined as: 2) The gradient magnitude and direction calculation. Using the image processed by Gaussian filtering, a suitable gradient operator is adopted for the gradient magnitude and direction calculation of each pixel by calculating the difference of the first-order bias between adjacent pixels. Where, A p ,A q are the Sobel gradient operator, E p ,E q is the difference between horizontal and vertical direction, respectively. The gradient E(p, q) and direction θ(p, q) are written as following: 3) Filtering non-extreme values. In the Gaussian filtering process, the edges may be amplified, and the Non-Maximum Suppression (NMS) is adopted to filter the points those are not edges. If the current calculated gradient amplitude in the field of the point is greater than along the gradient direction of the point. If the current calculated gradient amplitude of the other 2 neighboring points is greater than the gradient amplitude along the direction of the point, the point belongs to the possible edge point, otherwise it is not, and the suppression means is taken to set the gray value to 0. 4) Double threshold detection and connected edges. After the above steps of processing only get the candidate edge points, and then use the upper and lower threshold detection process to eliminate the pseudo-edge points. Points larger than. Points with upper threshold are detected as edge points, points smaller than lower threshold are detected as nonedge points, points between the two values are detected as weak edge points, and if they are adjacent to the pixel point identified as an edge point, they are judged as edge points; otherwise, they are non-edge points.

D. LOCATION ESTIMATION AND POINT CLOUD OVERLAY
After determining the exact contour of the dynamic objects, the dynamic points distributed within the objects are excluded, and only the stable feature points in the nondynamic region are used for a more accurate camera pose solution. (u i c , v i c ) is set to be the pixel coordinates of the static points in the current frame c, and the depth value z i c is used to obtain the 3D spatial point coordinates P i c (p i c , q i c , z i c ).
where (f p ,f q ), is the focal length of camera, (c p , c q , ) is the principal point coordinates of camera. Building 3D point cloud maps of the environment can provide better visualization of the environment. The semantic information carried by the point cloud can provide the basis for robot navigation and obstacle avoidance [22], [23]. When constructing point clouds, if there are large errors in the poses, the maps will be overlapped with obvious interlocks, which is not good for navigation. This problem can be effectively solved by overlaying the point clouds with the dynamic objects removed. The ORB_SLAM2 algorithm is used to obtain key-frames, and the point clouds of all key-frames are superimposed, which is too complicated and redundant [35], [43], [44]. In the process of key-frame screening, the following two strategies are considered: 1) key-frame validity judgment. If the area of the rejected point cloud is more than half of the current key-frame area, the key-frame is considered to contain insufficient valid information and is not involved in the overlay. 2) key-frame redundancy judgment.  The feature points that can be observed by multiple keyframes are called co-visual landmark points of multiple keyframes. To detect the co-viewing landmark points observed in the current key-frame, assume that the set of identified drawing key-frames is F, the set of observed landmark points is L, and the set of landmark points observed in the current key-frame is L c , If the number of L ∩ L c exceeds half of L c , the current key-frame is considered to contain too many coviewing landmarks and the information is redundant, so it does not participate in the superposition. If the above two conditions are satisfied, F and L c are updated, which ensures that new point cloud information is introduced and there is enough static environment information.

III. EXPERIMENTS AND ANALYSIS
In order to evaluate the actual performance and effectiveness of the proposed ORB_SLAM2_transformer system in this paper, the system is tested in three aspects: the performance segmentation performance of the transformer network, the performance of dynamic feature point rejection, and the performance of localization in dynamic scenes.

A. EXPERIMENTAL DATA AND SETTINGS
The ''Freiburg2_desk_with _person'' dataset from the Vision Group of the Technical University of Munich (TUM), Germany, was selected as the open-source dataset, which contains a total of 4067 frames with static tables and chairs and multiple slow moving human targets [45]. This dataset is designed to check the robustness of the SLAM system to dynamic objects and people, to distinguish the map and to check the changes in the scene, which meets the requirements of the experiments in this paper.In order to test the robustness of the proposed method, an additional ''DataSet_Factory'' data set based on real scenes is constructed [36]. The data set is obtained by fixing a camera on a mobile experimental platform equipped with LIDAR, which has a static industrial assembly line and several slow-moving human targets in a total of 1715 frames, in exactly the same format as the TUM data set.
For the analysis of semantic segmentation results, this paper selects the mainstream statistical pixel accuracy (Pixel accuracy), class Mean accuracy, Mean IoU and Frequency weight IoU are the four mainstream semantic segmentation evaluation criteria used to evaluate pixel accuracy and region overlap [31]. The specific definitions are as follows: MeanIoU = n ii t i + j n ji − n ii n cl (15) FreqweightIoU = i t i n ii t i + j n ji − n ii k t k (16) In this paper, we calculate the relative pose error (RPE) and Absolute Pose Eror (APE) to evaluate the difference VOLUME 11, 2023  in SLAM performance between the ORB_SLAM2 and the ORB_SLAM2-transformer in a dynamic environment. The relative pose error is calculated based the difference between the estimated SLAM pose and the truth value of the camera pose at the same time, and mainly describes the accuracy of the pose difference between two key frames between a fixed time t. The RPE for the i-th key frame is defined as: where Q i is the real trajectory pose; P i is the key frame pose estimated by the system. Root Mean Squared Error (RMSE) is used to evaluate the error and is defined as follows: In addition, one typical image segmentation method is introduced as the baseline. As this work manages to explore image segmentation-based method to enhance SLAM manufacturing process. The fully convolutional networks [46], named as FCN for short, is a most typical image segmentation method in this area. Thus, the proposal in this paper is compared with the FCN-based backbone network to measure performance appearance.

B. NUMERICAL RESULTS AND ANALYSIS
First, the transformer network based segmentation model used in this paper was analyzed. The detail performance is shown as in Table 1, and its pixel accuracy, category average accuracy and average region overlap reached 71.101%, 89.512% and 58. 157%, respectively. The performance comparison of Figure 4 indicating that this model is significantly better than other network models, and the feature map can retain more detailed features. This is evaluated because the introduction of the transformer network semantic segmentation model increases the complexity of the system, and the consequent problem is that the feature point extraction takes longer computation time, which affects the real-time performance. In Table 2, the average feature point extraction time per image reaches 0.21757 seconds due to the addition of semantic segmentation and dynamic feature point projection algorithms to the system. Although the extraction time is significantly increased compared to the ORB_SLAM2 system, the system is still able to achieve an average rate of about 5 frames per second, which basically ensures the real-time performance. Figure 5 shows the comparison of ORB feature point extraction results of the two systems under the TUM dataset: The feature point area contains the dynamic portrait target area; the points in the dynamic portrait target area are   completely eliminated, and the number of eliminated feature points increases gradually as all the portrait targets enter the picture, which achieves the expected goal. Figure 6 shows the comparison of the ORB feature point extraction results of the two systems in the real scene, the feature points in the dynamic target region are completely eliminated, and the same target is achieved as the dataset, which shows that the method is still applicable in the real scene. Table 3 shows the RMSE, maximum error(MAX), mean absolute error(MAE), and standard deviation of the relative and absolute positional errors, respectively. Absolute error, and standard deviation of the relative and absolute positional errors are presented in Table 4: As shown in Figure 7, compared with the ORB_SLAM2, the proposed ORB_ SLAM2_Transformer has a higher maximum error than the ORB_SLAM2, but the RMSE, the MAE and the standard deviation are reduced by 11.038%, 15.257% and 2. 309%, respectively; From the Figure 8, for the absolute trajectory error, the ORB_SLAM2_Transformer has a higher maximum error than the ORB_SLAM2. The ORB_SLAM2_Transformer has a smaller error compared to the ORB_SLAM2, and the four error parameters are reduced by 18.450% 27.%, 18.177%, and 19.492%, respectively. Therefore, it is proved that the ORB_SLAM2_Transformer has an overall smaller positioning error and better relative and absolute positional errors. We believe that the ORB_SLAM2_Transformer achieves the goal of removing the dynamic feature point fraction to reduce the camera tracking localization error, thus optimizing the problem of camera tracking drift under dynamic targets.

C. DISCUSSION
In this work, the deep learning-based image segmentation is employed to enhance the SLAM manufacturing process in terms of automobile parts. In our proposal, the Transformer is employed as backbone network for use. The performance of image segmentation methods directly determines efficiency of following additive manufacturing operations. Hence, the proposal is compared with a typical image segmentation method FCN for performance evaluation.
To better verify performance of the proposal, four aspects of evaluation metrics are introduced to visualize algorithm performance in the format of numerical values. The four aspects of metrics include: segmentation effect, time complexity, RPE performance, and APE performance. After simulative experiments on real-world scenes of SLAM-based additive manufacturing, the obtained results show that the proposal can have proper performance in terms of segmentation effect. The good image segmentation performance can well promote the following manufacturing operations.
Although the proposal can have proper performance in SLAM-based additive manufacturing process, there is still some distance to practical industrial application. The deep learning has received great development in recent years and brought much insight into many computer vision tasks. However, deep learning algorithms are mostly facing the problem of computational complexity, which requires relatively high hardware conditions [47]. Generalized into our VOLUME 11, 2023 proposal, how to improve the running efficiency and reduce computational complexity is the future direction of our work.

IV. CONCLUSION
In this paper, in order to achieve robust SLAM in dynamic scenes, a transformer based visual SLAM method is proposed. The method combines the segmentation technique with the motion-consistency detection algorithm. First, the transformer network is used to semantically segment the image to establish the a prior semantic information of dynamic objects, then the feature points belonging to dynamic objects are rejected by the motion-consistency detection algorithm. Finally, the static feature points are utilized for pose estimation and point cloud overlay. Simulation experiments are conducted to test function of the proposed method. The obtained results show that the absolute trajectory error and relative estimation error can be reduced additive manufacturing of vehicular parts compared with the traditional ORB_SLAM2 system.