A High-Precision Recognition Method of Circular Marks Based on CMNet Within Complex Scenes

Accurate recognition of circular marks is crucial for calibration, object tracking, and three-dimensional reconstruction in videogrammetry. However, most existing studies were designed under single or relatively simple scenes. When the existing algorithms are applied to more complex scenarios, it will result in higher false detection and miss-detection rate. In this article, we present a high-precision recognition method based on a novel deep learning model, circular-MarkNet (CMNet) to solve this problem. The proposed network consists of three main steps: first, circular marks are detected using the improved YOLOv4 model to narrow the search region of the circular contour; the contour of the circular marks is then extracted based on the saliency object detection model BASNet; and finally, least square fitting is used to calculate the central pixel coordinate of the identified contour on the saliency map. The proposed method was tested under three complex scenarios with different characteristics and disturbances. The experimental results demonstrated that: the proposed CMNet can effectively recognize of circular marks within complex scenes, which reveals the superiority and generalization ability of the proposed method; the improved YOLOv4 can significantly enhance the detection accuracy of circular marks, which is crucial to the subsequent saliency courter detection and circle center identification; and CMNet achieved the best performance, with an RMSE of 0.0713 pixel, compared to the state-of-the-art methods.


I. INTRODUCTION
H IGH-SPEED videogrammetry is an efficient and lowcost engineering method that provides spatial information of objects by image acquisition and processing. Owing to its high-precision, noncontact, and nondamaging nature, it has been widely applied in civil engineering [1]- [3], environmental science [4], and industrial inspection [5]. For these Manuscript  applications, artificial marks posted on the measured object can help obtain spatial trajectory variation information and have yielded satisfactory precision. The most crucial factor is the automatic and high-precision recognition of artificial marks used to indicate points of interest. This is a significant pre-processing for subsequent high-precision camera calibration [6], displacement monitoring, and three-dimensional (3-D) reconstruction [7]. Several artificial marks have been widely used in videogrammetry. Owing to its exceptional properties of scale-, translation-, and rotation-invariance, the circular mark is more popular than others [8], [9]. Circular marks are divided into coded [10] and noncoded [11], [12]. Irrespective of type, traditional recognition methods consist of two steps: mark detection, which is used to narrow the search area of the circular contour; and center identification, which corresponds to the centroid positioning of the circle. Localization methods are used to identify the center of the circle in the detected area. In the past few decades, digital image processing has been used for localization, including Hough transform (HT)-based methods [13], centroid method [14]- [17], point-fitting methods [18]- [21].
The crucial issue focuses on the detection of the circular mark, which affects the precision of subsequent centroid positioning. Because coded information is readily identified, the coded mark is easily and automatically detected in the image. For non-coded marks, the detection relied on semiautomated and ellipse detection. Semi-automated may require the operator to box a search region for the locations of targets [2], [22], [23]. The ellipse detection methods [24], [25] used circle edge geometric features to complete the related detection tasks. In addition, some scholars have researched the extraction of marked regions. Guo et al. [26] used template matching to detect regions of interest containing artificial targets, while Ok [27] applied an region of interest (RoI) based method to focus on finding a specific circle in an area. However, traditional methods use shallow image information, including texture, edge, grayscale, etc., which are easily affected by the background and lighting, and indiscriminately misrecognize circular objects as circular marks in the image.
Deep learning methods, which have the advantage of detecting both shallow and deep features, have been widely used for image classification [28]- [30], target detection [31], [32] and image segmentation [33], [34]. For example, CSPDarknet53 combined cross stage partial with residual structure to extract different levels of features [35]. Path aggregation network (PANet) [36] added bottom-up path based on feature pyramid network (FPN) This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [37] to better integrate shallow and deep features. Deep learning methods have been shown to be more effective than traditional methods. Therefore, they are the research focus for object detection tasks. In general, deep learning-based object detection can be divided into two categories: two-stage and one-stage methods. Region-based convolutional neural network (R-CNN) [38], fast R-CNN [39], and faster R-CNN [40] are the commonly used two-stage methods. They first generate the candidate object proposals and then the features from the proposals are extracted using a CNN. One-stage methods include SSD [41], YOLO series [42]- [45], and CenterNet [46]. They do not require the candidate object proposals, but directly regress the class scores and location coordinates of the object, so they have obvious advantages in efficiency. Zhou et al. [47] proposed a motionblurred vision object recognition model based on a CNN. Shi and Zhang [48] used a faster R-CNN to locate and recognize the motion of a specially designed coded target in blurred images. Kinaz et al. [49] proposed a new deep CNN for the automatic detection and recognition of the coded target. However, these studies tested their detection algorithms only within laboratory conditions, utilizing a stable lighting environment. The performance of these methods in complex backgrounds remains to be explored.
Detected rectangular regions may accidentally contain interference information from the background, or noise, which will affect the center positioning of the circular mark. Salient object detection (SOD) methods can highlight the most prominent objects in an image and filter irrelevant interference information. Many studies have used SOD for auxiliary detection tasks. Han and Fu [50] proposed a saliency-based method to extract circular array objects from remote sensing images with high spatial resolution. Zhang et al. [51] used a saliency-guided sampling strategy to extract a representative set of patches from very high-resolution (VHR) images. Li et al. [52] fused a heat map with a saliency map to improve object detection performance.
From the above analysis, most existing methods are tested under synthetic and simple background environments. However, in actual high-speed videogrammetry, various complex environments exist, including indoor, outdoor, and circle-like interference. Precisely recognition of circular marks within complex scenes is still a challenging task due to the following reasons: The low light environment of indoor leads to blurred imaging and low contrast of the circular marks, which cannot be detected and located well. Likewise, overexposure leads to unclear outlines of some marks in outdoor scenes, making positioning more difficult. The background of the actual application is more complex, and there are many classes of circle-like interference, increasing the probability of false detection. These aforementioned factors increase the difficulty of mark detection, which result in bias and false detections by using existing algorithms. Therefore, it is necessary to develop a more robust algorithm. The objective of this article was to propose a high-precision circular mark recognition method based on circular-MarkNet (CMNet) for complex scenes. The proposed method adopted a coarse-to-fine strategy, and an object detection model is used to detect the circular mark in the image to narrow the search region of the circular contour in the coarse stage. During the implementation of the fine strategy, we used a visual attention mechanism (VAM) to generate a saliency map of the detected rectangular region, focusing on the extraction of the circular contour. The central pixel coordinates of the identified contour are then calculated using the least squares fitting (LSF) method [53] on the saliency map. The main contributions of this article are as follows.
1) We propose a deep learning-based framework CMNet for recognition of circular marks within complex scenes. The network adopts a coarse-to-fine strategy and introduces a VAM based on a SOD network, BASNet. The false identification caused by background noise can be greatly reduced and sub-pixel level accuracy can be achieved. 2) We propose an improved YOLOv4 model for circular mark detection. The modified model used a large-scale feature map optimization structure and attention mechanism blocks (AMBs) between the neck and head to improve the accuracy of mark detection in the complex environment. 3) We generate a circular Mark recognition (CMR) dataset.
The dataset contains three complex scenarios, such as indoor, outdoor, and circle-like interference. The experimental results reveal the superiority and generalization ability of the proposed method. The rest of the article is organized as follows: Section II introduces the dataset. The details of the circular mark recognition methodology are described in Section III. Section IV presents and analyses a series of comparative experiments and Section V concludes the article.

II. DATASET
The nonretroreflective targets used in this article were a simple white circular mark on a black backing, which were labeled at critical points on the measured object. The circular mark images were collected at Tongji University, including experiments such as the collapse of civil structures, butt joints, experimental models of frame shakers, etc. Images were acquired using a CamRecord CL600×2 high-speed camera (Optronis, German, and 1280 × 1024 pixels image resolution) and a Basler ACA 2040-180 KM (Basler, German, and 2048 × 2048 pixels image resolution). To verify the effectiveness of the proposed method under various conditions, images were captured in three different scenarios: indoor, outdoor, and circle-like inference scenes. The camera was located1.0-8.5 m from the object, and images were acquired under different illumination conditions, including high and low light conditions. In this article, 1095 images of circular marks, including different angles of circular marks on various scales, were collected as the experimental dataset. Fig. 1 shows three scenarios of the CMR dataset. Indoor scenes usually have low light intensities, and image quality is significantly affected, making the edges of some circular marks unclear [see Fig. 1(a)]. In outdoor scenes, owing to strong illumination, some marks are overexposed, resulting in fuzzy edges of the circular marks [see Fig. 1(b)]. The image in Fig. 1(c) was taken under dark indoor conditions, and the background was complex with many circle-like inferences, including holes, bolts, light bulbs, etc. Table I gives a detailed description of the scenes.  For the object detection model training, the CMR dataset was split into training and test sets in a ratio of 8:2. Subsequently, LabelImg software was used to label the circular marks in the image in the PASCAL VOC format. For saliency object detection model training, 2020 images of the circular mark regions were collected as the dataset. Labelme software was used to segment the circle contour.

III. METHODOLOGY
The flowchart of the proposed high-precision recognition method is shown in Fig. 2, which was constructed around the neural network CMNet. The method consists principally of three components, including circular mark detection, saliency contour extraction and circle center identification. The circular mark detection model was used to extract the region of the circular mark from the images based on the improved YOLOv4 (I-YOLOv4). The boundary-aware saliency detection network, BASNet [54], was used to generate the saliency map to focus on circular contour, while avoiding background noise interference, and the subpixel center coordinate was calculated on the saliency map of the circular contour by LSF.

A. Detection of Circular Mark Based on the Improved YOLOv4
The CMR dataset used in this article has more small-sized objects; thus, YOLOv4's [45] high-level detector head was unable to predict small-sized circular marks. In addition, interference from the complex background, including circle-like objects (light bulbs, bolts, holes, etc.), low light, and overexposed circle marks, caused errors using the original YOLOv4 model. Therefore, we made two adjustments to the original network structure. First, a larger-scale feature map optimization structure was employed on YOLOv4's neck and head to make the model robust with small marks. Second, AMBs were embedded to pay more attention to the channel and spatial feature information to enhance the detection capability of circular marks in complex environments.
1) Feature Map Optimization: The original YOLOv4 has three output layers with down-sampling of 32, 16, and 8 times. Generally, the receptive field refers to the region that maps back to the input image. Thus, the larger receptive field is owned by the deeper network layer. Larger receptive field feature map is used to detect large marks. Low-level shallow feature maps retain more spatial information; therefore, it is more suitable for detecting minor circular marks. Consequently, it is necessary to design a new network structure with finer feature maps to detect small targets effectively. Fig. 4 shows the original structure and three newly designed structures with different redirected necks and heads. We tested different redirected necks and heads and found that the 4-in and 3-out structure worked best. Therefore, a larger-scale feature map optimization structure, with 4-in and 3-out, was adopted in this article.
The 16 × 16 feature map of the original YOLOv4 is responsible for detecting large objects, and feature maps with resolutions of 32 × 32 and 64 × 64 are responsible for detecting mid-sized and small targets, respectively. Because there are more small targets in the CMR dataset, we made corresponding improvements. In the neck network, we added four times down-sampling with the original three-scale feature map from the backbone for feature fusion. For the detector head, the 32 times down-sampling has too large a receptive field to regress the predicted circular-mark-sized targets. Therefore, this was deleted. Similar to the neck, we added 4 times down-sampling to detect minor circular marks. This redirected structure displays as 4-in and 3-out of the neck network. The sizes of the three-scale output layer of the detection network were changed from the original 16 × 16, 32 × 32, and 64 × 64 to 32 × 32, 64 × 64, and 128 × 128 to improve the detection accuracy of circular marks (see Fig. 3).
2) Attention Mechanism Block: The AMB consists of some particular convolution layers. It does not change the size of the feature map but can enhance target feature expression to increase detection ability. Therefore, an AMB can be easily inserted into the current object detection model. Fig. 3 shows the structure of the I-YOLOv4 model, where three AMBs are embedded after the three-scale output layer.
Given an intermediate feature map F ∈ R H × W × C as input, AMB sequentially assigns a channel attention map M C ∈ R 1 × 1 × C and a spatial attention map M S ∈ R H × W × C . The channel and spatial attention mechanism process can be  summarized as follows: where ⊗ represents the element-wise multiplication and F denotes the final adjusted output. Fig. 3 shows the specific calculation process of the AMB attention map. The channel attention module (CAM) in the AMB is arranged before the spatial attention module (SAM). The squeeze and excitation (SE) [55] block is a CAM that applies attention to objects from the perspective of channel features. It can suppress background information and highlights foreground characteristics by adaptively re-weighting channel-wise features. In this article, the SE block was used as a CAM to decrease the error detection of circular marks. Specifically, the feature map of each channel was transformed into a matrix of size 1 × 1 × C by average pooling, and the channel attention M C was obtained after two 1 × 1 convolutional layers. The calculation process for M C can be expressed as follows: where CM 1×1 r denotes the convolution operation before the Mish activation function, with 1 × 1 representing the size of the convolution kernel, and r representing the reduction ratio. C 1×1 r is the convolution operation, which has the same superscript and subscript as CM 1×1 r . σ denotes the sigmoid function. SAM pays attention to objects at the spatial scale. Generally, the foreground occupies much fewer pixel than the background. Therefore, more attention should be paid to the foreground region. This article used a mask of the same size and depth as the input feature map to generate a spatial attention map M S . Specifically, the mask used to generate spatial attention is produced by 1 × 1 convolution layer. The calculation process for M S can be expressed as follows: where F denotes the intermediate feature map of the CAM, CB 1×1 r represents the convolution operation before the batch normalization operation, and has the same superscript and subscript as CM 1×1 r .

B. Extraction of Saliency Circular Mark Contour Based on BASNet
For identification of the circular mark center, the preprocessing consists of two steps, namely, image binarization, and edge detection from binary on the contour. Binarization is performed by converting grayscale image pixel to zero or one using the adaptive local threshold method [20], [56]. However, as shown in Fig. 5(b), the detected rectangular region usually contains background noise other than circular mark, which will subsequently affect the location of the center of the circular marks. In this article, we adopt a SOD model, BASNet, as a VAM to filter out the background noise while retaining the circular sub-pixel contour information.
The network structure of BASNet is shown in Fig. 6, which consists of a prediction module (PM) and a residual refinement module (RRM). The U-Net [57] structure is employed in the PM. The encoder extracts the feature map through the basic resblocks adopted from ResNet-34 [58]. Both encoder and decoder have six levels. Each level feature map of the encoder is concatenated with the up-sampling output from the previous level and its corresponding level in the encoder. The output from the PM is a coarse map, in which the boundary of the mark is rough. The RRM then refines the saliency map of the PM by learning the residuals between the predicted saliency map and the ground truth. As same as the PM, it also has encoder and decoder phases. Unlike the PM, both the encoder and decoder have four levels. The final output is a refined saliency map that preserves the sub-pixel contour of the circular mark and removes background noise [see Fig. 5(c)].

C. Identification of Circular Mark Center Based on Least Squares
The circle edge points can be recognized quickly and accurately on the saliency map using Canny operator. In this article, the LSF is used to fit the center of the ellipse to achieve sub-pixel positioning. The general expression for the ellipse equation is as follows: f (α, X) = αX = Ax 2 + Bxy + Cy 2 + Dx + Ey + F = 0 (4) where A, B, C, D, E , and F are the five elliptical parameters and α represents(A, B, C, D, E, F ) , and X i = (x 2 i , x i y i , y 2 i , x i , y i , 1) . According to the principle of least squares, the curve-fitting problem can be solved by minimizing the sum of the squared algebraic distances The five elliptic parameters of A, B, C, D, E and F in (5) can be obtained by calculating the first-order partial derivative and setting it to 0. If the center coordinate of the ellipse is P (x 0 , y 0 ), the calculation formula can be expressed as

A. Experimental Design and Environment
The training process was carried out on an NVDIA GeForce GTX 1080ti GPU with 12 GB of memory, an AMD Ryzen 7 2700 eight-core processor, and a memory size of 48 GB. The I-YOLOv4 algorithm was implemented in Darknet. During the training process, batch size was set to 64. The initial learning rate was set to 0.013, momentum was set to 0.949, and decay was set to 0.0005. BASNet was trained using the derived target binary images. During the training phase, images in the training set were resized to 256 × 256 pixels. The initial learning rate was set to 0.01, and batch size was set to 8. The training converges after 60 000 iterations.

B. Comparison of Circular Mark Detection Network
First, comparison experiments of the four structures based on the YOLOv4 network are introduced in this section. We then compared the performance of each strategy of the best structure with that of YOLOv4. Finally, the improved network model was compared with various detection models to verify the effectiveness of object detection in circular mark images. AP, precision, recall and detection efficiency (detection time / number of images) are evaluation metrics.
1) Structure: We compared the four structures of the neck and the head. Structure1 was the original YOLOv4 model, called 3-in and 3-out (see Fig. 4(a)]. In structure 2 [see Fig. 4(b)], we removed the 32 times down-sampling feature maps, and only used 16 and 8 times to detect circular marks. To use a high-resolution feature map to detect small targets, structure 3 [see Fig. 4(c)] extracts a 4 times down-sampling into the neck. Synchronously, we added detector 4 for receiving the same size feature map from the neck. Structure 4 [see Fig. 4(d)] was similar to structure3, which was 4-in, but only used three higher-resolution feature maps. Comparison experiments were conducted in the same environment. The detection accuracies of these four structures are given in Table II. The four different structures obtained feature maps of different scales from the same backbone network (CSPDarknet53) for feature fusion. As given in Table II, structure4 yielded the best result, with an AP of 97.22%. The accuracy of structures 2, 3, and 4 were improved by 2.26%, 0.57%, and 2.67%, respectively, compared to structure 1. The reason for this improvement was that the high-resolution feature map added by structure4 contained richer spatial information. After fusion with the feature maps of the other three scales, circular marks were detected under various conditions. Structure1 and structure 3 both had a low-resolution detection layer. Owing to the lack of sufficient spatial information, it does not perform well on detection of small circular marks, leading to precisions of only 94.96% and 94.55%, respectively. Because structure 2 had two detection layers and lacked the ability to detect small targets, recall was the lowest (92.89%) of the four structures (see Table II).
2) Ablation: Table III gives comparisons among each strategy and the best neck-head structure (YOLOv4 with structure4) before adding. Compared with YOLOv4 with structure 4, the accuracy is improved 0.45% and 1.03% by adding SAM and CAM, respectively. After inserting AMB (CAM + SAM), the detection accuracy was improved by 1.38%. This indicated that the channel attention and spatial attention mechanisms introduced more semantic and spatial information. In summary, the I-YOLOv4 improved accuracy by 3.64% compared to the original YOLOv4.
3) Performance Comparison With Other Detection Models: A qualitative and quantitative representation was provided for the Faster R-CNN [40], SSD [41], YOLOv3 [44], CenterNet [46] and the I-YOLOv4. Fig. 7 shows the validation of the mark detection results for different deep learning detection models under complex scenes, including indoor and outdoor scenes and circle-like background scenes. Faster R-CNN and SSD did not reliably detect multiscale marks in the images, missing large targets in the vicinity. Faster R-CNN, SSD, CenterNet, and YOLOv3 missed detections of occluded and oblique marks. Faster R-CNN and the I-YOLOv4 incorrectly detected the background as a mark. For the outdoor scenes in Fig. 7, the detection results of most models were good, except for SSD. Some problems were apparent with YOLOv3 and CenterNet. YOLOv3 missed the overexposed and occluded marks, and CenterNet incorrectly detected circle-like cables as marks. Faster R-CNN, SSD, CenterNet, and YOLOv3 detected complex backgrounds as marks, including  Fig. 7(c)]. In contrast, the I-YOLOv4 exhibited good detection performance under our experimental conditions. Table IV gives the comparison of the AP, precision, recall, and detection efficiency for the five models employed in this article. Our improved method achieved the highest AP of 98.60% and the I-YOLOv4, which used a new neck-head structure (4-in and 3-out), exhibited excellent feature abstraction and feature fusion capabilities to deal with the detection of small circular marks (see Table IV). The modified YOLOv4 with AMB accurately detected circular marks in complex scenarios. CenterNet exhibited good detection accuracy (AP of 96.64%), but detection required 54.9ms per image, and recall was 3.96% lower than that of the I-YOLOv4. CenterNet employs keypoint estimation to find the center point and regresses it to other attributes. The centers of dense circular marks overlap after feature map down-sampling, resulting in CenterNet failing to detect dense and occluded circular marks in images. YOLOv3 uses FPN for feature fusion and the backbone network for feature extraction, whose feature abstraction ability is weaker than that of PANet [36] and CSPDarknet53. In addition, the structure of YOLOv3 is similar to that of the original YOLOv4. Thus, the AP was 4.16% lower than that of the I-YOLOv4. Faster R-CNN and SSD use only a single feature layer for object prediction, which cannot cope with the detection of multiscale scenes in the CMR dataset. Furthermore, because there is no feature fusion of multiple hierarchical feature maps, low-level features lack sufficient semantic information, and high-level features lack sufficient spatial information, making it difficult to accurately detect and locate circular marks in some complex backgrounds.
In addition, we also compared the receiver operation characteristics (ROC) of different models. ROC is an important metric that can be used to evaluate the detection effect under the same false positive [59], [60]. ROC The higher the true positive rate (TPR), the better the detection effect. ROC curves are acquired using Monte-Carlo simulations [61], [62], which is done on the CMR dataset. As given in Table IV and Fig. 8, the recall of SSD is only 46.23%, but the precision is 92.46%. SSD has both low FP and TP, causing the ROC curve to be close to the x-axis. CenterNet and YOLOv3 exhibit similar performance on ROC curves. The I-YOLOv4 works best by having lower number of FP and higher TPR.

1) Evaluation Metrics:
We used high-precision total station (SOKKIA NET05AX) to obtain the 3-D coordinates of the center. According to the camera calibration parameters and exterior orientation elements, the collinear equation is used to obtain the image pixel coordinates of the circular mark centers as the ground truth values. On the one hand, we use mean absolute error (MAE) and root-mean-square error (RMSE) to measure the accuracy of mark center localization. On the other hand, precision, recall and F-measure are used to verify algorithms' ability to recognize circular marks in real images, and they are defined as

Precision =
TPs TPs + FPs Recall = TPs TPs + FNs (8) where TPs, FPs, and FNs are numbers of true positives, false positives, and false negatives, respectively. We define points with RMSE less than 0.5 pixel as TP.
2) Compared Methods: We compared the performance of the proposed method with three state-of-the-art methods, which are the centroid search algorithm [17], Arc-Support [24] and arc adjacency matrix-based ellipse detector (AAMED) [25]. The centroid search algorithm integrated into the PhotoModeler Scanner software can achieve high-precision positioning of marks, which is currently widely used in videogrammetry [2], [22], [23]. Arc-Support utilized rich geometric features and arc-support line segments to complete the ellipse detection tasks. AAMED detected ellipse robustly by constructed a digraphbased arc adjacency matrix (AAM) for arc pairing. In addition, the role of the VAM is also explored by comparing with our method without the VAM.
3) Accuracy Comparison of Center Identification: First, we compare mark recognition rate of aforementioned methods. Since centroid search algorithm requires manual selection of marks, we did not include it in the recognition comparison experiment. Then, we compare the accuracy of methods for locating the center of the circle.
As given in the Table V, the recall of Arc-Support and AAMED are 93.21% and 95.86%, respectively. Both methods misrecognize many circle-like objects, such as light bulbs, pipes, auto wheels, etc., and the precision is only 83.30% and 86.28%, respectively. The ellipse detection methods cannot distinguish between circular marks and general circle-like objects. Due to the background noise interference after binarization, the precision of our method without VAM is only 79.69%. From point10 of Fig. 10(a), point12 of Fig. 10(b), and point14 of Fig. 10(c), it can be seen that the proposed method without VAM has deviations and false alarms in the positioning of the circular mark. After introducing the VAM, our proposed method can effectively recognize most of the circular marks in the image, and the precision and recall reach 98.77% and 99.37%, respectively. This proves that VAM has improved the recognition rate of the circular marks.
Both centroid search and our proposed method can identify the center of the circular mark very well (see Fig. 10). In scene 1, arc-support, AAMED and the proposed method without VAM cannot identify the overexposed circular marks from the location of the yellow triangle. In scene 2, from the position of the blue circle, the white background was recognized as the center of the circle without VAM. In scene 3, the proposed method was not influenced by the light bulb. However, owing to the absence of VAM and the small size of the circular mark, there was a large deviation in the identification of the method without VAM, as revealed by the location of the blue circle (see Fig. 10). Arcsupport and AAMED missed some points due to the smaller size of the circular mark and noise in Fig. 10 scene 3. Fig. 9 and Table VI show that the proposed method achieved the best result, with a mean RMSE of 0.0796 pixel. Without  VAM, there were many false and error identifications at some points [such as point7 and point10 in Fig. 10(a) and point 5, point14 in Fig. 10(c)]. After adding the VAM, the RMSE of the proposed method was reduced by 0.3942 pixel. In particular, in a complex background scene (scene 3), the RMSE was reduced from 1.0188 pixel to 0.0713 pixel. The RMSE of Arc-Support and AAMED is 0.2790 pixel and 0.2676 pixel, respectively. However, these methods cannot directly identify circular marks from the entire image, and need to use a sliding window to traverse the image, which reduces the efficiency of detection. Our proposed method was comparable to centroid search method, which manual selection is required. In summary, the proposed method achieved the best performance under all three complex scenarios. Table VII, the centroid search algorithm takes an average of 125.20 s per image to identify all circular marks. This is because the centroid search algorithm requires manual indication of search area, leading to low recognition efficiency. The recognition time of arc-support and AAMED is 2.29 and 3.91 s, respectively. I-YOLOv4 is a one-stage detector that can quickly extract and fuse deep and shallow features, making the recognition time of CMNet without VAM only 0.89 s. VAM performs boundary-aware detection on each circular mark region in the image, making the salient contour extraction stage more time-consuming. So, adding the VAM resulted in an increase in detection time per image of 3.36 s, taking 4.25 s to recognize. In this article, a high-precision recognition method based on a novel deep learning model, CMNet, is presented. First, circular marks were detected based on the I-YOLOv4 model to narrow the search region of the circular contour. The contours of the circular marks were then extracted based on the saliency object detection model BASNet, and LSF is used to calculate the central pixel coordinate of the identified contour on the saliency map. Two major improvements were made based on the original YOLOv4 model. The first improvement was a large-scale feature map optimization structure, which was displayed as a 4-in and 3-out structure. The second modification was to insert AMBs between the neck and head to improve the accuracy of mark detection in a complex environment.

4) Efficiency Comparison of Different Methods: As given in
Three complex scenes with different characteristics and disturbances were used to evaluate the effectiveness and robustness of the method. These three scenes were indoor with low light, outdoor with extremely strong light and over-exposure, and a densely scene with multiple circle-like objects. The experimental results demonstrated the following.
1) The improved YOLOv4 significantly enhanced the detection accuracy of circular marks, which is crucial to the subsequent saliency courter detection and circle center positioning. 2) CMNet achieved the best performance with an RMSE of 0.0713 pixel, comparable to the precision of the commercial software PhotoModeler.
3) The precision, recall, and F-measure of center identification of our proposed method is 98.77%, 99.37%, and 99.07%, respectively. The results indicate that our method exhibited a good ability for circular mark recognition in complex environments. In the future, we will try to propose an end-to-end circular mark recognition network with shared weights to realize the real-time circle center identification.