An Anchor-Free Convolutional Neural Network for Real-Time Surgical Tool Detection in Robot-Assisted Surgery

Robot-assisted surgery (RAS), a type of minimally invasive surgery, is used in a variety of clinical surgeries because it has a faster recovery rate and causes less pain. Automatic video analysis of RAS is an active research area, where precise surgical tool detection in real time is an important step. However, most deep learning methods currently employed for surgical tool detection are based on anchor boxes, which results in low detection speeds. In this paper, we propose an anchor-free convolutional neural network (CNN) architecture, a novel frame-by-frame method using a compact stacked hourglass network, which models the surgical tool as a single point: the center point of its bounding box. Our detector eliminates the need to design a set of anchor boxes, and is end-to-end differentiable, simpler, more accurate, and more efficient than anchor-box-based detectors. We believe our method is the first to incorporate the anchor-free idea for surgical tool detection in RAS videos. Experimental results show that our method achieves 98.5% mAP and 100% mAP at 37.0 fps on the ATLAS Dione and Endovis Challenge datasets, respectively, and truly realizes real-time surgical tool detection in RAS videos.


I. INTRODUCTION
Robot-assisted surgery (RAS) is the latest development in minimally invasive surgical technology. Robotic surgical tools make it easy to perform complex motion tasks during surgery by transforming the surgeon's real-time hand movements and forces acting on the tissue into small-scale movements [1]. Despite its advantages in minimally invasive surgery, the RAS system still has problems, such as a narrow field of view, narrow operating space, and insufficient tactile feedback, which may cause holes in organs and tissues during an operation [2]. Surgical tool detection can help solve these problems by providing the trajectory of a tool to realize surgical navigation. Also, to have real-time information on the motions of a surgical tool can help model poses for real-time automated surgical video analysis [3]- [6], which assists surgeons with automatic report generation, optimized The associate editor coordinating the review of this manuscript and approving it for publication was Ting Li . scheduling, and offline video indexing for educational purposes [7]. Hence, in this study, we focus on real-time surgical tool detection in videos.
Many methods have been proposed for surgical tool detection. Image-based methods are becoming more popular, as they rely purely on equipment already in the operating theatre [8]. Deep convolutional neural network (CNN) has been merged into various RAS medical image-based tasks, such as surgical tool detection [9]- [13], tracking [14]- [18], pose estimation [19]- [21], and segmentation [22]- [25]. Singleand two-stage detectors are generally used to detect surgical tools. Two-stage detectors [1], [3] apply a region proposal network (RPN) to generate region proposals before being passed to a final classification and bounding box refinement network; single-stage detectors [13], [26] place anchor boxes densely over an image and generate final box predictions by scoring anchor boxes and refining their coordinates through regression. Both single-and two-stage detectors use anchor boxes extensively, but single-stage detectors are more competitive and efficient than two-stage detectors. However, anchor boxes have two drawbacks [27]. They introduce many hyperparameters that require fine design, and they create a huge imbalance between positive and negative anchor boxes that slows down training. Methods using anchor boxes usually detect surgical tools with high accuracy, but cannot detect them in real time (handling at less than 20 fps). The method proposed by Jin et al. [5] and the surgical tool detection method proposed by Twinanda et al. [9] only detect the presence of the tool and cannot output the location of the surgical tool. Zhao et al. [13] presented a CNN-cascaded surgical tool detection method, which can not achieve end-to-end training and needs to design the output heatmaps carefully. Compared with the work of Zhao et al. [13], our method can not only achieve end-to-end training, but also innovatively use a more efficient and compact CNN backbone, which has an accuracy rate that exceeds their work at comparable speeds.
In view of the deficiencies of the various methods mentioned above and the inspiration of CenterNet [28], we propose a single-stage approach to detect surgical tools without anchor boxes. We introduce a compact stacked hourglass network [29] to detect the surgical tool as the center point of its bounding box. We evaluated the performance of the proposed method on the publicly available ATLAS Dione dataset [1] and the EndoVis Challenge dataset [21], and our approach performed better than three state-of-art detection methods with regard to detection accuracy and speed.
Our main contributions are summarized as follows: (1) We propose an anchor-free CNN architecture for real-time surgical tool detection in RAS. We integrate the lightweight idea (fire module and depthwise separable convolution [30]- [32]) in our architecture so that the accuracy is basically not reduced and the speed of detection of surgical tools is faster.
(2) Our approach distributes the ''anchor'' based only on location rather than box overlap [33]. Each of our objects has only one positive ''anchor,'' so no NMS is needed, and only local peaks in the keypoint heatmap must be extracted to achieve points to bounding boxes.
(3) We extensively evaluate our proposed surgical tool detection approach on the ATLAS Dione and EndoVis Challenge datasets. For greater accuracy, we manually relabeled the EndoVis Challenge dataset. Our approach demonstrates superior performance over state-of-the-art approaches.
The rest of this paper is organized as follows. Section II introduces our approach, including the network architecture and the loss function for learning. Section III elaborates on the experiments and results. We discuss the effectiveness of our approach and future directions for improvement in Section IV. Finally, our conclusions are drawn in Section V.

A. NETWORK ARCHITECTURE
Inspired by [30]- [32], we designed a lightweight hourglass backbone that works better than CenterNet [28]. The new network consists of two hourglass modules, and the residual modules in the traditional hourglass backbone are replaced with the more effective fire modules [30]- [32] to predict the heatmap at the center point of all instances of the surgical tools. Additional details can be found in Figure 1. As we can see, the fire module first uses a 1 × 1 kernel to squeeze the input channels, which reduces the parameters to accelerate our network. Then, it passes through a mixture of 1 × 1 and 3 × 3 kernels to feed the results. To accelerate the training of the network structure, we replace the original 3 × 3 standard convolution with a 3 × 3 depthwise separable convolution, as shown in the orange block (Dwise) in Figure 1. Peaks in the heatmap correspond to tool centers [34]. Image features at each peak predict the surgical tool bounding box's height and weight ( Figure 2). Inference is performed by a single network forward-pass, without non-maximal suppression (NMS) [35] for post-processing. In general, the depthwise separable convolution splits the ordinary convolution into deep convolution and point-by-point convolution. The advantage of depthwise separable convolution is that the number of parameters and the computational complexity can be greatly reduced with less loss of precision.  In our architecture (Figure 3), we use a 7 × 7 convolution module and a residual module to reduce the input image size (512 × 512) by a factor of four, followed by two hourglass modules. We modified the architecture of the hourglass modules. Each is a symmetric 2-layer downsample and upsample CNN with skip connections, each consisting of a fire module. A fire module followed by nearest neighbor upsampling is applied to upsample the features. There is a fire module in the middle of each hourglass module. We do not use max pooling, but simply use stride 2 to reduce the feature resolution. We increase the number of feature channels along the way (384,512) and reduce feature resolutions two times. We also adopt a 1 × 1 Conv-BN module to both the input and output of the first hourglass module as intermediate supervision. Inference is performed by a single CNN forward pass, without NMS for post-processing. The features of the stacked lightweight hourglass backbone are then passed through a separate 3 × 3 convolution, ReLU, and another 1 × 1 convolution.

B. LOSS FUNCTION FOR LEARNING
We denote an input video frame of width W and height H by I R W R × H R ×3 . Then, we leverage the lightweight stacked hourglass network to predict the keypoint heatmap , where R is the output stride and C is the number of surgical tool classes. We predict the heatmap at the center point of all instances of the surgical tools. Peaks in the heatmap correspond to object centers. Image features at each peak predict the surgical tool bounding box's height and weight. We train our network following Zhou et al. [28]. Focal loss [36] mainly solves the problem of severe imbalance of positive and negative samples in single-stage surgical tool detection. The focal loss function reduces the weight of a large number of simple negative samples in training, which can also be interpreted as a kind of difficult sample mining. The training objective is a penalty-reduced pixel-wise logistic regression with modified focal loss: where α = 2 and β = 4 [27] are hyperparameters of the focal loss, N is the number of keypoints in image I , Y xyc is a Gaussian kernel, and at the center point Y xyc = 1, the diffusion of Y xyc around the center point slowly decreases from 1 to 0.Ŷ xyc = 1 corresponds to a detected keypoint, whileŶ xyc = 0 is the background. The offset is trained with an L1 loss: where p is ground truth of the keypoint, andp = p R is a low-resolution equivalent. We use an L1 loss at the center point: where ) is the center point location, andŜ p k is a single size prediction for all tools. We let (x be the bounding box of object k with category c k . The overall training objective is We set λ size = 0.1 and λ o = 1 in our experiments. From to bounding boxes: , where σ p is a surgical tool size-adaptive standard deviation [27]. According to the peaks on the feature map, 100 peaks that are greater than or equal to 8-connected neighbors values around are selected as the central keypoints for preliminary prediction. Then it is necessary to predict the offset of the center keypointÔx i ,ŷ i (because there will be deviation after scaling the extracted feature scale). Next, we can predict the size of bounding boxŜx i ,ŷ i . Finally, we can predict the coordinates of the bounding box by Equation 5. In summary, the procedures of training our network are performed as Algorithm 1 with the steps.

A. DATASET
We used the ATLAS Dione dataset [1], consisting of 99 action video clips of ten surgeons from the Roswell Park Cancer Institute (RPCI) (Buffalo, NY) performing six surgical tasks (subject study) on the da Vinci Surgical System (dVSS). The resolution of each frame is 854 × 480 with the surgical tool annotations. Despite being a phantom setting, the ATLAS Dione dataset is challenging, as it has camera movement and zoom, free movement of surgeons, a wide range of expertise levels, background objects with high deformation, and annotations including tools with occlusion, change in pose, and articulation. Figure 4 shows some disturbing factors of the ATLAS Dione dataset. To train our model, we divided the entire set of video clips into two subparts: 90 video clips (20491 frames) for training and the leftover nine video clips (1976 frames) for testing. To validate the extensibility of our architecture, we evaluated our approach on the MICCAI 15 EndosVis Challenge dataset [21], which includes 1083 frames from ex-vivo video sequences of interventions. The resolution of each frame is 720 × 576 with the surgical tool annotations. For greater accuracy, we relabeled the dataset manually. This dataset was separated into a training set (984 frames) and test set (109 frames). The ATLAS Dione dataset is more challenging than the EndosVis Challenge dataset because there are more disturbing factors, such as motion blurring, fast movement, and background changes.

B. EXPERIMENTAL SETTINGS
We implemented the lightweight hourglass networks on the Ubuntu 18.04 LTS operating system using the PyTorch 1.0 framework based on Python 3.6, CUDA 10.1, and CUDNN 7.4. The Titan Xp GPU was used as an accelerator for training. We fixed the input and image resolution to 512 × 512 and 128 × 128, respectively. Before training, we used random scaling, flipping, cropping, and color jittering as data augmentation. The learning rate was initialized at 3.125e −5 for all layers, and decreased by a factor of 10 at 90 and 120 interations. We trained the networks for 140 epochs. To guarantee the fairness of comparison, we downloaded code and pre-trained models to test run time for each model on the same machine. As for the ATLAS Dione dataset, training on a TITAN GPU, our method uses half of the time required by CenterNet.

C. RESULTS
We elaborate the surgical tool detection outputs of our method in the video frames, in Figure 5 and   by three other state-of-the-art methods as a comparison. Our method is in purple (the probabilities are indicated in the top-left corners of bounding boxes), Faster RCNN is in green, Yolov3 is in blue, CenterNet is in white, and the ground truth is in red. To eliminate the need for multiple anchor boxes [37], our surgical tool detector uses a larger output resolution (output stride of 4) compared to many object detectors (output stride of 16) [38], [39]. To demonstrate the effective generalization capability of our backbone, we performed extensive experiments with five backbones: ResNet-18, ResNet101 [39], DLA-34 [40], Hourglass-104 [29], and ours(lightweight Hourglass). We also modified both ResNets and DLA-34 employing deformable convolution layers and leveraged the Hourglass network [28], [41].
For the DLA-34 and ResNet backbones, the learning rate, learning rate dropped, and training epochs were set the same as our backbone in Section 3.2. For Hourglass-104, we complied with ExtremeNet [42] and used batch size 8 and learning rate 3.125e −5 for 50 epochs with 10× learning rate dropped at the 40th epoch. After training for 140 epochs, all backbones could converge. Speed and accuracy tradeoffs for different backbones on the ATLAS Dione and EndosVis Challenge datasets are displayed in Table 1, respectively, from which we can observe the performance of these backbones. We present the mean average precision (mAP) at intersection over union (IoU) threshold 0.5 (this threshold is given by referring to Pascal VOC [43] dataset: if the IoU of the predicted bounding  box and the ground truth were greater than 0.5, then we considered the surgical tools to be successfully detected in a frame.). IoU is the ratio of the intersection and union of the prediction bounding box and ground truth, and is also referred to as the Jaccard index. We set different thresholds (0.5, 0.75, 0.95), comprehensively compare the experimental results of different backbones, and found that our backbone is the best in the balance of speed and accuracy. We also notice that the performance growth rate tends to be slower with the increase of ResNet deep, and our lightweight hourglass backbone works better than the Hourglass-104 backbone. The superior performance on both the ATLAS Dione and EndosVis Challenge datasets verifies the extensibility of our approach.
To prove the value of our tools detection method, we compared our method to three state-of-the-art detection methods on the ATLAS Dione and EndosVis Challenge datasets. We selected two anchor-based methods, Faster RCNN [44] and Yolov3 (Darknet-53) [45], and one anchor-free method, CenterNet (Hourglass-104) [28]. As described in Table 2, the mAP1 and the mAP2 represent the detection mAP on the ATLAS Dione and EndoVis Challenge dataset, respectively. Our method achieved a mAP of 98.5% for the ATLAS Dione dataset, and a mAP of 100% for the surgical tool detection of the EndoVis Challenge dataset. We compared the speed of our method with those of the other three state-of-the-art detection methods on two datasets, as shown in Table2. Our method had real-time performance at a speed of 0.027 seconds (over 20 fps), which demonstrates its potential for online surgical tool detection.
To more comprehensively reveal the advantages of our method, we also evaluated our method by the distance evaluation method. If the distance between the center of the predicted bounding box and the center of the ground-truth bounding box is less than the threshold in the image coordinates, then the surgical tool is considered to have been correctly detected. The experimental results are shown in Figures 7 and 8. CenterNet and our method achieved competitive performance on the ATLAS Dione dataset at the cost of lower 2× detection speed. Our method shows the best performance on the Endovis Challenge dataset.

IV. DISCUSSION
Automatically detecting tool location from videos plays a important role of the development of the RAS. Based on Table 2, we can see that our method is more accurate than the other three methods. In particular, experiments on the ATLAS Dione dataset demonstrate the superior performance of our method, which exceeds Fast-Rcnn and Yolov3 by a   significant margin. On the other hand, our approach shows superior speed to Faster RCNN and CenterNet, is competitive with Yolov3, and is two times faster than CenterNet. The considerable improvement of CenterNet (based on Hourglass-104) is largely attributed to the replacement of the residual modules in the hourglass backbone with the more effective fire modules and the utilization of depthwise separable convolution.
Our method achieved good results, but there are potential limitations. For example, if the center points of two surgical tools just overlap, our method can only predict one of them. The lack of large datasets (with tool annotations), the need to improve the speed, and the high training costs are other limitations of our study. Based on the above considerations, the following ideas should be investigated. With regard to the lack of datasets, our future work will pay more attention to extending the detection of weakly supervised surgical tools.
To increase the speed, we will try to leverage temporal information (using a long short-term memory network to extract temporal information) for the surgical tool detection task. We hope to employ time information to realize the detection task of surgical tools with a faster speed and greater accuracy.

V. CONCLUSION
We introduced an anchor-free CNN architecture and a frameby-frame method using a lightweight stacked hourglass network to predict the heatmap at the center point of a surgical tool for real-time surgical tool detection in robot-assisted surgery. Peaks in the heatmap correspond to tool centers. Image features at each peak predict a tool's bounding box size. Our detector eliminates the need to design a set of anchor boxes, and is end-to-end differentiable, simpler, more accurate, and more efficient than corresponding anchor box-based detectors. We believe our method is the first to incorporate the anchor-free idea for surgical tool detection in RAS videos. Our method has achieved good accuracy and speed to realize real-time surgical tool detection in RAS videos. In 2011, he was appointed as the Vice President of Qilu Hospital of Shandong University. In 2012, he was hired as the Mount Tai Scholar Distinguished Professor of Shandong Province. In 2019, he was appointed as the President of Shandong Qianfoshan Hospital (probation period is one year). In the field of laparoscopic research, he led the team to win the first prize for scientific and technological progress in Shandong Province, nine other scientific research awards at provincial and ministerial levels, published more than 30 SCI articles and applied for two invention patents. He has published 16 monographs, translated works and five audio-visual teaching materials.
Prof. Hu was an Outstanding Academic Leader in Shandong'