Sensorless Real-time Force Estimation in Microsurgery Robots using a Time Series Convolutional Neural Network

Robotic-assisted microsurgeries provide several benefits to both patients and surgeons. Nevertheless, there are still some limitations and challenges associated with their outcome, one of which is a lack of force feedback. Without force information, the risk of delicate tissue damage from the excessive force applied by surgeons would be increased. Since it is difficult to install force sensors on microsurgical tools, a novel approach for estimating a force vector from the deformation of the surgical tool is proposed in this paper. In the proposed approach, a surgical instrument that deforms according to the magnitude of the tool-to-tissue force is designed, and a time series convolution neural network is used to make the nonlinear relationship between the visual information of the deformation of the surgical tool and the applied forces in such a way that the tool-to-tissue force can be estimated according to the deformation of the surgical instrument in a real-time manner. The experimental results prove that the applied force can be successfully estimated with high accuracy in three dimensions using the proposed method.


I. INTRODUCTION
Microsurgery is an operation for suturing vessels or nerves under a microscope and is regarded as one of the most technically demanding surgical disciplines because it requires precise motion to manipulate delicate tissue in a small and constrained workspace [1] [2]. Microsurgeries have also been widely used in the areas of ophthalmology, otolaryngology, and neurosurgery. However, the accuracy of the surgeries is limited by the surgeons' capabilities. The robotic-assisted surgery system with a master-slave configuration brings several potential advantages in the field of microsurgery, such as enhancing the surgeons' dexterity and allowing precise manipulation by providing motion scaling and tremor filtering [3]- [5].
Although microsurgical robotic systems have several benefits to both surgeons and patients, there are still some limitations and challenges associated with their outcome, one of which is a lack of force feedback. Force feedback is a major feature that can improve the microsurgical performance since it enables surgeons to control the interaction forces [6], and thus, it helps in the proper execution of surgical procedures. In addition, force feedback can also enable surgeons to feel extremely small forces; as a result, surgeons can avoid exertion of excessive force and reduce the risk of damage to delicate tissues. Without force information, it is difficult for surgeons to feel how much force is applied to delicate tissues, and excessive force might result in irreversible damage [7]- [9]. To address this problem, a method to measure the interaction force between a surgical instrument and a tissue on the slave side and a method to feedback the force to the surgeon on the master side should appropriately be considered in roboticassisted microsurgery. This paper proposes a method to estimate the interaction force vector on the slave side.
To date, many researchers have studied force feedback for robotic-assisted minimally invasive surgery (widely used in endoscopic surgery such as laparoscopic surgery). Existing force estimation methods in the surgery robot can be roughly divided into two methods: direct force sensing and indirect force estimation [10]. In the direct force sensing method, the tool-to-tissue force is measured by integrated sensors that are installed on or close to the surgical tool. For example, several sensorized tools have been developed with force sensors or strain gauges mounted on the shaft or the base of the surgical tool [11] [12]. Other researchers [13] [14] have measured the force using force sensors installed on the tip of the surgical tool. Furthermore, a surgical forceps tool with the capability of force sensing, including pull and grasp forces has been developed using strain gauges attached to the complaint hinge [15]. The main advantage of this method is that the forces can be measured directly from the tool-to-tissue interaction. Nonetheless, this approach still suffers from many constraints, such as biocompatibility, sterilization, size, and cost [16] [17]. For these reasons, indirect force estimation has been proposed as an alternative solution to estimate the interaction forces. In contrast to the direct force sensing method, the indirect force estimation method involves estimating forces through driven system information, such as the current, torque and displacement, or vision, instead of direct force measuring. For example, force sensing was conducted based on changes in driving cable tension for a 3-DOF manipulator [18]. In [19], a dynamic model-based estimator was proposed to estimate cable pretension in cable-driven robots. However, the use of the driven system information for force detection could affect its accuracy due to its nonlinear characteristics.
The vision-based approach, which is called vision-based force measurement (VBFM) [20], was proposed to estimate the interaction forces from the observable displacements of the deformable object. To prove the feasibility and effectiveness of the VBFM, researchers have made a large amount of effort. In the study in [21], a method based on image processing techniques and a continuum mechanics model was proposed to estimate the force, which can be used in both micro-and macroscale environments. A virtual-template-based approach that estimates the tool-organ force interaction from monocular camera images was presented in [22]. The authors in [23] built a biomechanical model from the organ shape using 3D reconstruction and meshing techniques to estimate the contact forces. These research showed that the vision-based approach is stable, robust, and effective in estimating the applied forces according to the object deformation, whether in macro or microsystems. Thus, although the accuracy provided by the VBFM method is limited, it is still a potential solution to estimate the force.
In addition, some researchers have used artificial neural networks to improve the accuracy of vision-based force estimation. In [24], two different neuro-fuzzy inference systems were proposed to identify the tool-tissue force and the maximum local stress in laparoscopic surgery. A recurrent neural network (RNN) approach for 3D vision-based force estimation was proposed in [25]. In their work, the force is estimated through the RNN, which uses kinematic variables and deformation mapping as the input. To increase the accuracy of the learning system, a modification method was presented by the same authors in [26]. The solution extracts the geometry of motion of the tissue's surface first, and then, a deep network based on an LSTM-RNN architecture is used to find the accurate mapping between the extracted visualgeometric information and the applied force. This visionbased neural network method mainly relies on tissue deformation observed by monocular or stereo vision systems as the input of the deep learning models.
As described in the aforementioned literature, most existing methods have been proposed for laparoscopic surgical robots. Since microsurgery involves suturing blood vessels or nerves under an optical microscope, the size of the surgical instruments is smaller than that in laparoscopic surgery, which makes installing force sensors on microsurgical tools difficult. Moreover, the view of the region of interest in microsurgery is limited, which makes it difficult to observe the deformation of delicate tissues. Therefore, to date, there have been few studies on force feedback in robotic-assisted microsurgery. A 3-DOF force-sensing micro-forceps for robot-assisted vitreoretinal surgery has been developed based on fiber Bragg grating (FBG) strain sensors [27]- [29]. However, this direct detection method also has some shortcomings, such as temperature sensitivity, sterilization, and high cost.
Due to the reasons mentioned above, a novel vision-based sensorless force estimation approach for robotic-assisted microsurgery with the capability of multiaxis force sensing is proposed in this paper. In the proposed approach, a surgical instrument that easily deforms according to the magnitude of the applied force is designed, and a time series convolution neural network (CNN) model is proposed to determine the relationship between the deformation of the surgical tool and the applied forces in such a way that the tool-to-tissue forces can be estimated according to the deformation of the surgical instrument in real time.
Deep neural networks, such as CNNs, have achieved dramatic progress on image recognition tasks, and a number of extensions to process video have also been proposed [30]- [32]. Some of the most popular CNN architectures used today are AlexNet [33], VGG16 [34], ResNet [35] and GoogLeNet [36]. Among them, VGG16 is currently a widely used choice for extracting features from images and has excellent performance. For example, transfer learning using CNNs has been proven to be an effective method that has the benefit of decreasing the training time and can result in lower generalization error [37]. In transfer learning, a neural network model is first trained on a base dataset and task and then repurposes the learned features or transfers them to a second target network to be trained on a target dataset and task [38]. Shin et al. [39] accomplished two specific computer-aided detection problems in medical images by fine-tuning CNN models pretrained from a natural image dataset. Esteva et al. [40] demonstrated that a pretrained CNN model in the ImageNet dataset can be used in the classification of images that concern skin cancer lesions. Therefore, in the current work, the VGG16 model is used as an example of the pretrained model, and the transfer learning technique is exploited.
The remainder of this paper is organized as follows: Section II details the proposed approaches. Sections III and IV present experimental evaluations to demonstrate the feasibility and effectiveness of the proposed approach. Section V presents the conclusions of this paper.

II. VISION-BASED SENSORLESS FORCE ESTIMATION APPROACH WITH A TIME SERIES CNN MODEL
In the proposed approach, deep neural networks are applied together with time series visual information from a deformable surgical tool to estimate the applied force. The force estimation approach is a part of the robot-assisted microsurgical system based on a master-slave configuration, as shown in Fig. 1. The surgeon watches the motion of the surgical tool in the slave hand on a display and controls the master hand to perform microsurgery. The operation of the master hand can be scaled down with a motion-scaling ratio by the controller and transmitted to the slave hand; then, the slave hand directly acts on the patient with the deformable surgical tool. In this way, the slave hand can replicate the surgeon's motion with greater precision. A high-definition (HD) microscope is used to provide stereoscopic visualization of the surgical area, allowing the surgeon to perform operations with three-dimensional (3D) perception. To provide the surgeon with a sense of touch, the system integrates a force feedback function using the proposed approach, i.e., the interaction forces between the surgical tool and the tissue on the slave hand can be fed back to the surgeon.
In the force estimation approach, visual information on the displacements of the deformable surgical tool is provided to the proposed deep neural networks. In the proposed deep neural networks, a time series CNN model is trained to analyze the visual information and generate an accurate force estimation. Then, the estimated forces are transmitted to the surgeon's hand by the master hand.

A. THE DEFORMABLE SURGICAL TOOL
The proposed method in this paper enables the system to estimate the tool-to-tissue forces based on the deformation of the surgical tool. The first part of our approach is to design the deformable surgical tool. Figure 2 shows a 3D model of the surgical tool that is deformable according to applied forces.
The five parts (V0 -V4) in the middle of the surgical tool are marked with different colors, and each adjacent part is designed to be perpendicular to each other part to make it clearer to observe its deformation. During the operation, interaction forces (Fx, Fy and Fz) are generated between the tip of the surgical tool and the delicate tissue; as a result, the position of the five parts will change according to the deformation. Then, the relationship between the applied forces on the tip of the surgical tool and the deformation of the surgical instrument is established through the time series CNN model, which is detailed later in Section B. This method makes it possible to estimate the tool-to-tissue forces from the deformation of the surgical tool while in actual use.

B. THE TIMESERIES CNN MODEL
The approach presented in this paper is used to predict interaction forces using the image information of the deformable tool. Because of the long duration of the surgical image information, the neural network model for force  estimation should be established as a function that maps an input image sequence to an output force sequence. Moreover, due to the dynamic characteristics of the sequence, the speed and acceleration of the surgical tool deformation must be accounted for. Therefore, with the aim of performing the force estimation task, a time series CNN model is proposed in this paper, i.e., a sequence of 3 time series frames is used as input to the model, rather than utilizing only a single frame. It can be represented by the following equation: where , − and − ∈ × × ( , represent the height and width of the pixel images, and is the number of RGB channels) are the sequence of the current and previous 2 timestep frames. The nonlinear model (•) with weight vector parameters maps the sequence into estimated force vector = [ ] in the current timestep. The proposed time series CNN model is depicted in Fig. 3(a). First, a sequence of the current and previous 2 timestep frames ( , − and − ) is used as the input data of the time series CNN model. Subsequently, each frame of the sequence is trained by a fine-tuned VGG16 network model. Afterward, features of size 4096 are extracted from the penultimate layer (FC7 shown in Fig. 3(b)) of the VGG16 architecture. Finally, the extracted feature vector of 3 timestep frames is concatenated as new feature vectors, which are followed by a fully connected layer with linear activation to estimate the force vector ( ) at the current timestep.
The VGG16 network is fine-tuned on our task to extract feature vectors from image information that can be used in the time series model (3 timesteps). To match the force estimation task, after fine-tuning, the feature vectors can be extracted from the penultimate fully connected layer FC7. After the features are concatenated, a nonlinear activation function Tanh is used in the FC layers. By applying tanh, the output results are compressed between -1 and +1. A linear activation is used in the output layer for networks that estimate interaction forces.
The loss function plays an important role in machine learning to evaluate how well a learning system can predict the expected results. In this study, the mean square error (MSE) loss function is utilized for the regression task that predicts force vectors. The MSE loss function is given as: where represents the actual force vector, ̂ stands for the predicted force vector, and the index (1, 2, …, n) represents the samples in the dataset.

III. EXPERIMENTAL EVALUATION
In this section, experiments to verify the accuracy of the proposed approach are described in detail. In the proposed solution, the established time series CNN model must first be trained using the image information and real forces. Once the model is trained, it can be used to estimate the applied interaction forces. Then, the estimated forces are validated against the real forces to verify the feasibility of the proposed approach.

A. DATASET ACQUISITION
The experimental setup is composed of a camera, a 3D printed model of the surgical tool and a 3-axis force sensor (NITTA PD-3-32-05), as shown in Fig. 4. This setup was designed to acquire a dataset of image information of the proposed deformable surgical tool and the actual applied interaction forces. The dataset samples are used as the training and test sets of the model. The video sequence of the deformation of the surgical tool is obtained with a resolution of 1920 × 1080 pixels at 120 frames per second in RGB color space. At the same time, the applied force on the tip of the surgical tool was measured with the 3-axis force sensor to prepare the training dataset and validation dataset. The surgical tool deforms according to the magnitude of the applied forces, as shown in Fig. 5.
The raw video frames were preprocessed to effectively train and validate the proposed deep learning network. In preprocessing operations, the video is converted into an image frame by frame, and then, a resize method is used to extract a region of interest with a size of 256 × 256 pixels from 1920 × 1080 pixels. Exactly from this environment, a dataset with 5400 images and the corresponding force information labels was established, and this dataset was used to train and evaluate our proposed time series CNN model network.

B. TRAINING APPROACH
To fit and evaluate the learning model, the stages of machine learning are divided into two phases: a training phase and a test phase. Here, 67% of the dataset was used for training, and 33% was used for testing, which is a common practice in data science. In the training phase, the 3 timestep CNN learning model is trained using the training dataset to obtain an accurate mapping function of the given information and applied forces. In the test phase, the trained model uses the image information in the test dataset to estimate the applied force and then compares the estimated force with the actual force in the test dataset to validate the accuracy of the network model. Hyperparameters are the variables that determine how the network is trained, and a proper choice of hyperparameters can enable neural networks to learn faster and achieve higher performance. The hyperparameters used for the training of the proposed time series CNN model are listed in Table I, such as the batch size, epochs and learning rate. The number of epochs in this paper is defined as 30 to maintain the generalization capacity of the neural network. During the training stage, the error backpropagation method is used to update the weights of the model. In addition, the root mean squared error propagation (RMSProp) optimizer [41] is used as the learning rate optimization method. After the training process, the test dataset is used to assess the performance of

Camera
Surgical Tool the proposed model by evaluating the difference between the estimated forces and the real forces.

IV. RESULTS AND DISCUSSION
In this section, the experimental results and discussion are presented. To demonstrate the benefits of the proposed approach, the learning results of the proposed time series CNN network were compared with those of the other two comparison networks (described later in subsection B).

A. RESULTS OF THE PROPOSED METHOD
The results of the experiment are shown in Fig. 6. The accuracy of the proposed approach is verified by the plots shown in the figure in which the estimated force against the real force is compared in the X, Y, and Z directions. The figure shows that the estimated values in the three directions (X, Y and Z) are very close to the real values.
The _score accuracy function is used to evaluate the consistency between the estimated value and the actual value in the proposed model. The accuracy function is described as follows: where represents the actual force vector, ̂ stands for the predicted force vector, ̅ is the mean value of the actual force vector, is the sum of squares of the residual errors, and is the total sum of the errors. The value of varies between 0 and 1. A higher value indicates that the effect of the model is better.
The value results of three directions (X, Y and Z) are also shown in Fig. 6.
is 0.9740, is 0.9635 and is 0.9259. The experimental results prove that the proposed method can successfully estimate the applied force with high accuracy in three dimensions.

B. COMPARISON RESULTS
To verify the effectiveness of the proposed approach, the results of the proposed approach were compared with those of the other two models, as shown in Fig. 7. Comparison Model 1 is based on the CNN architecture and uses only a single image as input to predict the force vectors, while comparison Model 2 is composed of CNN and LSTM networks and estimates the force vectors through the sequence image information. The comparison models were trained using the same dataset. Figure 8 shows the comparison graphs between the actual forces and the estimated forces with each method, and Table Ⅱ shows the results of the comparison in terms of values. Comparing the three architectures, the proposed time series CNN model showed the best result. Although the value of comparison Model 2 is close to that of the proposed method, the model structure of the proposed method is simpler. Furthermore, using the same training data, the same learning completion condition (as detailed in Section III) and the same computer for training, the training time of the proposed method is approximately 30% less than that of comparison Model 2. This outcome occurs because the tool dynamics are effectively trained with the time series input in the proposed CNN, while LSTMs are trained by looping The movement of the surgical tool during the operation is dynamic and continuous, and thus, a major challenge in force estimation is to understand the dynamic information of the surgical tool. The time series CNN network proposed in this work considers the dynamic characteristics of real surgical scenarios, using a sequence of 3 timestep images as data input instead of using a single frame. Among them, a conventional CNN processes spatial information in a single image, while time series CNNs can handle dynamic information such as velocity and acceleration using 3 time series images. By applying this proposed model, a dynamic mapping of a nonlinear system can be effectively constructed, and the output can be calculated through the past and current frames to provide better force estimation performance. Other studies have also proven the effectiveness of neural networks for the learning of uncertain dynamics using dynamic information. Time series input data were also applied with damping neurons in the adaptive neuron network controller for force control of robotic manipulators in unknown environments to consider the velocity and acceleration terms and effectively compensate for the unknown dynamics of the environment [42]. In addition, a time-delay feature matrix is used to provide inputs for neural networks and support vector machine-based classifiers, which collect the dynamic characteristics of EEG signals for motion intention prediction in [43]. Furthermore, the authors in [44] confirmed that the convolutive architecture with dynamic information as input is more accurate and robust than recurrent networks in longitudinal vehicle dynamics modeling. In summary, the time series CNN model in this study is effective for practical dynamic systems in constructing dynamic mapping of nonlinear systems.  In robotic-assisted microsurgeries, the surgeons perform microsurgery through a master-slave configuration instead of directly contacting the patients. Therefore, in the absence of force feedback, this approach can affect the accuracy of the operation procedure and cause irreversible tissue damage. Therefore, this paper considers a method of estimating the interaction force vector between a surgical instrument and tissue on the slave side. Because the size of the surgical tool in microsurgery is smaller than that in laparoscopic surgery, which makes it difficult to install force sensors on microsurgical tools, a novel vision-based sensorless force estimation approach for robotic-assisted microsurgery with the capability of multiaxis force sensing is proposed.

V. CONCLUSIONS
In this paper, a method to estimate the amount of force (using a time series CNN model) from the deformation of the surgical tool, which is designed to easily deform according to the applied force, was proposed. The approach presented in this work offers a feasible alternative that overcomes the limitations of integrating sensors into surgical tools. The accuracy of the approach is validated by the experimental results. In particular, the evaluation of the other two comparison models further proves the effectiveness of this method. Future work could extend our approach to estimating the force vectors in real surgical scenarios.