Deep Learning Based Image Processing for Robot Assisted Surgery: A Systematic Literature Survey

The recent advancements in the surging field of Deep Learning (DL) have revolutionized every sphere of life, and the healthcare domain is no exception. The enormous success of DL models, particularly with image data, has led to the development of image-guided Robot Assisted Surgery (RAS) systems. By and large, the number of studies concerning image-driven computer assisted surgical systems using DL has increased exponentially. Additionally, the contemporary availability of surgical datasets has also boosted the DL applications in RAS. Inspired by the latest trends and contributions in surgery, this literature survey presents a summarized analysis of recent innovations of DL in image-guided RAS systems. After a thorough review, a sum of 184 articles are selected and grouped into four categories, based on the literature and the relevancy of the task in the articles, comprising 1) Surgical Tools, 2) Surgical Processes, 3) Surgical Surveillance, and 4) Surgical Performance. The survey also discusses publicly available surgical datasets and highlights the basics of the DL models. Furthermore, the legal, ethical, and technological challenges together with the intuitive predictions and recommendations related to the autonomous RAS systems are also presented. The study reveals that Convolutional Neural Network (CNN) is most widely adopted architecture, whereas, the JIGSAWS is most employed dataset in RAS. The study suggests fusing kinematic data along with image data, which produces better accuracy and precision, particularly in gesture and trajectory segmentation tasks. Additionally, CNN and long short term memory networks have shown remarkable performance, however, authors recommend employing these gigantic architectures only when simpler models have failed to produce satisfactory results. The simpler models, despite their limitations, are time and cost effective and yield considerable outcomes even on the smaller datasets.


I. INTRODUCTION
The introduction of Artificial Intelligence (AI) in the healthcare realm has drawn a tremendous amount of attention in recent years [1], [2], [3]. The subsequent rise of the Deep Learning (DL) has assisted the surgeons in the operating room in several different ways [4], [5]. This successful incorporation has paved the way for Robot Assisted Surgery (RAS) [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Chao Zuo .
Unlike traditional surgery, a RAS system includes a camera arm and few other mechanical arms with surgical instruments attached. The surgeon controls the arms while seated at a computer console near the operating table. The console gives the surgeon a high-definition, magnified, 3D view of the surgical site. The purpose of RAS, as the name suggests, is not to replace the surgeons and physicians but to assist them, in order to achieve higher proficiency in security and safety of the undergoing patients in preoperative, intraoperative, and postoperative surgical procedures [7], [8]. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Image driven DL methods for robotic surgery have already taken care of the instrument detection and segmentation [9], [10], gesture recognition [11], workflow analysis [12], skill assessment [13], and many more [14], [15], [16], [17] to facilitate the semi-autonomous RAS.
Moreover, the development of a fully autonomous image-guided surgical system, where the direct involvement of the surgeon is seldom required, is foreseeable task for the DL models. The surgical procedures go through several complicated scenes and contain artefacts and performance variances [13]. Additionally, the blur images and videos generated by camera are often misinterpreted and mislabelled by physicians and AI systems, because of the presence of smoke, shade of tools, plasma stains and vessels [18], [19], [20], [21].
Before the advent of the modern image modality capturing systems, the surgeons mostly relied on simple cameras and naked eyes to study the internal behaviour of the organs. Today, the most relied imaging modalities include X-rays, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Ultrasound (US), and Positron Emission Tomography (PET) [22], [23].
However, even the modern imaging modalities required intensive preprocessing and feature engineering [24]. Thanks to the DL, this laborious, time consuming, and cost intensive task is no more as tedious as heretofore. Moreover, the basic underlying principle of the DL mimics the (functionality of) biological neuron, connects with a complex layered structure, learns from generalization, and keeps the neuron-associated weights updated. One of the most powerful models of the DL is believed to be the Convolutional Neural Network (CNN). The introduction of the CNN can be traced back to early 1960s [25], which has led to the development of several highly efficient diagnostic systems [26].
The DL has ultimately proven the enormous success in Minimally Invasive Surgery (MIS) systems. The very first RAS system i.e. da Vinci surgical system, introduced in the year 2000, has successfully performed around 1, 594, 000 surgical procedures in 2021 [27] with an increase of 28% from the previous year (1, 243, 000 in 2020) and is expected to perform 12 − 15% more in the following part of the year.
Furthermore, the research in the DL based image driven RAS systems is expanding and also the availability of recent datasets, i.e. Johns Hopkins University and Intuitive Surgical Inc. Gesture and Skill Assessment Working Set (JIGSAWS), Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cholec80, and ATLAS Dione [36], [37], [38], [39] has boosted the interdisciplinary synergies of biomedical engineers and physicians. Several recent survey articles span the medical domain [40], [41], [42], however lack the DL part in the technical aspects. All the reviewed technical surveys consider a specific application of deep learning and image processing in the roboticassisted surgery, such as: surgical phase recognition [43], skill assessment [44], registration [45], tool tracking or segmentation and detection phases [46], [47]. For instance, the study by Rivas et al. [48], published in 2021, considered merely one article published post 2020, and mostly emphasized on available surgical datasets and future of robotics. Another article by Unberantha et al. [49] surveyed 2D/3D image registration in workflow analysis. The study was limited to the CNNs and has not incorporated robotic part and surveyed only one particular subdomain in the RAS. In authors' opinion, an updated survey that deals with and encircles all the possible applications and aspects is required by the research and medical communities, especially for the new researchers in the field to have the possibility to see the big picture. Another aspect that encouraged the authors to perform a new survey study concerns with the exponentially growing number of publications in the field, as depicted in the Figure 1, since the existing survey articles are pretty old. Authors in [50] have included merely three articles published post 2019, and in last couple of years, a great number of worthy articles have contributed to the domain. Additionally, the main focus of the survey remained limited to tool tracking. From recent studies, it can clearly be observed that image and video guided DL based robotic surgery survey dates back quite a few years, and in the meantime, a huge number of studies have been published on the topic. Therefore, a comprehensive updated survey is missing that can accommodate the DL part and the clinical part, considering image and video driven robotic surgery in light of the recent advancements.
After the comprehensive analysis and thorough survey, the selected papers are classified into 4 different classes, i.e. Surgical Tools, Surgical Processes, Surgical Surveillance, and Skills/Performance Assessment. Each of these classes are further subdivided, and the details can be found in the Sections II and IV of the article. The full text analysis revealed that majority of the articles included in the survey are published in year 2020 and 2021 as shown in the Figure 1. The most frequently used DL method and dataset are CNN and JIGSAWS, whereas, the tool segmentation and detection are most studied subcategories within RAS.
The remainder of the article is organised as follows. The Section II consists of the search and review methodology adopted in this survey. A formal introduction of the DL methods is inscribed in the Section III. Image-guided interventions are provided in Section IV followed by the frequently used datasets in the Section V. The Section VI presents legal, ethical, and technological challenges towards the autonomous RAS. Lastly, the Section VII spans the discussion and enlists the prominent findings of the survey followed by the conclusive remarks in the Section VIII.

II. THE LITERATURE SEARCH AND SURVEY METHODOLOGY
The following section describes the literature survey methodology adopted in this study. Initially the literature search and inclusion and exclusion protocols are provided, followed by the article selection process. Moreover, the objectives and the results are also illustrated for the conducted review. Finally, the survey classification layout is presented.

A. LITERATURE SEARCH
A thorough literature search is performed on Scopus R database to select the relevant articles for review and analysis. The conducted search is confined to the literature published in English language. To retrieve the optimised results, the combination of the keywords is used interchangeably with slight modifications over repetitive iterations of web search. The specific query used for the final search is: (''deep learning*'' OR ''deep-learning*'' OR convolution* OR ''deep networks*'' OR ''neural network*'') AND (surg*) AND (robot*). The survey study is conducted on the published articles (including those accepted and available online) until August 31, 2022.

B. INCLUSION AND EXCLUSION
The inclusion criteria span the image driven DL models used in any type of robotic surgery. The search query is constrained to computer science, engineering, biomedical engineering, and medical disciplines. Only the published articles are included without considering the books, seminars, doctoral symposiums, and talks. Any article that goes beyond the aforementioned limits, any study not tackling surgery or a part of surgery, the articles related to only engineering side of the robot, and the articles related to only medical side of the surgery (i.e. no intervention of DL) come under the exclusion criteria. The Figure 2 illustrates the stages of the inclusion and exclusion process flow with number of studies included and excluded at each phase.

C. ARTICLE SELECTION
Initially, the titles of the articles and the venues of the publications (i.e. publishing authority and domain) are used to decide the relevancy on a general scale. In the further stages, the abstracts are reviewed, and the contents of each study are skimmed to limit the number of articles to the decided realm for the survey study. Finally, the full-text review is performed, and the appropriate articles are selected for further proceedings. The Figure 2 illustrates the selection stages of the survey which is performed under the recommendations of Preferred Reporting Items for Systematic Review and Meta Analysis (PRISMA).

D. TARGETED OBJECTIVES
The primary objective of the review is to systematically analyse and summarize the recent contributions in the field of image-guided robotic surgery accounting the advancements of the DL. Generally, the review study is conducted on RAS systems and specifically on image based RAS systems.
Additionally, the study aims to comprehensively state DL methods, the future of surgical robotics and the challenges to achieve the autonomous surgery. Finally, the secondary objectives include the introduction of currently available surgical datasets, the legal and ethical issues, and the limitations.

E. RESULTS
The aforementioned query resulted in a total number of 879 articles and the minor changes (upto date search) in the query showed 75 additional results. After the first check i.e. title, relevancy and venue, a sum of 482 articles are found appropriate. Another 211 articles are discarded amid irrelevancy to the scope of the survey. At each of the stages, a considerable number of the articles is rejected and at the final stage, 184 articles are tagged eligible to the purview of the study, therefore, 184 articles out of total 954 are appended in this study as shown in the Figures 1 and 2.

F. SURVEY CLASSIFICATION
After the thorough analysis, the articles are found to be greatly overlapping that can be organised in numerous different topologies. However, the careful inspection resulted into four groups, each of which is further classified into several subgroups as depicted in the Figure 3. This classification includes: a) Surgical Tools, b) Surgical Processes, c) Surgical Surveillance, and d) Surgical Performance/Assessment.
The Surgical Tool section is further subdivided into Tool Detection and Tool Segmentation sections, the Surgical Processes includes Gesture Segmentation, Trajectory Segmentation, and Tissue Segmentation categories. The Surgical Surveillance is segregated in Surgical Planning, Phase & State Estimation, and Activity Recognition phases, and the last but not least, Surgical Skill Assessment and Surgical Workflow Recognition come under the Surgical Performance/Assessment group.

III. DEEP LEARNING: A BROADER PICTURE
The section below encircles the fundamental concepts of the DL in the context of medical image analysis and inscribes the formal introduction of the methods and techniques that appear in the literature of the surveyed articles.
A. LEARNING PARADIGMS ML algorithms are broadly categorized into four major subgroups, supervised, unsupervised, semi-supervised, and reinforcement learning algorithms [51], [52]. Although the literature suggests more subgroups, however, generally two groups are widely cited, namely supervised and unsupervised learning algorithms.
As the name suggests, supervised learning algorithms are fed with complete information under the managed supervision and the model makes decision based on the inputs. A supervised model is given input data along with the relevant labels and it learns by finding the relevant parameters. Whereas, in the unsupervised learning, only input data is provided without any corresponding additional information. The model by itself finds the patterns in the data, learns the meaningful concepts, and makes the prediction [53].

B. NEURAL NETWORKS
Most of the DL models are based on neural network architecture that is verily inspired by the complex structure of human brain. A deep neural network is composed of neurons that has an input and a bias which undertakes a series of processing units to understand, extract and learn the meaningful information from the data.
The DL has revolutionized the traditional ML by illuminating the manual feature extraction process. Furthermore, the introduction of the back propagation algorithm has enabled researchers to compute the impact each parameter imposes on the objective function [54]. The back propagation has further enriched the neural networks and made the computation faster, easier and better.
Until the introduction of layer over layer training of deep neural network, the training of the neural network was widely believed to be quite tedious and ineffective. However, Bengio et al. [55] proposed a mixed of unsupervised training during layer over layer training and supervised training while fine tuning at two different stages which showed considerable results. The algorithms trained in this manner include autoencoders and belief networks, which are still considered complex because of hectic process to reach substantial results.

C. CONVOLUTIONAL NEURAL NETWORKS
The most widely used DL model is CNN which has proven its applicability in image processing applications [9], [26], [50]. The generally accepted common CNNs based models include VGG16-19, ResNet, Inception, Xception, MobileNet, Effi-cientNet and many more. The key differences between a Multilayer Perceptron (MLP) and a CNN model are the inclusion of pooling layer in the CNN, sparsely connected layers instead of fully connected, and small associated weights of the layers that particularly help in dealing with image data.
The main building blocks of the CNN are convolutional layer, pooling layer, normalization layer, dense layer, dropout layer and activation layer, which along with their nature and responsibilities are described in the Table 1. The optimal number of layers in a network depends upon the nature of the problem that a network has to deal with, however, in general, there is no fixed number of layers, and hence, it is a matter of search to figure out the optimal number given a certain problem. To avoid the possible trade-off between the computational complexity and the performance, different numbers of the layers and neurons can be considered over repetitive iterations.
The networks with fewer number of layers and trainable parameters take less time, however, at the expense of lower accuracy. These type of models may not reach to full potential by modeling all the required parameters. On the other side, an overly populated network will provide better accuracy results but can also learn unnecessary features which will result in overfitting of the network. This type of model will perform poor on unforeseen data. The solution of the above problems is provided by the pretrained networks [56].

D. MULTILAYER PERCEPTRON
The word perceptron comes from the human like functionality of perception, which was intended for image recognition tasks. The most fundamental category of the artificial neural network is MLP, which incorporates feed-forward neurons. The MLP comprises of several FC layers following the input layer. In MLP, each connected layer contains several neurons, the associated weights, and the bias information coming from the activation function of the previous layer.
The term MLP has largely been misunderstood and therefore used with arguably different definitions. In literature, numerous authors believe that any feed-forward neural network is a MLP in a general context, whereas, some authors remain strict to precise definition, i.e. a network composed of multiple layers of perceptrons.
The concept of MLP arose to tackle the limitation of non-linear mapping of the input to the output. Apart from having the certain input and output layers, an MLP consists of hidden layers along with the several neurons. An activation function is used to control the neurons by imposing a threshold.

E. RECURRENT NEURAL NETWORK
The neural network models appear to have another class known as Recurrent Neural Network (RNN) models. An RNN follows the sequential feeding of the input data. The RNN models are great improvement for time series sequential input data problems. The internal state in the RNN model, also called the memory of the neuron, saves the leading information coming from the previous computations.
The implementation of the RNN in the image driven computer-assisted methods has not been much appreciated in the literature, however, its successful adoption in the natural language processing tasks makes it standout. Another worth noting point of the RNN is its ability to work on variable length input data. The use of the RNN in the robotic systems driven by images and the kinematic data has been increasing over the time.

F. MAJOR BUILDING BLOCKS
The major building blocks of a DL architecture are presented hereunder.

1) LAYER
As the name suggests, the principal operation of CNN is convolution, however, a number of additional layers are added namely dense, dropout, pooling layers and few more to the model to improve performance. The types of layers, the relevant hyperparameters, and their work is summarized in the Table 1.

2) COST FUNCTION
The cost function or the loss function describes how well a model has performed with respect to the ground truth. A number of loss functions have been used in literature depending upon the operations to be performed by model over a specific data. The cross-entropy is most widely used loss function in classification problems [57].

3) PERFORMANCE MEASURING METRICS
The performance measuring metrics are important part of CNN which are the measuring scales that quantify the performance of the model.
The most commonly used metrics for the classification task include accuracy, precision, recall, and f1-score. The segmentation and the object detection tasks may have additional measuring parameters depending upon the nature and the definition of the problem.

G. COMMON FRAMEWORKS AND LIBRARIES
There exist numerous DL frameworks that facilitate to design, train, and validate neural networks using several interfaces. These frameworks and libraries include but not limited to TensorFlow, PyTorch, MATLAB, NVIDIA Caffe, Chainer, Theano, and Keras.
These high level interfaces help researchers, mathematicians, scientists, and developers to implement the complex architectures of deep neural networks to solve the various real world problems. Few of the most commonly used frameworks and libraries are listed below.

1) TensorFlow
An end-to-end open source library developed by Google Brain team, supports the numerical computation and analysis, is extensively used library that works with both CPU and GPU. The programming interface of TensorFlow is limited to Python and C++ [58].

2) PyTorch
PyTorch has received a tremendous amount of attention by the researchers and the developers because of its ability of easily implement the complex architectures of DL models [59]. Additionally, it also supports the tensor manipulations, e.g. NumPy computations.

3) MATLAB
MATLAB is well-known mathematical framework which is highly regarded in scientific society. It offers great visualization tools and is not limited to DL and neural networks [60]. The high-level features in the MATLAB do not require high level of expertise to implement. The CUDA code is automatically generated by MATLAB from simple code.

5) KERAS
Keras is another product by Google engineers which is deemed fruitful for beginners. The four main basic principles were considered during the development of Keras including modularity, minimalism, extensibility, and Python based [62].

IV. IMAGE GUIDED ROBOT ASSISTED SURGERY
In the previous decade, DL methods brought tremendous amount of success, especially in image-guided Computer Aided Diagnosis (CAD) systems as illustrated in the Figure 1. This enormous triumph gave birth to the idea of image driven DL models in the field of robotic surgery. The availability of large amount of image data, the less complicated operational facilities, and the DL algorithms' performance on image dataset are major knocks towards autonomous surgery.
Based on the literature surveyed, the articles are classified into four categories, each of which are subdivided into further groups. These categories include Surgical Tools, Surgical Processes, Surgical Surveillance, and Surgical Performance. As the names suggest, the divisions are fundamental and encircle the most relevant parts of surgical scenarios in computer-assisted autonomous and semiautonomous surgical systems.
An additional fifth category named Others is provided for the applications that either do not fall in any of the aforementioned categories or the number of found articles were fewer. The section below inscribes all categories in detail.

A. SURGICAL TOOLS
Surgical tools are the most important actuators in surgery because they are responsible for performing interventions; however, keeping track of surgical instruments requires real-time knowledge of the pose and the movement of the tool. Literature suggests numerous tool localisation techniques embracing electromagnetic tracking [63], kinematic [64], optical tracing [65], and image-guided detection [66] among others [67]. Unlike other approaches, image driven surgical instrument localisation offers attractive benefits including the knowledge of pose and motion, and does not require instrument design modification [47].
The section below contains the literature surveyed in this study about image-based surgical tool detection and segmentation which are most studied areas in robotic surgery with an average of around 40% of the total publications encompassed in this study.
The reason behind CNN being the most applied architecture lies in the ability of multiple tool detection and localisation which traditional ML models have not been sufficiently successful at [67]. Among the tool detection, the articles employing public datasets revealed an accuracy range of 89-100%, whereas, the precision and Dice values vary greatly. The in-house datasets are incorporated by 18 studies achieving an overall accuracy range of well above 90% except one study ( [29]) that managed to reach around 85%.
The Endoscopic Vision (EndoVis) challenge [37] and m2cai16-tool datasets [39] are most widely used followed by the ATLAS Dione for the task of tool detection. Furthermore, accuracy is top used performance measuring metric with precision and Area Under the Curve (AUC) being the other most important evaluation parameters. The more details about the year of publication, objective/s, data description and performance outcome can be found in the Table 2.
Likewise, semantic segmentation by using unsupervised DL methods is another growing concept [137], [138]. The data provided by EndoVis robotic instrument segmentation challenge [39] is most frequently used dataset for segmentation, whereas the Dice score is common performance measuring metric. Out of total 38 studies, merely 4 studies incorporated in-house datasets and 2 assimilated both in-house and public datasets. The description of the input, results and other relevant information is provided in the Table 3.

B. SURGICAL PROCESSES
Surgical process is a nontechnical parent terminology induced to explain those sub-tasks of MIS which are not directly relevant to incision but lead to the understanding and developing the next generation autonomous medical robotic VOLUME 10, 2022  systems. The surgical process section is subdivided into three sections, i.e. Gesture Recognition, Trajectory Segmentation, and Tissue Segmentation, based on the contribution of the authors towards the field.
In addition, numerous studies employ these terminologies interchangeably, however, considering the in-depth analysis, the suturing task cannot be confused with unique movement of surgical instrument. The former can be perceived as series of analogous gestures, whilst latter can be described as the movement in and around a particular region.

1) GESTURE SEGMENTATION
Surgical gesture recognition has been perceived in several different overlapping contexts i.e. path planning, needle positioning, continuous tip detection, etc. and therefore, numerous DL methods have been applied in respective perspectives. The gesture segmentation is generally implemented for suturing tasks, therefore, it is a cumbersome job because of the similarity and repetition of analogous suturing steps.
The LSTM model is adopted by five out of total thirteen studies, whereas, CNN and RCN are other most applied models by several authors. Moreover, all thirteen articles used accuracy to measure the performance of the DL models along with other metrics and nine out of thirteen studies incorporated JIGSAWS [36] dataset and three employed in-house datasets (among these two studies used both JIGSAWS and other datasets also).
However, the outcome of these research studies is evident that fusion of different types of data (i.e. video and kinematic data) yields better accuracy as compared to only image/video data. The further technical details extracted from the gesture segmentation works are enlisted in the Table 4.
Likewise, not only the thread detection but also knot tying and path planning are largely associated with trajectory segmentation tasks [151], [152], [154], [157], [158], [161], [162]. It is worth mentioning that instead of using one single architecture, authors used combination of DL architectures to produce better performance results (e.g. CNN and LSTM). The Table 5 highlights the salient features of the trajectory segmentation task and the employed DL models.
The aforementioned division of the segmentation phase is placed under the same category because of the overlapping interest and the main task involved in the article. The individual categories can be assumed in a study that only encircle the tissue segmentation task regardless of the input type. The basic reason behind including the smaller categories (even with fewer published papers found) is the significant contribution discussed in the robotic surgery field.
The U-Net architecture is the most applied network among all studies with being adopted by eleven out of seventeen articles, as evident in the Table 6. The U-Net is often applied in conjunction to the other networks including LSTM, GAN [180] and other variants of CNN [164], [166], [169], [170], [172], [177], [179]. Because of the unavailability of large scale public datasets, a significant majority of the studies (13 out of 18) incorporated in-house dataset. Considering the proprietary datasets, authors have used varying performance measuring metrics including accuracy, Intersection over Union (IoU), Dice and AUC. The further insights about the datasets, performance, and the DL techniques are provided in the Table 6.

C. SURGICAL SURVEILLANCE
The increasing introduction of biomedical images facilitates the surgical surveillance and the navigation during the surgical process. This section surveys the articles that monitor the surgical situation with respect to the patient and the ongoing procedure.

1) SURGICAL PLANNING
Surgical planning is largely considered as preoperative planning, where the steps are performed in advance in order to pre-visualise the intervention. The application has a large benefit in emergency situations and war field areas where reaching the hospital requires time. The vessel detection and needle insertion are prominent and attractive applications, and surgical robots are made to achieve timely fashioned aid [183]. The recent trends in surgical planning have shown great interest in the 2D, 3D, and 4D model construction for the interventional guidance [183], [184], [185], [186], [187], [188], [189], [190], [191]. Due to the unavailability of the large amount of data for preoperative planning, majority of the articles rely upon in-house data [184], [185], [186], [187], [189], [190], [191] which includes CT scans and MRI scans [184], [185], [186], [188], [189].  As can be seen in the Table 7, the majority of the authors use Dice performance measure. Along with the surgical planning, the articles [184], [190], [191] also deal with the depth estimation, motion detection and path planning, which are also crucial parts of RAS. The articles are enlisted under the surgical planning section considering their major contribution towards the category.

2) PHASE AND STATE ESTIMATION
The surgical phase and state estimation subgroup a particular surgical process into several chunks of phases and establish a system that recognises the phase or state of the surgical process on a given set of inputs. The task is largely applied in several domains including cholecystectomy, endovascular, and esophagectomy procedures [87], [192], [193], [194], [195], [196], [197], [198], [199], however, it appears to have less generalization for other types of surgical procedures.
Dissimilar to the surgical planning, the availability of the data for phase recognition makes majority of the articles rely upon public datasets [87], [192], [194], [195], [196], [199] which include kinematic data with images and videos. The CNN and LSTM are two most frequently applied methods with accuracy being the top performance measuring parameter. Phase and state estimation are directly related to identifying the status of the process at certain time, therefore, the accuracy is used by almost all studies to evaluate the performance of employed DL models. The Table 8 provides further details on the phase and state estimation studies.

3) ACTIVITY RECOGNITION
The automatic surgical activity recognition before the surgical procedure and in the operating room during the surgical intervention gained considerable attention in the recent past [201], [202]. The surgical activity recognition involves the real-time followup of the procedure under consideration. The integration of CNN and LSTM networks helped surgeons draw reasonable conclusions, however, the unavailability of large datasets has led to use the pretrained DL models [203], [204], [205], [206]. The pretrained DL models are highly trained on ImageNet dataset [207]. Recently, surgical activity recognition has been considered the essential part of surgical planning and state estimation, therefore, as can be observed from the above two subsections, the activity recognition comes under the umbrella of surgical procedure planning. Apart from the video and image data, the kinematic data is integrated to generate better results, whereas, the accuracy is considered the reasonable performance measure. The Table 9 highlights the main concepts, methods, datasets, and other relevant information about the articles addressing the surgical activity recognition tasks.

D. SURGICAL PERFORMANCE/ASSESSMENT
The performance and skill assessment of surgeon in a surgical procedure is one of the most crucial aspects of autonomous robotic surgery because it lays down the steppingstone for computer-aided autonomous systems. The skill evaluation of the surgeons along with the workflow recognition from a surgical video allows to compute the dexterity and precision. Therefore, building a robotic surgical system requires the beforehand understanding of the skill assessment and flow during the procedure.

1) SKILL ASSESSMENT
The manual skill assessment and skill development monitoring of the doctors, surgeons, and trainees is burdensome tasks and requires a great deal of time and expertise. This usual task comes under the responsibility of expert doctors, which is not only arduous but also prone to errors. The automatic surgical skill evaluation for the RAS is indispensable.
The surgical skill assessment and skill level assessment are widely realised using CNN models driven by video and kinematic data [208], [209], [210], [211], [212], [213], [214]. The JIGSAWS [36] is most common dataset for skill evaluation since it provides both video frames and kinematic data ( Table 10). All the studies included in this section evaluated the models on accuracy and/or AUC. The skill assessment also handles the instrument tracking in some cases where the dexterity of surgeons carries extreme importance. The other performance measures, dataset description, and DL models are provided in the Table 10.

2) WORKFLOW RECOGNITION
Surgical workflow analysis describes the steps involved during the surgical interventions. Automatizing the surgical workflow has great importance in the modern operating room. Autonomous workflow recognition is vital in developing computer-aided autonomous and semiautonomous surgical frameworks. These systems have the ability to supervise the surgery within operating room by scheduling the tasks and resources and providing the seamless assistance to clinicians.
The interpretation of the recorded video of the surgical procedures requires expertise, focus, and huge amount of time. The technological advances automatically extract the valuable information by analysing the videos. The cholecystectomy procedure videos are largely used in literature to understand the workflow and surgery type recognition [109], [215], [216], [217], [218], followed by nephrectomy [202], [219], [220], [221]. The CNN and the LSTM are common methods used to study the workflow in the videos [109], [202], [215], [216], [217], [218], [219], [220], [221], [222], [223] because of their high performance. Eight out of ten studies used accuracy as a performance measuring metric along with others. The workflow recognition also overlaps with the future state prediction and phase recognition [221], [224]. The Table 11 contains the other necessary and relevant details about the section of the study.

E. OTHERS
This section describes the articles that do not come under the major categories, however their contribution towards the image-guided RAS is not negligible. The other reason of this distinctive but amalgam section formation is the small number of found publications for the relevant subclass.  Therefore, this section highlights the objective, contribution, DL methods adopted, and other significant details of each study.
In [225], the authors proposed dual neural network based models for organ recognition and presence or absence of internal organ in endoscopy image data. The second neural network model testifies the presence of the organ on series of images on the live screen. The in-house generation of small dataset resulted in 92% of the accuracy with only 200 randomly selected images for testing.
Similarly, the segmentation of organs [127], [226], [227] and tissues is also well studied task in RAS [170]. The Mask-RCN and CNN based YOLO, U-Net, TernausNet, LinkNet, and SegNet are applied on famous EndoVis Challenge and in-house datasets. The aforementioned articles coincide with the tool detection and segmentation category because of the segmentation of tool during organ detection.
In another similar study, the volume of organ segmentation in intraoperative guidance is studied using U-Net and V-Net architectures [228]. Two publicly available datasets namely VISCERAL and SLiver07 datasets are used in this study and 12.6% and 6.2% IoU for the aortic and liver segmentation are achieved.
The gaze estimation has an important role in tele-robotic surgery, however fewer studies have incorporated image data to support autonomous surgical systems. The [229] proposed a dense CNN architecture to control the surgical robot using gaze estimation. The local dataset generated from camera images is manually labelled in nine different gaze directions. The accuracy of the direction based gaze estimation reached 90%.
Smoke detection is another less common but growing application within the robotic surgery domain. The authors in [230], [231], and [232] use intraoperative images to detect and remove smoke using U-Net, GoogLeNet, and Cycle-GAN. The obtained results are considerable and detailed in the Table 12 for further reading.
An image based gauze detection and segmentation approach is proposed by Sanchez et al. [233] where pretrained models are employed to test the novel dataset of 4003 video provided by authors. The InceptionV3, MobileNetV2, and ResNet-50 managed to reach an accuracy value of 77.68%, 75.67%, 90.16%, respectively.

V. PUBLICLY AVAILABLE DATASETS
The impressive outcomes of the DL are underscored by the large amount of available datasets because of the intrinsic nature of neural networks. Neural network based models require training which follows the principle of more the data is available, better the results are. During this survey, a decent number of articles is found to have used publicly available datasets, however, numerous studies are conducted on proprietary datasets. The publicly available datasets do not only provide benchmark for DL model development but also support in evaluation and comparison of several models.
This section illustrates the datasets employed by the articles surveyed in this study. Furthermore, around 35 datasets are adopted by the surveyed articles, however, for the purpose of concision, only 10 datasets are described. The Table 13 contains the information of publicly available datasets including name, year of publication, modalities, and a short description. According to our study, the JIGSAWS is the most employed dataset in the RAS that encompasses video and kinematic data ( Figure 5).

A. JIGSAWS
The well-known gesture and skill assessment dataset JIG-SAWS comprises of kinematic and video data for suturing, needle passing and knot-tying tasks. This dataset was recorded by 8 experts on da Vinci surgical system performing the five repetitions of suturing, knot-tying and needle passing procedures [36]. The Figure 5 shows that the JIGSAWS is most extensively applied dataset by the studies included in this survey, however, not all the authors have utilized the complete dataset but a part of the dataset. The JIGSAWS dataset contains 163 videos and kinematic data. The brief description of the dataset is provided in the Table 13.

B. MICCAI DATASETS
The Medical Image Computing and Computer Assisted Intervention (MICCAI), officially known as The MICCAI Society was established in July 2004 [37] as a non-profit organisation. The MICCAI Society holds the annual competitions with the VOLUME 10, 2022 purpose to bring the researchers, clinicians, and engineers together to advance in the field of biomedical engineering. The MICCAI offers a wide range of datasets each year with different objectives and holds the competitions. A worth considering number of articles take advantage of datasets provided by challenges for biomedical analysis.
The most common challenge of MICCAI is EndoVis challenge, which is organised every year since 2015. The sub-challenges of EndoVis contain datasets for wide variety of tasks including instrument detection, segmentation, boundary detection, workflow analysis, skill assessment, etc. Further information about the MICCAI and its EndoVis challenge can be found in the Table 13.

C. Cholec80
The Cholec80 is another famous cholecystectomy dataset. It comprises of 80 cholecystectomy surgery videos recorded during the surgical interventions performed by 13 experts at 25 Frames Per Second (FPS) [38]. The Cholec80 data is widely used for tool presence detection and phase recognition as evident by the Figure 5 and Table 13. The original data contains the videos where the tool cannot be easily visualised by naked eye which makes the detection and segmentation a challenging task.
The Cholec80 dataset is further split into two groups of 40 videos each, namely for training and testing. Several different versions of Cholec80 are also available in literature.

D. ATLAS DIONE
The ATLAS Dione dataset contains the video data of ten surgeons having different expertise level and record six different surgical procedures on da Vinci surgical system at Roswell Park Cancer Institute, Buffalo, New York, USA [39]. The surgical procedures include basic robotic surgery task to high-level surgical processes. The motivation behind this data generation was unavailability of annotations in JIGSAWS   Table 13.

E. UCL dVRK DATASET
The UCL da Vinci Research Kit (dVRK) comprises 14 videos of 300 frames each of segmentation task. It also contains six videos of robotic kinematic recorded at 300 frames each. All the recorded frames are 720 × 576 pixels. The frames contained the camera artefacts which are later cropped at 720 × 576 pixel resolution. The dataset is recorded using four consecutive steps that are repeated to record the video and kinematic data. For the further information on the setup and the data acquisition methods, interested reads are referred to the [132].

F. M2CAI16 DATASET
The M2CAI16 challenge is a satellite event of MICCAI offering two different datasets including workflow and tool detection, which are enlisted hereunder as well as in the Table 13.

1) M2CAI16-TOOLS DATASET
This dataset was generated with the collaboration of University Hospital of Strasbourg and the Hospital Klinikum Rechts VOLUME 10, 2022 der Isar in Munich, Germany [39]. The dataset contains 41 laparoscopic procedural videos of the cholecystectomy with eight distinct phases. Out of the total 41 videos, 27 and 14 videos are for training and testing purposes, respectively.

2) M2CAI16-WORKFLOW DATASET
The M2CAI16-Workflow dataset was generated at University Hospital of Strasbourg [234]. Similar to the previous data, this dataset also contains laparoscopic videos of cholecystectomy. A total of 15 procedures were recorded, ten and five for training and testing purposes, respectively.

3) M2CAI16-LOCATION DATASET
The M2CAI16-Location dataset is extension of the aforementioned M2CAI16-Tool dataset [235]. It additionally contains the annotations of the tools and locations.

G. Laparo425
The Laparo425 dataset contains the laparoscopy videos of nine distinctive classes. The dataset was recorded at the University Hospital of Strasbourg. As its name suggests, it contain 425 procedures recorded at 25 FPS and down-sampled at one FPS for experiments.

H. CholecT40
The CholecT40 datasets comes from the Cholec80 dataset. It contains 40 videos from the Cholec80 dataset and annotates them to action triplets. The 128 different action triplets are introduced and recognised using video data [204].

I. SINUS-SURGERY-C DATASET
This dataset contains the endoscopic sinus surgery images which are annotated for surgical tool segmentation task. The dataset comes from the BioRobotics Lab at the University of Washington, USA [238]. The segmentation of the dataset is not easy since it contains smoke and shadows of the instruments. As soon as the movement occurs, the blue images generate that make the dataset more challenging for instrument segmentation tasks.

J. DeepLesion
The DeepLesion dataset is large dataset of National Institute of Health of USA that contains 32000 lesions of the CT images. The total number of 4400 unique patients were involved during the data generation [237]. This dataset offers a great diversity because of its diverse collection of images from liver, lungs, lymph, and many more human body organs. The further details about all datasets are also provided in the Table 13.

VI. CHALLENGES AND RISKS
This section describes the potential pitfalls that hinder the development of autonomous robotic surgical systems. The first part of the section illustrates the technical and technological challenges that are either under development phases or soon will be. However, the second part demonstrates the legal and ethical concerns of the DL guided computer-aided interventions.
A big thanks to the advancement in the AI, the healthcare industry has not only improved the diagnostic methods but also moved to surgical robots. However, the other side of the DL in medical domain is still darker. The technical risks include the design of the robotic components, the precision in complex scenarios, and unpredictability of the amount of surgical procedure assignment [239] among numerous others. The success rate of surgical systems is increasing day by day [239], nevertheless the high associated costs are not negligible [240]. These costs can be traced back to component design and continue through the operating room to the maintenance expenditures.
Additionally, the surgical systems are designed with limited allowed movement and the dexterity, which have a positive and a negative side. The controlled movement will fail if any unscheduled task originates during the surgery because of no or less adaptability expertise [241]. In the meantime, a robotic system can physically harm the patient and damage the involved components. The safety of the patient under the surgery is at risk, not only by hardware part of the robotic system but also by the software [242].
In a semi-autonomous surgical system, a minute human error can put the life of patient in danger [243]. The training and the testing capabilities of human are not as precise as machine, however completely relying on machine with present day technology could produce less efficient outcomes in general [244], [245]. Therefore, all these concerns require detailed studies and considerations.
Similarly, all the perks DL offers are appreciable, however, there are few things beyond technology. The legal and ethical issues concern the privacy of the medical data of the patients [246], [247], [248]. Another potential threat is cybercrime involving the genetic information of the patient and potential hack of the robotic system [249]. The prospective cyber attacks to seize the control of surgical system can lead to devastating results including lethal physical harm to patients, as discussed by Bonaci et al. [250]. The software designer and the developers should come forward and certify that the delivered products are not vulnerable to attacks. An automatic rescue service should be activated in cases of emergency i.e. power cuts, jamming, transmission, etc. Unlike autonomous driving, there exist no legal standard methods to define the level of degree to which a robotic system can be autonomous [251]. The National Health Service (NHS) of the United Kingdom came under similar attack back in 2017 which affected medical devices nationwide [252]. Moreover, the transparency is another worrisome issue because of the blackbox decision making nature of the DL models. The recent advent of eXplainable AI has taken care of transparency and interpretability problems, however, the field is growing and requires further exploration. Finally, all the aforementioned applications are accelerating the dread of physical harm by the AI systems and eventually reduce the human involvement. The additional intervention of government and giant companies spreads fear among the undergoing patient that demands and requires a code of conduct.
The General Data Protection Regulation (GDPR) by European Union also states the concise and transparent information provision and privacy protection of users [253]. This ensures the maximal control, however the full enforcement of the law across the board is yet to be experienced.
Furthermore, the fully autonomous robotic surgical system should first go through several security checks to answer the concerns of physicians, engineers, patients and general public. Who will be responsible in case of harm to the undergoing patient or patient's data? Is it the robot? Or the engineer developing the robot? Or anyone else? Successfully building the autonomous robotic systems will leave nothing to human but the responsibility which brings culpability and liability. The authors, in no case, are against the development of fully autonomous robots, rather demand a proper regulation before the train is missed.
Conclusively, the cyber-security of the robotics and particularly the surgical robotics is big market where researcher need to explore because a medical robotic system can not be judged based on average results but best possible outcomes are indispensable (better than human), otherwise, a medical robot looses the whole point to be developed for surgical applications.

VII. DISCUSSION
During the recent decade, the DL models have made substantial impact in healthcare domain (e.g. CAD systems). The large scale availability of surgical datasets and straightforward data acquisition protocols encourage cross-domain research synergies in order to reach fully autonomous robotic surgery. Recently, numerous DL models have been applied on medical images to capture the relevant information and provide to the RAS system for surgical procedures (Figure 1). This survey analyses and summarises the contribution of the DL architectures in the field of image-driven surgical robots. The analysis reveals the current interests of researchers in image-guided DL based RAS is increasing over time. Inside the RAS, the majority of articles contribute to the tool detection and segmentation tasks. One reason behind detection and segmentation task selection is the serene accessibility to enormous surgical tool data. The other reason can be the working mechanism of DL models. Since the DL models learn from patterns, the tools segmentation tasks are arduous because of the presence of smoke, reflection of instruments, and vessels in the surgical dataset.
During this study, the authors found that CNN is most commonly employed DL method because of its successful history with image data. The Figure 4 and Table 2 through  Table 11 are evident of the aforementioned claim. Not only the CNN is most applied method, but also it outperforms other techniques. However, as depicted in the Table 2, the combination of several different models yields improved results. Most of the articles dealing with action recognition applications (e.g. surgical gesture recognition and segmentation, trajectory segmentation, phase and state estimation, etc.) consider the combination of traditional static CNN architectures along with the RNN/LSTM models. The combination of CNN and RNN lies in the fact that CNN are able to automatically extract spatial features within images, whereas RNN have been designed to capture the temporal information. However, it is worth noting that not all the articles related to action segmentation employ a RNN. In fact, some authors choose 3D CNNs that are able to simultaneously capture both spatial and temporal features of the action/motion behaviour. These kind of solutions can directly extract information from multiple video frames through 3D convolutions. Comparing the two different approaches it is not easy to determine which approach is the best, a future work might investigate deeper this aspect. However, it can surely stated that the main limitation of 3D CNN is related to the increased dimension due to 3D convolutions, thus a bigger amount of data and time is needed for training such models.
Similarly, the most widely used dataset is JIGSAWS that contains video and kinematic data ( Figure 5 and Table 13). Authors proved that the fusion of video data into the kinematic data leads to better performance outcomes. The annual competition organised by MICCAI for several varying imaging tasks also brings new datasets. Although huge number of studies employ in-house datasets, a total number of 35 different publicly available datasets are used either partially or fully.
This fact reveals that the amount of available data is significant and has attracted scientists and boosted the research. Another benefit of extensive data availability is the possibility of comparison between the novel conducted studies.
A considerable number of papers have taken benefit of pretrained networks, as can be seen through the Table 2 and  Table 11. These models are heavily trained on ImageNet dataset [207]. The biggest reason of using pretrained nets is the unavailability of sufficient amount of data to train a DL architecture from the scratch. In the surgical workflow analysis task, numerous authors adopted in-house datasets which led them to use pretrained models. Since these models are pretrained, therefore, they come with the added benefit of lower time consumption. The most commonly used pretrained net remained VGG architecture followed by the ResNet and MobileNet.
With the increasing development of DL models, and the massive success with image modalities, the research in the field of image-guided robotic surgery is inevitably growing. Firstly, image data acquisition methods are smooth with least harm (tolerable) and do not require colossal acquisition cost. Secondly, image data contains bulk of information about the patient and helps understand internal state of patient without incision or physical injection.
Moreover, analysis and preprocessing on the image modalities is relatively easy and also, the DL methods work better with the image data. Therefore, the future of MIS, RAS and computer-aided interventions relies mainly on the image and video data, and of course DL.
Finally, based on the exhaustive analysis, the authors point out to the two major concerns, first for medical experts and other for DL engineers. The recorded image modalities and videos contain smoke, instrument reflections, and surgeon's movements (i.e. hands, reflections), therefore, the DL algorithms learn from the unnecessary features which may bias the results. Secondly, not only the DL methods require huge amount of time for training purposes that needs to be optimized, but also DL is blackbox in decision making mechanism which concerns the clinicians and the patients.

VIII. CONCLUSION
In this study, technical articles concerning image driven computer-assisted interventions incorporating DL models are selected for survey from Scopus database. The intensive text assessment indicated that the selected 184 articles can be grouped into four categories including: 1) Surgical Tools, 2) Surgical Processes, 3) Surgical Surveillance, and 4) Surgical Performance/Assessment. The key findings include: a) Surgical Tools is most studied topic which comprises Surgical Tool Detection and Surgical Tool Segmentation (45% of total articles), b) CNN is most widely applied DL topology (roughly 54% of total articles), c) the gesture recognition articles incorporate JIGSAWS dataset (around 77% of articles in relevant subcategory), whereas MICCAI datasets are top consideration for detection and segmentation tasks (around 60% of articles in relevant subcategory), d) VGG remains the widely accepted pretrained network especially when available dataset was not large enough, f) the most studied applications appear to be cholecystectomy and prostatectomy, g) for gesture and trajectory applications, suturing task is frequently studied application area, h) the fusion of kinematic data with image data yields better results. Considering the characteristics of the proposed survey, the authors believe that the main limitation of the study concerns the lack of deep details of the pre-processing and processing approaches proposed in the reviewed paper to solve the problems related to each application category. However, this was not the original scope of the review, since providing the general overview of the topics under discussion required gigantic efforts. In the context of future direction, the development of fully autonomous RAS system appears highly promising and fascinating research topic. Additionally, self-supervised learning based models can greatly improve the environment in operating room. Finally, a steady walk from the weak AI to strong AI and to the super AI can lead to the notable breakthroughs.
SARDAR MEHBOOB HUSSAIN received the bachelor's degree from The University of AJK, in 2015, and the master's degree from COMSATS University Islamabad, Pakistan, in 2017, all in computer science. He is currently pursuing the Ph.D. degree with the Electrical and Information Engineering Department, Polytechnic University of Bari, Italy. He has authored and coauthored over ten international publications. His research interests include deep learning, explainable AI, biomedical engineering, and robot-assisted surgery.
ANTONIO BRUNETTI received the bachelor's degree in computer science and automation engineering and the master's degree (cum laude) in computer science engineering from the Polytechnic University of Bari, and the Ph.D. degree in electrical and information engineering from the Doctoral School of the Polytechnic University of Bari, in February 2020, under the supervision of Prof. Vitoantonio Bevilacqua. He is an Assistant Professor with the Department of Electrical and Information Engineering, Polytechnic University of Bari. His final dissertation in human-computer interaction concerned the design and implementation of an intelligent system able to adapt the living environment of subjects with mild cognitive impairment based on cognitive analyses based on event-related potentials measured on the EEG signal stimulated via an innovative protocol based on virtual reality. From December 2015 to October 2016, he was a Research Fellow at the Polytechnic University of Bari working on the GCESYS-Green Community Efficiency Systems Project, working on designing and developing optimization algorithms applied to energy consumption scheduling in smart buildings with systems fueled by mixed traditional and renewable sources. His Ph.D. thesis dealt with the design and implementation of intelligent frameworks, based on image processing, machine learning, and deep learning algorithms, for supporting clinical diagnosis in the Precision Medicine Era. From April 2020 to December 2021, he was a Postdoctoral Research Fellow working on research activities in the fields of electronic and informatics bioengineering, and bioinformatics. On March 2019, he founded Apulian Bioengineering s.r.l.-a spin-off of Polytechnic of Bari, as a Shareholder.

GIUSEPPE LUCARELLI is currently an Associate
Professor of urology with the Department of Emergency and Organ Transplantation, University of Bari, Italy. He is a Clinician-Scientist, a Urologist, and a Transplant Surgeon. His primary research interests include urologic oncology and kidney transplantation. In particular, his research investigates the biology of prostate and renal cancer using high throughput omics technologies to obtain a detailed molecular characterization of cancer, and also to identify novel biomarkers that allow a more accurate prognostic stratification.
RICCARDO MEMEO is the Chief of the Hepatobiliary and Pancreatic Unit with Miulli Hospital, Bari, Italy. He attended the Medical School at Universita' degli Studi di Bari and then, during residency, he moved for the last two years of residency improving his knowledge in hepatobiliary, pancreatic surgery, and liver transplant at Paul Brousse Hospital, Paris and Henri Mondor Hospital. He pursued for other five years in France, improving his knowledge in minimally invasive and robotic surgery in Strasbourg, France. His clinical interests include oncologic surgery of liver and pancreas, focusing on minimally invasive access and enhanced recovery after surgery.
VITOANTONIO BEVILACQUA received the Laurea degree in electronic engineering, the Ph.D. degree in electrical engineering, and the postdoctoral degree in industrial informatics from Polytechnic University of Bari. He is a Full Professor of electronic and information bioengineering with the Electrical and Information Engineering Department, Polytechnic University of Bari. In March 2019, he founded Apulian Bioengineering s.r.l.-a start-up and spin-off company of Polytechnic University of Bari, where he is currently the Chief Executive Officer. In September 2019, he joined as an Affiliate Professor the BioRobotics Institute, Scuola Superiore Sant'Anna di Pisa. Currently, he is the Head of Industrial