Weakly Supervised Temporal Convolutional Networks for Fine-grained Surgical Activity Recognition

Automatic recognition of fine-grained surgical activities, called steps, is a challenging but crucial task for intelligent intra-operative computer assistance. The development of current vision-based activity recognition methods relies heavily on a high volume of manually annotated data. This data is difficult and time-consuming to generate and requires domain-specific knowledge. In this work, we propose to use coarser and easier-to-annotate activity labels, namely phases, as weak supervision to learn step recognition with fewer step annotated videos. We introduce a step-phase dependency loss to exploit the weak supervision signal. We then employ a Single-Stage Temporal Convolutional Network (SS-TCN) with a ResNet-50 backbone, trained in an end-to-end fashion from weakly annotated videos, for temporal activity segmentation and recognition. We extensively evaluate and show the effectiveness of the proposed method on a large video dataset consisting of 40 laparoscopic gastric bypass procedures and the public benchmark CATARACTS containing 50 cataract surgeries.

and robot-assisted surgeries (RAS) for the demanding situations of a modern Operating Room (OR) [1]- [3] has seen significant progress in the last decade. One of the primary functions of these advanced systems is automatic surgical workflow analysis, i.e., reliable recognition of the current surgical activities. Surgical activity recognition could play a key role in assisting clinical decisions, report generation, and data annotation by providing valuable semantic information.
Depending on the level of granularity, a surgical procedure can be decomposed into activities, such as the whole procedure, phases, stages, steps, and actions [4], [5]. Surgical phases are defined as a set of fundamental surgical aims to accomplish in order to successfully complete the surgical procedure. Similarly, steps are defined as a set of surgical actions to perform in order to accomplish a surgical phase. These definitions help clinicians define an ontology for each procedure, e.g. [6], [7] define ontologies for cataract and gastric bypass procedures. Although the ontologies are well defined, automatically recognizing these activities from available endoscopic videos is a topic of high interest.
Phase recognition has received a lot of attention and is a very active area of research in the medical computer vision community [8]- [12]. Alongside phases, there has been substantial research focusing on fine-grained activities such as robotic gestures [13]- [19], action triplets [20], and instrument detection and tracking [11], [21], [22]. Recently, there has been a surge of research works focusing particularly on step recognition [6], [7], [23].
While steps define a surgical workflow at a more finegrained level than phases, the time required to annotate a dataset with steps is significantly higher than with phase annotations. For example, in Laparoscopic Roux-en-Y gastric bypass (LRYGB) procedures, the workflow consists of 44 steps and 11 phases (Table II). Precisely defining and annotating all the steps requires a considerably higher time of experts due to the number of steps and more importantly lower interclass variances between steps. Since recent works in surgical phase/step recognition employ deep learning models, they rely on the availability of large-scale annotated datasets. Curation of these annotated datasets is difficult and time-consuming as these tasks require domain-specific medical knowledge.
To address this issue, a few works [24]- [27] have proposed methods based on semi-supervision. These approaches involve either pre-training the model on proxy tasks or training on synthetic labels generated by a teacher model trained on a small subset for phase recognition. Unlike these works, inspired by [22] and [28], we address the annotation scarcity issue by proposing a weakly supervised learning approach utilizing relatively economical annotations.
The main contributions of our work are summarized as follows: 1) We propose a weakly supervised learning method for surgical workflow analysis to tackle the problem of fine-grained surgical activity (step) recognition. We exploit the hierarchical step-phase relationships and utilize easier-to-annotate weak phase annotations on videos with missing step annotations. 2) We introduce a novel dependency loss to enforce the weak supervision and encode the step-phase hierarchical relationship as a matrix. By optimizing for this loss, it encourages the model to learn possible step sequences and transitions from videos with only phase annotations. 3) We present an end-to-end model consisting of ResNet-50 and Single-Stage Temporal Convolutional Network (SS-TCN) to learn both visual and temporal cues jointly. 4) We extend the CATARACTS 1 dataset (containing step annotations) with phase annotations. These annotations will be released upon acceptance of this manuscript. 5) We extensively evaluate our approach on two surgical video datasets, namely Bypass40 [7] and CATARACTS [29], demonstrating the effectiveness and generalizability of our method.

A. Surgical Activity Recognition
Research on developing deep learning methods for surgical phase recognition has seen significant progress with initial works of EndoNet [8] and DeepPhase [9] on cholecystectomy and cataract surgeries, respectively. EndoNet proposed a Convolutional Neural Network (CNN) followed by a hierarchical Hidden Markov Model (HMM) to perform both phase and tool detection. Similarly, DeepPhase introduced an architecture with ResNet [30] and Recurrent Neural Network (RNN), instead of HMMs, for temporal modeling, for both phase recognition and tool detection. EndoLSTM [31], [32] extended EndoNet by utilizing a Long Short-Term Memory (LSTM) for temporal refinement of spatial features. Similarly, SV-RCNet [10] trained a ResNet and LSTM model end-to-end and proposed a prior knowledge inference scheme for surgical phase recognition. MTRCNet-CL [11] presented a multi-task model to detect tool presence and perform phase recognition along with a novel correlation loss to capture the relationship between tool presence and phase identification. Recently, TeCNO [12] adapted the multi-stage Temporal Convolutional Network (MS-TCN) [33] architecture for online surgical phase prediction by implementing causal convolutions [34].
On the other hand, step recognition has seen a spark in research with the initial work of [23]. A Content-Based Video Retrieval (CBVR) system, for real-time step recognition, was proposed utilizing a novel pupil center and scale tracking method as pre-processing of motion features. In [6], the CBVR system along with surgical tool presence information was used as input to statistical models consisting of Bayesian Network and HMMs for multi-level online recognition of step and phase. Recently, MTMS-TCN [7] adapted TeCNO utilizing TCNs for multi-level online recognition of step and phase. In this work, we build upon the architectures of TeCNO and MTMS-TCN by utilizing a variant of MS-TCN in an end-toend fashion for online step recognition.

B. Weak Supervision
Weak supervision has seen a great interest in the medical computer vision community to tackle the need for highvolume annotated datasets that are difficult to generate. Some of the interesting applications of weak supervision are seen in surgical tool localization [22], tool segmentation [28], cancerous tissue segmentation [35], and detection of the region of interest in chest X-rays and mammograms [36]. To reduce the number of labeled videos, most of the recent research works in phase recognition have proposed approaches based on semi-supervised learning. These approaches follow a similar strategy of pre-training the models on different proxy tasks of frame-sorting [24], predicting the temporal distance between multiple frames [25], and predicting the remaining surgery duration [26]. The most closely related work to this paper in terms of objectives is [27], which proposed a teacher/student approach for phase recognition in scenarios of extreme manual annotation scarcity (≤ 25% of the training set). The teacher model (trained on a small set) generated synthetic phase annotations for a large number of videos on which the student model was then trained.
Weakly supervised coarse-to-fine methods have received considerable interest in the computer vision community [37]- [39] for image classification. [37] proposed an image-based weakly supervised end-to-end model for object classification consisting of a CNN followed by two self-expressive layers. One self-expressive layer captures the global structures through coarse labels and the other captures the local structures for fine-grained classification. [38] tackled the problem of learning finer representations from coarser labels without any fine-grained labels. Their proposed method consists of CNN based trunk-target network that learns coarse representations from labels and finer representations with nearest-neighbor classifier objective. Recently, [39] tackled the problem of Coarse-to-Fine Few-Shot (C2FS) and proposed a novel 'angular normalization' module that effectively combines supervised and self-supervised contrastive pre-training for C2FS.
Although these previous works in the vision community propose weakly supervised learning methods exploiting hierarchical structures, the focus solely lies on object recognition in natural images containing a single object in each image. In this work, we focus on weakly supervised learning from videos instead of images. We aim to recognize fine-grained activity, as opposed to object, exploiting the temporal information available in videos. In particular, we target fine-grained surgical activity recognition on videos from endoscopic procedures on two different types of surgeries, i.e., gastric bypass and cataract.

III. METHODOLOGY
The overview of our proposed method is presented in Fig. 2. In this section, we first present our end-to-end Spatio-temporal (ResNet-50 + SS-TCN) model for the task of fine-grained activity, i.e, step, recognition. Then we introduce the phasestep dependency loss for weak supervision of step recognition using phase annotation.

A. Spatio-temporal Model
Our weakly supervised step recognition network consists of a ResNet-50 model for visual feature extraction followed by an SS-TCN for modeling the recognition problem temporally. The complete model is trained in an end-to-end fashion. The overview of the model setup is depicted in Fig. 2.
For phase segmentation, ResNet-50 [40] has been successfully employed as the backbone in many previous works [10]- [12], [27]. In this work, we utilize the same architecture for visual feature extraction. We use a single-stage TCN (SS-TCN), a single-stage variant of MS-TCN, to learn the spatial coherence across video frames. The choice of SS-TCN was motivated by the work of [7] where MS-TCN did not provide a significant improvement over SS-TCN for both the step and phase recognition. Following the design of MS-TCN, the SS-TCN contains neither pooling layers nor fully connected layers and is constructed with only temporal convolutional layers, specifically dilated residual layers performing dilated convolutions. With the aim of online activity segmentation, we perform at each layer causal convolutions [7], [12], [34] that depend only on the current frame and n previous frames.
The complete model takes an input video consisting of T frames x 1:T . The ResNet-50 maps 224 × 224 × 3 RGB images to a feature space of size N f = 2048. These frame-wise features are collected over time and are inputs to the TCN model that predictsŷ s 1:T whereŷ s t is the class label for the current timestamp t, t ∈ [1, T ]. Since step recognition is a multi-class classification problem that exhibits an imbalance in the class distribution, softmax activation and class-weighted cross-entropy loss are utilized. Additionally, the dependency loss used when step labels are not available also relies on softmax activation and weighted cross-entropy loss, utilizing phase labels instead. The class weights for both steps and phases are calculated using the median frequency balancing [41] on the training set. The total loss is given by: where L step represents weighted cross-entropy loss for steps, L dep is the step-phase dependency loss (subsection III-B), and δ step is a binary variable that indicates if the video contains step labels.

B. Weak Supervision:
Step-Phase dependency loss Steps and phases are two types of activities describing the surgical workflow that are defined at different levels of granularity and possess an inherent hierarchical relationship [4], [7].
Steps are defined at a higher level of detail compared to phases. This brings about lower inter-class variances between steps, compared to phases, making it a more complex task to clearly define and distinguish between them. The challenges can be seen in the sample images presented in Fig. 1. For instance, in the Bypass40 dataset, similar actions are performed across different steps belonging to different phases. Dissection is performed in at least 7 steps spread across 3 different phases. Similarly, Stapling is performed in 5 steps across 4 different phases. Designing and training a deep learning model to distinguish between these similar steps poses a great challenge. Even the state-of-the-art method, MTMS-TCN [7], trained on a fully annotated dataset achieves an accuracy of ∼76% with a precision of ∼56%, accentuating the difficulty of the problem. The class imbalance further creates a challenge for training deep learning models that require large datasets with plenty of samples for each class.
In the scenario presented in this paper where the number of annotations is scarce, the recognition difficulties increase drastically. To overcome some of the challenges, this work proposes a weakly supervised approach that utilizes labels of less granular activities, i.e., phases. Phase information alone could help the model in two ways. Firstly, phase information could help the model reduce errors related to recognizing similar looking steps, e.g., 'S6: horizontal stapling' and 'S18: gastrojejunal stapling', belonging to two different phases. Secondly, we can gather a smaller subset of probable steps that could occur in a given phase eliminating the rest. For example, given the phase to be 'Phacoemulsification' of cataract surgery, only 5 out of 19 steps are likely to occur (Table I). Similarly, a phase such as 'P5: anastomosis test' in the Bypass40 dataset, reduces the possible steps to 7 out of 44 (Table II). Here, the phase information provides cues to the model to learn to distinguish between steps belonging to the subset rather than the whole set. Thus we hypothesize that the additional available weak phase information could be very beneficial for step recognition in the low data regime.
We propose to represent the relationship as a step-phase mapping matrix M s→p , where the elements m ij of the matrix are binary indicator variables which are 1 if step s i occurs in phase p j . The matrix encodes the weak information about which steps can occur in a particular phase and does not provide details of their occurrence, duration, and/or order. To enforce this weak link between steps and phases, the step predictionsŷ s t of our Spatio-temporal model (as described earlier) are linearly transformed by M s→p into the phase space. Then a weighted cross-entropy loss (L CE ) captures the similarity between the phase labels (y p t ) and the transformed predictions (M s→p ×ŷ s t ) of the model. The dependency loss (L dep ) is given by:

IV. EXPERIMENTAL SETUP
In this section, we discuss the experimental setup of our method. First, we present the datasets used for evaluation. Next, we discuss the experimental study followed by the training setup and evaluation metrics.
A. Datasets 1) Bypass40: The Bypass40 dataset [7] consists of 40 videos of LRYGB procedures with resolution 854 × 480 or 1920×1080 pixels recorded at 25 fps. Each frame is manually assigned to one of the 11 phases and one of the 44 steps [7]. For example, steps such as gastric opening, gastric tube placement, horizontal stapling, and vertical stapling occur in gastric pouch creation phase. A detailed list of phases and steps along with their hierarchical relationship is presented in Table II. For more information, we ask the readers to refer to [7]. We split the 40 videos into 24, 6, and 10 videos for training, validation, and test sets, respectively, and subsampled them at 1 frame-per-second (fps). This amounts to 150k, 40k, and 65k images in each set. The images are resized to ResNet-50's input dimension of 224 × 224, and the training dataset is augmented by applying horizontal flip, saturation, and rotation.
2) CATARACTS: The CATARACTS dataset, proposed in [29], contains 50 videos of cataract surgery. With the recent CATARACTS2020 challenge, the dataset has been released with step annotations. Similar to [6], we define a phase ontology for available step labels. Cataract surgery consists of 5 phases and 19 steps that are summarized in Table I. The dataset is extended with phase labels that is automatically generated using the available step annotations and the ontology presented in Table I. For each frame in a video, the phase label is obtained by a simple lookup of the step label in Table I. The only constraint while generating phase labels is when there are steps that can occur in several phases. In this case, the phase of the immediately preceding frame is assigned to the current frame. Since the only steps that occur in more than one phase are Idle, Incision, and Viscodilatation, and they do not occur at the beginning or at the end of a phase, it is therefore always possible to identify the correct phase by checking the phase of the previous step. Since very few steps occur in multiple phases, the automatically generated phase labels by table lookup are accurate and do not require expert knowledge or verification from a clinical expert.
We split the 50 videos (following the challenge 2 ) into 25, 5, and 20 videos for training, validation, and test sets, 2 https://www.synapse.org/#!Synapse:syn21680292/wiki/601563 respectively. Each set consists of 66k, 3.5k, and 11.8k frames extracted at 1 fps from the videos. The frames are resized from 1920 × 1080 to 224 × 224, and the training set is augmented with horizontal flip, saturation, and rotation.

B. Study
To demonstrate the effectiveness of our approach, we train and evaluate different configurations of the model. Given n videos, of which k are annotated with steps and the rest (n − k) are weakly annotated with phases, the Spatio-temporal model is trained in the proposed weakly supervised setting utilizing the dependency loss, presented as 'DEP'. To analyze the efficacy of 'DEP', we compare it against the Spatiotemporal model trained only on k videos in a fully-supervised approach for the task of step recognition, which we refer to as 'FSA'. Additionally, we add a state-of-the-art semi-supervised learning method proposed by Yu et al. [42] to our results. Yu et al. [42], proposed a teacher/student semi-supervised learning method where both the teacher and student models consisted of spatial and temporal components, CNN-biLSTM-CRF and CNN-LSTM respectively. As noted in Section II-B, [42] is a closely related work in the literature to the work presented in this paper. Hence, we have implemented and adapted the method of Yu et al. [42] for the task of step recognition. We repeat all the experiments for different values of k ∈ {3, 6, 12, 18}.
Furthermore, to analyze the influence of the number of additional videos with phase labels on the model performance, we conduct experiments where we fix k videos with step annotations and vary the number of videos with phase annotations from 0 to n − k (i.e., 3, 6, 12, etc.).

C. Training
The ResNet-50 model is initialized with weights pre-trained on ImageNet. The complete ResNet-50 + SS-TCN model is then trained end-to-end for the task of step recognition. Since SS-TCN models the temporal information in an online setup, features from all the past frames in the video needs to be cached. To achieve this, a feature buffer is maintained to store features from the spatial model of the past frames. The feature buffer is reset at the end of the video. In all the experiments, the model is trained for 50 epochs with a learning rate of 1e-5, weight regularization of 5e-4, and a batch size of 64. The test results presented are from the best performing model on the validation set. The models were implemented in PyTorch and trained on NVIDIA RTX 2080 Ti.

V. RESULTS AND DISCUSSIONS
A. Bypass40 1) Effect of weak supervision: To quantitatively evaluate our method, the results of step recognition on the test set are presented in Table III. The table contains  Moreover, the results of Yu et al. [42] semi-supervised method are also presented in Table III for different step annotated videos (3, 6, 12, and 18) used to train both teacher and student model. The student model's performance increases by 3-8% over 'FSA' in all the metrics for 6 videos with step annotations. Furthermore, an increase of 6% and 2% is noticed in recall and F1-score above 'FSA' with 12 step annotated videos. However, the method falls short of our proposed 'DEP' method. We notice a 10-15%, 2-6%, and 1-6% increase in performance in all the metrics of the 'DEP' model over Yu   that the proposed method effectively utilizes this information in the lower data settings.
2) Effect of the amount of phase annotated videos: In Table  IV, we present the results of our model with a varying number of phase annotated videos. Utilizing 6 videos containing step annotations, the addition of phase labeled videos as weak supervision improves all metrics: accuracy, F1, precision, and recall. With 6 videos annotated with phases, the model performance increases by 7-8% in all metrics over the baseline 'FSA' model. The addition of more videos does not affect the accuracy but further improves both precision and recall by 4%. This is due to our weakly-supervised method, which only provides supervision information if a step can occur in the given phase. This information helps to distinguish steps belonging to different phases, as opposed to steps belonging to the same phase. Therefore, the precision and recall of the model improve with more phase annotated videos, and no significant improvement in accuracy is seen. We see a similar trend when using 12 videos annotated with steps and increasing the number of videos annotated with phase labels. Thus, ultimately it is beneficial to train our method utilizing all additional videos in the dataset with phase annotations for weak supervision.
B. Cataracts 1) Effect of weak supervision: We quantitatively evaluate our method and present the results of step recognition in Table V. The table contains the results of our model, on a similar set of experiments as with Bypass40, by varying the number of videos in the training set labeled with steps (3, 6, 12, and 18) along with the rest of the training set containing phase annotations. We see a similar trend as with bypass where the 'DEP' model outperforms 'FSA'. We notice a 13-22% improvement 'DEP' model considering only 3 step annotated videos. Furthermore, we see a 6-13% and 1-3% increase in performance in all the metrics of the 'DEP' model in experiments corresponding to 6 and 12 step annotated videos, respectively. We see that our method achieves a similar performance improvement on a relatively easier surgical workflow, such as cataracts, consistently surpassing the FSA in all labeled ratios. The semi-supervised method of Yu et al. achieves performance improvement of 16%, 8%, and 1.5% over 'FSA' in F1-score for experiments corresponding to 3, 6, and 12 videos, respectively. However, as seen earlier, it falls short of 'DEP' by 5%, 0.5%, and 0.5% in the F1score for experiments corresponding to 3, 6, and 12 videos. Interestingly, Yu et al. achieves high recall on both datasets (Table III & V). On CATARACTS, it even outperforms the 'DEP' model in recall in all the experiments but falls short significantly in precision. This could be credited to the student model which learns from imperfect pseudo labels generated by the teacher model. Since our proposed 'DEP' model learns from true phase labels on additional videos its performance increases in both precision and recall. This validates the applicability of our approach to different surgical workflows.
2) Effect of the amount of phase annotated videos: We present the results of our experiments, with a varying number  of phase annotated videos, on CATARACTS in Table VI. We notice that utilizing 6 step annotated videos with additional phase labeled videos improves all the metrics by 6-13%. In particular, with 6 videos annotated with phases, we see a performance increase of 5% in accuracy and F1-score and 8% in recall of the 'DEP' model over the baseline 'FSA'. The addition of more videos provides a fractional improvement in accuracy but further improves both recall and F1-score by 1-4%. We see a similar trend when using 12 videos with step annotations reaffirming our hypothesis that it is beneficial to train our method utilizing all additional videos in the dataset with phase annotations for weak supervision.

C. Weak supervision on step predictions
To visualize the effectiveness of our method, we visualize the step predictions of our method on the CATARACTS dataset which contains fewer phases and steps thereby enabling us to render a simple and clearer graphical diagram. We compare the step predictions of our 'DEP' model against 'FSA' for 2 best and 2 worst videos in CATARACTS in Fig. 3 for different labeled ratios (3, 6, and 12 videos with step annotations). Along with the step predictions we present the errors in the phase predictions for both models. The phase prediction error plot is computed as the errors in phase predictions derived from step predictions, using the step-phase mapping matrix, against ground truth phase predictions. Fig.  3 clearly depicts the effectiveness of our method for different labeled ratios. By correcting for the phase labels through dependency loss, our 'DEP' model is able to correct for corresponding step labels without explicit supervision for step recognition (e.g. S10, S15, S18). The top row of Fig. 3a shows this effect where we see a marked improvement in recognition of steps S18 (first video) and S10 (second video) by correcting for phase errors.

D. Limitations
In some cases, for example, S16 (Fig. 3a, 3b, 3c), correcting for phase errors does not improve step recognition. The step is misrecognized with another step that occurs in the same phase. This is an expected outcome due to the intrinsic limitations of our weakly supervised method using coarser phase labels. Given the phase to be 'P2: gastric pouch creation' (Table II), it is impossible for a model to differentiate between 'crura dissection' and 'his angle dissection' or between 'horizontal stapling' and 'vertical stapling'. As can be seen in Fig.  1, the steps are quite similar in appearance and perform similar actions on the same anatomy (i.e., stomach or small intestine). This makes it challenging for a model to learn even when all the annotations are available. Furthermore, the phase information is too weak and does not provide any cues to better distinguish between the steps because both are valid steps in the current phase. Another limitation of our method is that adding more videos with phase annotations is not always beneficial. This limitation also stems from weak phase signals. If the fully supervised 'FSA' model learns to separate steps belonging to different phases, i.e., it has no or few phase-step  Step predictions on two best and two worst videos on the CATARACTS dataset for different labeled ratios. For each video, we visualize the step prediction of ground truth, DEP model predictions, DEP model phase prediction errors, FSA model predictions, and phase prediction errors of FSA model. correspondence errors, then additional videos with phase labels add no significant value as the model, during training, makes no/few errors in phase-step correspondence that helps improve feature learning. The significant errors by the model would be the inter-class separation of steps belonging to the same phase. Learning good representations to reduce these errors without supervision is a challenging task that needs to be tackled in future works.
Meanwhile, the effect of utilizing more phase annotated videos as weak supervision for improving the model performance on step recognition is presented in Tables IV & VI. As observed in Sections V-A.2 & V-B.2, it is beneficial to train the 'DEP' model utilizing all the additional phase annotated videos in the dataset for weak supervision. We also observe that in the lower data setting (6 videos with step annotations) model performance improves even when the phase annotated videos are increased from 12 to 18 (19 for cataracts). However, our study doesn't provide insights as to how many phase annotated videos are truly required to achieve the best performance by our proposed 'DEP' model. This is another limitation of our study, irrespective of the complexity of the procedure, that is hindered by the size of the available labeled datasets (24 in Bypass40 & 25 in CATARACTS). Understanding the extent of the 'DEP' model would require extending these datasets which is an important direction that needs to be pursued in future studies.

VI. CONCLUSION
In this paper, we introduce a weakly-supervised learning method for surgical step recognition utilizing less demanding phase annotations. To model the weak supervision between steps and phases, we introduce a step-phase dependency loss and train a ResNet-50 + SS-TCN model end-to-end. The proposed method is extensively evaluated on a Bypass40 dataset consisting of 40 LRYGB procedures and on the CATARACTS dataset containing 50 cataracts surgeries. The proposed 'DEP' model significantly improves the step recognition metrics over the baseline 'FSA' model for all the amounts of step annotations available. We hope that this work will inspire and foster future research in weak supervision for surgical workflow analysis utilizing multi-level descriptions of the workflow.
Ethical approval The surgical videos were recorded and anonymized following the informed consent of patients in compliance with the local Institutional Review Board (IRB) requirements.

Informed Consent
The patients consented to data recording.