I. Introduction
Video-driven fine-grained surgical action recognition aims to recognize detailed surgical activities in each video frame [1]. It can foster safety in the operating room by providing surgeons with intra-operative context-aware support [2]. As a key technology for automatically extracting information from surgical videos, it is also essential for surgical archives, postoperative recovery, and surgical education [3], [4], [5]. Among all fine-grained surgical action recognition tasks, recognizing triplets of the surgical activity is an emerging topic that delivers the finest level of granularity in surgical activity understanding. Specifically, the surgical activity is formalized as a triplet of , which is commonly referred to as triplet recognition. Triplet recognition is a multi-label image classification problem, as multiple activities may occur in one frame. An example of triplet recognition in CholecT45 [6] is shown in Fig. 1 (a). Two triplets, and , appear in one frame to represent the cystic plate dissection with the hook and the gallbladder retraction using the grasper, respectively.
(a) An introduction of triplet recognition. (b) Attention maps of different backbones. The CNN-based model possesses a limited local attention field, while the Transformer-based model presents a more extensive one. MT4MTL-KD offers a favorable attention field that facilitates both local and global context modeling. (c) The class imbalance ratios of triplet recognition and its sub-tasks. Higher values indicate a more severe class imbalance. (d) Loss convergence on a shared backbone and individual backbones. A shared backbone results in inferior performance, as it is unable to converge to an optimal point for each sub-task.