Loading [MathJax]/extensions/MathZoom.js
MT4MTL-KD: A Multi-Teacher Knowledge Distillation Framework for Triplet Recognition | IEEE Journals & Magazine | IEEE Xplore
Scheduled Maintenance: On Monday, 30 June, IEEE Xplore will undergo scheduled maintenance from 1:00-2:00 PM ET (1800-1900 UTC).
On Tuesday, 1 July, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET (1800-2200 UTC).
During these times, there may be intermittent impact on performance. We apologize for any inconvenience.

MT4MTL-KD: A Multi-Teacher Knowledge Distillation Framework for Triplet Recognition


Abstract:

The recognition of surgical triplets plays a critical role in the practical application of surgical videos. It involves the sub-tasks of recognizing instruments, verbs, a...Show More

Abstract:

The recognition of surgical triplets plays a critical role in the practical application of surgical videos. It involves the sub-tasks of recognizing instruments, verbs, and targets, while establishing precise associations between them. Existing methods face two significant challenges in triplet recognition: 1) the imbalanced class distribution of surgical triplets may lead to spurious task association learning, and 2) the feature extractors cannot reconcile local and global context modeling. To overcome these challenges, this paper presents a novel multi-teacher knowledge distillation framework for multi-task triplet learning, known as MT4MTL-KD. MT4MTL-KD leverages teacher models trained on less imbalanced sub-tasks to assist multi-task student learning for triplet recognition. Moreover, we adopt different categories of backbones for the teacher and student models, facilitating the integration of local and global context modeling. To further align the semantic knowledge between the triplet task and its sub-tasks, we propose a novel feature attention module (FAM). This module utilizes attention mechanisms to assign multi-task features to specific sub-tasks. We evaluate the performance of MT4MTL-KD on both the 5-fold cross-validation and the CholecTriplet challenge splits of the CholecT45 dataset. The experimental results consistently demonstrate the superiority of our framework over state-of-the-art methods, achieving significant improvements of up to 6.4% on the cross-validation split.
Published in: IEEE Transactions on Medical Imaging ( Volume: 43, Issue: 4, April 2024)
Page(s): 1628 - 1639
Date of Publication: 21 December 2023

ISSN Information:

PubMed ID: 38127608

Funding Agency:


I. Introduction

Video-driven fine-grained surgical action recognition aims to recognize detailed surgical activities in each video frame [1]. It can foster safety in the operating room by providing surgeons with intra-operative context-aware support [2]. As a key technology for automatically extracting information from surgical videos, it is also essential for surgical archives, postoperative recovery, and surgical education [3], [4], [5]. Among all fine-grained surgical action recognition tasks, recognizing triplets of the surgical activity is an emerging topic that delivers the finest level of granularity in surgical activity understanding. Specifically, the surgical activity is formalized as a triplet of , which is commonly referred to as triplet recognition. Triplet recognition is a multi-label image classification problem, as multiple activities may occur in one frame. An example of triplet recognition in CholecT45 [6] is shown in Fig. 1 (a). Two triplets, and , appear in one frame to represent the cystic plate dissection with the hook and the gallbladder retraction using the grasper, respectively.

(a) An introduction of triplet recognition. (b) Attention maps of different backbones. The CNN-based model possesses a limited local attention field, while the Transformer-based model presents a more extensive one. MT4MTL-KD offers a favorable attention field that facilitates both local and global context modeling. (c) The class imbalance ratios of triplet recognition and its sub-tasks. Higher values indicate a more severe class imbalance. (d) Loss convergence on a shared backbone and individual backbones. A shared backbone results in inferior performance, as it is unable to converge to an optimal point for each sub-task.

Contact IEEE to Subscribe

References

References is not available for this document.