Abstract:
This letter proposes a novel video-based, contrastive regression architecture, Contra-Sformer, for automated surgical skill assessment in robot-assisted surgery. The prop...Show MoreMetadata
Abstract:
This letter proposes a novel video-based, contrastive regression architecture, Contra-Sformer, for automated surgical skill assessment in robot-assisted surgery. The proposed framework is structured to capture the differences in the surgical performance, between a test video and a reference video which represents optimal surgical execution. A feature extractor combining a spatial component (ResNet-18), supervised on frame-level with gesture labels, and a temporal component (TCN), generates spatio-temporal feature matrices of the test and reference videos. These are then fed into an action-aware Transformer with multi-head attention that produces inter-video contrastive features at frame level, representative of the skill similarity/deviation between the two videos. Moments of sub-optimal performance can be identified and temporally localized in the obtained feature vectors, which are ultimately used to regress the manually assigned skill scores. Validated on the JIGSAWS dataset, Contra-Sformer achieves competitive performance (Spearman 0.65–0.89), with a normalized mean absolute error between 5.8%-13.4% on all tasks and across validation setups.
Published in: IEEE Robotics and Automation Letters ( Volume: 8, Issue: 3, March 2023)