Self-Supervised Pretraining Improves Performance and Inference Efficiency in Multiple Lung Ultrasound Interpretation Tasks

In this study, we investigated whether self-supervised pretraining could produce a neural network feature extractor applicable to multiple classification tasks in B-mode lung ultrasound analysis. When fine-tuning on three lung ultrasound tasks, pretrained models resulted in an improvement of the average across-task area under the receiver operating characteristic curve (AUC) by 0.032 and 0.061 on local and external test sets respectively. Compact nonlinear classifiers trained on features outputted by a single pretrained model did not improve performance across all tasks; however, they reduced inference time by 49% compared to the serial execution of separate fine-tuned models. When training using 1% of the available labels, pretrained models consistently outperformed fully supervised models, with a maximum observed test AUC increase of 0.396 for the task of view classification. Overall, the results indicate that self-supervised pretraining is a useful strategy for producing initial weights for lung ultrasound classifiers.


Introduction
Recent years have witnessed a surge of interest in self-supervised learning (SSL) as a strategy for representation learning in computer vision.Hailed as a means to productively leverage unlabelled data when labels are scarce, self-supervised pretraining has been shown to improve performance on several supervised learning tasks in multiple domains of medical imaging, such as radiography [1,2], computed tomography [2,3], magnetic resonance imaging [2,3], ultrasound [4,5], and dermatology [1].Self-supervised pretraining produces a feature extractor that may be used to initialize the weights of a model in a supervised learning setting.Studies have indicated that models pretrained with SSL perform comparably to fully supervised models even when fine-tuned with significantly less labelled data [1,3].Given the widespread paucity and expense of labelled medical images, it is therefore unsurprising that SSL has risen as a reasonable strategy to leverage unlabelled data.
Interpretation of medical images consists of completing several recognition tasks, occasionally in a hierarchical manner.In the hierarchical setting, interpreters engage in the predictive process of a decision tree, beginning with the root node and traversing down a single path, guided by decisions at each node.Examples include the distinction (1) Three tasks were identified for lung ultrasound (LUS) image classification: parenchymal versus pleural views, A-lines versus B-lines (applicable to parenchymal views), and pleural effusion (PE) versus no pleural effusion (applicable to pleural views).(2) A convolutional feature extractor f was pretrained to minimize a self-supervised objective, using unlabelled and labelled LUS images as input and trainable projector g. (3a) Task-specific models were defined by appending linear classifier or multilayer perceptron h i to copies of pretrained f .The models were trained end-to-end for each task using labelled data.(3b) An alternative framework in which f 's weights were not fine-tuned.Instead, task-specific models h i were trained that each received f 's features as input.
of malignant pulmonary nodules on CT [6] and the identification of lipomas and liposarcomas on MRI [7].In this study, we focus on lung ultrasound (LUS) -an examination involving the recognition of multiple artefacts that narrow differential diagnoses in emergency and critical care scenarios, hereafter referred to as multi-task LUS interpretation.
Past work in machine learning-based hierarchical medical imaging classification has resorted to training entirely separate classifiers for each node in the tree [8].This study sought to determine if a single feature extractor can produce meaningful representations for multi-task LUS interpretation.We hypothesized that self-supervised pretraining is suited for the task of developing a feature extractor that is useful for multiple classification tasks.Self-supervised pretraining produces feature representations that may be adapted for training multiple supervised learners, while making use of unlabelled examples.The feature extractor can be fine-tuned for individual subsequent tasks (Figure 1, 3a).Alternatively, the weights of the feature extractor may be held constant, facilitating the addition of new tasks to the multi-task LUS interpretion system by training nonlinearities on top of the features (Figure 1, 3b).The contributions of this work are thus as follows: (1) an investigation of the suitability of self-supervised feature extractors for multitask interpretation of B-mode LUS, and (2) a tree-based classification strategy in which the inputs to the root node are obtained from a feature extractor pretrained with SSL.The evaluation provides performance and runtime metrics for each task, comparing the fine-tuning of end-to-end models for each task against training a multilayer perceptron (MLP) on features yielded from one pretrained extractor.
2 Related Works 2.1 Joint Embedding Self-Supervised Learning Broadly, self-supervised learning (SSL) is a form of unsupervised representation learning that is employed to pretrain a feature extractor for transfer learning.It consists of learning to solve a pretext task, which is a supervised learning problem for which labels are computed from unlabelled data.The weights of the feature extractor are used to initialize a new model trained to solve a supervised learning task for which labels are available.In the joint embedding

No effusion Effusion
Table 1: A summary of the LUS tasks addressed in this study.
framework of SSL, the pretext task is designed to reduce the differences between features of semantically related images that satisfy a pairwise relationship.Semantically related pairs of images are customarily passed through the feature extractor, with the output being sent through a projection head (typically a MLP), producing embeddings.Contrastive learning methods seek to reduce the distance between embeddings of paired images (positive pairs) and increase the distance between embeddings of images that do not satisfy the pairwise relationship (negative pairs) [9].Non-contrastive learning methods dispense with negative pairs, focusing only on minimizing distances between the embeddings of positive pairs [10,11].

Joint Embedding Methods in B-Mode Ultrasound
Multiple studies have assessed the impact of Joint Embedding approaches to self-supervised pretraining on the performance of machine learning solutions in diagnostic B-mode US tasks, particularly when labels are scarce.Contrastive and non-contrastive methods have been applied to breast tumour classification and left ventricle segmentation, with mixed results [4,12,13].Chen et al. [14] proposed a custom contrastive learning objective with interpolated intravideo positive pairs, outperforming both fully supervised and SimCLR-pretrained models on the public POCOVID-Net dataset [15].Adopting a curriculum learning approach, Basu et al. [5] achieved even better performance on POCOVID-Net with their contrastive learning method that employed progressively harder intra-video positive pairs.

Multi-Task Medical Image Interpretation
Several studies have addressed multi-task learning for multi-task medical imaging interpretation.For instance, Zhang et al. [16] trained a single neural network with dedicated output layers for the classification of carotid plaques and estimation of the degree of stenosis on CT angiography imaging.Xu et al. [17] proposed a single convolutional neural network (CNN) architecture for adbominal US view classification and landmark localization, using features from intermediate residual blocks as input for both tasks.Focusing instead on hierarchical interpretation, Fu et al. [18] proposed a system for medical image classification consisting of a convolutional neural network (CNN) followed by a decision tree in which each node is a linear classifier [18].Decision trees with neural network nodes have also been proposed [8].
In our study, we show that a single CNN pretrained with self supervision provides sufficient features for multiple tasks, including tasks arranged hierarchically.Note that the present work is distinct from multi-task learning in that it explores the utility of reusing a single self-supervised pretrained feature extractor for the development of multiple LUS classifiers.

LUS Classification Tasks
The LUS interpretative workflow addressed in this work has been described as a decision tree [19].After determining the view, the interpreter traverses down the tree to look for increasingly specific artefacts that reduce a possible differential diagnosis.We focus on three binary classification tasks for LUS image interpretation: view classification (View), A-line versus B-line classification (A/B), and pleural effusion detection (PE).The former is applicable to parenchymal LUS views, and the latter to pleural LUS views.Table 1 summarizes these tasks, and Figure 1 displays emblematic examples for each class.

Data
Datasets from one local and one external healthcare institution were extracted from a private repository of LUS videos, access to which is permitted via ethics approval granted by Western University (REB 116838).The dataset was previously labelled for the View, AB, and PE tasks by clinicians competent in LUS as a part of prior work [20,21].The labelled portion of the local dataset was split by patient identifier into a training set (70%), validation set (15%), and test set (15%), and the external dataset was reserved for testing only.Local videos with no labels were used only during self-supervised pretraining.Table 2 details the cardinalities and class distribution of these datasets.Regions peripheral to the US beam were expunged of extraneous visual artefacts, and the images were cropped to the boundaries of the beam.All images were downsampled to 128 × 128 pixels.
We also evaluated the effectiveness of SimCLR-pretrained weights on the public COVIDxUS dataset [22], splitting it by video identifier into a training, validation, and test set.Although patient identifiers were not available for every video, care was taken to ensure that multiple videos from the same patient identifier were contained in the same set.COVIDxUS contains 243 LUS videos from a variety of manufacturers and clinical sources with labels for 4 classes: normal lung, COVID-19, other pneumonia, and other pathologies.

Self-Supervised Pretraining
Three Joint Embedding SSL methods were trialled to produce pretrained models for each LUS tasks: SimCLR (with τ = 0.1) [9], Barlow Twins (with γ = 0.005) [10], and VICReg (with γ = 25, µ = 25, ν = 1) [11].As was done in the original studies, positive pairs were produced by distorting images by applying stochastic data augmentations sampled from a family of transformations.Figure 3 provides examples of augmented view of B-mode images from the local dataset.Below is the list of transformations, where P indicates the probability of that transformation being applied: 1. Random crop of c ∼ U(0.5, 1.0) of the image's area.(P = 0.8).Feature extractors were pretrained for 15 epochs using the union of the unlabelled and training images.The Mo-bileNetV3 [23] architecture, initialized with ImageNet-pretrained weights, was employed for all pretraining.The same pretrained feature extractors were used to initialize all downstream LUS tasks.

Evaluation Protocol
We compared pretrained models with fully supervised models initialized with ImageNet-pretrained weights.The following experiments were conducted to determine pretrained models' effectiveness at learning the LUS tasks.
• Linear classification (LC): The weights of the feature extractor were held constant, and a linear classifier was trained using its outputted features.• Fine-tuning (FT): The weights of the feature extractor and a linear head were both trained.
• Nonlinear classification (NC): The weights of the feature extractor were held constant and a nonlinear head was trained on the features.The head consisted of a multilayer perceptron with a single hidden layer of 32 nodes with ReLU activation.
Figure 1 (3a & 3b) illustrates how FT and NC each implement hierarchical LUS interpretation for the tasks of interest.In all trials, the initial learning rates for the feature extractor and head were 1 × 10 −5 and 1 × 10 −4 respectively; they were multiplied by a factor of e −0.02 each epoch.Models were trained for 10 epochs to minimize binary crossentropy loss, and the weights resulting in the lowest validation loss were retained.We assessed model performance by determining the area under the receiver operating curve (AUC) on the local and external test sets.All experiments were conducted using a system with an Intel i9-10900K CPU at 3.7 GHz and a Nvidia GeForce RTX 3090 GPU.

Test Performance
Feature extractors were pretrained using SimCLR [9], Barlow Twins [10], and VICReg [11].Pretrained models were then fine-tuned for each of the three experiments outlined in Section 3.4.case of linear evaluation, self-supervised pretraining resulted in greater performance on local unseen data.Fine-tuned models and MLPs generally achieved greater local test performance, with a notable exception occurring in NC for the PE task on the local test set.
Seeking to better understand these results, we visualized two-dimensional t-SNE [24] projections of the features outputted by a ImageNet-pretrained and SimCLR-pretrained feature extractors.As can be seen in Figure 4, the projections for PE were not well-separated, even after pretraining with SimCLR.In contrast, the projections suggest that self-supervised pretraining improved the separability of the data for the AB task, which is reflected in the ubiquitously stronger performance of the SimCLR-pretrained model.The difference in performance after self-supervised pretraining was less clear for View, which may be because there were significantly more labelled examples available for View (see Table 2).Moreover, the t-SNE projections for View exhibited separability before and after SimCLR pretraining.As conveyed in Table 2, similar performance trends emerged when evaluating the fine-tuned models on the external test set.
To promote experimental replicability, we investigated the effect of self-supervised pretraining on COVIDxUS, a public LUS dataset.As shown in Table 4, pretraining with SimCLR on the COVIDxUS training set resulted in better mean class-wise test AUC than initialization with ImageNet-pretrained weights.To explore transferability of pretrained weights, we conducted a separate training run using weights pretrained using SimCLR on the local LUS dataset.Note that, although COVIDxUS contains less than a tenth of the number of videos in the local training set alone, it was amalgamated from a variety of institutions and device manufacturers, The results highlight the importance of pretraining on a similar distribution, as pretrained weights on the local LUS dataset performed comparably to supervised ImageNet pretraining.

Inference Efficiency
Real-time device inference could be done by reusing the output of a single feature extractor as input to multiple lightweight classifiers.We compared the inference time of two serial fine-tuned CNNs (Figure 1, 3a) against one feature extractor and two subsequent MLP classifiers (Figure 1, 3b), reflecting the decision tree that results from connecting the View, AB, and PE tasks in the LUS interpretation workflow.After serially conducting 1000 predictions, the former took an average of 0.116 s (SD 0.003 s), while the latter an average of 0.059 s (SD 0.001 s), underlining the runtime advantage of multi-task inference with a shared feature extractor.With each feature extractor and MLP requiring 3.7 × 10 7 and 3.7 × 10 4 floating point operations respectively, reusing the output of a single feature extractor as input to multiple task-specific MLPs would save considerable computational resources.The LUS diagnostic tree depicted in Figure 1 would require approximately half the floating point operations if each node was a lightweight MLP instead of an entire CNN.Future work should focus on improving the applicability of representations from frozen self-supervised pretrained models for multiple ultrasound classification tasks.

Conclusion
In this study, joint embedding SSL methods were observed to improve the performance of classifiers on a variety of LUS tasks, particularly when a small fraction of labels were employed.Fine-tuning self-supervised pretrained models for each task consistently yielded the greatest performance gains for each task, with SimCLR-pretrained models improving across-tasks average AUC improvement of 0.032 and 0.061 on local and external test sets respectively.When holding the weights of pretrained feature extractors constant, linear classifiers trained on representations from self-supervised models consistently achieved greater across-task average AUC on local and external test data.MLPs trained on features outputted by self-supervised pretrained models did not outperform fully supervised models on all tasks; however, low-dimensional projections of features provided qualitative evidence that the features were wellseparated with respect to two of the three tasks studied.Given the greatly reduced inference time for multi-task LUS interpretation when reusing features from a single pretrained feature extractor, there would be great merit in future work that improves the quality of pretrained feature extractors and the separability of their outputs with respect to multiple tasks.As such, future studies could systematically ascertain the effect of US-specific data augmentations in joint embedding methods and explore sample weights for SSL objectives that exploit temporal proximity in B-mode videos.

Figure 1 :
Figure 1: An overview of the methods described in this work.(1)Three tasks were identified for lung ultrasound (LUS) image classification: parenchymal versus pleural views, A-lines versus B-lines (applicable to parenchymal views), and pleural effusion (PE) versus no pleural effusion (applicable to pleural views).(2) A convolutional feature extractor f was pretrained to minimize a self-supervised objective, using unlabelled and labelled LUS images as input and trainable projector g. (3a) Task-specific models were defined by appending linear classifier or multilayer perceptron h i to copies of pretrained f .The models were trained end-to-end for each task using labelled data.(3b) An alternative framework in which f 's weights were not fine-tuned.Instead, task-specific models h i were trained that each received f 's features as input.

Figure 2 :
Figure 2: Examples of each class for each LUS binary classification task: View (a), AB (b), and PE (c).

Figure 3 :
Figure 3: Augmented views of B-mode images, comprising positive pairs for self-supervised pretraining.

Figure 4 :
Figure 4: A comparison of t-SNE projections of features for the examples in the local test set outputted by a feature extractor initialized with ImageNet-pretrained weights before SimCLR pretraining (top) and after SimCLR pretraining (bottom).Bold typeface indicates the best-performing pretraining strategy and dataset combination.

4. 2 Figure 5 :
Figure 5: Local test AUC for supervised models initialized with ImageNet-pretrained weights and SimCLR-pretrained weights.Results are provided for the fine-tuning (FT) and nonlinear classification (NC) experiments training on various fractions of the labelled dataset.

Table 2 :
Breakdown of the institutional US datasets used in this study.For each LUS binary classification task, x / y indicates the number of negative and positive examples respectively.

Table 3
details the results on the local and external test sets for each of the experiments described in Section 3.4.Area under the receeiver operating curve (AUC) was designated as the primary evaluation metric, but other classification metrics are reported in Appendix A In the

Table 3 :
AUC evaluated on the local and external test sets for the linear classification (LC), fine-tuning (FT), and nonlinear classification (NC) experiments.Results are presented for each of the View, AB, and PE tasks.The bottom row gives the geometric mean across tasks, with bold typeface indicating the best-performing pretraining strategy.

Table 4 :
Mean class-wise AUC on the COVIDxUS test set for FT and NC.