Applications of Self-Supervised Learning to Biomedical Signals: A Survey

Over the last decade, deep learning applications in biomedical research have exploded, demonstrating their ability to often outperform previous machine learning approaches in various tasks. However, training deep learning models for biomedical applications requires large amounts of data annotated by experts, whose collection is often time- and cost- prohibitive. Self-Supervised Learning (SSL) has emerged as a prominent solution for such problems, as it allows learning powerful representations from vast unlabeled data by producing supervisory signals directly from the data. The high number of recent works employing the self-supervised learning paradigm for the analysis of biomedical signals (biosignals) can make it difficult for researchers to have a complete picture of the current research state. Therefore, this paper aims at outlining and clarifying the state-of-the-art in the domain. The article: briefly summarizes the nature and acquisition modality of the main biosignals; introduces the self-supervised learning method, focusing on the different pretraining strategies; provides a concise overview of the works employing SSL for the analysis of different types of biosignals; provides an overall analysis of critical aspects to consider when employing SSL to biosignals, also highlighting current open challenges. The analysis of the scientific literature highlights the importance of SSL, confirming its potential to improve models’ performance and robustness, and to promote the integration of deep learning into clinical tasks.


I. INTRODUCTION
In the last decade, deep learning has emerged as a powerful and versatile tool capable of achieving state-of-theart performance in various fields.Starting from AlexNet [1], winner of the 2012 Imagenet Large Scale Visual Recognition Challenge (ILSVRC) [2], many of the biggest companies have invested considerable resources to promote and introduce deep learning applications in their products and software.Notorious examples are Google DeepMind's AlphaZero [3], a reinforcement learning algorithm capable of winning against the strongest humans and computer engines on various board games (e.g., chess, go, shoji), Google DeepMind's AlphaFold [4], winner of the 13 th and The associate editor coordinating the review of this manuscript and approving it for publication was Hasan S. Mir.
14 th Critical Assessment of Techniques for Protein Structure Prediction (CASP), and the novel OpenAI's ChatGPT, 1 considered a fundamental step in Natural Language Processing (NLP).The mentioned examples demonstrate how deep learning can be successfully applied in various research areas; therefore, medicine was not excluded by the ''golden fever'' of Artificial Intelligence (AI).Looking at PubMed, 2 one of the most used search engines for biomedical literature [5], it is possible to see that the number of yearly published works involving deep learning has increased from less than 300 in 2016 to approximately 17 000 in 2022 (a remarkable increase of approximately 5 700%).However, despite the rocketing number of applications, the use of deep learning is still limited in common clinical practice [6].Deep neural networks are usually trained in a supervised way, where a manually labeled dataset is fed to train, optimize and test the model.At first, given the novelty of the field, this approach was able to often outperform previous state-of-theart algorithms based on more naive approaches [7].More recently, considering the increasing complexity of models and tasks where deep learning can be involved, limitations have started to be highlighted [8].Training large neural networks that can generalize well in the biomedical domain requires a huge amount of highly heterogeneous annotated data, which is difficult to collect in medical research [9].
In fact, manually labeling medical data is a time-consuming task that only experts in the field can perform.Furthermore, their collection is often hindered by ethical (e.g., trial approval, anonymization) and economical aspects, which make data provision and annotation extremely challenging.
In contrast, thanks to the digitization of the healthcare sector, a large amount of unlabeled data is generated every day, with an order of magnitude already reaching the exa-scale [10].Exploiting them could greatly improve the performance and robustness of deep learning models, which is why the research community has started to propose novel unsupervised solutions.Self-Supervised Learning (SSL) has emerged as one of the most prominent paradigms in this context.Its goal is to learn robust general-purpose representations from the data by exploiting an auxiliary task (pretext task); then, transfer the acquired knowledge to a new model designed to solve the target (medical) task.Self-supervised learning has been successfully applied in many fields, such as natural language processing [11], computer vision [12], speech recognition [13], and robotics [14].In medical research, computer vision is the most investigated area [15].Here, self-supervised learning is employed for classification, segmentation, registration and reconstruction of different types of images, from 2D microscopy for digital pathology [16] to 3D MRI (magnetic resonance imaging) [17].The interested reader can consult the work of Saeed et al. [18] and that of Xu [19], who have already reviewed SSL implementations in the medical imaging domain.
Biomedical signals (biosignals) represent a fundamental resource in the medical domain, including many modalities such as electroencephalography (EEG), electromyography (EMG), and electrocardiography (ECG).Moreover, with the progress of the IoT (Internet of Things) and the spread of wearable devices, their role is increasingly becoming more relevant, especially in telehealth and precision medicine [20].As a matter of fact, several researchers have already proposed SSL strategies for the analysis of biosignals.However, considering the large and constantly growing number of publications, it is difficult to keep up with the progress of the state of the art.A review targeting SSL applications to biosignals is not available according to the best of our knowledge.In fact, previously cited works focus on different types of data (medical imaging) [19], [20], specific biomedical signals (EEG) [21], or specific self-supervised learning paradigms (contrastive learning) [15].Moreover, they often tend to extensively describe SSL pretraining strategies and the surveyed works but do not put the same effort into discussing special aspects to consider when employing existing SSL techniques for a specific biosignal analysis task (the work of Rafiei et al. [21] for EEG data is an exception), which are crucial for effectively designing novel strategies.Therefore, this paper aims at solving these limitations by providing a resource where readers can receive an outline of the main principles behind the most commonly used SSL frameworks for the analysis of biomedical signals and have an overview of the current state-of-the-art of the domain, regardless of the nature of the signal or of the investigated self-supervised paradigm.
The rest of the work is organized as follows.Section II provides a brief description of the most important types of biosignals, with a focus on the ones encountered during the survey.Sections III and IV introduce the self-supervised learning paradigm, describing its main concepts and different pretext task strategies.In Section V, a brief description of the survey methodology is provided to the reader.Section VI reports and analyzes SSL applications for the analysis of different types of biosignals (e.g., ECG, EEG, and EMG), also considering multimodal approaches.Section VII aims to answer to different questions related to the application of SSL for biosignals analysis, while also providing a description of critical issues and open challenges.Finally, Section VIII summarizes the most important outcomes of the work.

II. BIOSIGNALS
As per Bansal's Real-Time Data Acquisition in Human Physiology [25]: ''Biological signals, or Biosignals, are space, time, or space-time records of a biological event such as a beating heart or a contracting muscle.The electrical, chemical, and mechanical activity that occurs during these biological events often produces signals that can be measured and analyzed.Biosignals, therefore, contain useful information that can be used to understand the underlying physiological mechanisms of a specific biological event or system, and which may be useful for medical diagnosis''.
Most of the biosignals are of the electrical type, collected by electrodes placed in specific parts of the body (e.g., head for electroencephalography, chest and limbs for electrocardiography, eyes' region for electrooculography), generally in a noninvasive way.Moreover, with the spread of wearable devices, the acquisition and collection of various types of biological signals has become much easier, hence their exploitation for clinical tasks [26].For example, Continuous Glucose Monitoring (CGM) devices can help diabetic people manage their disease by detecting real-time variations of the blood glucose concentration at intervals of usually one, three, or five minutes [27].Moreover, other devices like smart  [22].Healthy subject is subject 18, and pathological subject (atrial flutter) is subject 33.Middle: single-channel EEGs selected from the BONN EEG dataset [23].Healthy subject is subject 2 from set B, while pathological (epileptic) subject is subject 6 from set E. Right: single channel EMGs selected from the NinaPro dataset [24].Intact right-handed subject is subject 16 from dataset 2, while amputated right-handed subject is subject 4 from dataset 3. Note how, regardless of their type, it is possible to spot some differences in amplitude and/or waveforms between normal and abnormal biosignals.
wristbands (e.g., Empatica © E4) can simultaneously record different types of biosignals by means of multiple sensors, introducing the possibility to combine their acquisitions with other types of data to improve the diagnosis and prognosis of several pathologies.Biosignals collected by wearable devices may include blood volume pressure, electrodermal activity (i.e., variation in the electrical properties of the skin), temperature readings, motion-based activity data, and many more.Give a complete description of all the biosignals is beyond the scope of this work.Nevertheless, it is important to at least introduce the main ones encountered during the survey: • Electrocardiography ( ECG ): this type of signal records the electrical activity of the heart.ECGs are generally recorded with the 12-lead method, which consists of placing ten electrodes, six on the chest and the remaining on the limbs, to calculate a set of electric potentials.The combination of the measurements from all the electrodes gives a unique quantitative and spatial information about the heart's electrical activity, called lead.An ECG machine processes the information coming from all 12 leads to produce a graphical representation.ECGs possess a particular structure (P wave, QRS complex, ST segment, T wave, and U wave) given by the sequential repolarization and depolarization of the heart's atria and ventricles.Unusual variations in the amplitude, time, or frequency of these structures provide information about the normal or abnormal activity of the heart, thus leading to the diagnosis of a particular pathology [28] (see exemplary ECGs provided in the left part of figure 1); • Electroencephalography ( EEG ): this type of signal records the electrical activity of the brain cells generated by the exchange of ions between the inside and outside the neurons.EEGs are usually recorded by placing several electrodes around the subject's scalp in specific configurations, which can vary depending on the number of electrodes and the study objective.EEG signals are really complex and are usually analyzed both in the time and frequency domains.In fact, clinically relevant information for diagnostic and prognostic purposes can be retrieved by looking at specific bandwidths of the signal, namely: delta (0.3-4 Hz), theta (4-8 Hz), alpha (8-14 Hz), beta (14-30 Hz), and gamma (>30 Hz).EEGs are widely adopted by neuroscientists for cognitive tasks as well as for the study of several neurological disorders such as epilepsy (exemplary EEGs provided in the middle part of figure 1), dyslexia, and mental diseases [29]; • Electromyography ( EMG ): this type of signal records the electric currents that are generated during muscle contraction.EMGs are usually recorded by surface electrodes, but more invasive types like needle electrodes can be adopted to improve the signal-to-noise ratio and to get access to single motor unit action potentials (MUAP).EMGs are generally used to detect anomalies in the activity of the muscles (e.g., myopathy, neuropathy) as well as in biomechanics for the development of body prosthetics [30] (see exemplary EMGs provided in the right part of figure 1); • Other types of biosignal: other biosignals used for various clinical tasks and therefore worthy of being mentioned are the magnetoencephalography (MEG), which measures the magnetic field generated by the activity of brain cells and has many applications such as brain connectivity, cognitive studies on newborns, and epilepsy research; the phonocardiography (PCG), which measures the sound produced by the heart's beat and is used for the detection of heart diseases; the electroretinography (ERG), which measures the electrical activity of various cell types in the retina and is mainly used for diagnostic reasons; the electrooculography (EOG), which measures the electric potential that is generated by the cornea and the retinal activity during eye movement; eye tracking data, which measures the orientation of the eye in space or the position of the eye with respect to the subject's head [31].Unfortunately, researchers have not yet used most of these signals to train deep learning models in a self-supervised way.However, future works may include them, especially in multimodal approaches.

III. SELF-SUPERVISED LEARNING
Training deep neural networks with fully supervised methods requires large amounts of data.In medical research, however, it is usually difficult to assemble very large datasets.The acquisition of medical data is in fact expensive in terms of time, costs and administrative procedures (e.g., ethics).It also requires specific instrumentation and human volunteers.Moreover, data annotation can be performed only by medical experts in a laborious and time-consuming process.Ultimately, medical data are highly heterogeneous (e.g., instrumentation, acquisition protocols and settings, subject-variability), and the model's robustness and generalization capability are inherently affected by that [32].
In contrast, the amount of unlabeled data is enormous.For this reason, researchers have started to investigate new methodologies to exploit unlabeled data [33] such as semi-supervised learning [34], weakly-supervised learning [35], or self-supervised learning, as described in this section.
Self-supervised learning attempts to address the issue of having limited annotated data by extracting general-purpose features from vast unlabeled data [36]; hence, it is usually referred to as an unsupervised technique.Despite that, selfsupervised learning differs from common unsupervised methods like clustering [37] or Principal Component Analysis (PCA) [38].In particular, clustering techniques aim at finding groups of similar objects by agglomerating or separating samples based on specific metrics (distance functions) designed to evaluate the grade of dissimilarity between the investigated data.PCA is instead used to reduce the dimensionality of a dataset by finding new variables that are linear functions of the original ones, that successively maximize variance, and that are uncorrelated with each other [39].Both techniques are mainly used as exploratory tools for data analysis to infer statistical properties of the investigated feature set.Moreover, they don't include any type of label, nor are they used to predicting some outcome from unobserved data, like in supervised approaches.On the contrary, SSL aims at predicting part of its input from other parts of its input, converting the unsupervised problem into a supervised one (hence its name).As it will be clarified in the next section, self-supervised learning can generate its own form of supervision directly from the data; hence, it can use way more supervisory signals than standard fully supervised approaches.That's why it is more proper and less misleading to allocate SSL algorithms to a separate category rather than trying to associate them with other unsupervised methods.
Figure 2 summarizes how the self-supervised learning paradigm works.First, a deep neural network is trained to solve an auxiliary task, also called pretext task, upstream task, or simply pretask, whose primary goal is to learn general-purpose features of the given data without having access to any sort of external supervision.During this phase, no information about the target (medical) task or the real (physiological) meaning of the given data is explicitly used.Moreover, no interest is given to the model's performance, as the pretext task has (often) no connection to the target one, and it is designed with the assumption that solving it requires the network to learn useful information intrinsic to the data; in other words, model them.Although pretraining strategies can highly differ from each other, this phase usually includes the generation of artificially created pseudo-labels from the unlabeled dataset, here used as the target variable.Training samples are then fed to the model to predict the constructed target.Finally, model predictions are used to calculate the value of a given objective function, which is then used to update the model weights with backpropagation.Once the model is pretrained, the weights of its feature extractor (encoder) are transferred to a new model, which will be trained to solve the target task, usually called downstream task.The new model shares the same backbone structure, while its head (final set of hidden layers) is slightly modified to make it compatible with the downstream task, for example by adding a softmax or a regression layer in case of classification or regression problems, respectively.Model transfer is performed by applying transfer learning, a method that consists of employing the knowledge that has been learned in a source task (here upstream task) to another target task (here downstream task) to improve the performance and generalization capability of the new model [40].
The final step, which is performed after the encoder's weights are transferred to the new downstream model, consists of learning more task-specific features using the limited amount of labeled data in a process called finetuning.The fine-tuning phase shares many similarities with a standard fully supervised training procedure; the main difference resides in the fact that the model weights, instead of being randomly initialized, originated from the solved pretext task.Another important difference is that, as described in [21], it is common practice to divide the fine-tuning process into two steps.The first consists of freezing all the backbone weights and updating only the modified final hidden layers; then, conclude the training process with the whole network unfrozen.
In conclusion, self-supervised learning, although more complex than a standard fully supervised approach, is supposed to help improve accuracy and mitigate overfitting in contexts where the amount of labeled data is limited or where multiple heterogeneous datasets can be aggregated, which is likely the biomedical context.

IV. PRETEXT TASKS
Pretext tasks are the core of the self-supervised learning paradigm.Although they are designed for the same scope, which is to learn general-purpose features without having access to manually annotated data, pretext tasks can accomplish their goal in different ways.Some of them have been developed for specific types of data, such as the Rubik's cube method for 3D images (e.g., magnetic resonance imaging) [41].Others are more versatile and allow researchers to work with different types of data.This section focuses on approaches that are compatible with biosignals and that were encountered during the survey.Various classification schemes have been proposed to organize the pretext tasks, depending on the domain of application [42].
Here, methodologies will be grouped into the following three categories: predictive, generative, and contrastive learning pretext tasks.

A. PREDICTIVE PRETEXT
Predictive pretext is a family of supervised pretraining methods characterized by the construction of classification or regression problems as an auxiliary task.This approach makes use of artificially created pseudo-labels, which are assigned to the unlabeled data, to pretrain the model in a supervised way.The generation of pseudo-labels, which needs to be automatic and knowledge-free, is what really differentiates one approach from the other.For example, one can simply construct a transformation recognition problem, where single or multiple transformations (e.g., scaling, permutation, time shift, noise addition) are applied to the original sample with the goal of predicting or classifying them.Others can exploit specific (biological) properties of the signal and construct more complex targets to predict [43].An example of such a strategy can be found in [44].Here, the authors have applied two different sets of transformations to EEG data to produce abnormal samples.The first transformation amplifies portions of the signal in the time domain, while the second alters the original sample in the frequency domain.Both original and transformed samples were fed to the pretraining model to predict the type of transformation, thus building a 3-class classification problem.The model was pretrained to optimize the cross-entropy loss, and its head (the final softmax layer) was discarded during model transfer.Similar protocols can be applied to other biosignals or to build regression predictive pretext tasks.For example, authors in [45] have built a regression task based on the prediction of features extracted directly from the ECG signal (characteristic intervals and amplitudes).Predictive pretext tasks are fairly easy to implement and do not require many computational resources compared to other methods.However, the specificity of the task has a strong impact on the quality of representations.Therefore, careful consideration must be given to its design, as wrong choices could degrade model performance.

B. GENERATIVE PRETEXT
Generative pretext [46] is a family of unsupervised methods widely used in natural language processing (like BERT [47]) which is living a new life in other domains like computer vision and signal processing [46].Its goal is to train general-purpose features by learning either to regenerate an augmented version of the input data or to generate new samples from the same distribution of the training repository.Since the pretext task is treated as a generative problem, architectures like auto-encoders [48] or Generative Adversarial Networks (GAN) [49] are utilized in this category.In the signal domain, the most adopted generative pretext task is masked modeling, whose goal is to learn robust representations by reconstructing a portion of the signal that was previously cropped or masked.Masked modeling is widely adopted for other types of data as well.Two examples are the masked autoencoders for imaging data [50] and the work presented in [51] for audio data.Another example of such a strategy can be found in [52].Here, the authors have applied a set of transformations to EEG data to generate new corrupted samples.Such transformations not only include the cited masking operation but also other ones typically employed in predictive pretext tasks such as the noise addition, the moving average filtering, or the EEG channel dropout.However, in contrast to predictive pretraining strategies, no artificial pseudo-labels were generated, and the original samples were used directly as the predictive target.In this setting, the Mean Square Error (MSE) between the output of the model and the original sample was used as the objective function to evaluate the quality of the reconstructed samples and update the model weights.Practical challenges associated with generative pretexts, such as the higher computational costs and repository size required to efficiently pretrain the model, make this approach rarely adopted compared to other supervised pretext tasks.In fact, GAN-based approaches like the one proposed in [53] require learning two different neural blocks: a generator, responsible for creating new samples, and a discriminator, responsible for distinguishing between the original and the generated sample, which is the only block that is kept after pretraining.The presence of two different neural blocks, usually with numerous parameters, inevitably increases the computational demand and, consequently, the training time and the needed GPU memory.

C. CONTRASTIVE LEARNING PRETEXT
Contrastive Learning (CL) is a family of methods that aims at learning robust general-purpose representations from the data by embedding augmented versions of the same sample close to each other while trying to push away representations from different samples [54].This goal is achieved either by learning to discriminate between similar (positive) and dissimilar (negative) samples, or by maximizing only the agreement between pairs of similar views.Data augmentation is the core of contrastive learning.Positive and negative samples are generated by applying a set of transformations to the original sample (e.g., noise addition, scaling, permutation, horizontal or vertical flip), which aim at introducing some differences while at the same time preserving the data's global features.Contrastive learning has gained enormous attention due to its simplicity and effectiveness in training general-purpose encoders.For this reason, a large variety of approaches can be found in the literature, usually employing siamese architectures (weightsharing neural networks applied on two or more inputs) [55] to compare the augmented samples.Here are reported only those baseline methodologies that have been applied in works selected during the survey, whose schematic views are collected in Figure 3: (a) CPC ( 2019 ): Contrastive Predictive Coding (CPC) is a modality-agnostic framework designed to suit any type of data (e.g., images, text, speech, signals) [56].Its goal is to predict high-level information of future time steps of a sample given a series of past ones.However, instead of simply trying to predict future observations, CPC aims to learn the underlying shared information between different parts of the (high-dimensional) signal.
with sim(z i , z j ) the cosine similarity between two projections, and τ the temperature parameter.(c) MoCo ( 2020 ): Momentum Contrast (MoCo) [59] is a method that, in its updated version (MoCo V2 [60]), outperformed end-to-end frameworks like SimCLR.MoCo took the problem of learning good representations by performing look-up operations on a large dictionary rich of negative examples, which is continuously updated to keep it consistent during training.The dictionary, which can be considered an improvement of the memory bank introduced in [61], allows for lessening the memory burden while keeping the number of negative pairs sufficiently high.In fact, the dictionary size can be much larger than a typical batch size and is treated as a queue, where newer keys from the current mini-batch are enqueued The first is the online network (parametrized by θ), which is responsible for generating a set of projections z (as in SimCLR).The second is the momentum network (parametrized by ξ ), which is responsible for encoding the new dictionary keys k to be enqueued.The online network is trained to optimize the InfoNCE loss and is updated through stochastic gradient descent.On the contrary, since the dictionary does not allow back-propagation on the momentum network, the latter is updated with an exponential moving average of the online network weights, defined as: with m ∈ [0, 1) momentum coefficient, usually bigger than 0.995.(d) BYOL ( 2020 ): Bootstrap Your Own Latent (BYOL) is a method that, unlike SimCLR or MoCo, uses neither negative pairs nor contrastive losses [62].In particular, BYOL sets up a regression task as the learning problem, where the embedding of one augmented version of a sample is used to predict the embedding of another augmented version of the same data.It is important to note that using only positive pairs could potentially lead to a collapsing solution [63], i.e., the trend of siamese architectures to ''collapse'' to a constant output.However, the authors empirically demonstrated that an asymmetrical architecture and the momentum encoder could avoid this problem.As can be seen in Figure 3(d), BYOL adopts two different networks to learn.The first is the online network (parametrized by θ), which added a predictor block q θ after the usual encoder and projector blocks, and is used to make the predictions.The second is the target network (parametrized by ξ ), which is used to provide the regression target to be predicted by the online network.During training, two augmented versions of a sample are fed into the networks.Then, outputs are ℓ 2 -normalized and the mean square error (MSE) is calculated.Finally, the online network is updated through stochastic gradient descent, while momentum updates are used to change the target network weights.(e) SimSiam ( 2020 ): Simple Siamese Representation Learning can be considered a simplified version of BYOL without the momentum encoder [64].The key element of this minimalist approach is the stop-grad operation.The authors empirically showed that this operation is sufficient to avoid the collapsing solution and no momentum encoder, like in BYOL, is necessary.However, the gain in simplicity is counterbalanced by a slight drop in performance.(f) SwAV ( 2020 ): Swapping Assignments Between Views (SwAV) is a cluster-discrimination-based framework [65].SwAV does not directly compare features extracted from different transformations of the same sample, like in previous approaches.Instead, it combines online clustering with a swapped prediction mechanism to enforce consistency mapping between augmentations of the same original data.Figure 3(f) illustrates the structure of SwAV.In particular, each augmented sample is fed into an encoder to produce a vector representation.Representations are then ℓ 2 -normalized and mapped to a set of trainable normal vectors, i.e., prototypes, thus computing a ''code'' Q i .In other words, prototypes can be considered as the clusters where the data are being partitioned and codes the results of the online clustering.Finally, with the assumption that different views of the same image should maintain similar information, the model is trained to predict the cluster assignment of a view from the representation of another view (swapping prediction).Aside from the ones already listed, other contrastive learning methods can be employed with time-series data.Such methods mainly differ in the structure of the network, the formulation of the contrastive loss, and the way negative samples are exploited.Few examples are PIRL [66], Barlow Twins [67], VICReg [68], W-MSE [69], TNC [70], MoCo V3 [71], and DINO [72].The interested reader can consult the work of Balestriero et al. [73], which provides a more detailed analysis of the self-supervised learning paradigm, with lots of insights on critical aspects of its implementation.

D. NOVEL METHODS FOR TIME-SERIES DATA
The previously cited baseline methods were used as a reference for the development of novel approaches for time-series data that were not specifically designed for medical applications, but still tested on medical repositories.For example, Cheng et al. [74] proposed a subject-aware contrastive learning method for biosignals whose core element was the addition of an adversarial subject identifier module to promote subject-invariance during pretraining and mitigate the negative effects of inter-subject variability.Gorade et al. [75] proposed a BYOL-based approach based on the combination of two different sets of projector plus predictor designed to extract, respectively, low-and high-frequency characteristic features from the embedding.Zhang et al. [76] developed a contrastive pretraining method that promoted the alignment of time-and frequency-based representations projected in a shared latent space.Ultimately, Wickstrøm et al. [77] proposed a novel contrastive learning approach that combined a custom contrastive loss with a new data augmentation scheme designed to generate new data by mixing two training samples.All methods listed in this subsection demonstrate that ideas from other research areas can be successfully imported into the medical domain.However, as it will be discussed in section VII, special considerations about the physiological nature of the signal and the target clinical task must be taken into account in order to avoid failures in the application of SSL strategies.

V. SURVEY METHODOLOGY
This section summarizes the methodology followed to search and identify relevant literature on self-supervised learning for the analysis of biosignals.To summarize, it consists of a first selection of papers from various literature sources, followed by multiple exclusions, if necessary, using specific criteria.For the literature search, the following bibliographic databases were used as primary references: 6  In addition, the research was extended to other sources of literature, namely: • Google Scholar and use Google Scholar only to double-check and refine the research in case of possible missing works.Moreover, we carefully checked arXiv preprints before considering their inclusion, as they did not undergo a full peer-review process.
Each source was queried by combining a set of selected keywords, using at first only general terms (e.g., selfsupervised learning, contrastive learning, time-series, biosignal); then, research was refined by adding more specific terms related to each type of biosignal (e.g., ECG, EEG, EMG, EOG).An example of such an approach, using the Google Scholar query format for simplicity, is reported below: 1) ''Self-Supervised Learning'' AND ''time-series''; 2) ''Self-Supervised Learning'' AND ''biomedical''; 3) ''Self-Supervised Learning'' AND ''biosignals''; 4) ''Self-Supervised Learning'' AND ''wearable sensors''; 5) ''Self-Supervised|Contrastive Learning'' AND ''Electrocardiogram|ECG''; 6) ''Self-Supervised|Contrastive Learning'' AND ''Electroencephalography|EEG''; 7) ''Self-Supervised|Contrastive Learning'' AND ''Electromyography|EMG''; 8) ''Self-Supervised|Contrastive Learning'' AND ''Electrooculogram|EOG''.Self-supervised learning is a novel technique that has only recently made its way into medical research.In addition, this field is very mutable and the state of the art can rapidly change.For this reason, only works published no earlier than 2016 were considered, focusing on the period 2019 -2023, when their number has increased considerably.In particular, we included only those papers that adopted self-supervised learning on biosignals to solve medical tasks.We also considered publications that present novel SSL methods not specifically designed for medical tasks but still tested on biomedical datasets, gathered for organizational reasons in the subsection IV-D.From the selected list, we excluded works that adopted the same SSL methodology to solve a particular task or works that have been updated by another one.In those cases, we kept the one that we considered the most relevant by weighing several factors such as the number of citations, the impact factor of the journal or conference, and the type of work.Finally, we considered research works cited in the bibliography or in the related works sections of the selected papers.

VI. SELF-SUPERVISED LEARNING ON BIOSIGNALS
The survey resulted in a selection of 61 works describing SSL applications for the analysis of biosignals.As can be seen in Figure 4, there is a high imbalance between applications on ECG or EEG signals and other types of data, probably associated with the higher availability of public datasets.Taking this into account, the results were grouped into four categories, namely: SSL on ECG, SSL on EEG, SSL on other types of biosignals, and multimodal SSL with biosignals.Each category will present the investigated medical tasks and the adopted pretraining strategies, delving into those works that present novel SSL approaches.At the end of each section, a summary table reports a synthesis of the main information for each of the presented works, namely: upstream task, downstream task, datasets used, year of release, and, if necessary, the type of data.Tables were sorted by year and author names.Moreover, works sharing the same downstream task were grouped to improve the organization and consultation of big tables.

A. SELF-SUPERVISED LEARNING ON ECG
ECG is one of the two major categories of biosignals where self-supervised learning has been adopted until now.Out of all the investigated tasks, classification of cardiac pathologies (e.g., arrhythmia) plays a central role in SSL ECG-based analysis, with 19 out of 23 works evaluating such a medical task.This aspect reflects the high demand for integrating deep learning models into decision support systems to be used in real-world scenarios, which still suffer from the great variability associated with such data.In fact, most of the datasets listed in table 1 are open repositories released by large hospitals or collected for highly competitive challenges (CinC datasets) organized to improve ECG medical analysis.
In contrast to the identified trend in the investigated downstream tasks, the choice of the pretraining strategy is highly variable but reveals an overall preference for contrastive learning pretext tasks (see table 1).Moreover, the provided survey reveals that works often attempted to exploit biological properties of the ECG signal during pretraining, for example by exploiting its peculiar waveform [45], its periodicity [80], [83], or its associated variability [89].

1) CARDIAC PATHOLOGY
Concerning single-or multi-class pathology classification tasks, contrastive learning was the primary choice in most of the studies.Nakamoto et al. [109] adapted the baseline MoCo for left ventricular systolic dysfunction detection, while Lai et al. [114] improved the loss of the same algorithm to recognize 60 diagnostic terms on a large-scale private dataset.Mehari and Strodthoffet al. [108] compared baseline contrastive learning approaches (e.g., SimCLR, BYOL, SwAV, CPC) to assess their ability to extract good representations from the ECG signal, while Soltanieh et al. [110] provides an extensive analysis on the efficacy of different data augmentations.Lee et al. [107] proposed a variant of the contrastive learning algorithm VICReg (VIbCReg) which slightly modified the loss function and included, after the projector of the siamese network, an additional iterative normalization layer.Gopal et al. [106] leveraged the unique spatiotemporal properties of the ECG signal by adopting a physiologically 3D augmentation technique to generate the positive pairs for the contrastive learning pretraining phase.Liu et al. [112] proposed a joint cross-dimensional contrastive learning method that consists of pretraining the model to maximize the similarity between positive pairs of ECG signals as well as between the ECG and its 2-D image representation.It is important to note that contrastive learning was not a unilateral choice, and other pretext tasks were investigated for this type of problem.In particular, Yang et al. [111] and Gedon et al. [104] adopted masked modeling for the representation learning part of the model, both achieving comparable or slightly superior results to fully supervised training.
Given the large amount of supervision that can be provided from some of the available free open datasets (e.g., PTB-XL [22]), results were not always superior to fully supervised state-of-the-art methods but still comparable.For example, Liu et al. [112] reported an absolute drop in accuracy and F1macro score of respectively 0.012 and 0.033 on the PTB-XL dataset, with similar results on the CPSC2018 dataset [78].However, comparable performances were achieved using only half the available labels, showing the robustness of SSL strategies against drops in the label ratio.This demonstrated that self-supervision has the potential to improve the learning process but requires further advances before becoming the new golden standard for those challenging problems.

2) ARRHYTHMIA CLASSIFICATION
Upon the pathologies investigated, arrhythmia seems to be the most common use case.In contrast to the previous subsection, the investigated pretraining strategies were heterogeneous, with at least one work for each category (predictive, generative, and contrastive) selected during the survey.Kyasseh et al. [83] proposed CLOCS, a family of patient-specific contrastive learning methods that they showed were able to outperform other baseline contrastive learning techniques, thus becoming the comparison element for other works.In particular, they presented three different approaches: the first, Contrastive Multi-Segment Coding (CMSC), exploits the temporal invariances in the ECG; the second, Contrastive Multi-Lead Coding (CMLC), exploits the spatial invariances in the ECG; and the latter, Contrastive Multi-Segment Multi-lead Coding (CMSMLC) combines the two previous methods.Oh et al. [92] combined Kyasseh's CMSC with a transformer-based self-supervised method (based on Wav2Vec 2.0 [115]), achieving performances superior to CLOCS.Moreover, they added a random lead masking module, which improves the model's robustness in the case of downstream tasks that accept an arbitrary number of ECG leads.CLOCS was also considered as a comparison element by Chen et al. [82], which exported MoCo V2 to the ECG domain and used a combination of wavelet transform and random crop to generate the positive and negative pairs for the pretraining.Ultimately, Phan et al. [90] combined representations coming from both time and time-frequency modalities managed by two different backbone encoders pretrained with DINO [72].In contrast with previous works, Lan et al. [89] designed an Intra-Intersubject Self-supervised Learning (ISL) method for multivariate cardiac signals that tries to learn good representations of the ECG signal by learning distinct representations both at the heartbeat level (intra-subject) and at the subject level (inter-subject).The last selected contrastive learning proposal for arrhythmia classification is that of Wei et al. [87] with their ''Contrastive HeartBeat'', a novel method designed to learn patient-specific representations at the heartbeat level by considering as positive pairs all heartbeats of the same subject and as negative the others.
Again, contrastive learning was not the only approach investigated.Grabowski et al. [91] tested masked modeling on both classification and regression downstream problems.Furthermore, Zhang et al. [94] applied spatial and temporal signal manipulation to generate pseudo-labels for their predictive pretext task (transformation prediction).A predictive pretask was also chosen by Lee et al. [45] and Luo et al. [80] for arrhythmia detection and classification.The first constructed a pretraining based on the prediction of specific critical features extracted from the heartbeat with ECG delineation algorithms, while the second pretrained the model to assess if randomly selected pairs of ECG segments were adjacent or not.
As in the previous subsection, the large amount of supervision that can be provided from some of the available free open datasets resulted in model performances that were not always superior to their fully supervised counterparts.For example, Kyasseh et al. [83] reported an absolute drop in AUC ranging from 0.02 to 0.04 on different fine-tuning datasets, while Chen et al. [82] achieved an AUC improvement of 0.03 with a similar experimental setting.However, as reported in [83], [89], self-supervised learning was able to produce similar results even when the fraction of labeled data used for fine-tuning was halved, thus mitigating the performance drop compared to other strategies.Ultimately, it is worth noting that it is possible to identify a positive progression in the achieved results (see the use of CLOCS in many other works as a baseline comparison), guided by the proposal of novel methods being able to include both physiological and subject-related information during pretraining.

3) EMOTION CLASSIFICATION
Stress detection and emotion classification (e.g., prediction of the affective score) were the other two investigated downstream tasks.In particular, stress detection was studied by Rabbani et al. [103] and Sarkar et al. [98].The first adopted the baseline SimCLR, while the second tried to assess maternal and fetal stress during pregnancy using a predictive multitask pretraining (e.g., prediction of different signal transformations).The same author extended this method to the emotion recognition problem [95], which was also studied by Rodriguez et al. [101], who chose masking modeling as the pretraining strategy.
Overall, selected works achieved performances superior to their fully supervised baseline.For example, Rodriguez et al. [101] achieved a mean absolute improvement of 0.03 over both accuracy and F1-score calculated on the AMIGOS dataset [97].However, given the differences in the datasets used for the evaluation, it is impossible to compare results overall and extract a possible hierarchy for the pretraining strategies.

B. SELF-SUPERVISED LEARNING ON EEG
EEG is the other major type of biosignal where selfsupervised learning has been applied.Here, SSL was employed for different downstream tasks such as sleep staging, seizure analysis, emotion classification, and motor imagery classification.The high number of downstream tasks, which were investigated using datasets provided by medical facilities, large clinical studies, or specific competitions, demonstrate how self-supervised learning could impact many real-world applications.For example, SSLbased sleep analysis can promote the development of novel deep learning-based automatic sleep scoring algorithms, which can eliminate some drawbacks of manual protocols [116], while SSL-based seizure analysis can improve the performance of automated detection systems, which allow an objective assessment of seizure frequency and a treatment tailored to the individual patient [117].
Differently from the results presented on ECG data, the choice of the pretext task depends on the study objective, with contrastive learning being slightly preferred overall.However, despite the chosen pretext task, works often attempt to include domain knowledge information about the EEG signal during pretraining, for example by considering the importance of frequency-based EEG analysis [118] or the similarity between resting state brain hemisphere activity [119].

1) SLEEP STAGING
Sleep staging, i.e., the problem of determining the patients' status (wake, light sleep, deep sleep, REM) during their sleep, was highly investigated, with 6 out of 23 EEG works selected 144190 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
during the survey.It is interesting to note that contrastive learning is predominant, with all works employing it to pretrain their models.Ren et al. [120] applied a modified version of contrastive predictive coding, while Jiang et al. [121] chose SimCLR.Yang et al. [122] proposed ContraWR, a novel approach that aims at solving the problem of negative sampling by using the average representation over the dataset as the only contrastive information.Another novel approach called SleepDPC was presented in Xiao et al. [123].SleepDPC combines two different learning objectives during the pretraining: one, called predictive contrastive learning, uses the CPC-based Dense Predictive Coding (DPC) [124] as a reference; the other, called discriminative contrastive learning, tries to discern between temporary nearer or farther portions of the signal.Dense Predictive Coding was also part of the CoSleep method described in Ye et al. [118], which exploits multiple views of the EEG signal.In particular, DPC was first used to train from scratch two encoders, one for the time view and the other for the frequency view; then, contrastive multiview [125] was used to refine the weights of the two encoders.Finally, Lee et al. [126] presented SSLAPP, a hybrid approach based on the combination of a GAN-based generative pretext task and contrastive learning, achieving performances superior to CoSleep, SleepDPC, and other fully supervised strategies.
Based on the results presented on the SleepEDF dataset [127] which, as can be seen in 2, was used for the evaluation of all the proposed methods, most of the works achieved an accuracy and an F1-score superior to fully supervised baselines, with only CoSleep and SleepDPC being left behind.For example, SSLAPP reported an absolute improvement on the F1-score of 0.03 but, more importantly, the achievement of similar results using only 10% of the labeled data.However, despite the overall improvement, no real superiority can be found among the various selected strategies.

2) SEIZURE ANALYSIS
Unlike sleep staging, where contrastive learning was the primary choice, studies dealing with seizure analysis usually adopted predictive pretext tasks.Xu et al. [128] generated the pseudo-labels by applying a set of scaling transformations to only the EEGs of healthy subjects and pretrained the model to detect them.Tang et al. [129] use forecasting (future 12 seconds on a clip of 12 or 60 seconds) as the predictive pretext task, combining SSL and graph neural networks for seizure analysis (detection and classification) for the first time.Das et al. [52] pretrained the model to reconstruct the original signal by its own corrupted version using different modification protocols, including masked modeling, thus exploiting a generative pretext task.Finally, Yang et al. [130] combined self-supervision with online learning and weak-supervision for patient-specific seizure forecasting.
Seizure analysis is highly heterogeneous in the investigated types of learning problems and performance achieved.When it comes to seizure detection, for example, all the proposed methods were able to surpass fully supervised baselines.On the contrary, seizure classification (identification of seizure type) and forecasting pose more challenges.Hopefully, advancements in the research will reveal the potentiality of SSL for those problems as well.

3) MOTOR IMAGERY
Thanks to the BCI Competition IV [156], self-supervised learning was also extended to motor imagery, i.e., the mental execution of a movement without any overt movement or peripheral (muscle) activation [157].Out of the three selected works, two used predictive pretext tasks.In particular, He et al. [146] pretrained their model to forecast a slice of the EEG signal given a set of past ones, while Ou et al. [148] randomly shuffled portions of the EEG signal and defined a binary classification task (signal segments in order or not).In contrast, Lotey et al. [144] assessed the impact of contrastive learning for cross-session motor imagery using the baseline SimCLR.However, they achieved a lower overall accuracy on the BCI Dataset 2a [149] compared to the forecasting proposal in He's work.
BCI competition datasets remain an important source of open datasets in the relative domain.They offer a common place to share and compare results achieved with different learning strategies, facilitating the advancement of research in this prominent field.Although SSL strategies were not able to achieve state-of-the-art results, which are still based on fully supervised methods [158], it is likely that their role will increase in future years, especially when the pretraining will be performed on multiple datasets to enhance the quality of the representations.

4) EMOTION RECOGNITION
Works employing self-supervised learning for emotion recognition varied in the choice of the pretext task.Xie et al. [139] applied six different transformations to EEG data and pretrained a multi-branch neural network to predict them.Zhang et al. [53] proposed GANSER, a generative self-supervised framework based on adversarial training.In particular, adversarial training is promoted through a masking operation and regulated by an augmentation factor designed to restrict the feature distribution difference between real EEG samples and the generated ones.Finally, Shen et al. [142] and Kan et al. [140] proposed two novel contrastive learning approaches.The first, Contrastive Learning for Inter-Subject Alignment (CLISA), tries to maximize the similarity in EEG signal representations across subjects who received the same emotional stimuli, hence without resorting to standard data augmentation procedures.The other, Group Meiosis Contrastive Learning (SGMC), adopted a genetically inspired data augmentation technique where positive and negative pairs are generated by grouping EEG samples sharing the same stimuli and then cross-exchanging (mixing) parts of their signal.Overall, results on the widely used SEED [132] and DEAP [141] open datasets were superior to their fully supervised counterparts but comparable with each other.However, although minimal, it is worth noting that Xie's predictive pretext and Zhang's GANSER achieved state-ofthe-art performances in the SEED and DEAP, respectively.This aspect highlights how, in emotion recognition, there is no clear superiority of one pretraining strategy over the others.

5) OTHER OR MULTIPLE CLASSIFICATION TASKS
Six other works adopted self-supervised learning on other downstream tasks or simply provided results on multiple applications.Mohsenvald et al. [131] provide an extensive analysis of the SimCLR contrastive learning framework on several downstream tasks.In particular, their analysis of the influence of the EEG sequence length, the applied data augmentation and the number of latent dimensions, as well as the role of the aggregation of heterogeneous datasets, is of great interest and provides a good insight into those aspects that are crucial for the efficient development of SSL strategies in the EEG domain.Banville et al. [134] evaluated three different pretext tasks (CPC and two predictive) on sleep staging and pathology classification.Wagh et al. [119] highlighted the importance of exploiting domain knowledge information from the EEG signal during pretraining and proposed an SSL method based on the combination of three different domain-guided pretext tasks (hemispheric symmetry, behavioral state estimation, and age contrastive).Zheng et al. [44] investigated the efficacy of SSL for anomaly detection on EEG data by designing a predictive pretask (3-class classification) where pseudo-labels were generated by locally increasing/decreasing the amplitude of the signal in the time domain or specific components in the frequency domain.Instead, Kostas et al. [135] designed BENDR, a novel method that combines a transformer-based framework with contrastive learning.Ultimately, Zygierewicz et al. [150] applied MoCo to memory-related neurofeedback data with the goal of identifying brain regions and frequency bands consistent with current neurophysiological knowledge of the processes critical to attention and working memory.

C. SELF-SUPERVISED LEARNING ON OTHER TYPES OF BIOSIGNAL
This section presents self-supervised learning applications on other types of biosignals such as EMG, eye tracking and other sensor data.Given the low number of selected works and the various biosignals included, it is difficult to identify a trend in the choice of the pretext task (see Table 3).
However, it is worth noting that works presented here are no less important than others from the previous sections, as the analysis of the biosignals included in this subsection is essential for many real-world applications, from the development of myoelectric prostheses to the support of older people's daily lives.
EMG certainly deserves a proper category because of its wide range of applications.However, only two studies adopting self-supervised learning on such data were found.In particular, Liu et al. [159] use contrastive learning (NeuroPose) to predict finger joint angles for 3D hand pose estimation from wearable EMG sensor data (8-channel 144192 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.armband), achieving good performances and demonstrating robustness to natural variation in sensor mounting positions or changes in the wrist position.Wu et al. [160] designed a novel self-supervised learning approach (Neuro2vec) for neurophysiological data based on masking pretext task applied to both the spatiotemporal and the frequency domains.They tested their approach on classification and regression tasks using EEG data and the NinaPro dataset [24], which is one of the biggest collections of open-source datasets with EMG data.In the NinaPro dataset 5, they were able to achieve an absolute improvement of 0.03 both in accuracy and F1-score on the investigated classification task and a relative drop of 10% in the Mean Square Error on the regression task.
Regarding other modalities, Saeed et al. [161] exported self-supervised learning on accelerometer data for human activity recognition [162], a promising assistive field that can support older people's daily lives.In their work, they designed a multitask predictive approach based on the recognition of eight different signal transformations.
Considering eye tracking data, Mengoudi et al. [163] presented a predictive pretext task for their study.In particular, they tried to classify subjects with dementia, transferring the features learned during the pretraining to a support vector machine majority voting scheme.
Ultimately, Ballas et al. [164] designed Listen2YourHeart, a contrastive learning approach for Heart Murmur detection based on the baseline method SimCLR using Phonocardiography (PCG) data.
Overall, the investigated works demonstrated that SSL can be successfully applied to other types of biosignals, even when the amount of data available is not extremely high.

D. MULTIMODAL SELF-SUPERVISED LEARNING WITH BIOSIGNALS
Multimodal self-supervised learning with biosignals is the final category presented in this section.The number of works in this context is still limited (see Table 4), which highlights how efficiently combining information from different types of data is a difficult task.The modalities mainly analyzed with self-supervised learning include combinations of EEG, ECG, EMG, and other data coming from wearable devices.Differently from the trend of single-modality SSL, most of the works chose predictive pretext tasks instead of contrastive learning.Furthermore, multimodal data are often treated simultaneously via multichannel architectures, with each modality having its specific encoder and representations combined only on the network head.
SSL applications that employ data from wearable devices for medical tasks are still limited in number (even though many studies have been released for more industrial applications).Spathis et al. [173] investigated health and lifestyle monitoring with multimodal wearable data, designing a particular pretext task whose goal was to assess the subject's heart rate from other wearable data.Deldari et al. [174] presented COCOA, a contrastive learning approach designed to learn quality representations from multisensor data by computing the cross-correlation between different data modalities and minimizing the similarity between irrelevant instances.Their approach was tested on several downstream tasks (e.g., emotion recognition, sleep staging, human activity recognition) combining different biosignals such as EEG, ECG, EMG, EOG, and activity data from wearable devices, achieving overall results always superior to fully supervised strategies (absolute accuracy improvements range from 0.03 to more than 0.1).Saeed et al. [175] presented ''sense and learn'', a novel framework designed to learn general-purpose representations from multisensor data produced by omnipresent sensing systems.In their work, they compared several pretext tasks on multiple downstream tasks such as activity recognition, sleep scoring, and stress detection.
Three of the identified works applied self-supervised learning strategies to Intensive Care Unit (ICU) data.In particular, Chen et al. [176] proposed a novel method for the prediction of adverse surgical events.To accomplish that, they combined a set of static (e.g., covariates) and dynamic (e.g., biosignals) variables, pretraining the backbone module of the latter with a forecasting predictive pretext task.Weatherhead et al. [190] improved the baseline contrastive learning method TNC [70], pretraining the model on high-time resolution ICU data and evaluating it on several tasks such as the prediction of 12-hour in-hospital mortality, circulatory failure, and cardiopulmonary arrest.Ultimately, Tipirneni et al. [192] pretrained the model with a forecasting pretext task.
Considering works employing other combinations of biosignals, Lemkhenter et al. [187] investigated selfsupervised learning for sleep scoring with polysomnography data (collection of EEG, EOG, EMG, and ECG acquired during sleep), adopting a predictive pretask task built on top of the model-agnostic meta-learning framework [195].The learning problem was to detect if a training sample came or not from PhaseSwap [43], an operator that takes two signals as input and then combines the amplitude of the first with the phase of the second.
Thiam et al. [181] were the only one to propose a generative pretask on multimodal data.Using a multimodal deep denoising convolutional auto-encoder, they tested the pretrained model for the pain intensity classification, achieving stateof-the-art performances on the BioVid Heat Pain Database [182].
The last two selected multimodal approaches combine biosignals with video recordings.Leveraging a combination of EEG and facial activity data extracted from video, Das et al. [184] trained an explainable AI model to predict upcoming speech stuttering, while Martini et al. [177] showed the potentiality of multimodal self-supervised learning by combining stereoelectroencephalography (SEEG) and video data to forecast seizure events in drug-resistant epileptic subjects.
Overall, the listed works achieved performance comparable or superior to fully supervised baselines.Moreover, works like [187] (sleep staging) show how some downstream tasks can be treated both with single and multimodal approaches.In this regard, multimodality seems to help extract complementary representations, enhancing the quality of representations compared to single-modality SSL strategies.

VII. DISCUSSION AND OPEN CHALLENGES
This section aims to answer important questions that may arise from the analysis of the selected works: when self-supervised learning might be preferred to a standard fully supervised strategy; how data aggregation can improve the model's robustness; what is the role of the fine-tuning phase; what is the best pretext task to choose; what is the role of data augmentation during the pretraining; and how multimodality can benefit from this paradigm.Although some of these topics can be presented in general terms, particular focus will be given to the analysis of special aspects to consider when applying existing SSL techniques (which have succeeded in other time-series analysis tasks) to a specific biosignal analysis task.To make the narrative clearer and easier to follow, each topic will be presented concisely in a separate subsection, providing examples from the previous listed works whenever possible.

A. SUPERVISED VS SELF-SUPERVISED LEARNING
Overall, the analysis of the selected works has shown that self-supervised learning may improve the performance of the trained model and mitigate overfitting in most of the listed downstream tasks.Hence, it seems likely that this strategy can be useful when performing deep learning-based biosignal analysis.However, one must be cautious and consider some important aspects that may guide the researcher towards the choice of the most suitable training strategy.
First, it is important to address the amount of supervision that can be provided for the downstream task.If the amount of labeled data is sufficiently high, it is unlikely that SSL will boost performances in a statistically significant manner, especially when pretraining and fine-tuning are performed on the same single repository.However, it is difficult to find such datasets in the domain of biosignals.Few exceptions worth mentioning are the Temple University Hospital (TUH) dataset for EEG or the Computing in Cardiology (CinC) datasets for ECG.Works employing such datasets (for example [112], [134]) were not always able to improve their performances compared to fully supervised baselines.However, although results can be comparable, SSL has proven to lead to a better generalization of the problem, as a drastic decrease in the amount of supervision is translated into only a slight drop in performance, the opposite of fully supervised methods.
The second aspect to consider is the amount of external data that can be exploited during pretraining.Selfsupervised learning's main goal is to provide a way to learn general-purpose features by exploiting large amounts of unlabeled data.The more data that can be fed into the network during pretraining, the more robust the learned features will be, as they come from a larger and more heterogeneous parterre of data.This has the potential to boost the performance on the downstream task, as reported in many of the selected works.A more in-depth analysis regarding the role of data aggregation in self-supervision will be done in the next subsection.

B. THE POWER OF DATA AGGREGATION
Regardless of the specific type of biosignal or the investigated clinical task, self-supervised learning pretraining has demonstrated that it can reach state-of-the-art performances when more datasets are simultaneously employed.However, although SSL approaches facilitate the aggregation of multiple repositories not acquired in the same experimental setting, this procedure is still not a common practice in biosignal analysis.There are indeed many studies that combine more than one dataset during pretraining, but their number is generally limited to two or three repositories, usually acquired for the same medical purpose.On the contrary, works like [131] demonstrated how data aggregation might improve model performance even when records came from completely different experimental settings.
Practical limitations like the inability to standardize multiple datasets automatically and easily certainly play a key role in the hindrance of such practice.In fact, biosignals are not only complex to interpret but also suffer from great variability, which may come from experimental settings, acquisition protocols, storage modalities, and intra-and inter-subject variability.For example, EEG preprocessing includes not only data imputation, resampling, and filtering as for any other biosignals, but also the re-referencing to a common (or average) channel, the alignment to a unique template, and the interpolation of missing channels.Manually performing all these steps is an extremely time-consuming and discouraging task.However, as of now, no tools are designed to simultaneously preprocess and align multiple datasets of the same modality.Therefore, it could be of great interest for the research community to develop novel tools that can both perform consistent preprocessing on multiple datasets and integrate their functionalities with preexisting ones, such as EEGlab [196] for EEG or ECG-kit [197] for ECG, allowing to aggregate heterogeneous datasets for SSL applications.
Moreover, although there is already evidence that the aggregation of multiple datasets can improve the accuracy of downstream models [131], [187], it could be useful to further investigate the effect of massive data aggregation during pretraining and how the quality of the general-purpose features learned is affected by that.It could also be of great interest to understand whether this strategy could be exploited in advancing the problem of domain adaptation [198], i.e., the problem of avoiding significant performance degradation due to changes in the marginal distribution of the feature space (domain shift), which remains a critical aspect in the biomedical domain.

C. THE CHOICE OF THE FINE-TUNING DATASET
When defining a self-supervised experimental pipeline, it is important to not only select the right pretraining datasets to aggregate but also the fine-tuning one.Unfortunately, considering the way self-supervised learning strategies are usually presented, fine-tuning seems to often take a back seat.However, this phase is no less important than the pretraining one, since model evaluation will be based on the performance metrics estimated from the test set of the fine-tuning dataset.Moreover, choosing the right fine-tuning dataset is important not only for model evaluation but also to promote results replicability and facilitate the comparison between different approaches.
Regarding results replicability, one should opt as much as possible for free open repositories, or at least ones accessible up to a filled-out request form.While the use of private datasets is certainly not forbidden, especially during pretraining, it is also true that the community could benefit more from the introduction of novel strategies tested with only open datasets.The use of open data can, in fact, make results not only reproducible but also more reliable since nothing is hidden from the reader.Moreover, it encourages the use of the same dataset as well as the production and release of tools designed to preprocess and split it in a standardized way, which is a crucial step in the creation of useful benchmarks.
Regarding the comparison between different approaches, while in other fields such as computer vision the research community has adopted well-defined protocols (e.g., use of datasets with predefined test sets, use of the same combination of data augmentations, use of standard model architectures) to promote fair and robust comparison between the proposed strategies, in the biosignal domain this aspect remains an open challenge.In fact, given a specific downstream task, several factors, such as the choice of different fine-tuning datasets, the use of a different splitting strategy (subject-, session-or trial-based), or the way performance variability was assessed (repeated fine-tuning, leave one subject out cross-validation, pretraining with different subsets of data), often make it impossible to compare the presented results.While the splitting strategy and the performance variability assessment can change according to the experimental study, the choice of the specific fine-tuning dataset can be at least aligned based on the investigated downstream task.To help readers choose the right fine-tuning repository, the following list of datasets often used for different downstream tasks is provided:  The dataset was also divided into several subsets annotated for specific case studies (e.g., TUAB for normal/abnormal classification, TUEP for epilepsy).Data can be accessed only after filling out a request form.
• DEAP : an EEG dataset for emotion studies.It comprises EEG records from 32 subjects, with the possibility to download already processed samples.Data can be accessed only after filling out a request form, which must come from researchers with a permanent position at an academic or research institute.
• BCI competition : a set of datasets released for the BCI Competition IV.Widely used datasets include datasets 2a and 2b for motor imagery with EEG data.Data can be directly downloaded.
• NinaPro : a large multimodal database aimed at fostering machine learning research on human, robotic and prosthetic hands.It comprises 10 datasets with EMG and other kinematic or inertial data acquired from subjects with intact or amputated hands.Data can be directly downloaded.
• MIMIC : a large multimodal dataset that included multimodal recordings from ICU patients.The datasets often employed are the MIMIC-II and MIMIC-III datasets, which can be accessed only after filling out a request form.
• WESAD : a multimodal dataset for wearable stress and affect detection comprised of physiological and motion data recorded from 15 subjects.Data can be directly downloaded.

D. THE CHOICE OF THE PRETEXT TASK
Looking at the pure numbers, contrastive learning was the most chosen pretext task, outnumbering the sum of works adopting other methodologies.This aspect certainly reflects not only the ability of contrastive learning pretext to learn better general-purpose representations from the data compared to other approaches but also its easiness of adaption to the medical domain.In fact, self-supervised contrastive learning baseline approaches are fairly easy to implement and have lots of alternatives that, although similar, can fit specific experimental needs.Moreover, they can also be easily modified without actually changing their core parts.Many of the presented works, rather than designing completely novel approaches, slightly changed baseline methods to incorporate specific medical domain knowledge.For example, some works proposed more biologically inspired data augmentation techniques [140], while others focused on the way similarity between pairs is evaluated, for example by modifying the objective learning function or the structure of the siamese network [107].Specific examples of the incorporation of medical domain knowledge during pretraining can be found in the surveyed works.In particular, authors in [119] have presented an EEG-based multitask pretraining strategy that takes into account both similarities and dissimilarities in the activity of the left and right brain hemispheres but also considers the known effect on the EEG dynamic of both the age and behavioral state of the subject.In addition, although not classified as a contrastive learning pretext task, the method presented in [45] represents another example of domain knowledge incorporation since it is based on the prediction of characteristic features automatically extracted from the ECG signal (with standard procedures) and typically used by cardiologists for diagnostic purposes.
Although contrastive learning seems to generally perform well, discarding other pretraining strategies can be counterproductive.For example, predictive pretext tasks can lead to better results on some downstream tasks if properly designed, like motor imagery classification [146].Moreover, they are still largely employed in multimodal approaches, where finding effective ways to assess similarities and dissimilarities in representations of different modalities for contrastive approaches is still an open challenge.Masked modeling was also successfully applied for several downstream tasks, although its paradigm is less open to novel implementation.However, when combined with other SSL strategies, especially when transformer architectures are involved [92], it could improve the model's performance and robustness.
Each pretraining technique has its own peculiarity; hence, it is reasonable to assume that the quality of the representations will be affected as well.In this context, it could be more valuable to investigate ''hybrid'' approaches, which incorporate the qualities of different methods, rather than trying to assess the best strategy among the categories.The combination of multiple pretext tasks might lead to more 144196 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
robust features, as they can instill in the feature extractor knowledge learned from very different tasks.An example of such a strategy can be found in [126], where contrastive (SimSiam) and generative (GAN-based) pretext tasks were combined to improve the quality of the representations.
In conclusion, it is probably still too early to understand what the best SSL pretext task category is for the analysis of biosignals, especially considering that the field is evolving quickly and some methods (e.g., generative pretext tasks) have such a limited number of applications.Future works and advancements in this domain will have the possibility of revealing which directions will be more effective.

E. THE ROLE OF DATA AUGMENTATION
Data augmentation plays a central role in affecting the quality of the representations learned during pretraining.They guide the network during the general-purpose feature learning process, consequently influencing its performance on the target task after fine-tuning.This fact is true not only in contrastive pretext tasks, where data augmentation is an essential part of the general workflow but also in generative (e.g., reconstructive) and predictive strategies.Therefore, particular attention must be given to the design of augmentation methods, as wrong choices could deeply degrade the model's performance.Considering the field of application, it is extremely important to consider both the physiological nature of the signal and the prior medical knowledge about the target clinical task.
As for the signal's physiological nature, a data augmentation must generate a new version of the same data that not only preserves its physiological information but also does not diverge too much from the original dataset distribution.For example, a commonly employed data augmentation is the addition of generated noise or artifacts.In the biosignal domain, there are many known physiological sources of artifacts that could be exploited, such as the line noise, the drift artifact caused by changes in the electrodes' impedance, or the ocular and muscle artifacts typical of EEG data.
As for the medical knowledge of the target clinical task, while it is true that pretext tasks can produce robust features without any knowledge about the subsequent clinical task, it is also true that indirectly including such information in the model could be beneficial, even at the cost of reaching a worse loss minima during pretraining.For example, if the medical literature has already identified specific patterns that can be exploited to distinguish between normal (healthy) and abnormal (pathological) signals, it is important and reasonable to design data augmentations that will force the network to focus on such aspects.This, for example, applies to variations of the PQRST complex in ECG analysis or variations of the signal spectrum in specific bands in EEG applications.
Another key point to assess is how data augmentations are combined.While a single data augmentation chosen at random from a wide list could be a good initial strategy, compositions of multiple transformations can increase the sample's heterogeneity and produce more complex patterns, enhancing the learning process.In fact, as reported in [58], the composition makes the pretext task harder, but the quality of the representations improves dramatically.In the same work, the authors proposed a good pipeline to systematically study the impact of data augmentation, which was also used in [121] on EEG data.The results of both works demonstrated the superiority of data augmentation composition.However, it is also important not to stack too many augmentations, as the new transformed sample will be too noisy; hence, the trade-off between task complexity and quality or representation will probably be lost.A good compromise could be to apply a sequence of 2 augmentations, preceded by another physiologically invariant transformation, designed to increase the number of training samples without actually changing the biological information of the original data.An example of such augmentation could be the EEG re-referencing to another channel, as suggested in [18], or the signal polarity inversion.
Despite the central role of data augmentation, the literature still lacks an extensive analysis of its role in SSL-based biosignal analysis.Aside from the previously mentioned work on EEG data, a similar analysis on ECG data is provided in [110].However, no extensive study about their composition was performed.

F. THE CHALLENGE OF MULTIMODALITY
In the biomedical domain, multimodal data are often complementary with each other, meaning that each type of data (e.g., signals, images, text reports) can be used to extract unique latent representations to allow a better understanding of a pathology, even at the subject level.However, the analysis of the methods presented in the selected works certainly reveals how difficult it is to exploit the SSL paradigm in a multimodal environment.Two main reasons can explain this difficulty: the limited availability of multimodal datasets acquired for a specific task, and the challenging problem of effectively combining different modalities during pretraining.
As for the availability of multimodal datasets, there is no doubt that their collection within a unique experimental setting is extremely hard and costly.However, as reported in section VI-D, a commonly adopted strategy is to train a specific feature extractor for each data modality.This allows to overcome the problem of data availability by performing a two-step pretraining strategy.In the first step, each encoder can be trained separately by aggregating several unimodal repositories; then, multimodal data should be used to simultaneously optimize and align representations of all the feature extractors.
While this strategy allows for lessening the needs of multimodal repositories, the second problem, which is how to effectively combine multiple data types, remains open.As of now, predictive pretext is the most chosen approach, given its lower computational requirement and easiness of implementation.However, predictive pretexts rely on the concatenation of the different embeddings only at the network head level (usually discarded during the model transfer phase) without actually promoting the alignment of different modalities at the backbone level.On the contrary, contrastive learning could be the most suitable type of pretext task in this context, as it allows improving the agreement between representations of different modalities by projecting them in a common latent space used to calculate the contrastive loss (see COCOA [174]).The alignment of different modalities in a single space could have great potential in knowledge discovery scenarios, for example, by connecting the aligned embeddings to a common ontology.It could also open new possibilities in deep phenotyping and precision medicine [199], [200].
One medical area that could benefit most from multimodal applications is neuroscience.Neuroscience is extremely multimodal, with biosignals like EEG or EMG collected together with different types of images (e.g., positron emission tomography, optical coherence tomography, structural and functional magnetic resonance imaging) and tabular data.However, limited effort has been made to align images, signals, and clinical data, a procedure that could greatly improve the study of different neurological disorders and the understanding of the mechanisms behind their onset and progression.
Another application that could benefit from the use of multimodal self-supervised strategies is the management of chronic diseases through multimodal wearable data.As previously stated in section I, the role of wearable devices is constantly growing, and nowadays, people affected by chronic diseases like diabetes, coronary heart disease, or chronic obstructive pulmonary disease can heavily rely on them [201].However, while different wearable devices can facilitate the monitoring of several physiological information, the introduction of deep learning-based decision support systems that can exploit them in real-world scenarios is still hindered by the high sources of variability (e.g., subject variability, sensor variability) associated with such data.In this context, the ability of self-supervised learning to improve model generalizability, as reported in other surveyed works, could help solve this problem.However, further investigations need to be performed, as the number of SSLbased works in this area is still limited.

VIII. CONCLUSION
Self-supervised learning represents a relatively recent and extremely powerful resource in the context of deep learning and, more generally, machine learning applications to different data modalities.In particular, the potential impact of self-supervised learning in biomedical sciences, where it's difficult to get large amounts of annotated data, is extremely high.While previous works reviewed SSL applications on biomedical images, this is the first review paper targeting SSL applications for the analysis of biosignals.The survey highlights how self-supervised learning has been widely adopted for various types of biosignals, including multimodal approaches.It also highlights how, despite its relatively young age, SSL can potentially solve the problem of learning robust representations from biosignals in situations where there is a limited amount of labeled data.However, several factors remain unclear and require further investigations, such as the choice of the pretext task, the data aggregation procedure, and the exploitation of biological information from biosignals during the pretraining phase.Despite these limitations, selfsupervised learning has opened the path to a more robust and performant deep learning, which could finally bridge the gap between research and clinical applications.It also has the potential to make applications of deep learning in the biomedical domain (where it's more difficult to get data and annotations by experts) more substantial and to help face some open challenges (e.g., accountability, distribution shifts, robustness), which still hinder the reliability of AI for healthcare [202], [203].

FIGURE 1 .
FIGURE 1.Example of four seconds of three different biosignals (ECG, EEG, and EMG) with normal (top) and abnormal (bottom) conditions.Left: lead II ECGs selected from the PTB-XL dataset[22].Healthy subject is subject 18, and pathological subject (atrial flutter) is subject 33.Middle: single-channel EEGs selected from the BONN EEG dataset[23].Healthy subject is subject 2 from set B, while pathological (epileptic) subject is subject 6 from set E. Right: single channel EMGs selected from the NinaPro dataset[24].Intact right-handed subject is subject 16 from dataset 2, while amputated right-handed subject is subject 4 from dataset 3. Note how, regardless of their type, it is possible to spot some differences in amplitude and/or waveforms between normal and abnormal biosignals.

FIGURE 2 .
FIGURE 2. A simple schematic representation of the self-supervised Learning paradigm.First, a model is pretrained with only unlabeled data to solve an auxiliary task (pretext task).Then, the backbone's weights are transferred to the downstream model, which is then fine-tuned with the limited amount of labeled data.

Figure 3 (
a) summarizes how CPC works.First, sequences of observations x t+k , k ∈ Z, are passed to a nonlinear encoder to produce a set of latent representations z t+k ; then, latent representations of the past portion of the signal are fed into an autoregressive model, which is used to summarize all the encoded information and produce a context latent representation c t .Finally, the context latent representation is used to predict the latent representation of future portions of the signal (target).The encoder and the autoregressive model are trained to jointly optimize a loss based on noise-contrastive estimation (NCE) [57], which is called InfoNCE loss.(b) SimCLR ( 2020 ): A simple framework for Contrastive Learning Visual Representation (SimCLR) is an endto-end framework designed to learn high-quality representations by maximizing the agreement between differently augmented views of the same data example via a contrastive loss in the latent space [58].SimCLR relies on two simple key ideas.The first is to use heavy random data augmentation; the second is to adopt large batch sizes rich of negative examples.

Figure 3 (
b) illustrates how SimCLR works.Each sample x is augmented twice with randomly selected transformation functions.Then, each of the augmented samples is fed to a backbone encoder to produce a set of representations h; after that, representations are passed to a small neural block called projector head, which will output a set of projections z in a new latent space.Finally, projections are used to maximize the agreement between positive pairs, i.e., pairs of augmented samples of the same original data.The encoder and projector head are trained to jointly optimize the normalized temperature-scaled crossentropy loss (NT-Xent), defined as:

FIGURE 3 .
FIGURE 3. Schematic view of some contrastive learning frameworks.(a) Contrastive Predictive Coding (CPC); (b) Simple Contrastive Learning (SimCLR); (c) Momentum Contrast (MoCo); (d) Bootstrap Your Own Latency (BYOL); (e) Simple Siamese (SimSiam); (f) Swapping assignment between views (SwAV).x denote the original sample, x its augmented version, h the encoder output, z the latent representation, k the new enqueued keys in MoCo, p the prediction of the online network in BYOL, c t the context latent representation in CPC.Momentum network modules are represented with a lighter color compared to their online counterparts to highlight their little difference in weight values.Also, note that all methods are similar to each other but have clearly distinct peculiarities.

FIGURE 4 .
FIGURE 4. Number of works per SSL strategy grouped by the type of biosignal (ECG: electrocardiography, EEG: electroencephalography, EMG: electromyography, PCG: phonocardiography) adopted and the type of upstream task.The ''other'' category refers to those works that have tested multiple SSL pretraining strategies or have proposed hybrid approaches, i.e., a combination of the previous three.

•
PTB−XL :A large open dataset comprised of 21799 clinical 12-lead ECG records of 10 seconds length from 18869 patients.Each ECG was assigned a diagnostic label based on the evaluation of expert cardiologists.The number of labels can vary depending on the chosen experimental setting.Data can be directly downloaded.

TABLE 2 .
Self-supervised learning works on EEG signal.

TABLE 3 .
Self-supervised learning works on other types of biosignals.
TUH : the largest EEG repository to date.It includes EEG records from 10874 subjects recorded at a minimum of 250 Hz with a 24-to 36-channel system. •