Transfer Learning on Electromyography (EMG) Tasks: Approaches and Beyond

Machine learning on electromyography (EMG) has recently achieved remarkable success on various tasks, while such success relies heavily on the assumption that the training and future data must be of the same data distribution. However, this assumption may not hold in many real-world applications. Model calibration is required via data re-collection and label annotation, which is generally very expensive and time-consuming. To address this issue, transfer learning (TL), which aims to improve target learners’ performance by transferring knowledge from related source domains, is emerging as a new paradigm to reduce the amount of calibration effort. This survey assesses the eligibility of more than fifty published peer-reviewed representative transfer learning approaches for EMG applications. Unlike previous surveys on purely transfer learning or EMG-based machine learning, this survey aims to provide insight into the biological foundations of existing transfer learning methods on EMG-related analysis. Specifically, we first introduce the muscles’ physiological structure, the EMG generating mechanism, and the recording of EMG to provide biological insights behind existing transfer learning approaches. Further, we categorize existing research endeavors into data based, model based, training scheme based, and adversarial based. This survey systematically summarizes and categorizes existing transfer learning approaches for EMG related machine learning applications. In addition, we discuss possible drawbacks of existing works and point out the future direction of better EMG transfer learning algorithms to enhance practicality for real-world applications.


Introduction
The human motor control system is a complex neural system that is crucial for daily human activities.One way to study the human motor control system is to record the signal due to muscle fiber contractions associated with human motor activities by means of either inserting needle electrodes into the muscles or attaching electrodes onto the surface of the skin.The signal obtained is referred to as electromyography (EMG).Given the location of the electrodes, EMG is further divided into surface EMG (sEMG) and intramuscular EMG (iEMG).Advancement in the analysis of EMG and machine learning has recently achieved remarkable success enabling a wide variety of applications, including but not limited to rehabilitation with prostheses [1], hand gesture recognition [2] and human-machine interfaces (HMIs) [3].
The current success of applying deep learning onto EMG related tasks is largely confined to the following two assumptions, which are usually infeasible when it comes to real-world EMG related scenarios: 1) Sufficient amount of annotated training data.The growing capability and capacity of deep neural networks (DNN) architectures are associated with million-scale labeled data [4,5].Such high quality abundant labeled data are often limited, expensive, and inaccessible in the domain of EMG analysis.On the one hand, EMG data annotation requires expert knowledge.On the other hand, EMG data acquisition process is a highly physical and time-consuming task that requires several days of collaboration from multiple parties [6].
2) Training data and testing data are independent and identically distributed (i.i.d).
The performance of the model is largely affected by the distribution gap between the training and testing datasets.The testing data might also refer to the data generated during actual application usage after model deployment.Take hand gesture recognition, for example.The model is only capable of giving accurate predictions with the exact same positioning of the forearm of the test subject and the exact placement of the electrodes.
As the distribution of data changes, models based on statistics need to be reconstructed with newly collected training data.In many real-world applications, it is expensive and impractical to recollect a large amount of training data and rebuild the models each time a distribution change is observed.Transfer learning (TL), which emphasizes the transfer of knowledge across domains, emerges as a promising machine learning solution for solving the above problems.The notion of transfer learning is not new, Thorndike et al. [7] suggested that the improvement over one task is beneficial to the efficiency of learning other tasks given the similarity exists between these two tasks.In practice, a person knowing how to ride a bicycle can learn to ride a motorcycle faster than others since both tasks require balance keeping.However, transfer learning for EMG related tasks has only been gaining attention with the recent development of both DNN and HMIs.Existing surveys provide an overview of DNN for EMG-based human machine interfaces [8], and transfer learning in general for various machine learning tasks [9].This survey focuses on the intersection of machine learning for EMG and transfer learning via EMG biological foundations, providing insights into a novel and growing area of research.Besides the analysis of recent deep learning works, we make an attempt to explain the relationships and differences between non-deep learning and the deep models, for these works usually share similar intuitions and observations.Some of the previous non-deep learning works contain more biological significance that can inspire further DNN-based research in this field.To consolidate these recent advances, we propose a new taxonomy for transfer learning on EMG tasks, and also provide a collection of predominant benchmark datasets following our taxonomy.
The main contributions of this paper are : • Over fifty representative up-to-date transfer learning approaches on EMG analysis are summarized with organized categorization, presenting a comprehensive overview to the readers.
• Delve deep into the generating mechanisms of EMG and bridge transfer learning practices with the underlying biological foundation.
• Point out the technical limitations of current research and discuss promising directions on transfer learning on EMG analysis to propose further studies.
The remainder of this paper is organized as follows.We introduce in section 2 the basics of transfer learning, generation and acquisition of EMG and EMG transfer learning scenarios.In Section 3, we first provide the categorization of EMG transfer learning based on existing works and then introduce in detail.We also give a summary of common used dataset in Section 4. Lastly, we discuss existing methods and the future research direction of EMG transfer learning.

Preliminaries
This section introduces the definitions of transfer learning, related concepts, and also the basics of EMG, from how EMG signal is generated to how EMG signal is recorded.We also summarize possible transfer scenarios in this section.

Transfer Learning
We first give the definitions of a "domain" and a "task", respectively.Define D to be a domain which consists of a feature space X and a marginal probability distribution P (X), where X is a set of data samples X = [x i ] n i=1 .In particular, if two domains have different feature spaces or marginal probability distributions, they differ from each other.Given a domain D = {X , P (X)}, a task is then represented by T = {Y, f (•)} where f (•) denotes the objective prediction function and Y is the label space associated with X .From the probability point of view, f (x) can also be regarded as conditional probability distribution P (y|x).Two tasks are considered different if they have different label spaces of different conditional probability distributions.Then, transfer learning can be formally defined as follows: Definition 1 (Transfer Learning): Given a source learning task T S based on a source domain D S , transfer learning aims to help improve the learning of the target objective prediction function f T (x) of the target task T S based on the target domain D T , given that The above definition could be extended to multiple domains and tasks for both source and target.In this survey, we only consider the case where there is one source domain D S , and one target domain D T , as by far this is the most intensively studied transfer setup of the research works in the literature.Based on different setups of the source and target domains and tasks, transfer learning could be roughly categorized into inductive transfer learning, transductive transfer learning and unsupervised transfer learning [10].
Definition 2 (Inductive Transfer Learning): Given a transfer learning task (D S , T S , D T , T T , f T (x)).It is a inductive transfer learning task where the knowledge of (D S and T S is used to improve the learning of the target objective prediction function f T (x) when T S = T T .
The target objective predictive function can be induced by using a few labeled data in the target domain as the training data.
Definition 3 (Transductive Transfer Learning): Given a transfer learning task (D S , T S , D T , T T , f T (x)).It is a transductive transfer learning task where the knowledge of D S and T S is used to improve the learning of the target objective prediction function f T (x) when D S = D T and T S = T T .
For transductive transfer learning, the source and target tasks are the same, while the source and target domain vary.Similar to the setting of transductive learning of traditional machine learning [11], transductive transfer learning aims to make the best use of the given unlabeled data in the target domain to adapt the objective predictive function learned in the source domain, minimizing the expected error on the target domain.It is worth to notice that domain adaptation is a special case where X S = X T , Y S = Y T , P S (y|X) = P T (y|X) and/or P S (X) = P T (X).
Definition 4 (Unsupervised Transfer Learning): Given a transfer learning task (D S , T S , D T , T T , f T (x)).It is an unsupervised transfer learning task where the knowledge of D S and T S is used to improve the learning of the target objective prediction function f T (x) with Y S and Y T not observed.
Based on the above definition, no data annotation is accessible in both the source and target domain during training.There has been little research conducted on this setting to date, given its fully unsupervised nature in both domains.

EMG Basics
Motor Unit Action Potential.A motor unit (MU) is defined as one motor neuron and the muscle fibers that it innervates.During the contraction of a normal muscle, the muscle fibers of a motor unit are activated by its associated motor neuron.The membrane depolarization of the muscle fiber is accompanied by ions movement and thus generates an electromagnetic field in the vicinity of the muscle fiber.The detected potential or voltage within the electromagnetic field is referred to as the fiber action potential.The amplitude of the fiber action potential is related to the diameter of the corresponding muscle fiber and the distance to the recording electrode.It is worth noticing that MU, by definition, refers to the anatomical motor unit where the functional motor unit is of more research interest when it comes to real-world applications.The functional motor unit can be defined as a group of muscle fibers whose action potentials occur within a very short time (two milliseconds).Intuitively, one could consider a functional motor unit as a group of muscle fibers that contract for one unified functionality.From this point on, MU refers to a functional motor unit unless otherwise specified.A Motor Unit Action Potential (MUAP) is defined as the waveform consisting of the superimposed (both temporally and spatially) action potentials from each individual muscle fiber of the motor unit.The amplitude and shape of the MUAP is a unique indicator of the properties of the MU (functionality, fiber arrangement, fiber diameter, etc.).MUs are repeatedly activated so that muscle contraction is sustained for stable motor movement.The repeated activation of MUs generates a sequence of MUAPs forming a Motor Unit Action Potential Train (MUAPT).
Signal Recording.Based on the number of electrodes used during the recording of MUAPT, the recording techniques could be divided into mono-polar and bi-polar configurations.As shown in Figure 1, based on whether the electrodes are inserted directly into the muscles or placed on the surface of the skin, the collected signal is referred to as intramuscular EMG (iEMG) or surface EMG (sEMG), respectively.If muscle fibers belonging to multiple MUs are within the vicinity of the electrode, all MUAPTs from different MUs will be detected by the electrode.A thin and sharp needle shaped electrode is quickly and smoothly inserted into the targeted muscle during iEMG acquisition [12].iEMG is considered to have good spatial resolution due to the small diameter (around 0.5 mm) of the needle electrode.Individual MUAPTs could be identified by visualization.However, the effectiveness of the process of iEMG acquisition is highly dependent on the skill of the electrodiagnostic physician.Moreover, the punctuation procedure bears the risks such as skin infection, severe bleeding, and muscle irritation.sEMG, on the other hand, is a non-invasive analysis tool for the human motor system places electrodes on the surface of the skin [13].Given the different diameters of the electrode, sEMG is composed of MUAPTs from MUs from the same layer or deep layers, leading to a poor spatial resolution as compared to iEMG.sEMG is widely adopted for Human-Computer Interface (HCI) due to the major advantage of its ease of use and noninvasive nature.

Transfer Scenarios of EMG
Based on various factors in real usage scenarios that cause a difference between the source domain and the target domain, we summarize transfer settings in EMG based applications as follows: 1) Electrodes Variation.Electrode variation could be categorized into electrodes placement shift and channel variation.Channel variation refers to the situation where some channels are missing during actual use as compared to the number of channels while recording EMG for model training.The placement of electrodes plays a crucial role in EMG applications.However, electrode shift is inevitable from wearing and taking off EMG acquisition devices whether in the form of armband [8] or sockets [14].
Figure 2 provides a visualization of electrode variation in the case of an eight-channel EMG armband acquisition device.Consider the task of hand gesture and source domain associated with data collected with electrode placement shown in Figure 2(a).A transfer learning setting is formed with the target domain consisting of the same task and data collected with electrode placement shown in Figure 2(b) or with missing channels as in Figure 2(c).

Transfer Learning in EMG Analysis
In the previous section, we introduced basic concepts on transfer learning on general and EMG generating mechanisms along with recording techniques.These preliminaries shed insights on the underlying principles of recent progress in the area of transfer learning on EMG.In this section, we construct a categorization that best summarizes existing research endeavors of transfer learning in EMG analysis.As shown in Figure 3, we categorize existing works in EMG related transfer learning into four lines, i.e., data-based approaches, modelbased approaches, training scheme based approaches, and adversarial-based approaches.Considering whether the approach weights the data instance or apply feature transformation, we further divide data-based approaches into feature based methods and instance weighting approaches.In similar ways, we further divide model-based approaches into parameter-based and structure-based.Even further, we divide parameter-based methods into parameter sharing and fine-tuning while splitting structure based methods into the model ensemble and model calibration.Besides model-based and data-based interpretation, some transfer strategies are based on specially designed training schemes or adversarial training.

Data-based Perspective
Data-based transfer learning approaches aim to reduce the data distribution difference between the source domain and target domain via data transformation and adjustment.From a data perspective, two approaches are generally employed in order to accomplish the knowledge transfer objective, namely instance weighting and feature based transformation.According to the strategies illustrated in Figure 3, we present some most related approaches.
3.1.1.Instance Weighting Consider a special case of domain adaptation where P S (y|X) = P T (y|X) and P S (X) = P T (X) which is referred to as covariate shift [16].Consider the transfer scenarios that we introduced in Section 2.3, collecting abundant data in the target domain is often prohibitive, and thus target domain instances are limited.A natural solution is to assign weights to partial instances from the source domain so that these source domain instances can be used along with limited target domain data.Huang et al. proposed Kernel Mean Matching (KMM) [17] to estimate the instance weights by matching the means of the target and source domain in a Reproducing Kernel Hilbert Space (RKHS).The weighted instances from the source domain are combined with labeled target domain instances to train the target objective prediction function.Li et al. [18] proposed to use TrAdaBoost [19] along with Support Vector Machine (SVM) to improve the motion recognition performance under inter-session scenario.In specific, they first apply TrAdaBoost to weight EMG data of day one and train a target classifier with weighted EMG from day one and EMG collected from another day.TrAdaBoost iteratively adjusts the weights of instances to decrease the negative effect of the instances on the target learner.TrAdaBoost is largely inspired by a boosting algorithm called AdaBoost [20].AdaBoost iteratively trains weak classifiers with updated weights.The weighting mechanism of AdaBoost is the misclassified instances are given more attention during the training of the next weak learner in the following iteration.The weighting mechanism of TrAdaBoost is to reduce the distribution difference between the source domain and the target domain.

Feature Based Strategy
Feature-based approaches map each original feature into a new feature representation either by linearly transforming the original feature or non-linearly transforming the original feature to enable knowledge transfer.
Linear Transformation.Lin et al. [21] proposed a normalization based approach called Referencing Normalisation to reduce the distribution difference among domains for intersubject sEMG-based hand gesture classification.In specific, data from the source domain are mapped to the range of the target domain data: where XS is the transformed source domain data.
In addition to directly applying a linear transformation to normalize the data to the target domain range, authors [22][23][24][25] attempted to reduce the distribution gap based on statistical features such as covariance and mean.Conventional classifiers such as Linear Discriminant Analysis (LDA) [26], Quadratic Discriminant Analysis (QDA) [27] and Polynomial Classifier (PC) [28] are commonly adopted for sEMG classification tasks.The covariance matrix, mean vector, and the prior are the discriminant variables of LDA and QDA classifiers.Define Σ S , Σ T , µ S , µ T to be the covariance matrices and mean vectors of data from the source domain and target domain, respectively.The transfer learning process of LDA and QDA based linear classifiers could be defined with a convex interpolation: where α, β ∈ [0, 1] are the trade-off parameters to balance the knowledge from the source and target domain, Σ and μ represent the adapted covariance and mean vector.The optimal value for α and β are set empirically or via grid search with a fixed step size.Liu et al. [23] also proposed to use transfer learning on PC for the inter-session transfer scenario on both intactlimbed and amputee subjects.Let M be the polynomial expansion matrix of the training data, an optimal weight matrix W could be formulated as: Similarly, the transfer learning process based on PC is defined as: where W i and β i are the optimal weight matrix for the i th session and the corresponding weight ratio, W represents the optimal weight matrix on the new session and W represents the adapted weight matrix.It is worth noticing that distance measurements such as Kullback-Leibler divergence [29] could be used to select the source domain that's the most similar to the target domain to avoid negative transfer when there are multiple source domains available [30].Next, we review main bio-inspired research endeavors under the linear assumption.As discussed in Section 2.2, EMG signals are composed of superimposed MUAPTs generated from different MUs in both temporal and spatial domains.Muscle Synergy Modeling (MSM) [31][32][33][34] has shown great success in terms of modeling the linear relationship between MUAPTs of muscles and the collected EMG signal.Let x m (t) be the generated MUAPTs from the m th muscle, define act i (t) ∈ R to be the activation signals, x m (t) could then be expressed as: where g mi is the gain factor of muscle m transferred to the i th activation signal with N < M .Assuming that only attenuation exists with distance but no filtering effect, the observed EMG signal at the k th electrode (k th channel) is written as: where l km is the factor that reflects the attenuation level from the m th muscle on the k th electrode and a ki is the combined weight factor that models both l km and g mi .The above mixture could be written in matrix form: where A ∈ R K×N is the weighting matrix and F is the synergy matrix.In EMG analysis, Y is often observed, thus the solving for W and F becomes a linear blind source separation (BSS) problem [35].Non-negative matrix factorization (NMF) [36] finds an approximate solution to the equation ( 7) with the constraint that all elements are non-negative.Jiang et al. [37] proposed correlation-based data weighting (COR-W) for inter-subject transfer scenario of elbow torque modeling.In specific, they assume that the target domain data is a linear transformation of the source domain data, X T ≈ XS = AX S , where XS is the transformed source domain data.The underlying assumption is that the synergy matrix remains the same for both domains while the weighting matrix varies.A derived assumption of Jiang et al. is that the covariance matrix of the transformed source domain should also be similar to the covariance matrix of the target domain data.The optimal matrix A * is estimated by minimizing the discrepancy between ΣS and Σ T .The transformed source data is then used to re-train the model.Although Jiang et al. proposed for inter-subject transfer scenario, while we argue that the linear assumption might not hold due to variation across subjects.Electrode shift, on the other hand, is reasonably more consistent with the linear assumption in practice.Günay et al. [38] adopted MSM with NMF for knowledge transfer across different tasks.The weighting matrix W calculated on the source domain is kept constant while the synergy matrix is re-estimated on the target domain data using the non-negative least squares (NNLS) algorithm.
In contrast to the works that map the source domain data to a new space, another line of work [39][40][41] transforms the target domain data so that the source domain objective prediction function is applicable again.Prahm et al. [39] viewed the target domain data as a disturbed version of the source domain data.The disturbance can be expressed as a linear transformation matrix A. The main aim is then to learn and apply an inverse disturbance matrix A −1 to the target data such that the disturbance is removed.Prahm et al. [39] adopted Generalized Matrix Learning Vector Quantization (GMLVQ) [42] as the classifier and estimate the optimal A −1 using gradient descent on the GMLVQ cost function.The linear transformation that maximizes the likelihood of disturbed data based on the undisturbed data could also be estimated by the Expectation and Maximization (EM) algorithm [41,43].Following their previous work [39,41], Prahm et al. [40] proposed that the linear transformation matrix could be further exploited based on the prior knowledge that the underlying EMG device is an armband with eight uniformly distributed channels.For the electrode shift scenario, Prahm et al. assumed that the disturbed feature from channel j could be linearly interpolated from neighboring channels from both directions with a mixing ratio r.Then the approximation of the linear transformation matrix is reduced to finding an optimal mixing ratio r.Non-linear Transformation.The principle objective of feature transformation is to reduce the data distribution between the source and target domain.Thus, the metrics for measuring distribution difference is essential.Maximum Mean Discrepancy (MMD) [44] is widely adopted in the field of transfer learning: where Φ indicates a non-linear mapping to the Reproducing Kernel Hilbert Space (RKHS) [45], N S and N T indicate the number of instances in the source and target domain, respectively.Essentially, MMD quantifies the distribution difference via calculating the distance between the mean vectors of the features in a RKHS.In addition to MMD, Kullback-Leibler divergence, Jenson-Shannon (JS) divergence [46] and Wasserstein distance [47] are also common distance measurement criteria.The Siamese architecture [48,49] is one commonly adopted architecture for DNN related transfer learning, as illustrated in applied fast Fourier transform (FFT) to data segment and used the spectrum as input to their designed CNN based network.Similar to [50], the MMD loss is applied to the output of the second fully connected layer.A Regression Contrastive Loss is proposed to minimize the distance in the feature space between the source domain instance and target domain instance of the same category.Normalization tricks are adopted to modify the loss for regression tasks.
Côté-Allard et al. [52,53] proposed to use the Progressive Neural Network (PNN) [54] to alleviate catastrophic forgetting caused by directly fine-tuning the network parameters with data from the target domain.As shown in Figure 5, a source domain network is first trained with data from the source domain.The model parameters of the source domain network are then fixed while the parameters for the target domain network is randomly initialized.Note that the network structures of both networks are exactly the same except for the model parameters.During the transfer learning process, target domain instances are fed to both networks.The intermediate features of each module of the source domain network is then merged with the corresponding features of the target domain network and fed forward to the next module of the target domain network.The underlying hypothesis is that although distribution variation exists between the source and target domain, generic and robust features could be attracted for more effective representation learning.
Du et al. [55] proposed to adopt Adaptive Batch Normalization (AdaBN) [56] for intersession transfer learning.AdaBN is a lightweight transfer learning approach for DNNs based on Batch Normalization (BN) [57].BN was initially proposed to accelerate the convergence of the DNN for faster CNN training.Formally, define Z = [z i ] B i=1 to be a batch of intermediate features of instances with batch size B, the BN layer transforms Z as follows: where γ and β are learnable parameters, V ar stands for variance.The underlying hypothesis is that labeled related knowledge is stored in the network parameters of each layer, and the domain related knowledge is portrayed by the statistics of the BN layers.The transformation ensures that the distribution of each layer remains the same over mini-batches so that each layer of the network receive input of similar distribution regardless of the source or target domain.Different from fine-tuning, AdaBN doesn't require target domain label for knowledge transfer and only a small fraction of the network parameters need to be updated.
In particular, the network is first pre-trained on source domain data.During the training process, the statistics of BN layers are calculated by applying a moving average for all data batches.All network parameters are fixed except for the parameters of BN layers during transfer learning.The update of BN statistics to target domain data could easily be done by a forward pass.

Model Based Perspective
From the model perspective, transfer learning approaches can also be interpreted in terms of model parameters and model structures.
3.2.1.Parameter Fine-tuning One intuitive way of transferring knowledge of DNN is to tune the network parameters of the source learner using data from the target domain.Finetuning [58] refers to the training process where the network is first trained on one dataset (large-scale) and use the network parameters as initialization to further train on another dataset (small scale).Fine-tuning is a common strategy in the Computer Vision (CV) community where the neural networks are first pre-trained on ImageNet (IN) either in a supervised manner or self-supervised manner and later fine-tuned for various downstream tasks such as classification [59] and object detection [60].IN  The weights of the backbone modules are first copied to the target domain network and frozen.The term 'module' refers to a combination of layers that might contain convolution, normalization, or residual connection.FC stands for the fully connected layer.The weights of the prediction head are randomly initialized and trained from scratch.objects, animals, and humans.Since the gap between the source domain (natural scenes) and the target domain (spectrum image) is tremendous, it is questionable as to what knowledge is transferable.Phoo et al. [64] compared the transfer performance of using miniIN (a small subset of IN) as source domain and using IN as source domain to ChestX (X-ray images for chest) [65] as target domain.Experimental results show that pre-training on IN yields no better performance than on miniIN and both yields poor diagnosis accuracy.This suggests that more data does not help improve the generalization ability, given that no more informative knowledge can be extracted from the source domain to benefit the target domain learner.Pretraining the network on the source domain and then using the pre-trained weights to initialize the neural network for further training using the target domain data is another popular finetuning strategy for EMG transfer learning [24,[66][67][68][69].There would be little constraint nor assumption on the transfer scenarios since this transfer process is simple and can be viewed as sequentially train the network with two datasets.When there are EMG data recorded from multiple subjects or sessions, it is possible to combine the data and treat the combined data as the source domain [70,71].Or it is also a solution to train a unique model for each subject or session and to select a certain number of models that give the best performance on the target domain [72,73], the selected models are then fine-tuned on the target dataset to provide final prediction based on majority voting [74].However, fine-tuning suffers from the catastrophic forgetting, meaning that knowledge from the source domain will be forgotten by the neural network rapidly upon the introduction of target domain data [75].Besides the parameters fine-tuning of DNNs, the parameters of Decision Trees [76] (DTs) could also be fine-tuned for EMG transfer learning [77].The motivation is that the structure of decision trees for similar tasks should be similar and the domain difference is reflected from different decision threshold values associated with the features.Structure Transfer (STRUT) [78] first discards all the numeric threshold values of learned trees on the source domain data and selects a new threshold value τ (ν) for a node ν given that the subset of target examples reach ν in a top-down manner.Any node ν that's empty in terms of target domain data is considered unreachable and will be pruned.Define τ to be the threshold value of feature φ at node ν that splits any set of labeled data S ν into two subsets, denoted S L and S R .P L and P R denote the label distribution of S L and S R , respectively.STRUT aims to find a new threshold τ with maximum Divergence Gain (DG) subject to the condition where the new thresholds are local maximums of Information Gain (IG) [76]: where • stands for the cardinality, S and T on the superscript stand for the source and target, respectively.

Parameter Sharing
The neural network architectures are not specified in Section 3.2.1 since parameter fine-tuning tunes all parameters of the network regardless of various network designs.It is stated that fine-tuning the whole network suffers from catastrophic forgetting and knowledge learned from the source domain will be quickly forgotten.In most of the works [24,[66][67][68][69] that adopt fine-tuning, the target domain dataset is of the same size as the source domain dataset.Consider the case where the target domain dataset is small compared to the source domain, with forgotten knowledge from the source domain, the neural network is prone to suffer from over-fitting [79].A possible solution is to freeze partial network parameters and to only update partial parameters during the fine-tuning process.An illustration of knowledge transferring via parameter sharing is provided in Figure 6.
A neural network design could be roughly divided into the backbone and the prediction head.The backbone serves as the feature extractor and is usually CNN based or Recurrent Neural Networks (RNN) based.The prediction head is usually composed of fully connected layers and predicts the desired labels based on the deep features extracted by the backbone.
Assuming that the extracted deep features are generic for various transfer scenarios, the weight of the backbone could be frozen once pre-trained on the source domain dataset to prevent catastrophic forgetting [80][81][82][83][84][85][86].Only the fully connected layers of the prediction head need to be updated which reduces transfer training time and guarantees fast convergence.

Model Structure Calibration
Besides knowledge transferring via trained parameters, next we explore the possibility of EMG transfer learning from the model structure perspective.Since it is often the case that there is a lack of labeled data in the target domain and as such it might not be sufficient to construct a reliable high performance model solely on the target domain data, optimizing the model structure of a pre-trained model to fit the target domain data is desired.As we mentioned in the previous section that DNNs are believed to be able to extract generic features, thus it is impractical and time consuming to alter or even search for neural network structures using Neural Architecture Search (NAS) [87] for various domains.However, Random Forest (RF) [88] on the other hand, is more suitable for structure calibration since knowledge transfer could be done by pruning or growing the source tree model.Marano et al. [77] proposed to use structure expansion/reduction (SER) [78] for EMG based hand prostheses control.As the name suggests, the SER algorithm contains two phases: expansion and reduction.Consider an initial random forest that is induced using the source domain data.In the expansion phase, SER first calculates all labeled data points in the target domain dataset that reaches node ν and then extends node ν into a full tree.In the reduction phase is performed to reduce the model structure in a bottom-up fashion.Define E sub to be the empirical error of the subtree with root node ν, E leaf denotes the empirical error on node ν if ν were to be pruned to a leaf node.The subtree is to be pruned into a node leaf if SER is performed on each decision tree separately and the resulting random forest is the adapted model for the target domain data.

Model Ensemble
Combining data from various sources into a single source domain may not yield satisfactory results since the distributions of these domains might vary greatly from each other.Another commonly adopted strategy for EMG transfer learning is model ensemble.The model ensemble aims to combine a set of weak learners to make the final prediction.Some previously reviewed EMG transfer learning approaches already adopted this strategy.For instance, Kim et al. [72] proposed to train a unique classifier for each subject and further fine-tune the top ten best performing classifiers on a new target subject.The final prediction is the most commonly classified by the ensemble of all ten fine-tuned classifiers.Decision Trees are another popular choice for weak learners.Zhang et al. [89] proposed feature incremental and decremental learning method (FIDE) based on Stratified Random Forest (SRF) for knowledge transfer with missing or added electrodes.In specific, define S i and S j to be the electrode sketch score [90] for electrode e i and e j , respectively.The distribution difference between electrodes e i and e j is defined as: where ρ(•) stands for the Pearson Correlation Coefficients (PCC) and ψ denotes the inverse of the Euclidean distance between e i and e j .K-means [91] is then utilized to cluster the electrodes into K clusters based on the DD.Denote M as the number of weak learners in the ensemble model, SRF is built on the source domain data where M/K trees are induced using data collected with electrodes in the corresponding cluster.If electrode i is missing in the target domain data, the missing features could be recovered from the most similar electrode j.If there are incremental electrodes in the target domain dataset, FIDE first selects set of weak learners to be updated based on a performance score: Update Θ with gradient descent: where h m stands for the m th decision tree, #f eature m denotes the number of features used by h m , and #f eature denotes the total number of features.Top M * δ weak learners are then selected for updated where δ ∈ [0, 1].The SER and STRUT algorithms [78] introduced in previous sections are again used for transfer learning on decision trees.Compared to the majority voting way of ensemble, FIDE updates the source domain model to extract new knowledge from target domain data while not abandoning the already learned knowledge.

Training-scheme Based Perspective
In addition to the previously mentioned approaches that can be subsumed into pre-defined paradigms, we also review works that design special training schemes for EMG transfer learning.Zhai et al. [92] proposed a self re-calibration approach for inter-session hand prosthesis control.In particular, a source domain classifier is first trained with EMG data of existing sessions.Given the target domain data, each EMG data segment x i is assigned a prediction label y i by applying a forward pass of the EMG segments.Based on the assumption that temporally adjacent EMG segments are likely to be generated from the same hand movement, the assigned labels are re-calibrated with majority voting: where f S is the source domain classifier and k indicates the number of neighboring segments used to re-calibrate the label from both directions in time before and after x i .Then the target domain data with re-calibrated labels are used to update the source domain classifier.It is worth noticing that such a transfer scheme does not require target domain data and can be easily adopted for day-to-day re-calibration.
Meta-learning [93] is another training paradigm that can be used for EMG transfer learning.Meta-learning is commonly known as learning to learn [94].In contrast to conventional machine learning algorithms that optimize the model over one learning episodes, meta-learning improves the model over multiple learning episodes.The meta-learning goal of generalizing the model to a new task of an incoming learning episode with limited samples aligns well with the notion of transfer learning.Intuitively speaking, meta-learning divide the source domain data into multiple learning episodes, with each containing a few samples and mimicking the transfer processing during training so that the model trained has good transferability in terms of the true target domain.Rahimian et al. [95] proposed meta-learning based training scheme called Few-Shot Hand Gesture Recognition (FHGR) for the transfer case where only a minimal amount of target domain data are available for re-calibration.Define a N-way k-shot few shot learning problem, let T j = {D train j , D test j , L} denote a task associated with the source domain dataset where and L is a loss function to measure the error between the prediction and the ground-truth label.Please be aware that the task T here is a naming convention in the meta-learning area and is of a different meaning than the task that we define for a domain.FHGR aims to predict the labels of D test j based on the samples seen from D train j consisting of K samples from each of the N classes over a set of tasks samples from p(T ).A Pseudocode in the MAML style [96] is provided in Algorithm 1.
EMG transfer learning could also benefit from data augmentation via generating synthetic data as data from other sessions or subjects (target domain data).Generative Adversarial Networks (GANs) are a famous type of networks for data generation without explicitly modeling the data probability distribution.A typical GAN contains a generator G and the discriminator D which are two neural networks.A random noise vector sampled from a Gaussian or uniform distribution is input to the generator network to produce a sample x g that should be similar to a real data sample x r drawn from a true data distribution P r .Either x r or x g is input to the discriminator to get a classification result of whether the input in real or fake.Intuitively, the generator aims to generate fake samples that could confuse the discriminator as much as possible, while the task of the discriminator is to best distinguish fake samples from real ones.The training objective of GAN can be defined as: Zanini et al. [97] adopted DCGAN [98] which is an convolution-based extension of the very original GAN and style transfer for Parkinson's Disease EMG data augmentation.Besides GANs, style transfer has also been utilized to augment EMG data.Given a piece of fine art work, painting, for example, humans have the ability to appreciate the interaction of content and style."The Starry Night" by Van Gogh is an appealing painting that attracts a lot of re-drawing attention which follows the same drawing style of Van Gogh but with different content.Gatys et al. [99] proposed an algorithm for artistic style transfer that combines content from one painting and the style of another painting.A similar idea could be extended to EMG signals for transfer learning.An EMG signal can also be regarded as the interaction of content and style.The style might refer to the biological characteristics of the subject, such as muscle condition, the filtering effect of a recording device, or simply a session.The content depicts the spikes carrying moving intention from the neural system to the corresponding muscles.Consider that the content of the different muscle movement are the same regardless any other conditions, the style component then process the control signals for moving to subject, device, or session specific data.Zanini et al. [97] adopted style transfer [99] to augment Parkinson's Disease EMG data of different patterns.Specifically, given a content EMG signal e c and a style image e s , the algorithm aims to find an EMG signal e that's of the same content as e c and of the same style as e s .Mathematically, the transferring process minimizes the following loss function: where F(•) is the output feature of the l th layer of the neural network, G stands for the Gram matrix [100].The content component and style component are controlled by two hyperparameters.
Besides directly generating EMG data, Suri et al. [101] proposed to synthesize extracted features of EMG signals with an LSTM network [102] to mimic EMG data from other subjects or different sessions.Different from GAN and style transfer based EMG augmentation that are directed by loss functions that either measure the authenticity or similarity, the method proposed by Suri et al. simply relies on the assumption that extracted features are robust and that EMG signal generated by altering features are correlated to the recorded real data.

Adversarial Based Perspective
Recall that in Section 3.1.2,we introduce non-linear feature based approaches that reduce the data distribution by explicit deep feature transformation.In this section, we review a set of methods that force the neural network to learn hidden EMG representations that contain no discriminative information in terms of the origin of the data for domain generic feature extraction.With this objective, Domain-Adversarial Neural Networks (DANN) [103] is a type of neural network that contains a backbone F(•; θ F ) parameterized by θ F for feature extraction and two prediction heads: one for predicting the task label and another for predicting the origin of the data (source or target domain).We refer to the prediction head for the source domain task as the task prediction head P t (•; θ t ) and refer to the prediction head for domain classification as domain prediction head P d (•; θ d ).The parameters of the network are optimized in a way that the learned deep feature minimizes the loss for the task prediction head while maximizing the loss for the domain prediction head.The domain prediction head works adversarially to the task prediction head hence the name DANN.Formally, the overall loss function for optimizing θ F , θ t and θ d is defined as: where L t denotes the loss function for the source domain prediction task, L d denotes the loss function for the domain classification, λ is a balance factor, n and m indicate the number of the source domain data and target domain data, respectively.The parameters θ F , θ t and θ d and then are updated using gradient descent: where β is the learning rate.We provide an illustration of data and gradient flow of DANN in Figure 7.
Côté-Allard et al. [104] proposed to use DANN for multi-domain for inter-session EMG transfer learning.During training, each mini-batch contains randomly sampled EMG segments from one session.Each mini-batch is assigned with a class index indicating different sessions for the domain predicting labels.A gradient reversal layer [103] is adopted for easy implementation of negative gradient flow from the domain prediction loss to the backbone.Note that the task prediction head is only updated with loss from the source domain data.In Formally, the overall loss to train DAA is defined as: where p stands for the likelihood.As illustrated in Figure 8, the decoder, adversarial prediction head, and nuisance prediction head are discarded after the disentangled feature learning process of DAA.The weight of the encoder is then frozen for feature extraction, and a task prediction head with random weight initialization is placed on top of the encoder for specific downstream tasks.Based on their previous work [110], Han et al. later proposed a soft version of the latent representation disentanglement [112].

Summary of Common Datasets
We summarize common EMG datasets [6,52,55,104,[113][114][115][116][117] that could be used for transfer learning and provide dataset statistics in Table 1, including task category, number of subjects, number of recording device channel, sampling frequency, number of gesture classes, and corresponding citations.

Discussion and Future Directions
In this section, we revisit EMG transfer learning approaches based on our categorization and discuss the advantages and drawbacks of each category.Given our discussion, we further point out future directions.
Instance Weighting: By applying the weight onto the data samples from the source domain, instance weighting makes use of existing source domain data to augment the target domain data to enlarge the size of the data to train the model.This line of method alleviates data shortage when the target domain data are limited.One potential drawback of such methods is that the overall performance is highly dependent on the weighting mechanism and that the target model could suffer from poorly selected and weighted samples from the source domain.
Linear Feature Transformation: Linear feature transformation based approaches are the most bio-inspired transfer learning approaches of all categories in the sense that the generation of EMG and the recording of EMG could all be abstracted with linear assumption.This line of work is simple and computationally light since the transfer process is simply done by applying a linear transformation on either the data or feature, which is easily done by matrix multiplication.We argue that the linear assumption holds for the transfer scenarios, which are electrodes shift correlated.We mentioned Section 2.2 that certain non-linear factors such as the filtering effect of muscle and fat tissues and muscle fiber recruitment patterns vary across subjects.These non-linear factors could not be modeled with a linear transformation.However, if the underlying subject and recording devices remain the same, electrode shift can then be somewhat captured by such approaches.
Non-linear Feature Transformation: The non-linearity of this line of work mainly comes from the non-linear activation functions of DNNs.Consequently, the non-linear factors such as subject variation can be modeled in a black-box fashion.Meanwhile, such methods also share the common advantages of DNNs, such as robust feature extraction ability.One main drawback is that DNN based non-linear transformation lacks interpretability in that it's not clear what features are exactly extracted to reduce data distribution discrepancy.Therefore, it's hard to further improve the algorithm since no biological sound clue resides behind the design of the architecture.
Parameter Fine-tuning: Fine-tuning as transfer learning is simple in practice, since the only operation is to run the training process again on the target domain dataset.However, if the data size of the target domain is limited, the resulted model might suffer from over-fitting.Moreover, fine-tuning, in general, suffers from catastrophic forgetting where the learned knowledge from the source domain is quickly forgotten with the introduction of new target domain data.
Parameter Sharing: Parameter sharing based approaches are quite similar to fine-tuning, however, partial network parameters are shared between the source and the target model.By doing so, the aforementioned catastrophic forgetting could be alleviated since certain knowledge is considered kept by sharing the associated network parameters.The common practice would be to share the parameters of the feature extractor and to train a task-relevant prediction head from scratch.Freezing the backbone is a common practice when the source domain is believed to be of large size and of similar distribution to the target dataset.
Otherwise, there is no guarantee that only training a small fraction of parameters would yield a good transfer performance.
Model Ensemble: Directly combining data of multiple domains might lead to the neural network not converging smoothly due to data distribution differences.Building individual models with respect to individual domains and then ensembling them best preserves the information for each domain.Since we assume that data distributions from different sessions or subjects vary greatly for EMG applications, thus model ensemble gains the most performance improvement by promoting the diversity of the models.The model ensemble is computational and memory expensive, given that multiple models are stored in memory, and data point is processed multiple times for the final prediction.
Model Structure Calibration: Existing model structure calibration based models are mainly based on random forest, which in essence is model ensemble already.Thus, this line of work shares the advantages with model ensemble based methods.The structure calibration refers to the growing or pruning operations of individual decision trees.One drawback is that features need to be extracted manually, which is also the drawback of the decision tree itself.It would also be interesting to explore the possibility of calibrating the model structure of DNNs using neural network structure searching tools such as Neural Architecture Search (NAS).
Label Calibration: This line of work use the source model to label unseen.The labeled and calibrated target domain label is then used to update the model.One advantage is that transferring mechanism of these methods is very in favor of real-world applications.Such methods do not require an expert for target domain data labeling.The transferring process could be deployed on end devices and be automatically applied with new incoming data with a simple user interface.However, since the source domain model label data with knowledge learned from the source domain and will assign label to data points even with previous unseen categories, the label calibration procedure may potentially introduce label noise.
Data Generation: Generating synthetic EMG data could avoid the tedious workload of data collection and annotation.Given that EMG collection and labeling is very time consuming and requires expertise, generated data of good quality could enhance practicality.However, unlike the data generation in the vision or language community, where the quality of the generated images or texts could easily be verified by human observation, it is hard to evaluate the quality of EMG signals generated.As a consequence, using poorly generated data as data from another domain may bring a negative impact.
Meta/Adversarial Learning Based: Adversarial learning learns features that are domain irrelevant.Meta learning mimics consecutive transfer learning during the training time so that the model can be adapted to a new domain with limited data.All related methods will perform well on a series of transfer learning with many new target domains.However, the training process of these approaches is either complex or/and introduces additional network components during transferring, which makes it almost impossible for fast transfer learning on an end device.
The essence of EMG transfer learning is to boost the viability of existing machine learning based EMG applications.Consequently, the transfer learning algorithm should bear the following characteristics: 1) Bio-Inspired.The working mechanism of muscles is relatively well studied and straightforward compared to that of the brain.We point out that the activation patterns of the muscles, relative location between muscles and electrodes, and individual biological characteristics should be explicitly modeled into the neural network to embed the network with A priori knowledge.AlphaFold [119] is a successful attempt at protein structure prediction with protein A priori knowledge guided network structure design.
2) Hardware-friendly.Ideally, the re-calibration should be done on end devices rather than on cloud servers.With wearable or even implantable devices, the memory and computation resources are highly restricted.Most current DNN based transfer learning approaches fail to take the hardware constraints into consideration.Future works should incorporate a hardware resource perspective into algorithm design (hardware-software co-design).
3) User-friendly.The transferring process should be fast and light in the sense that there should be no heavy data collection procedure that requires user participation.Future works thus should put more attention on transfer learning algorithms that work with limited target domain data and annotation.For instance, given a hand gesture classification task with more than 20 classes, the algorithm is considered user-friendly if the user is required to perform the most simple gesture once for system re-calibration.

Acknowledgement
The

Figure 1 .
Figure 1.Demonstration of EMG acquisition.The sEMG acquisition configuration is shown above the dotted line, with iEMG acquisition configuration shown below the dotted line.The triangle represents an amplifier.For the bi-polar setup as in (a) and (c), two electrodes are placed on the skin surface or inserted into muscle fibers penetrating the skin surface.(b) and (d) show the case of a mono-polar setup with one electrode attached to the skin or muscle fiber and the other electrode connected to the ground or a reference point with no EMG (bones).

Figure 2 .
Figure 2. Illustration of electrode variation.The left-hand side shows an EMG acquisition armband put on the forearm of a subject.(a), (b) and (c) are the net of the armband and the corresponding skin underneath.Colored circles represent electrodes, with two vertically placed electrodes being one bi-polar channel.(a) demonstrates the original placement of an eight-channel bi-polar EMG collecting armband on the surface of the skin.(b) shows a shifted placement of the electrodes on the skin compared to (a).(c) is the case where electrode placement is the same as (a), but some channels are missing due to any reason.

Figure 3 .
Figure 3. Overview of categorization of transfer learning on EMG analysis.

Figure 4 .Figure 5 .
Figure 5.Illustration of the architecture of the progressive neural network.Frozen indicates that the parameters of the network are fixed while trainable suggests that the network parameters will be updated during training.The same input is fed to both networks, the intermediate features from each module of the pre-trained network is merged with corresponding intermediate features of the target domain network.

Figure 6 .
Figure 6.Illustration of transferring knowledge by sharing the weights of the neural network.The weights of the backbone modules are first copied to the target domain network and frozen.The term 'module' refers to a combination of layers that might contain convolution, normalization, or residual connection.FC stands for the fully connected layer.The weights of the prediction head are randomly initialized and trained from scratch.

Figure 7 .
Figure 7. Illustration of a typical DANN.A backbone of any arbitrary design for feature extraction is marked in green while the task prediction head and domain prediction head are marked in blue and purple, respectively.The output deep feature from the backbone is fed to both heads for loss calculation with respect to the ground truth label.The gradient of L t is backpropagated through the task prediction head and the backbone for parameter update.The domain prediction head is updated by the gradient of L d .The negative gradient from L d also flows back to the backbone for parameter update.

EMG Transfer Learning Data Based Model Based Trainning-scheme Based Parameter Based Fine-tuning Parameter Sharing Structure Based Model Ensemble Model Calibration Meta Learning Label Calibration Adverarial Based Instance Weighting Root Task Taxonomy Mid-level Task Leaf Task Feature Based Data Generation Linear Transformation Non-linear Transformation
Algorithm 1: MAML Style Meta-learning for Transfer Learning Input : Task distribution : p(T ), Loss function : L, learning rate for inner loop: α, learning rate for outer loop: β Output : Prediction Model : f Θ , Initialization :Randomly initialize Θ while not done do Sample a batch of tasks T i from p(T ) for all taskT i do Evaluate error L T i (f Θ ) with respect to the D train jUpdate Θ with gradient descent: [111]tration of Disentangled Adversarial Autoencoder (DAA).The disentangled feature learning phase is demonstrated above the dotted line, while the task prediction phase is shown below the dotted line.In the disentangled feature learning phase, the input data is mapped into disentangled feature representation z a and z n with each passed to the corresponding prediction head.The overall latent representation is passed to the decoder for signal reconstruction.After feature learning, two prediction heads with the decoder are discarded.A new task prediction head with random weights is introduced on top of the encoder with frozen weight for task prediction.acontemporaneouswork,Côté-Allardetal.[105]alsoexplored using Virtual Adversarial Domain Adaptation (VADA)[106]together with Decision-boundary Iterative Refinement Training with a Teacher (DIRT-T)[106]for adversarial based EMG transfer learning.VADA is an extension of DANN that incorporates locally-Lipschitz constraint via Virtual Adversarial Training (VAT)[107]to punish the violation of the cluster assumption during training.On top of the trained model by VADA, DIRT-T aims to optimize the decision boundary on the target domain data by fine-tuning the model.In specific, the model parameter from the previous iteration is treated as the teacher model, the optimization goal is to seek a student model that is close to the teacher model while minimizing the cluster assumption violation.Based on the autoencoder (AE)[111]structure, the encoder F(•; θ) maps the input signal x into a latent representation z = [z a , z n ] where z a and z n stand for the adversary and the nuisance sub-representation, respectively.z a is expected to contain only the task relevant feature but no domain-specific information i d .On the other hand, the encoder embeds sufficient domain-specific data into z n .The decoder G(•; η) reconstructs the original input signal based on latent representation z.Similar to DANN, DAA also adopts two prediction head: adversarial prediction head P a (•; φ) and nuisance prediction head P n (•; ψ).
Following the work of Côté-Allard et al., other DANN related EMG transfer learning research endeavors [108, 109] were made for various transfer scenarios.Han et al. [110] further proposed Disentangled Adversarial Autoencoder (DAA) which disentangles the learned latent representation into adversary and nuisance blocks to model task-related features and domain-related features disjointly.

Table 1 .
Summary and statistics of common EMG datasets for transfer learning.
authors would like to acknowledge start-up funds from Westlake University to the Center of Excellence in Biomedical Research on Advanced Integrated-on-chips Neurotechnologies (CenBRAIN Neurotech) for supporting this project.The Zhejiang Key R&D Program Project No. 2021C03002 and the Zhejiang Leading Innovative and Entrepreneur Team Introduction Program No. 2020R01005 both provided funding for this work.