Mordo2: A Personalization Framework for Silent Command Recognition

Wearable human-computer interactions in daily life are increasingly encouraged by the prevalence of intelligent wearables. It poses a demanding requirement of micro-interaction and minimizing social awkwardness. Our previous work demonstrated the feasibility of recognizing silent commands through around-ear biosensors with the limitation of user adaptation. In this work, we ease the limitation by a personalization framework that integrates spectral factorization of signals, temporal confidence rejection and commonly used transfer learning algorithms. Specifically, we first empirically formulate the user adaptation issue by presenting the accuracies of applying transfer learning algorithms to our previous method. Second, we improve the signal-to-noise ratio by proposing the supervised spectral factorization method that learns the amplitude and phase mappings between around-ear signals and the signals of articulated facial muscles. Third, we leverage the time continuity of commands and introduce the time decay into confidence rejection. Finally, extensive experiments have been conducted to evaluate the feasibility and improvements. The results indicate an average accuracy of 92.38% which is significantly larger than solely using transfer learning algorithms. And a comparable accuracy can be achieved with significantly reduced data of new users. The overall performance shows the framework can significantly improve the accuracy of user adaptations. The work would aid a further step toward commercial products for silent command recognition and inspire the solution to the user adaptation challenge of wearable human-computer interactions.


I. INTRODUCTION
T HE prevalence of intelligent wearables increasingly enables a convenient manner of daily human-computer interactions (HCIs) for remote health monitoring, outer device control and metaverse applications [1], [2], [3].Such wearable HCIs raise the requirement of natural interactions that can avoid social awkwardness or interrupting main tasks, i.e. the "micro-interaction" requirement [4].In our previous work, we proposed to recognize silent commands using around-ear biosensors such that the sensors can be easily integrated with earbuds and interacting through silent commands can meet the "micro-interaction" requirement.
Silent command recognition (SCR) benefits from its little awkwardness, contextual commands and good wearability.Among all the modalities, wearable cameras [5], [6] recognized facial landmarks and can achieve considerable performance without worrying about occlusion issues.However, the energy cost of cameras hinders their wearability thus impedes their application especially for nearly field conditions [7].Audio-based methods [8], [9] recognized commands through wearable microphones, but may suffer from social awkwardness when speaking vocally.Other wearable sensors like magnetic [10], capacitive [11], EoG [12] or surface myoelectric sensors [13] were used to track facial gestures or even generate face avatars.However, the facial gestures lack the contextual link with corresponding manipulations.Moreover, the algorithms did not fit the task of silent command recognition [14].Also, continuously tracking silent speech using biosensors on the face or even under the tongue can hardly be accepted by healthy users.Our previous work [14] took an initial step toward recognizing silent commands through around-ear sensors and was demonstrated with good performance under unideal working conditions.However, our previous work also presented an obvious issue that might significantly impede the system's usage and user acceptance under real-world applications.That is, the basic transfer learning experiment presented that the pre-trained model cannot significantly decrease the data requirement, and sometimes even induced a negative transfer phenomena.As a result, when adapted to new users, it still needs to collect a considerable amount of labeled data to re-train the classifier, which poses a great challenge for user adaptation and user acceptance.
The user adaptation issue is a common challenge for such wearable biosensor-based HCIs.The transfer learning performance presented by [15] that recognized silent speech using on-face biosensors still suffered from a significant accuracy degradation, even if most of the labeled data were used for fine-tuning.Deng et al. [16] utilized high-density facial biosensors and achieved a promising user adaptation performance.The issue would be worsened when using around-ear biosensors, considering that the signals collected by around-ear sensors are implicitly related to the articulated muscles and a relatively small amount of channels can be posed.Wu et al. [13] tracked facial gestures through around-ear biosensors using user-specific models that still required considerable training data.The study also presented the highly personalized characteristics of around-ear biosignals.Considering the application scenarios of such wearable biosensor-based HCIs and the challenge induced by the around-ear sensors, the user adaptation would require an acceptable performance and minimal effort when adapting to new users.
In this work, we aim to decrease the amount of labeled data, improve the user adaptation performance and thus increase user acceptance based on the testbed of our previous work.We first formulate the issue by presenting the performance of several transfer learning algorithms, under different sizes of labeled data.In this way, the performance deterioration under relatively small amount of data may depict the user-specific and non-stationary characteristics of the around-ear biosignals.Second, we propose to improve the signal-to-noise ratio by a supervised matrix factorization method to map the around-ear signals into facial muscle signals, considering that the facial muscle is directly activated during silent commands.In this way, the silent command recognition model can be constrained to learn a more generalized mapping between the input signals and commands.Then, we adapt a confidence rejection method to reject false positives and temporally smooth the recognized commands.Finally, we evaluate the feasibility of our method by ablation experiments, performance comparisons and combining it with commonly used transfer learning techniques.Overall, compared with our previous work [14], we develop an algorithmic framework and improve the its performance of adapting to new users.Our contributions can be summarized as follows.
• To the best of our knowledge, we for the first time realize a user adaptation framework for around-ear biosensorenabled HCIs that significantly decreases the requirement of labeled data.
• We propose a novel spectral factorization method that aims to map the around-ear signals into signals of facial muscles, such that the signals of the directly articulated muscles can be used for SCR.
• We adapt a confidence rejection method by combining the false positive rejection and the time continuity of a command.
• We evaluate our method by extensive comparisons, ablation experiments and data efficiency tests.
II. RELATED WORK As stated above, most facial or around-ear wearable devices that process biosignals for HCI utilize subject-specific models and train the model nearly from scratch when adapting to a new user [13], [15], no matter for SSR [17] or facial gesture recognition [13].Currently used subject-independent models rely on high-density collection of facial sEMG or on the combination of facial muscles and tongue [18], [19], [20].Such methods may not be suitable for our around-ear sensor configurations due to the relatively implicit correspondence between around-ear signals and muscle articulated during commands.And if the sensors are posed on cheeks, chin or even tongue [15], [21], [22], it might induce social awkwardness and decrease the user acceptance.In this section, we review the related works on user adaptation of such around-ear HCIs that may ease the challenge under the limitation of the sensor configuration.
Transfer learning techniques that utilize knowledge of source domain are a promising solution.Finetuning the pretrained model with labeled data from the new user would provide a baseline of the source-domain knowledge's utility and a basic measurement of the difficulty of the task.That is, the difficulty of the task can be depicted by the amount of the labeled data used to achieve acceptable performance.As shown in our comparison below, over 80% data of a new user is used to achieve an accuracy of 82%.That is, under the basic finetuning paradigm, little knowledge can be obtained from source domain.We need to collect a considerable amount of labeled data to retrain the model.In this work, we will further depict the user adaptation issue by presenting the performance of some more advanced transfer learning algorithms based on the adaptation and alignment of marginal distributions [23], conditional distributions [24] and joint distributions [25] of source and target domains.
Another potential solution derives from the mapping between the around-ear signals and facial signals of articulated muscles.Considering the stochastic and considerably subject-specific characteristics of the around-ear signals, an intuitive idea is to build a mapping and implicitly use the signals of articulate muscles.Such a signal transformation and factorization technique have been demonstrated with good performance in bio-signal processing.Nguyen et al. [26] proposed a supervised separation algorithm to factorize the spectrum of in-ear bio-signals into EEG, EMG and EOG and demonstrated its feasibility in sleep stage classification.However, the linear summation assumption did not consider the potential amplitude and phase decay of the transmission of different sources.The limitation may induce a confused factorization, which violates the signal characteristics of actual sources.In this work, we treat the around-ear sEMG signals as the crosstalk from articulated muscles and further consider the crosstalk-induced signal decay in both amplitude and phase domains.
In order to further reduce the false alarm, confidence estimation and consequent rejection are considered.Confidence estimation provides confidence of a prediction.The consequent rejection can deny the low-confidence predictions and improve the reliability.Previous studies utilized the maximum posterior probability of linear discriminative analysis (LDA) [27] or support vector machine (SVM) [28] to estimate the machine learning-based models' confidence of their prediction.With the development of deep learning (DL), Ranjan et al. [29] proposed to estimate deep learning models' confidence using the soft labels of convolutional neural network (CNN).Wang et al. [30] used the entropy of CNN soft labels to estimate confidence.Wan et al. [31] further proposed to use an arificial neural network (ANN) to process the soft labels of each class.Bao et al. [32] adopted the idea, proposed a trainable linear layer to estimated the confidence of CNN and for the first time realized the confidence estimation and rejection on DL-based myoelectric control.However, they ignore the time continuity of the sEMG signals thus do not incorporate the temporal information into the confidence estimation.Wu et al. [13] used a Kalman filter to temporally smooth the output landmark series for a more stable avantar generation, without considering either soft labels or confidence.In our work, we propose to extend the confidence estimation proposed for sEMG-based gesture classification [32] and include the time continuity of the commands.

III. METHODOLOGY
In this section, we introduce how we ease the challenge of the user adaptation issue.Specifically, we first formulate the issue by presenting the performance of several representative transfer learning algorithms.Then, we factorize the spectral representations of the around-ear signals in a supervised manner to constrain the input space of the silent command recognization model and reduce the need of labeled data.Finally, we describe how we introduce temporal continuity into confidence rejection and suppress false alarms.

A. Problem Formulation
We use the data from 33 subjects, the collection paradigm of which is described in section D. Each subject performed the 10 silent commands for 40 times, a 2-min trial of resting state and a 2-min trial of speaking vocally.The paradigm follows what we used in [14].Herein, we adopt the basic finetuning paradigm and some typical transfer learning methods that were demonstrated in sEMG-based motion recognitions, i.e.MCD [23] that uses the marginal distribution of target domain with demonstrated performance on human activity recognition, DSAN [24] that uses the conditional distribution of target domain and DDAN using target domain's joint distribution [25].The leave-5-subject-out paradigm is adopted.That is, we randomly sample the 5 subjects from the dataset and use the rest data of 27 subjects to train and validate the algorithm.And we use 20%, 40%, 60% and 80% of the 5-subject data for transfer learning.For all the transfer learning algorithms except for MCD, such training data are labeled.For MCD that solely uses the marginal distribution, there is no need for labels.For all the transfer learning algorithms, we remain the same hyper-parameters.That is, the batch size is 128 and 50 epochs are used.The learning rate is 0.00005.
For fully connected (FC) and LSTM layers, there is a dropout with a possibility of 0.5 to avoid over-fitting.For the basic finetuning paradigm, 10 epochs are used to tune the parameters of all layers.For DSAN that measures the distances of different layers' distances between source and target domains, we follow its original setting and only measure the distance of the final FC layer.To present it clearly, we compare the transfer learning algorithms with the algorithm trained from scratch using the same amount of data.ANOVA is used to analyse the significant difference (the significant level α = 0.05).
As shown in Fig. 2, when increasing the amount of labeled data of new users, the classification accuracies of all the methods increase.It should be noted that the performance of retraining from scratch is same with the performance we report in our previous work [14].Compared with retraining the model from scratch, the basic finetuning paradigm does not present a significant performance improvement.It can be seen that for reduced training data, a slightly negative transfer phenomenon is even induced.That is, the performance of transfer learning is lower than the performance of retraining from scratch.For other typical transfer learning methods, performance is slightly but not significantly improved when the training data size reaches 80%.Overall, under common transfer learning algorithms, the around-ear signals present a relatively strong subject-specific characteristics and deteriorate the transfer learning performance.In this way, we formulate the personalization issue as the original algorithm we proposed suffers from the subject-specific characteristics of the around-ear signals such that the size of the collected data used for adapting to new users is equivalent to that of training the algorithm from scratch.In the following, we propose a framework for reducing the size of the data when adapting to new users.As presented in Fig. 1, the framework consists of three components, each of which is trained individually and used in a cascaded manner.

B. Supervised Spectral Factorization
In the following, we first present our assumption and propose the spectral factorization method that aims to map the around-ear signals into the signals of articulated facial muscles.There are seldom muscles around the ears.Moreover, as discussed in our previous work [14], the collected signals rarely include EEG or EOG signals due to the sensor location and the data preprocessing procedures we adopt.In consequence, it is reasonable to treat the around-ear sEMG signals as the crosstalk of the articulated facial muscles.That is, the collected around-ear signals are a mixture of facial sEMG signals with frequency attenuations and some noise.We assume that mapping the around-ear signals into facial signals can constrain the input space of the silent command recognition model such that it can learn more generalized representations across subjects.Herein,we formulate the mixture as a matrix factorization problem in the spectral domain, and develop the mapping using in a supervised manner.First, we use the frequency domain of the (1) where where U i j denotes the amplitude attenuation of each frequency component and channel, U i j ∈ (0, 1), φ km denotes the  of around-ear signals would be the mixture of the facial signals with equivilant and higher frequencies.We set φ km > 0 when k ≥ m, e − jφ km = 0 when k < m.In this way, the matrix form of Eq.( 3) can be formulated as where E C E ×N E denotes the matrix of the frequency domain of the around-ear signals, F C F ×N F denotes the matrix of the frequency domain of the facial signals, denotes the element of the matrix of phase attenuations.Considering the Euler's formula e jφ = cos(φ) + jsin(φ), we treat the righthand side of Eq.(4) as a vector summation and treat each element of E as a vector in the complex plane, as shown in Fig. 3.
In this way, we denote the amplitude of F jk as A F jk , the amplitude of E im as A E im and rewrite Eq.(3) as where A E im and A F jk denote the amplitudes of E im and F jk , respectively, φ F jk and φ E im denote the phases of E im and F jk , respectively.We denote U • A F as A, the amplitude matrix of E as A E and the matrices of cosines and sines as cos and sin .Eq.( 5) can be written in a matrix form.
The problem is thus formulated as an issue of non-negative matrix factorization with a regularization and can be solved in a gradient-based manner [33] (Supplementary Note 3).The element-wise multiplex updating rule can be formulated as where the subscription i h denotes the element of the ith row and the jth column, ( ′ cos ) i h and ( ′ sin ) i h denote ∂ cos /∂φ i h and ∂ sin /∂φ i h , respectively.Considering the same importance of sin and cos , we set β as 1.After training, the amplitude and frequency attenuations U i j and e − j km can be obtained from A F , sin and cos and used as dictionaries for factorizing around-ear signals.It should be noted that the signals of the facial channels are only used in the training phase.When the system is applied in real scenario usages, only the around-ear channels are used.

C. Confidence Rejection Method
Compared to the Kalman filter adopted by [13] and the learnable linear layer adopted by [32], we adopt a learnable weight to further consider the temporal continuity of commands.Specifically, we first predict the soft labels using the CNN-LSTM network and calculate the confidence of the prediction accordingly.Then, we treat the most confident prediction of a label as the start point and adapt the learnable weights by temporally smoothing the confidence of the consequent labels.In so doing, the confidence among adjacent windows can be utilized to meet the temporal continuity of the muscle activation during resting state, continuously speaking or a command.
The output of the softmax layer can be a probabilistic vector, the ith element of which denotes the probability of being classified as the ith label (rest state, a command or speaking vocally), given by where Ĉ denotes the output classification which is a one-hot vector, P(C i |X) denotes the probability distribution of each class It is intuitive to set a threshold for the probabilistic soft labels to suppress false alarms by rejecting the predictions that the classifier is not sufficiently confident with.Specifically, for classifying silent commands, we denote t 1 as the index of the window when the signals initially presents the probability of a class over a threshold, γ , given by where max denotes the confidence of classifying the input X t as the class with the largest probability, i.e. the confidence of the classifier.If the threshold γ is constant and irrelevant with time, the constant threshold-based confidence rejection is intuitive and commonly used.However, the muscle activations are temporally continuous.That is, the time continuity may lead to a transferable confidence among adjacent windows.Herein, we set γ as 0.95 in order to have a high confidence for t 1 and regard t 1 corresponding to the time index with large confidence as the starting point of a class.In this way, we formulate the temporally transferable confidence as where Pt is a vector, each element of which denotes the confidence of each class after temporal smoothness, P t is also a vector, each element of which denotes the soft label of each class output by the classifier, P t 1 denotes the confidence vector of window t 1 , D denotes the time duration of the confidence temporal transfer, k denotes the time decay.Then, we normalize Pt to make i Pt (i) = 1.In this way, the temporal confidence of adjacent windows can be considered.We adopt the learnable linear layer to further tune the confidence.
where β denotes the learnable parameter that provides a hyperplane in confidence space to further distinguish the confidence and compress the confidence distribution into a confidence score Con f ( Pt , β), γ 1 and γ 2 denote the bounds for rejection and acceptance.According to [32], we also set γ 1 and γ 2 as 0 and 1.When the trained parameter is used for rejection, we herein also utilize the temporal continuity, given by where Ĉt−1 denotes the classification of the last window.α denotes the threshold for confidence acceptance.In this way, the classification of the window with the confidence score Con f ( Pt , β) larger than a threshold can be accepted, and the classification of the window with a lower confidence score can be rejected and assigned the classification of its last window.

D. Dataset Construction
33 subjects are newly recruited to conduct the dataset.27 of them are the subjects recruited in our previous dataset [14].The rest of them are newly recruited.We denote our previously conducted dataset as EAR, and denote the dataset conducted in this study as E&F.In this dataset, we collect aroundear signals from the 4 channels around an ear and facial signals from facial muscles mainly articulated during silent commands.We select 3 muscles on each side following the paradigm of [20] and [34] that used articulated facial muscles to perform silent speech recognition.According to the biosensing system's limitation of channel numbers, we select the channels on zygomaticus major, orbicularis oris and depressor  angulioris and still use the channel behind the ear as the ground channel (E1).In this way, we attach 7 channels on each side.We adopt the same data collection parameters and preprocessing procedures as our previous study [14].

E. Training
We perform cross-subject validation on two architectures.Specifically, the architecture Mor do is trained using the around-ear signals as input, which follows the same manner with our previous work [14].We map the around-ear signals using the dictionaries obtained from our spectral factorization method and train the architecture Mor do F .As for confidence rejection, we adopt the paradigm of [32] and leverage genetic algorithm to maximize balanced mean effective confidence (BMEC).
1) Factorization: We first perform DFT on sEMG signals from both facial and around-ear channels to obtain their frequency-domain representations.Then, we calculate the amplitudes and phases of the frequency spectrums of each channel.The amplitudes and the phases are then stacked across channels to form the matrices.The amplitude matrices are denoted by A E and A F for the around-ear and facial signals, respectively.The phases are denoted by φ E and φ F for around-ear and facial signals, respectively.Then, A E , A F , φ E and φ F are plugged into Eq.( 5) to calculate the ground truth and inputs for the factorization (i.e.A, sin and cos ).The cost function Eq. ( 6) can be used to estimate the amplitude and frequency attenuations by the gradient-based method.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The iterations can be stopped when the residual of the cost function between two successive iterations is lower than 0.01 or the number of iterations exceeds 1000.We set the number of frequency components N E = N F = 300.
2) The CNN-LSTM Algorithm: We perform the same training paradigm with our previous study [14].The input in this study is the around-ear signals factorized by the matrices U and cos .Such matrices are obtained by the fatorization step.
3) Confidence Rejection: We train the learnable parameter β by the genetic algorithm using the following cost function.
where S C denotes the set of samples whose Con f ( Pt , β) corresponds correctly to their labels, S W denotes the set of samples whose Con f ( Pt , β) corresponds incorrectly to their labels, l i = 1 is set to be 1 denoting the label for correct acceptance, l j is set to be −1 denoting the label for incorrect acceptance.By maximizing BMEC through genetic algorithm, the learnable parameter β can be obtained.We perform a grid search for D ∈ [0.8s,2.0s]with the step of 0.1s and for k ∈ [1.0, 2.0] with the step of 0.1.According to the performance on validation sets measured by BMEC, we set D as 1.5s and k as 3.We perform a greedy search for α to remain the largest true acceptance rate.It should be noted that β and α are estimated using the data of new subjects, i.e. labeled data in the target domain rather than the source-domain data for pretraining the model.We randomly select data from 27 subjects to train the algorithm and test it on the remaining 6 subjects.We perform 6 such rounds.For the confidence rejection method, we select data from 21 subjects as the training set, data from 3 subjects as the validation set and data from the remaining 6 subjects as the test set.The training time for the spectral factorization is 653.3saveraged across cross-validations.And the training time for the confidence rejection is 195s averaged across crossvalidations.Note that both methods are trained on the group level.Retraining is not required when adapting to new users.The settings of the PC are 11th Gen Intel (R) Core (TM) i7-11700 CPU @ 2.50GHz 2.50 GHz, RAM 32.00 GB.The GPU is NVIDIA GeForce RTX 3070.After training, the learnable parameters of the spectral factorization are fixed, and the parameters of the confidence rejection and the CNN-LSTM network are tuned when adapting to new users.Compared with our previous study [14], only a matrix multiplication and confidence calculation (Eq.( 10) and ( 11)) are introduced during user adaptation.Thus, this study does not introduce new procedures for users.

IV. EVALUATION
We propose a spectral factorization method and a temporal confidence estimation method to pre-process signals and post-process the output confidence.We aim to suppress the subject-specific characteristics of around-ear signals and improve output confidence.In this way, our proposed method can be further integrated with transfer learning algorithms to mitigate negative transfer phenomena and improve the user experience of user adaptation.In this section, we examine the feasibility of the spectral factorization and the temporal confidence estimation method by performance comparisons and ablation experiments.We further demonstrate our method by integrating it with transfer learning methods to show its benefit.

A. Performance Comparison
In the following, we present the performance of our framework by integrating it with basic fintuning and some representative transfer learning algorithms that use marginal (MCD), conditional (DSAN) and joint (DDAN) distributions of target domain.We depict the benefits of our framework by comparing the performance with that presented in Fig. 2, i.e. solely using such methods without our framework.Specifically, following the workflow, the around-ear signals are first factorized, then fed into the classifier and the temporal confidence rejection component.The classifier is used for transfer learning.We denote the combination of our framework with finetuning, MCD, DSAN, DDAN and retraining from the scratch by Finetune F , MC D F , DS AN F , D D AN F and Retrain F .We adopt the same evaluation paradigm as section III-A and evaluate the performance under different sizes of training data in the target domain.ANOVA is used to analyse the significant difference (the significant level α = 0.05).
Fig. 6 presents that the performance of all the transfer learning algorithms under all sizes of training data improves compared with the performance of not using our framework shown in Fig. 2. Compared with the performance of our previous work [14] (73.39% for 60%, 83.50% for 80% of the training data size, see also Fig. 2 Retrain), the performance of Retrain F , i.e. retraining with our framework, can be comparable with the performance of Retrain using 80% of the training data.Moreover, by integrating with our framework, the representative transfer learning algorithms also present an improved performance.Importantly, our framework's performance under 60% training data is comparable with that of not using our framework under 80% training data.In consistent with our assumption, the comparison shows our framework's improvement on data efficiency.
We further quantify the average improvements over using the same transfer learning algorithms without our framework in Tab.I (i.e.Fig. 6 over Fig. 2).It can be seen that a generally significant improvement can be achieved when using our proposed framework.It shows that some transfer learning algorithms can achieve better performance than retraining from scratch, demonstrating that the transfer learning algorithms start to utilize the knowledge learnt from pretraining.

B. Ablation Experiments
In order to evaluate the necessity of each part of our framework, we ablate the spectral factorization and confidence rejection in turn.We evaluate the performance under different sizes of training data and average the performance across different transfer learning techiniques.The classifer and the transfer learning techniques remain the same (i.e. the CNN-LSTM network [14] as the classifier, MCD, DSAN, DDAN, Finetune, Retrain as the transfer learning techniques).ANOVA is used to analyse the significant difference (the significant level α = 0.05).
It is shown in Fig. 7   components can achieve the performance using a smaller size of training data (40% for Full versus 80% for Baseline).

C. Temporal Stability
We test the temporal stability by using training data and testing data on separate days.The aim of the study is to improve the algorithm's ability of adapting to new subjects.Herein,we follow the cross-subject validation paradigm above, pre-train the algorithm by the 33-subject training set and test the algorithm using the data of 3 subjects out of the training set on separate days.We recruit 3 new subjects to collect data 4 times on separate days.We denote the first time of the data collection as Day 0 , and use 80% of the data collected in Day 0 for the subject adaptation using the abovementioned tranfer learning techniques.The second time denoted by Day1 is 3 days after the first time, and the third time Day 2 is 5 days after the second time, the fourth time denoted by Day 3 is 2 days after the third time.We use the remaining 20% data collected in Day 0 and all the data collected in Day 1 to Day 3 as the test set to test the temporal statbility of the framework.For the comparison purpose, we apply the same procedure on the transfer learning techniques without using our framework and denote it as Baseline.ANOVA is used to analyse the significant difference (the significant level α = 0.05).
It is shown in Fig. 8 that the overall performance decreases when the test set is collected in a different day compared with the training set.And the performance of our framework on each day is better than that of the baseline method.It is also presented that the maximum decrease of our framework is about 7%, while the maximum decrease of the baseline method is about 10%.The performance on Day 2 is interestingly Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The performance comparison among different inputs.* * * denotes significant difference compared with all the rest methods.
higher than that of Day 1 and Day 3 .This may be attributed to the potentially similar sensor placements between Day 0 and Day 2 , since the day collections on different days include not only the factor of time but the difference sensor placements on different days.

D. The Performance Comparison Between Facial Signals and Around-Ear Signals
In order to show whether the around-ear signals are from the facial sEMG signals, we perform a comparison study by evaluating the performance of using facial signals, around-ear signals and the fusion of them.We train the classifier using 80% of the data from all the 33 subjects, validate and test the classifier on 10% of the data, respectively.We use the around-ear signals, the factorized/mapped around-ear signals, the facial signals and the fusion of the facial signals and the factorized/mapped around-ear signals as inputs.The fusion is performed by stacking the signals across channels and apply PCA to remain the same input dimension.Then, we train and test the classifiers corresponding to each inout separately.We use the classifier and the confidence rejection method and remain the same hyperparameters during training.We denote the classifier using the around-ear signals as Ear , the classifier using the factorized around-ear signals as Ear Map, the classifier using the facial sEMG signals as Face and the fusion of the latter two signals as Ear Map + Face.ANOVA is used to analyse the significant difference (the significant level α = 0.05).
As shown in Fig. 9, directly using the around-ear signals as the input presents a significantly lower performance compared with other inputs.Ear Map, i.e. the factorized around-ear signals, presents a slightly lower but not significant performance compared with Face and Ear Map + Face.Similarly, Ear Map+ Face presents a lower by not significant performance compared with Face.

V. DISCUSSION
In this study, we propose a personalization framework for silent command recognition which aims to solve the user-adaptation disadvantage of our previous work.The framework factorizes around-ear signals based on a supervised factorization method to improve the signal-to-noise ratio, estimates temporally decayed confidence and performs consequent acceptance and rejections.The framework can be easily integrated with transfer learning algorithms.The extensive experiments demonstrate the effectiveness of our framework.
In consistent with our assumption, the spectral factorization that learns dictionaries to map the around-ear signals into facial signals significantly improves the ability of adapting to new users.Such mappings present an around 3dB SNR that to a relatively large extent improves the signal quality compared with the 0.2 to 2 dB SNR of around-ear signals reported in our previous study (Supplementary Note 1, Supplementary figure 1).In addition, the ablation experiments (N oCon f versus Baseline, Full versus N oSpec) also demonstrate the effectiveness of the spectral factorization method under different sizes of training data and combining it with different transfer learning algorithms (Tab.I).The spectral factorization method constrains the input space of the classifier, thus improves the learned feature representation.This makes the pretrained model can learn some generalizable knowledge, which consequently improves the data efficiency when adapting to new users.It can be concluded that the spectral factorization method can effectively improve the signal quality and separability of commands.And the phenomena may also implicitly demonstrate our assumption of around-ear signals.
The confidence rejection method utilizes the time continuity of signals and introduces a temporal decay of confidence of previous windows.It presents an improved performance compared with previously used confidence rejection methods for sEMG-based classifications, accross different sizes of the training data (Supplementary Note 2).The better performance compared with Con f Scor e especially demonstrates the feasibility of further introducing the temporal decay of confidence(Supplementary figure 3).The ablation experiments (N oSpec and Baseline, Full and N oCon f ) also indicate that our confidence rejection method contributes to the improved performance of our framework.Combined with the comparison of ablating the spectral factorization, we demonstrate the validity and neccessity of including both spectral factorization and the confidence rejection components.
The overall framework presents an improvement when combining with transfer learning algorithms.Integrating our framework constraints the input space, improves the SNR and thus the accuracy to a relatively large extent compared with solely using the around-ear signals.The accuracies of our framework under 60% training data can be comparable with the accuracies of our original algorithm under 80% training data.The similar performance can be achieved for 20% versus 40% and some algorithms for 40% versus 60%, corresponding to our framework and the baseline, respectively.And the accuracies of our framework under 80% training data are beyond 90% for most transfer learning algorithms.Moreover, several transfer learning algorithms present a better performance over Retrain, even under 20% of the training data.This indicates that our framework may ease the negative transfer phenomenon and improve the transferability among subjects.Moreover, the temporal stability analysis shows that although our framework suffers from the sEMG's common bottleneck of the accuracy degredation on temporal stability, the maximum accuracy degredation (about 8%) is smaller than that of the baseline (about 10%).The performance demonstrates the validity of our framework.
The comparison among the factorized around-ear signals, facial signals and the fusion of them presents a similar performance with no significance on silent command recognition.This suggests that our hypothesis is valid on the SCR task, i.e. the around-ear signals contain similar information with the facial signals.The comparison between the around-ear signals (Ear ) and the factorized around-ear signals (Ear Map) demonstrates the effectieness of our factorization method.Moreover, the user study (Supplementary Note 4) indicates the significant difference of using facial or around-ear channels on user feedback, which corresponds to the micro-interaction requirement.
The present study has several limitations.First, the dataset used in this study just incorporates 33 subjects which might be limited to representing a much larger user population, especially when considering the application scenario of smart wearables.Second, the formulation of the spectral factorization is limited.It just considers the main articulated facial muscles and the sensor placement constrained by sensor volume.Other signal sources, like neck muscles, periocular muscles that often account for blinks, facial muscles with smaller sizes or deeper positions and noise, are not considered.Following studies should include such factors and formulate them as an underestimated component during the matrix factorization.Furthermore, although the threshold and weighting parameters of the confidence rejection method are estimated separately for each user, the spectral factorization is still performed in a group level.Further efforts should be paid to formulating a personalized estimation of the mapping dictionaries.

VI. CONCLUSION
In this study, we propose Mordo2 to solve the user adaptation issue by incorporating spectral factorization, temporal confidence rejection and commonly used transfer learning algorithms.The spectral factorization method learns the amplitude and phase attenuations in the frequency domain and maps the around-ear signals into facial signals.The temporal confidence rejection utilizes the time continuity of commands and introduces the temporal confidence decay into confidence estimation.We collect a 33-subject dataset to train and evaluate the framework.Extensive experiments demonstrate Mordo2 can achieve an accuracy of over 90%.And a similar performance can be achieved using our proposed framework with an obvious reduction of the training data size.For some specific conditions, the reduction can be approximately 20%.The extensive experiments demonstrate the effectgiveness, necessity, temporal stability and benefits of using the aroundear signals.The proof-of-concept realization of the framework improves the ability of adapting to new users, thus can aid the system's further step toward a commercial product.

Fig. 1 .
Fig. 1.The workflow our algorithm.The red frames denote the delta and the algorithmic contributions of the study.The blue dotted frames denote each step of training the framework.In step 1, around-ear signals are factorized in the spectral domain.The factorization is supervised by the frequency and phase domain tranformation of facial sEMG signals.After training the spectral factorization and getting the parameters fixed, the factorized around-ear signals are fed into the CNN-LSTM network (step 2).The labeled commands are used to calculate Loss 2 .In step 3, the soft labels from the trained CNN-LSTM network are fed into the temporal confidence rejection algorithm and Loss 3 is calculated by the actual labels.After the whole training phase, the around-ear sEMG signals are first factorized by the learned dictionary.The factorized signals are then used to classify the silent commands, during which transfer learning techniques can be optionally used to adapt the model.Finally, the soft labels are used to determine whether the classification should be rejected and output the final classification.
im and E im denote the mth frequency component of around-ear and facial signals of the ith channel, respectively, f in and e in denote the nth sample point of around-ear and facial signals of the ith channel, respectively, F im and E im (m = 1, • • • , M E ) are the frequency-domain transformation of f in and e in , respectively, N E and N F denote the number of sample points of around-ear and facial signals, respectively, C E and C F denote the number of channels of around-ear and facial signals on each side, respectively.Second, under our assumption of the relationship between facial and around-ear signals, we can treat the around-ear signals of each channel as the result of the mixture of facial signals after frequency attenuation.

Fig. 2 .
Fig. 2. The performance of different transfer learning algorithms under different sizes of training data.The percentages on each figure denote the size of labeled training data.MCD, DSAN and DDAN denote three transfer learning algorithms we use for the comparison purpose.Finetune denotes finetuning the classifier using the labeled data.Retrain denotes retraining the classifier from scratch using the labeled data.*denotes the significant difference compared with Retrain.

Fig. 3 .
Fig. 3.The shematic diagram of vector and vector summation in the complex plane.

Fig. 4 .
Fig. 4. The schematic diagram of selected muscles and attached channels.

Fig. 5 .
Fig. 5.The diagram of the experimental setup.

Fig. 6 .
Fig. 6.The performance of our framework combining with different transfer learning algorithms under different sizes of training data.MCD F , DSAN F and DDAN F denote three transfer learning algorithms combining with our framework.Finetune denotes finetuning the classifier using the labeled data and combining with our framework.Retrain F denotes retraining the classifier under our framework from scratch using the labeled data.The percentages on each figure denote the size of training data.
that compared with Baseline that solely utilizes around-ear signals, Full significantly increases the performance under different sizes of training data.The comparisons between N oCon f and Baseline and between Full and N oSpec indicate the effectiveness of solely introducing the spectral factorization.Similarly, the comparisons between N oSpec and Baseline and between Full and N oCon f indicate the effectiveness of solely introducing the confidence rejection.Introducing either spectral factorization or confidence rejection improves the performance under the same amount of training data.And the comparison between Full and Baseline further indicate that introducing both

Fig. 7 .
Fig. 7.The performance of the ablation experiments under different sizes of labeled training data.Baseline denotes the performance of solely using around-ear signals without spectral factorization or confidence rejection.NoSpec denotes the performance of our framework without spectral factorization.NoConf denotes the performance of our framework without confidence rejection.Full denotes the performance of our framework.* denotes significant difference.* * * denotes significant difference compared with all the rest methods.

Fig. 8 .
Fig. 8.The temporal stability of our framework and the baseline under difference days.Day 0 to Day 3 denotes the test day from the first to the fourth days.* denotes significant difference.* * * denotes significant difference compared with all the rest conditions.

Fig. 9 .
Fig. 9.The performance comparison among different inputs.* * * denotes significant difference compared with all the rest methods.