GMMPLS: A Novel Automatic State-Based Algorithm for Continuous Decoding in BMIs

In this paper, a novel fully-automated state-based decoding method has been proposed for continuous decoding problems in brain-machine interface (BMI) systems, called Gaussian mixture of model (GMM)-assisted PLS (GMMPLS). In contrast to other state-based and hierarchical decoders, the proposed method does not demand any prior information about the desired output structure. Instead, GMMPLS uses the GMM algorithm to divide the desired output into a specific number of states (clusters). Next, a logistic regression model is trained to predict the probability membership of each time sample for each state. Finally, using the concept of the partial least square (PLS) algorithm, GMMPLS constructs a model for decoding the desired output using the input data and the achieved membership probabilities. The performance of the GMMPLS has been evaluated and compared to PLS, the nonlinear quadratic PLS (QPLS), and the bayesian PLS (BPLS) methods through a simulated dataset and two different real-world BMI datasets. The achieved results demonstrated that the GMMPLS significantly outperformed PLS, QPLS, and BPLS overall datasets.


I. INTRODUCTION
Brain-machine interfaces (BMIs) are technologies for constructing an external pathway between the brain and a machine [1]. These systems capture the neural activities and translate them into understandable commands for prostheses and exoskeleton robots [2]- [5], quadcopters [6], or any external device.
Translating the neural activities has consisted of extracting beneficial features and decoding them to the desired commands [7]. A BMI application may requests discrete or continuous commands. Many researchers focus on decoding continuous parameters like limb movement [8], applied force, and grasp trajectories [9]. Therefore, much research has been conducted on the machine learning aspect of BMIs for decoding improvement.
Usually, For feature extraction in BMI, each channel of the recorded brain signals is decomposed to the specific frequency ranges, e.g., delta 0.5-4 Hz, theta 4-8 Hz, alpha 8-13 Hz, beta 12-30 Hz, low-gamma 30-60 Hz, etcetera. 1 The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir . 1 Frequency division depends on the sampling frequency of the recorded signals.
Afterward, the envelope of each band per channel is derived. Sometimes, some specific previous lags in each envelope are considered too. Therefore, the feature space dimension equals the multiplication of three factors: the number of channels, frequency bands, and lags [10]- [12]. From the machine learning aspect, this procedure usually leads to the high-dimensionality problem and overfitting phenomena.
Except for the high-dimensionality problem, another concern is choosing linear or nonlinear models in continuous decoding. In many cases, a simple linear regression can not express the relation between the input and the desired output. However, nonlinear regression methods increase the risk of overfitting. Besides that, choosing a proper nonlinear function for the decoder is a critical issue.
To deal with overfitting obstacles, many researchers utilize feature reduction and feature selection techniques. These techniques reduce the dimensionality of the input space by generating reduced-rank combinations of the features (recorded channels) or selecting an optimal and relevant subset of features (recorded channels) via specific criteria to improve the decoder performances. For example, Jin et al. employed bispectrum analysis for channel selection in an Electroencephalogram (EEG)-based BMI system [13]. In another study, they proposed to use l 1 -norm and Dempster-Shafer Theory for feature selection in the same application. Nazari et al. demonstrated that selecting informative features using a feature-ranking approach based on the Wilcoxon criterion led to performance improvement in a local field potential (LFP) based-cognitive BMI application [14].
Partial least square (PLS) and its extensions have been widely used in BMI systems for continuous decoding [10]- [12], [15]- [25]. PLS is a linear regression technique that reduces the risk of overfitting phenomena by reducing collinearity in the input data [26]. To overcome this obstacle, PLS builds a low-rank linear relation between the extracted features (input data) and the desired output [25]. However, PLS still tends to overfit, and because of that, some research introduced sparsity and regularization to the PLS cost function [18], [25], [26].
A simple linear regression model in BMI applications may not always explain the relationship between extracted features and the desired output [27]. Several nonlinear kernelbased PLS versions have been proposed and utilized in BMI applications to deal with this difficulty. Kernel-based PLS methods established a linear relation between a nonlinear map of the input and the desired output. However, these methods impose high computational costs and storage if the number of samples in the training set be high since they require building a square kernel matrix with the dimension equals to the number of samples [28]- [30]. On the other hand, other nonlinear PLS methods, such as quadratic PLS (QPLS), depend on choosing a suitable function for the nonlinear mapping of the input data.
Recently, some researchers introduced state-based and hierarchical algorithms for decoding the desired output in BMI applications. These studies assert that the desired output may have been composed of different processes. Therefore, the desired output samples are divided into a certain number of states (groups), and based on this segmentation, the input data samples are assigned to the related states. Afterward, a specific decoder has trained to learn the relation between the input and output samples in each group. Moreover, switching between different continuous decoding models is performed based on a classifier [31].
For example, a hierarchical PLS regression model was proposed in [32] for decoding hand speed, velocity, and position using humans electrocorticogram (ECoG) signals by Bundy et al.. The authors were aware that the designed task in their application contained rest and movement periods. Therefore, these periods were separated, and a logistic regression classifier with elastic net regularization 2 was trained to recognize whether the subject's hand was moving or not in every 300 ms time window. Next, a PLS model was trained for each of these states separately. In other words, the output of the classifier was used for switching between the two PLS models.
Ahmadi et al. proposed a state-based decoder for estimating applied force using recorded LFP signals in rats [33]. This study asserted that the applied force time series includes resting and active phases. Active phases were included time segments wherein rats were applying force, while in the resting phases, they did not perform the task. In this research, the filter bank common spatial patterns algorithm (FBCSP) was employed in conjunction with the linear discriminant analysis (LDA) for classifying the recorded LFP signals to the rest and active segments. Two different PLS models were trained then to decode each state of the desired output. This method led to higher performance compared with the conventional PLS. It is worth mentioning that a 500 ms window of LFP signals were required to classify the input data into rest/active group in this study. Thus, employing this method leads to at least a 500 ms delay in online decoding applications like Bundy et al. work in [32].
Farrokhi et al. proposed a state-based probabilistic decoding method for estimating 3D hand position trajectories of monkeys using recorded electrocorticogram (ECoG) signals [24]. The authors expressed that the desired output traces consisted of three states in their task: idle, righthand movement, and left-hand movement states. Therefore, the authors introduced a feature extraction schema based on linear discriminant analysis (LDA) and PLS. Three LDA were trained to discriminant one state versus the others. LDA filters output fed to seven different PLS models: three models for predicting left-hand movement, three models for right-hand movement, and one model for estimating the movement's state (left, right, or idle). Afterward, a probabilistic model based on the conditional expectation operator was trained to decode the desired output using the output of the PLS models. The achieved results in this research indicated that their proposed method led to better performances compared to the naïve PLS and Kalman filter.
To the best of our knowledge, all the proposed hierarchical and state-based decoding algorithms in the BMI area relied on prior information about the desired output structure. For instance, we should know that the desired output contains resting/non-resting or different movement (task) types periods, and we should be able to separate them in the first step of the state-based decoding algorithm. Therefore, these algorithms suffer from generalization capabilities for utilizing in a different types of applications.
In this paper, a generalized state-based decoding algorithm has been proposed, called Gaussian mixture of model (GMM)-assisted PLS (GMMPLS). First, GMMPLS discriminates the desired output to a specific number of clusters using the GMM algorithm, and it calculates the membership probability of each sample for each cluster. Next, a regularized logistic regression model has been trained for each cluster to estimate an input sample's probability belonging to that cluster. Finally, a novel extended PLS algorithm has been developed to decode each extracted feature sample with respect to its probabilities. VOLUME 9, 2021 Unlike other state-based decoding algorithms, the proposed GMMPLS is fully automated and does not rely on prior knowledge about the desired output, and could be employed in different BMI applications (and even other types of continuous decoding paradigms). To illustrate this advantage, two different BMI datasets were used to validate the efficiency of the proposed method compared to PLS, the nonlinear QPLS [34] and the bayesian PLS (BPLS) [35] methods. In addition to these real-world datasets, a simulation study was performed for more rigorous analysis. In all scenarios, the achieved results indicated that the proposed method outperformed the others regression techniques in terms of decoding correlation of coefficient, coefficient of determination, and mean absolute error metrics.
The paper is structured as follows: In Section II, a detailed description of conventional PLS is given, and then, the framework of the proposed GMMPLS is described. In Section III, the employed synthetic and real-world BMI datasets, performance metrics, evaluation, and hyper-parameters tuning procedure are introduced. Finally, section IV describes the obtained evaluation results, and sections V and VI conclude the paper.

II. METHODS
First, a detailed description of conventional PLS is given. Next, the structure of the proposed method has been demonstrated.

A. NOTATIONS
Throughout this manuscript, let R and T denote the set of real numbers and the transpose operator, respectively. Matrices are denoted by boldface capital letters (X and Y), vectors are denoted by boldface lower-case letters (x and y), row and column dimensions are denoted by italic upper-case letters (M and L) except for T , and scalar numbers are denoted by italic lower-case letters (m and l). A matrix can be present via its elements, i.e., X = x l,k where l and m are the row and column index, respectively.

B. PLS ALGORITHM
Let X = [x 1 , · · · , x M ] ∈ R L×M and Y = [y 1 , · · · , y N ] ∈ R L×N be the mean-centered extracted features (input) and desired output matrices, where L is the number of samples, and M and N are the input and output dimensions, respectively. PLS seeks a reduced-rank linear relation between X and Y to decrease the chance of overfitting occurrence. Suppose X and Y could be decomposed as follows: where T = [t 1 , · · · , t R ] ∈ R L×R and U = [u 1 , · · · , u R ] ∈ R L×R include the input's and output's latent vectors, respectively, P = [p 1 , · · · , p R ] ∈ R M ×R and Q = [q 1 , · · · , q R ] ∈ R N ×R consist of the loading vectors, E and F are the residual matrices, and R is the decomposition rank. For this purpose, PLS tries to find projections of the input t = Xw ∈ R L and the output u = Yq ∈ R L , in a way that the cross-covariance between them become maximum: where w ∈ R M is the normalized basis vector and · 2 is the l 2 -norm operator. PLS usually solves this problem via the nonlinear iterative partial least squares (NIPALS) algorithm [36]. After finding a maximally correlated 3 latent vectors of X and Y, the loading vector p ∈ R M could be calculated as follows: A linear inner-relation between the latent vectors t and u is assumed to exist in PLS, i.e., u = dt + e, where d = u T t t T t would be a scalar and e represents the residual vector. After extracting r-th latent and loading vectors, the input and desired output matrices are deflated by their rankone estimation for seeking the subsequent latent vectors: where X 1 = X and Y 1 = Y. Finally, after extracting R components, the linear relation between X and Y could be expressed as: where D = diag (d 1 , · · · , d R ) represents a diagonal matrix and diag (,) is a diagonal matrix. Furthermore, the following relation could be proved: where G ∈ R M ×R . By substituting (6) in (5), the PLS regression coefficient matrix could be derived as follows [37], [38]: where B = W P T W −1 DQ T ∈ R M ×N is the PLS regression coefficient matrix. It is worth mentioning that R is a hyper-parameter that could be tuned using cross-validation (CV) techniques. The PLS algorithm is presented in the appendix section.

C. MAIN IDEA OF GMMPLS
The main idea behind the GMMPLS is that the desired output may compose of several different processes, and the contribution of each process in forming the desired output may vary over time. Suppose that the desired output has formed from K different components: where y k (l) ∈ R N is the k-th constituent component of the desired output and 0 ≤ γ k (l) ≤ 1 is a scalar representing the participation probability of the k-th component at the l-th sample. With this assumption, (8) can be written as follows for decoding the desired output: where x (l) ∈ R M is the l-th sample of the extracted features and f k (·) is the k-th decoder. In other words, unlike employing a binary manner to switch between decoders for the fibal decoding, we suggest a probabilistic framework for such a purpose. We offer to utilize GMM algorithms for estimating γ k (l) in (9). GMM is a model-based clustering technique, which fits K independent Gaussian distributions to the data based on the expectation-maximization (EM) algorithm. In summary, with an initialized mean vector and covariance matrix for each cluster, GMM estimates the membership probability of a sample for each cluster via the expectation step, and thereupon in the maximization step, GMM updates mean vectors and covariance matrices based on the obtained probabilities [39]. The GMM algorithm is presented in the appendix section.
It should be noted that the GMM initialization is done via the K-means algorithm in this study. See [40]- [42]. In addition, we used the z-scored version of Y in GMM algorithm to prevent the effect of data magnitudes in learning the GMM model.
After applying the GMM algorithm to the desired output Y, the membership probability matrix for all clusters = γ k , · · · , γ K = {γ k (l)} ∈ R L×K is calculated. Fig.  1 illustrates the main idea of the proposed method.

D. MEMBERSHIP PROBABILITY PREDICTION
The membership probabilities could be obtained using the available desired output and the GMM algorithm in the training phase. Nevertheless, this approach will not be executable in the testing phase or real-time applications since the desired output is absent. Hence, we proposed using a logistic regression model to decode the membership probabilities from the extracted features.
Consider a logistic regression for decoding k-th membership probabilities: where σ (z) = 1 1 + e −z is the sigmoid function, h k ∈ R M is the k-th linear regression vector, and h 0k is the Typically, the cross-entropy error is adopted as the logistic regression's cost function since it is convex rather than the mean square error [39], [43], [44]. This cost function could be represented to estimate h k in (10) as follows: where γ k (l) and σ k (l) are the actual and the predicted membership probability of the k-th cluster at the l-th sample. The derivative of this cost function could be represented as follows: is the gradient of (11) with respect to h k . To solve (11), adaptive moment estimation (ADAM) optimization has been employed in this study. In short, ADAM is an optimization algorithm that uses the first and second moments of the gradient vector to estimate an adaptive learning rate for each of the optimization parameters [45]. In addition, to prevent overfitting during the learning procedure, the weight decay schema in ADAM is utilized in this study which is a l 2 -norm regularization priocedure in ADAM optimization [46]. The details of this method are given in the appendix section. It is worth mentioning that ADAM converges faster than traditional gradient descent algorithms.

E. GMMPLS ALGORITHM
After estimating membership probabilities = γ 1 , · · · , γ K in the previous section, we should build a model for VOLUME 9, 2021 decoding the desired output with the aid of these probabilities. Therefore, we proposed a novel method using the concept of the PLS algorithm, called GMMPLS.
Assume the input and the desired output are zero-centered. Suppose each component of the desired output in (8) could be decomposed to latent and loading vectors in the form of equation (1): By Substituting y k (l) from (13) in (8), we have: where q r,k 2 = 1. It is worth mentioning that f k (l) is the residual term in (13). We assume that all loading vectors in all clusters for each r are equals for simplification, i.e., q r = q r,1 = · · · = q r,K . This assumption means that each desired output's latent variable could be expressed and decomposed in the form of (8). Therefore, (14) can be rewritten as: Finally, we assume a linear relation between u r,k (l) and a linear projection of the input data: where w r,k is a normalized basis vector, and d r,k,1 and d r,k,0 are scale and intercept scalars, respectively, and ψ r,k (l) is the error term. Thus, equations (15) and (16) are the main framework of the proposed GMMPLS.
To explain the parameters learning strategy in GMMPLS, suppose we want to extract the first latent and loading variables, i.e., r = 1. Therefore, we ignore r subscripts to improve the readability of equations. In addition, assume we possess initial values for w k , q r,k , d k,1 , and d k,0 and subsequently, t k (l) and u k (l) can be initialized for k = 1, · · · , K .
Altogether, we have: Now consider these relations. Due to q 2 = 1, we have z (l) ≈ y T (l) q. Next, we can claim: Now by defining the following augmented matrices: equation (18) can be reformulated as: where ⊗ is the element-wise multiplication operation.
To avoid overfitting phenomena, we can solve (20) by a rank-1 PLS as follows:w where each w k should have a unit norm: After obtaining w k for each k = 1, · · · , K , latent vectors t k = Xw k are calculated. Now by constructing augmented matrices: we have: After derivingẑ using (24), we can estimate q using the least square error method: The procedure involving (17)-(25) is iterated until the convergence of the parameters.
After fining r-th components of GMMPLS, the desired output is deflated for seeking the subsequent components as follows: After the deflation process in (26), GMMPLS goes back to the iterated process through (17)- (25) to seek the next (r + 1)-th components. Algorithm 1 demonstrates the GMMPLS procedure in detail.
It is worth mentioning that the number of clusters K and GMMPLS rank R are hyper-parameters, and the process for choosing them will be explained in section III.

F. ESTIMATING THE DESIRED OUTPUT IN GMMPLS
After extracting the GMMPLS parameters in Algorithm 1, the desired output could be estimated via Algorithm 2.

III. EXPERIMENTS
In this section, first, we demonstrated the described datasets. Next, we discussed the hyper-parameter selection procedure, and finally, we introduced evaluation metrics used in this manuscript.
A. DATA DESCRIPTION In this paper, three different datasets were used to evaluate and to compare the proposed method with PLS, QPLS, and BPLS. The first dataset is a synthetic dataset, the second is the publicly available ECoG dataset for decoding hand trajectories, and the last is our LFP for decoding applied force.

1) SYNTHETIC DATASET
For building this dataset, we generated each column of X ∈ R L×M randomly with L = 10000 and M = 500 via the Gaussian distribution with random means and variances. The means and variances were derived from the normal Gaussian distribution. To increase collinearity, we performed the singular value decomposition (SVD) on X and eliminated 40% of the singular vectors and then X was reconstructed. We assumed that the desired output was generated as follows: where: In these relations, W K ∈ R M ×N and v k ∈ R M were generated randomly using Gaussian distribution with random means and covariances. It is worth mentioning that the samples of γ k are bounded between 0 − 1 and K k=1 γ k = 1. Twelve different cases were considered in this simulation with assuming N ∈ {1, 3, 5} and K ∈ {1, 2, 3, 4}. In addition, we ran the simulation 100 times for performance analysis. In each run, the first 90% samples of the synthetic data were used to train the decoders, and the remaining 10% was used for performance evaluation.

2) PUBLIC ECoG DATASET
This public dataset was introduced in 2012 by Shimoda et al. for decoding hand movement trajectories using monkeys' recorded ECoG signals [10]. ECoG electrodes with 64 channels were implanted on the epidural space of the Algorithm 1 GMMPLS Algorithm for Estimating Desired Y From X Input: X ∈ R L×M , Y ∈ R L×N , Predicted membership probabilities: = γ k , · · · , γ K ∈ R L×K , Number of PLS components: R.
Initial loading vector q r . Initial d r .
while not convergence do z r = Y r q r . Calculate ϕ using (19).

Algorithm 2 Estimation of the Desired Output in GMMPLS Algorithm
Input: X ∈ R L×M , Predicted membership probabilities: prefrontal cortex (PFC) and the primary somatosensory cortex (S1) in the left hemisphere. Each animal performed VOLUME 9, 2021 ten 15-minute sessions of food-tracking task while ECoG signals were recorded with a 1000 Hz sampling rate, and hand motions were recorded with a sampling rate of 120 Hz. For feature extraction, similar to [18], the ECoG signals were down-sampled to 250 Hz, and a common average reference (CAR) was applied to the recorded channels. Next, ECoG signals were divided into six different frequency bands [1-4, 4-8, 8-12, 12-30, 30-60, and 60-120 Hz] using an eightorder Butterworth filter for extracting delta, theta, alpha, beta, low-gamma, and high-gamma1 brain rhythms. It is worth mentioning that the upper bound in the high-gamma1 was set to 120 Hz due to the frequency range in the epidural ECoG signals (<120 Hz). Next, each frequency band in each channel was then rectified and smoothed using third-order Savitzky-Golay moving average with 0.3s width to obtain its envelope. The extracted envelopes represent the variations through each frequency band. Eventually, for each sample of time, the sample itself and its' nine previous ones with a 0.1s interval were collected as the features in each frequency band of each channel. This manner yielded feature vectors with the dimension of 3840 = 64 (channels) × 6 (bands) × 10 (lags).
The objective of this manuscript was to decode threedimensional right arm trajectories of the monkeys from the extracted features using PLS, QPLS, BPLS and the proposed GMMPLS.

3) LFP DATASET
This dataset was collected in our neuroscience lab in 2016 by Khorasani et al. for decoding rats' applied force from the 16 channels of LFP signals [11]. Thirsty rats were trained to apply a certain amount of force on a load cell to receive a drop of water as the reward.
16-channel microwire array was implanted in the forelimb sensorimotor cortex in the left hemisphere of three Wistar rats. The neural signals were then recorded with a 10 kHz sampling rate during the task. First, a low-pass filter with a 500 Hz cutoff frequency was applied to the recorded signals, and then the signals were down-sampled to 1000 Hz. Simultaneously, the output of the load cell was recorded at a 30 Hz sampling rate.
Envelope extraction of each channel's frequency bands was achieved via rectifying and smoothing by third-order Savitzky-Golay moving average with 0.3s width. Finally, analogously to the previous sub-section, each sample of time and its lags were collected as the features in each channel's frequency band. Therefore, the dimension of the extracted features was 1280 = 16 (channels) × 8 (bands) × 10 (lags) for each time sample.
The goal of this manuscript was to continuously decode one-dimensional applied force values using extracted features for each rat.

B. HYPER-PARAMETER SELECTION
In the PLS, QPLS, and BPLS algorithms, the only hyperparameter is the decomposition rank R. We used the BPLS toolbox presented in https://github.com/vidaurre/bpls. On the other hand, GMMPLS involves other hyper-parameters: the decomposition rank R, the number of clusters K , and the regularization parameter λ for membership probabilities estimation (see Algorithm 5 in appendix section). In addition, β 1 , β 2 , ε and the step size α are also hyper-parameters in ADAM (see Algorithm 5 in appendix section) which have to tune.
Usually, fix values β 1 = 0.9, β 2 = 0.999 and ε = 10 −8 are used in ADAM algorithms in the manuscript. Therefore, we employed these values and ignored tuning these hyperparameters. On the other hand, the step size α can be tuned within the ADAM optimization. Therefore, we considered the list α = [0, 0.0001, 0.001, 0.01, 0.1] and performed a line search procedure to test different values of α in updating h k (t) and selected the one that yielded a lower cost value in equation (11) in each iteration.
For tuning R, K, and λ, we performed an internal 10-fold CV procedure during the GMMPLS algorithm. More precisely, we divided the training data into new training and validation sets using 10-fold CV and chose the optimal hyper-parameters from several values; those that minimized the mean square error (MSE) of the desired output estimation in the validation set. The optimal values for the hyper-parameters were selected among the predefined set R ∈ [1, · · · , 50], K ∈ [1, · · · , 5] and λ ∈ [0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100]. In the PLS, QPLS, and BPLS algorithms, the same procedure was performed to optimize R.

C. PERFORMANCE EVALUATION
To evaluate and compare the proposed method with the other regression techniques performances, we performed a 10 × 10-fold CV procedure in the ECoG and the LFP datasets to divide data into training and test sets. Briefly, the recorded brain signals (ECoG and LFP) and the desired output (hand trajectories and applied force) were randomly split into ten non-overlapping folds. At each stage, nine folds were used to train the decoders, and the remaining one was used to test the decoding performance. All of the folds were used as the test set one time. Finally, this procedure was repeated ten times, and all 100 gained performances were averaged for each decoding method.
In this manuscript, we employed three different metrics to analyze and compare PLS, QPLS, BPLS, and GMMPLS performance. The first one is the correlation coefficient which TABLE 1. Decoding correlation coefficients obtained in different scenarios for PLS, QPLS, BPLS and the proposed GMMPLS. N represents the dimension of the desired output, and K represents the number of clusters in generating the dataset. GMMPLS-K stands for training GMMPLS with K clusters. Bold values indicate the highest performance in each column. The higher value is better. ranges between −1 and 1: where cov (.) and std (.) are the cross-covariance and the standard deviation operators, respectively. The next metric is the coefficient of determination, denoted by R 2 which represents the ''goodness of fit'' and ranges between − ∞ and 1. R 2 = 1 means the perfect fit and R 2 = 0 means that the estimated desired output is equal to 0. On the other hand, negative values of R 2 indicates a disaster in the decoding procedure. R 2 is defined as: where var (.) is the variance operator. It is worth mentioning that var y −ŷ /var (y) has been defined as the normalized mean square error (NMSE) in manuscripts. The last metric used in this paper is the mean absolute error: where mean (·) and |·| are the average and the absolute operators, respectively. Compared to MSE based metrics (like R 2 and NMSE), MAE is less sensitive to outliers and large error values since it does not involve squaring values [47].

D. RUN-TIME COMPARISON
For complexity comparison, we performed a run-time analysis for each decoding method. To achieve this goal, we used three minutes of ECoG and LFP datasets to train PLS, QPLS, BPLS, and the proposed GMMPLS and calculated the training run-times separately. This process was repeated 100 times, and the obtained values were averaged.
To avoid the effect of hyperparameter tuning and CV procedure in this analysis, we used fix hyper-parameter values in all decoding methods. We set R = 20 for all methods, and we used a fixed regularization parameter value λ = 10 in training GMMPLS. In addition, we trained the proposed method using a different number of clusters K to evaluate the effect of this hyperparameter in the computational cost.

A. SYNTHETIC DATASET
As mentioned before, we generated the synthetic dataset 100 times randomly. At each time, 90% of the generated data was used to train PLS, QPLS, BPLS, and GMMPLS, and the rest of the data was used for performance evaluation. Tables 1, 2 and 3 demonstrates achieved correlation coefficients, R 2 and MAE decoding performance using each decoding method in this simulation study. Different cases were considered in this simulation. N = 1, 3, 5 and K = 1, 2, 3, 4 led to 12 different scenarios. In addition, we performed the decoding using a different number of clusters in GMMPLS, e.g., GMMPLS-3 means that we supposed K = 3 in training GMMPLS. These results indicated that when the desired output is composed of more than one process, i.e., K > 1, our proposed algorithm with K > 1 led to higher performance. On the other hand, GMMPLS with K = 1 resulted in approximately the same performance compared to PLS. In other words, GMMPLS is simplified to the ordinary PLS when we consider K = 1 in the training of GMMPLS.
It is evident that increasing the number of states in the generation of the desired output has caused decreasing in decoding performance. However, the results illustrate that the proposed GMMPLS could stand against this phenomenon more than others. GMMPLS with K = 2 achieved the highest performance in this simulation study.

B. ECoG DATASET
In this sub-section, first, an example of estimated hand trajectories in the first monkey using PLS and GMMPLS was depicted in Fig. 2. To make this figure more clear, QPLS and BPLS predicted traces were not drawn. In this figure, a 70 s time window of the observed and predicted hand trajectories was shown for three dimensions. In this example, all hyperparameters were optimized through the CV procedure. The optimized number of clusters in GMMPLS was K = 4. Decoding coefficient of determination (R 2 ) obtained in different scenarios for PLS, QPLS, BPLS and the proposed GMMPLS. N represents the dimension of the desired output, and K represents the number of clusters in generating the dataset. GMMPLS-K stands for training GMMPLS with K clusters. Bold values indicate the highest performance in each column. The higher value is better. The achieved performances in this dataset were shown in Fig. 3. The correlation coefficients, the coefficient of determinations, and the mean absolute errors were given for monkeys 1 and 2 in three dimensions for PLS, QPLS, BPLS, and GMMPLS methods (mean ± standard error). All the hyper-parameters in each method were tuned using described CV technique for obtaining these results.
This figure demonstrates that except in the Z dimension, the proposed method outperformed other methods in all directions in both animals. The Friedman non-parametric test with the post-hoc Bonferroni correction was applied to the achieved performance for more rigorous analysis. This statistical test revealed that the superiority of GMMPLS performance is significant compared to PLS, QPLS, and BPLS (p < 0.001 for all metrics). These results indicated that the proposed method improved performance metrics by about 15%, 12% and 13% in ρ, 64%, 52% and 52% in R 2 , and 22%, 20% and 20% in MAE compared to PLS, QPLS and BPLS, respectively.
An example for explaining the effect of the number of used components (R) and the number of clusters (K ) in decoding performance in the ECoG dataset is given in Fig. 4. The rank-R was varied from 0 to 20 for decoding hand trajectories, and the achieved performances were averaged over three dimensions for PLS, QPLS, BPLS, and GMMPLS. In addition, the GMMPLS was trained using a different number of clusters. i.e., from K = 1 to K = 5, which was denoted as GMMPLS-K in this example. Fig. 4 demonstrates that the performance of GMMPLS was approximately equal to PLS when K = 1 in lower R values. GMMPLS-2 behaved slightly better than PLS, QPLS, BPLS, and GMMPLS-1. However, the best results were achieved when the decoding was performed using K = 3 to K = 5 in GMMPLS. Finally,    this example showed that all the decoders' performances become steady for R ≥ 4 except BPLS.
C. LFP DATASET Fig. 5 shows an example of the decoded applied force trace using PLS and GMMPLS in the LFP dataset. To make this figure more clear, QPLS and BPLS predicted traces were not drawn. A time window (≈30) of the observed and predicted desired output in the first rat was depicted in this example. All hyper-parameters were optimized through the CV procedure. The optimized number of clusters in GMMPLS was K = 3, and the number of used components were R = 5 and R = 10 for PLS and GMMPLS, respectively. The gained performances in this example were ρ PLS = 0.82, R 2 PLS = 0.62, and MAE PLS = 4.84 for PLS and ρ GMMPLS = 0.89, R 2 GMMPLS = 0.79 and MAE GMMPLS = 2.85 for GMMPLS. Again similar to the ECoG dataset, it can be noticed that the proposed method won over PLS when it came to decoding steady intervals in the desired output. Fig. 6 demonstrates the obtained performances in the LFP dataset via the correlation coefficient, the coefficient of determination, and the mean absolute error metrics for the three rats using all decoding methods. Again, the hyper-parameters were optimized through the CV procedure. From this figure, it seems that the proposed method slightly outperformed PLS, QPLS, and BPLS in all metrics. The Friedman non-parametric test with the post-hoc Bonferroni correction was employed to investigate this hypothesis. This test revealed that the proposed method's results were significantly higher than PLS, QPLS, and BPLS in all metrics (p < 0.001 in all metrics).
It is worth mentioning that the improvement in MAE was more remarkable than other metrics since the desired output in this dataset includes intervals with high-value peaks. Prediction error in these intervals causes dramatic effects in l 2 -norm based performance metrics (like ρ and R 2 ). On the other hand, since MAE is a l 1 -norm based metric, it is less sensitive to large error values. In addition, the proposed method caused more minor prediction errors in steady intervals compared to PLS, QPLS, and BPLS, and this feature led to a lower MAE.
Finally, the effect of the decomposition rank (R) and the number of clusters (K ) in decoding performance for the second rat in the LFP dataset was explicated via an example in Fig. 7. The rank-R was varied from 0 to 20 for all decoding methods. A different number of clusters were used for training the proposed GMMPLS, which is denoted as GMMPLS-K . This figure demonstrates that the best performance in this example was achieved by using GMMPLS-2.
As it can be deduced from this figure, PLS, QPLS, and GMMPLS-1 behaved similarly for R ≤ 5. However, unlike GMMPLS, the metrics R 2 and MAE achieved for PLS and QPLS fell down when more than R = 5 components were used for training. On the other hand, the performance of BPLS was lower than others for R ≤ 5. Unlike PLS and QPLS, BPLS performance did not decrease for R > 5 and it behaved almost similar to GMMPLS-1. However, the best performances belonged to GMMPLS with K = 2 to K = 5.  Therefore, we can claim that the proposed method is more robust to over-fitting when the number of used components is overvalued compared to PLS and QPLS.

D. RUN-TIME COMPARISON
In this sub-section, we presented the run-time analysis results as the metric of computational complexity. Table 4 presented each decoder training computational time using three minutes of LFP and ECoG datasets. Again, a different number of clusters were used for training GMMPLS, which is denoted as GMMPLS-K .
It can be seen that the computational complexity of GMMPLS-1 is almost equal to PLS and QPLS. However, the run-time is increased with increasing the number of clusters (K ) used in the GMMPLS structure, which is not unexpected.

V. DISCUSSION
The simulation and the experimental results represented the efficacy of the proposed method in the continuous decoding problems. This method led to higher performances in the sense of correlation coefficient, coefficient of determination, and mean absolute error compared to ordinary PLS, QPLS, and BPLS with statistical significance. VOLUME 9, 2021 GMMPLS achieved more success than other employed methods in decoding steady intervals. This may occur because of the state-based and hierarchical nature of the GMMPLS. The proposed method divides the desired output into a specific number of clusters. Then, the membership probabilities of each sample are estimated for each cluster, and these probabilities are employed in the GMMPLS algorithm. Therefore, the aforementioned steady intervals are recognized as a specific cluster, which helps GMMPLS perform more accurately for these samples.
As mentioned before, all the previously state-based algorithms have been introduced for specific applications. Thus, they are dependent on the existence of some prior information about the structure of the desired output. On the contrary, the proposed method is more generalized for such a purpose since the clustering and the decoding procedures are fully automated.
The proposed method includes three hyper-parameters which have to tune using the CV procedure: the regularization parameter (λ), the decomposition rank (R), and the number of clusters (K ). Regularization parameter (λ) is used to avoid overfitting phenomena in the ADAM algorithm. We tuned this parameter using the CV procedure through this manuscript. However, our investigation reveals that setting λ = 10 or λ = 20 may be a good choice in most scenarios. The decomposition rank is a common hyper-parameter in all variants of PLS algorithms. We showed that GMMPLS is more robust to R increment compared to PLS and QPLS in Fig. 7. On the other hand, we illustrated that the proposed method leads to higher performances in lower R compared to other employed methods.
GMMPLS with one cluster (K = 1) downgrades to the ordinary PLS. On the other hand, our simulation study illustrated that increasing K does not necessarily lead to improved performances. In other words, choosing optimal K is depended on the structure of the desired output, and it should be tuned using a CV procedure.
In the end, we used a simple logistic regression with the regularized ADAM optimizer to decode the membership probabilities. However, based on our experiments, we informed that the membership probability estimation in GMMPLS significantly influences decoding precision. Therefore, improving this estimation will be the goal of our future studies.
It should be noted that the MATLAB code of the proposed GMMPLS is available upon request from the authors.

VI. CONCLUSION
This paper presents a novel fully automated state-based decoder called GMMPLS for continuous decoding in various BMI applications. Unlike the other state-based decoders, GMMPLS does not rely on prior information about the desired output structure. We illustrated and evaluated the generalization of GMMPLS in two different real-world datasets and a synthetic dataset. In all cases, the proposed method outperformed the ordinary PLS, quadratic PLS (QPLS), and Bayesian PLS (BPLS) in terms of decoding correlation coefficient, coefficient of determination, and mean absolute error metrics.

APPENDIX
NIPALS algorithm for PLS is outlined in Algorithm 3. Algorithm 4 explains the detail of the GMM method imple-Algorithm 3 PLS Algorithm for Estimating Desired Y From X Input: X ∈ R L×M , Y ∈ R L×N , number of PLS components: R Output: T = [t 1 , · · · , t R ] ∈ R L×R , P = [p 1 , · · · , p R ] ∈ R M ×R , Q = [q 1 , · · · , q R ] ∈ R N ×R , W = [q 1 , · · · , q R ] ∈ R M ×R . D = diag (d 1 , · · · , d R ) ∈ R R×R and B ∈ R M ×N Assign X 1 = X and Y 1 = Y for r to R do Initial q r while not convergence do u r = Y r q r w r = X T r u r u T r u r w r ← w r w r 2 t r = X r w r q r = Y T r t r t T r t r q r ← q r q r 2 end while p r = X T r t r t T r t r d r = u T r t r t T r t r X r+1 ← X r − t r p T r Y r+1 ← Y r − d r t r q T r end for B = W P T W −1 DQ T ∈ R M ×N

Algorithm 4 GMM Algorithm for Estimating Membership Probability
Input: Y = [y (1) , · · · , y (L)] T ∈ R L×N number of clusters: K Output: = {γ k (l)} ∈ R L×K Initialization the means µ k ∈ R N , covariance k ∈ R N ×N using k-means algorithm and the mixing coefficients π k = 1 for k = 1, · · · , K while not convergence do E-step γ k (l) = π k N y (l) µ k , k K j=1 π j N y (l) µ j , j where N (·) is the Gaussian distribution operator M-step mentation. Note that we used the K-means for initializing the GMM algorithm. In addition, the z-scored version of the desired output is used as the input of this algorithm to prevent the effect of data magnitudes in learning the GMM model. Algorithm 5 demonstrates the ADAM algorithm with weight decay. In this algorithm, β 1 and β 2 are exponential decay rates for the moment estimates, ε is a small constant, α is a step size parameter, and λ is the regularization hyper-parameter. Choosing these parameters is explained in section III.