EEG Mental Recognition Based on RKHS Learning and Source Dictionary Regularized RKHS Subspace Learning

This article mainly studies Electroencephalogram (EEG) mental recognition. Because the human brain is very complex and the EEG signal is greatly affected by the environment, EEG mental recognition can be attributed to domain adaptative problems. Our main work is as follows: (1) At present, most domain adaptation learning only learns the linear subspace of Reproducing Kernel Hilbert Space (RKHS), and RKHS itself does not. Given the complexity and nonlinearity of EEG mental recognition, we propose an EEG mental recognition algorithm based on two learning. The two learning is RKHS learning and RKHS subspace learning. The source dictionary regularized RKHS subspace learning we proposed applies to EEG mental recognition and is better than pure RKHS subspace learning, but not enough. To get satisfactory results, we learn RKHS before RKHS subspace. (2) According to Moore–Aronszajn theorem, RKHS can be uniquely generated by a kernel function. The existing RKHS is rarely learnable. It is difficult to find a kernel function that can be learned and optimized. In RKHS learning, this paper uses a learnable kernel function that we published, and the kernel function is easy to optimize. (3) In RKHS subspace learning, most of the existing methods adopt the Maximum Mean Discrepancy (MMD) criterion, but it cannot make the spatial distribution of the same category of source and target domain data overlap as much as possible, and the label of the target domain data is unknown. To solve this problem, this paper proposes a framework of RKHS subspace learning based on source domain dictionary regularization. The experimental results on the brain-computer interface international competition data set (BCI competition IV 2a) show that the effect of the algorithm proposed in this paper is better than that of the other five advanced domain adaptation learning algorithms.


I. INTRODUCTION
Transfer Learning is a machine learning method that uses existing knowledge to solve different but related domains or tasks [2], [3]. Domain and task [1] are the two most critical points of transfer learning. We can treat data samples with the same feature space and marginal distribution as the same domain, and the same task represents the label space and posterior of the two tasks. The conditional probability distribution is consistent. Domain adaptive learning is a kind of transfer learning, which transfers data features. Its research The associate editor coordinating the review of this manuscript and approving it for publication was Sung Chan Jun . on how to use labeled source domain data and prior knowledge of the target domain to reliably learn and complete tasks in the target domain when the probability distributions of the source and target domain are different but related.
Domain adaptation can be divided: semi-supervised [7]- [9] and unsupervised [10]- [14]. To identify the subspace structure of noisy data, Liu et al. [16], [17] proposed low-rank representation (LRR). Jhuo et al. [18] proposed robust domain adaptation based on LRR. Shekhar et al. [23], [24] represent the source domain and the target domain data respectively, through the shared dictionary in the latent subspace. Domain-specific dictionary learning [25], [26] is to learn a dictionary for each domain, and then use VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ domain-specific or domain-common representation coefficients to represent each domain. However, none of the above methods makes the source domain data and target domain data close to each other by category, which is different from the method proposed in this article. MMD is one of the most commonly used criterion for domain adaptation. MMD essentially uses the first-order origin moment of two random variables to measure the similarity between two random variables, which brings great computational convenience. But MMD also has some flaws. Therefore, many researchers add various regularization on the basis of MMD to meet the needs of different application scenarios, such as literature [15], [33], and [34]. In addition, the other domain adaptation criterion has been proposed, but this type of algorithm is more complex. Li and Zhang [9] proposed a covariance criterion, which minimizes the difference between the variance of the source domain and the variance of the target, and achieves the same distribution of the two domains. Wang et al. [5] proposed that the distribution of the target domain projected to the subspace can be approximated to the source through a transformation matrix. BCI is a direct communication and control channel established between the human brain and the computer. Through BCI, you can directly use the human brain to express your intentions or control equipment, without language or body movements [19], [20]. Non-invasive BCI is mainly to record a series of computer graphics activities through the EEG acquisition device [27], the purpose is to detect and classify some specific EEG patterns. The EEG recognition task is a classification task. Basic machine learning classification algorithms include SVM, KNN, etc., but these algorithms do not consider the structure and distribution of the data. If you consider the category information of the data label, you can use the LDA algorithm, and if you consider the structure information of the data, you can consider the LE algorithm. But these are not the most suitable for EEG signal data. In order to solve the problem of EEG signal data classification, the common spatial pattern (CSP) algorithm was proposed [28]. But Kam believes that CSP does not consider the time information of the EEG signal. To solve this problem, they proposed a Time-Dependent Common Spatial Patterns (TDCSP) algorithm [29]. The algorithm can assign weights on each time band to consider the distribution of ERD/ERS in the time and frequency domains, and use the selected function to construct a separate classifier on each time band. Jiao et al. [21] proposed multi-scale optimization of spatial patterns (MSO), which optimizes the filter band through multi-view learning in CSP. Through L2,1-norm regularization, sparse optimization based on multi-view learning is to capture the shared information of multiple related spatial patterns. Zhang et al. [22] added Laplace regularization based on MOS, and proposed subclass relation regularized multi-task learning (srMTL). The algorithm effectively reveals and utilizes the internal distribution structure of the data to obtain a more accurate MI-related EEG classification.
Later, researchers discovered that the probability distribution of EEG signal data are different but related. If the BCI recognition problem is solved from the domain adaptation aspect, the classification accuracy of the data can be improved. In order to solve the transfer learning problem in EEG-based brain-computer interface classification, Zanini et al. [30] proposed a Riemannian geometry framework and a classifier for BCI. Zanini uses the spatial covariance matrix of the EEG signal to represent data, and uses Riemannian geometry based on symmetric positive definite (SPD) manifolds to perform affine transformation on the covariance matrix of each session/object. This makes data from different sessions/objects comparable. T Then in order to improve the classification accuracy, the author also proposed a probabilistic classifier to model the class probability distribution, which used the mixture of Riemann Gaussian and Gaussian distribution proposed by Said et al. [31]. Transfer Component Analysis (TCA) proposed by Pan et al. [32] also learns a low-dimensional RKHS subspace for reducing the distribution differences and maintaining the internal data structure. Semi-Supervised Transfer Component Analysis [32] (SSTCA) maximizes the correlation between data and label information, and retains locality, based on TCA. And Integration of Global and Local Metrics for Domain Adaptation Learning [33] (IGLDA) introduces category information for minimizing the intra-class distance of the projected data. The Transfer Independently Together (TIT) algorithm [34] is that the distance between the target domain data on RKHS and the target domain data formed by the subspace orthogonal basis after dimensionality reduction is as close as possible. Guide Subspace Learning (GSL) algorithm [35] use two projection matrices to map the two domain data to different subspace for approaching the projected data.
The main contributions of this paper are as follows: 1) At present, most domain adaptation learning only learns the linear subspace of RKHS, and RKHS itself does not. Given the complexity and nonlinearity of EEG mental recognition, this paper proposes an EEG mental recognition algorithm based on two learning. The two learning is RKHS learning and RKHS subspace learning. The source dictionary regularized RKHS subspace learning proposed in our previous paper [43] applies to EEG mental recognition and is better than pure RKHS subspace learning, but not enough. To get satisfactory results, we learn RKHS itself first before learning RKHS subspace. Compared with single learning, two learning makes the marginal distribution of source domain data and target domain data closer. 2) RKHS can be uniquely produced by a kernel function.
The existing RKHS is rarely learnable. The reason is that it is difficult to find a kernel function that is learnable and easy to optimize. In RKHS learning, this paper uses a learnable kernel function that we have published, and the kernel function is easy to optimize. The kernel function can be used to generate a learnable RKHS (2)The subspace projection transforms the data into a subspace of RKHS for minimizing the distribution difference between domains. The source domain dictionary regularization distributes the target domain data around the most linearly related source domain, which indirectly reflects the requirement that the source and target domain data coincide according to the category space distribution.
3) In RKHS subspace learning, most of the existing methods use the MMD criterion, but it cannot make the spatial distribution of the same category of source and target domain data overlap. And the label of the target domain data is unknown. So we adopt the source domain dictionary regularization proposed by ourselves [43] and propose a RKHS subspace learning algorithm based on the source domain dictionary regularization. On the basis of MMD, the algorithm requires the target domain data to be distributed around the source with the strongest linear correlation. The algorithm proposed has not been reported in similar literature. RKHS learning can make the marginal distribution of the source and target domain minimize, while RKHS subspace learning further brings the marginal distribution of the source and target domain closer to each other on the basis of RKHS learning. Moreover, in RKHS subspace learning, the source domain and target domain are close to each other according to the spatial distribution of the same category, rather than the overall spatial distribution.
The following chapters of the article are arranged as follows. In the second chapter, we briefly introduce the basic mathematical theories related to this article, including Reproducing Kernel Hilbert Space (RKHS) and dictionary learning (DL). Then we introduce in detail the EEG mental recognition algorithm based on RKHS learning and source domain dictionary regularized RKHS subspace learning in Chapter 3. In Chapter 4, we briefly introduced the five comparison algorithm models, and the theoretical difference between the proposed algorithm and the comparison algorithm. In Chapter 5, we conducted a series of experiments on the BCI Competition IV 2a dataset and compared it with five state-of-theart algorithms to verify the effectiveness and practicability of our proposed algorithm. Finally, we summarized our work in Chapter 6.

A. NOTATIONS
In this paper, X s and X t are the set of source domain data and the set of target domain data, respectively. X = [X s , X t ] represents the set of the source domain and target domain data, y s and y t represent respectively the source domain data and the target domain data projected to the subspace (shown in Table 1).

B. RKHS AND KERNEL FUNCTIONS
The Hilbert space is a complete inner product space, and Reproducing Kernel Hilbert Space (RKHS) is a special Hilbert space. Let H be a Hilbert space composed of functions defined on the set Ω and satisfying certain conditions (such as square integrability), that is, f ∈ Ω, f : (2) For any x ∈ Ω and any f ∈ H , f (x) = f , k(•, x) , where •, • is the inner product of H ; Then H is called RKHS, and k is the reproducing kernel of H [36].
The reproducing kernel has properties such as symmetry, positive semi-definiteness and uniqueness. Using the reproducing kernel k, the transformation can be defined: ϕ : Ω → H , for any x ∈ Ω, ϕ(x) = k(•, x) ∈ H . Using the properties of reproducing kernel, it can be proved that for any x, y ∈ Ω, (1) VOLUME 9, 2021 According to the Moore-Aronszajn theorem, RKHS can be uniquely generated by the kernel function. The definition of the kernel function is as follows [37]: k : Ω × Ω → R, it satisfies (1) Symmetry: For any x, y ∈ Ω, k(x, y) = k(y, x); (2) Positive definiteness: For any finite number of elements {x 1 , · · · , x N } ⊆ Ω, the following matrix is positive definite: The kernel function is not the same concept as the reproducing kernel. The reproducing kernel relies on the Hilbert space definition, and the kernel function is a separately defined function.
The process of kernel function to generate RKHS is as follows: 1) Generate a linear space by using the kernel function: 2) Define an inner product on H k : for all f , g ∈ H k , since 3) Complete H k to obtain the complete space of H k , denoted as H k , then H k is an RKHS and the kernel function k is exactly the reproducing kernel of H k . Since the kernel function generates only one RKHS, learning an RKHS means learning a kernel function.

C. DICTIONARY LEARNING
Let X = {x 1 , · · · , x N } be a sample, and find a suitable dictionary {d 1 , · · · , d L } through the N samples. The objective function of dictionary learning can be expressed as follows: . . . . . . . . . α 11 · · · α NL    is called the sparse coding matrix.
After the dictionary {d 1 , · · · , d L } has been learned, each trusted data point x can be roughly represented by this set of dictionaries. The function of sparse coding is to make the coefficient components of x linearly represented by this set of dictionaries {d 1 , · · · , d L } tend to 0 as much as possible. The objective function of sparse coding can be expressed as follows Here α = [α 1 , · · · , α L ] T , sparse(α) represents the sparse regular term of the sparse encoding,so that the component tend to zero. We usually use the 1 norm to represent the sparse regular term of sparse coding And sparse(α) is the feature vector after sparse coding. If there is no sparse(α), it will become to solve subspace problem. If {d 1 , · · · , d L } is orthogonal to each other, then

D. BCI DATA
A typical brain-computer interface (BCI) experiment includes several experiments. In these experiments, the EEG signals are used to infer which tasks the subjects are performing. The typical BCI data set is composed of a set of coupled pairs if the records are d electrodes and each of the N tests is composed of T time samples.
where X i is the ith EEG test recorded in the experiment and l i is its label (there are C classes). Different BCI examples consist of different cognitive tasks that perform different filters on recorded signals. Generally speaking, the distribution of collected experimental data differs greatly due to the different experimental conditions, such as experimental period and experimental subjects, etc. Therefore, the classifier trained by labeled training data has low performance in test data.

III. EEG MENTAL RECOGNITION BASED ON RKHS LEARNING AND SOURCE DICTIONARY REGULARIZED RKHS SUBSPACE LEARNING
When the subjects complete the corresponding motor imagination tasks, they may be affected by the environment or their own factors, which may lead to different EEG signals generated by the same motor imagination task imagined on different days. Due to the instability of EEG signals, the problem of EEG mental recognition can be attributed to the problem of domain adaptation. At present, most domain adaptation learning only learns the linear subspace of RKHS, and RKHS itself is not learned. In view of the complexity and nonlinearity of EEG mental recognition, this paper proposes an EEG intention recognition algorithm based on the two learning framework, which is called EEG Mental Recognition Based on RKHS Learning and Source Dictionary Regularized RKHS Subspace Learning (EEG-KLSD).

A. DOMAIN ADAPTION BASED ON RKHS LEARNING 1) SAMPLE-DEPENDENT AND LEARNABLE KERNEL FUNCTIONS
As mentioned in section II.B, RKHS is uniquely generated by the kernel function. Therefore, RKHS learning can be transformed into kernel function learning. This paper proposes a learning kernel function framework based on sample dependence. Let k(x, y) be a symmetric positive definite kernel function defined on Ω ×Ω, and k 0 (x, y) is an arbitrary binary function. Given a set of learning samples Define a new kernel function as follows [39]: For any x, y ∈ Ω, It can be proved that the kernel functionk is symmetric positive semi-definite. Existing {u 1 , · · · , u Q } ⊆ Ω is a set of samples, the number of samples is Q, let Then, the kernel matrix of sample X with respect to kernel functionk isK where, Knowing that k(x, y) is a symmetric positive definite kernel function, then the kernel matrix K is symmetric positive definite, and it is known that M is a symmetric positive semidefinite matrix.
=K , that is,k is symmetric; for any y ∈ Ω, there is According to the definite properties of K and the semi-definite properties of M , we know that y T Ky > 0, (K T UZ y) T M (K T UZ y) ≥ 0, so y TK y > 0, that is,k is positive definite. In summary,k satisfies Mercer's condition, so it is a legal kernel function.

2) RKHS LEARNING BASED ON SAMPLE-DEPENDENT AND LEARNABLE KERNEL FUNCTIONS
Supposed a source domain data set and a target domain data set in the original data Ω and the source domain data X s is labeled, and the labels of the unlabeled target domain data X t need to be identified by X s . However, the distribution of X s and X t are different, and identifying directly on will cause larger error, inevitably. According to section II.B, using the reproducing kernel k, the transformation can be defined: ϕ : Ω → H , for any x ∈ Ω, ϕ(x) = k(•, x) ∈ H . Therefore, after mapping the source domain data and target domain data to RKHS, the source domain data on RKHS can be expressed as ϕ(X s ) = {ϕ(X s 1 ), · · · , X s N s } ⊆ H , and the target domain data can be expressed as ϕ(X t ) = {ϕ(X t 1 ), · · · , X t N t } ⊆ H We use the MMD criterion to measure the mean square error of the two data on RKHS, as shown below. where, s − t Substituting the kernel function framework into the above formula, the RKHS learning model based on the MMD criterion can be obtained. The model is as follows: where, In order to satisfy the constraint that the parameter M satisfies the symmetric positive semi-definite and facilitate the solution, this paper requires the parameter solution set to satisfy the symmetric positive definite property, and the symmetric positive definite matrix can form a Riemannian manifold. The function on the Riemannian manifold can be optimized by the Riemannian conjugate gradient method. In this paper, the Riemannian conjugate gradient is used to optimize the parameter M . Let E be a function on the Riemannian manifold, and function E and the Euclidean gradient of x in Euclidean space as ∂ x E. According to the literature [38], the Riemannian gradient of function E with respect to x is Then, the Riemannian gradient of obj(M ) with respect to M on the SPD manifold is According to the objective function obj(M ) and its Riemannian gradient on the SPD manifold, this paper uses the conjugate gradient method to iteratively optimize M to select the optimal kernel functionk.

B. DOMAIN ADAPTION BASED ON SOURCE DICTIONARY REGULARIZED RKHS SUBSPACE LEARNING
In this section, the structure of the RKHS subspace, the constraints that must be satisfied, and the data representation of the subspace dependence in Euclidean space will be given. RKHS subspaces can be determined according to the specific application of machine learning, i.e. the specific choice is open.

1) THE FRAMEWORK OF RKHS SUBSPACE LEARNING
In this section, we will introduce the construction and constraints of the RKHS subspace, and the data representation of the RKHS subspace.
Let (H , •, • ) be the RKHS on the data space Ω, and use the reproducing kernel k of H to define the transformation from the data space Ω to RKHS H : ϕ : Ω → H , for any Give a dataset in the data space Ω Use ϕ to transform X to H : And we record K as where K iCol represents the i th column of K , i = 1, . . . , N . Now, we use ϕ (X ) to structure a subspace of H , Let And we denote W as We hope that constitutes the standard orthogonal basis of spanθ, then For all we have Obviously, the subspace span is a d-dimensional subspace and completely determined by the combinational coefficient W , and W must satisfy the above constraints.
Based on the RKHS orthonormal basis given above, this section gives a data expression form based on the RKHS subspace.
According to the projected theorem, if {θ 1 , . . . , θ d } is the standard orthogonal basis of the subspace spanθ, the coordinate of ϕ (x i ) projected on the subspace spanθ is where i = 1, . . . , d.

2) DOMAIN ADAPTION BASED ON RKHS SUBSPACE LEARNING
We aim to study the domain adaptive problem. Supposed a source domain dataset and a target domain dataset in the original data the source domain data X s is labeled. Similarly, the label of the unlabeled target domain data X t needs to be identified by X s . The distribution of X s and X t are different.
This section focuses on the domain adaptive algorithm based on RKHS subspace learning. First, both domain data are transformed from the original space to the RKHS subspace. And then, the distribution of the source domain and the target domain data will converge in RKHS subspace by the subspace learning. Let where the matrix W is unknown and represents the RKHS subspace. Y s and Y t are converged in the distribution of Euclidean space R d by learning W . We measure the convergence of the distribution by Maximum Mean Difference (MMD) criterion:

3) DICTIONARY LEARNING IN RKHS SUBSPACES
We hope that the spatial distribution of the same category data in the source and target domain coincides in the subspace.
Since the category of the target domain data is unknown, this requirement is difficult to model. We adopt the regular-term of the source domain dictionary and realize that the target domain data is distributed around the source domain data that is almost linearly related to it, indirectly meeting the VOLUME 9, 2021 requirements. We use the source domain X s as the dictionary and the target domain as the training sample X t . X t use the dictionary X s to linear representation expressed as α.
In dictionary learning, every data in the target domain can be linearly represented by appropriate original data through learning coding coefficients α. And the coding coefficients are sparse, so the L1 regularity is adopted for the coding coefficients. The model of dictionary learning is as follows:

4) DOMAIN ADAPTION BASED ON SOURCE DICTIONARY REGULARIZED RKHS SUBSPACE LEARNING (SDRRKHS-DA)
Domain adaptive learning based on RKHS subspace learning only overlaps the overall spatial distribution of source domain data and target domain data as much as possible in the subspace. However, we want the spatial distribution in the same category to overlap in the subspace. Comparison to other tasks of domain adaptive learning, mental recognition from EEG seems to be very challenging. Therefore, in this section we will combine the MMD and dictionary learning, and propose a subspace learning algorithm based on the source domain dictionary regularization. The algorithm requires the target domain data to distribute around the source domain with the strongest linear correlation, which indirectly reflects the requirement that the spatial distribution of the source domain and target domain in the same category should be consistent. The data is transformed from the original space to RKHS by the reproducing kernel, and further to a RKHS subspace by the subspace projection. For selecting subspace, MMD minimizes the mean difference to reduce the distribution difference between domains, and source domain dictionary regularization constitutes a new RKHS subspace learning method. The model should learn the relationship between the source domain and the target domain for representing the target domain by the most relevant data in the source domain. Our model is as follows: y s represents the source domain data, y t is the target domain, W is the subspace projection matrix. α represents the coding coefficient, the linear representation of the source domain data mentioned above. α 1 controls the sparsity of α, and W 2 is used to control the complexity of W . Finally, the objective function of the subspace learning part is as follows: arg min

5) SOLUTION TO SDRRKHS-DA a: SOLUTION TO W
To update the subspace projection matrix W , we fix the value of the coding coefficient α. The problem finally becomes a common problem of solving the subspace projection matrix W . The updated objective function of W is where where Rayleigh entropy can be used to solve the above problem. Take the d eigenvectors corresponding to the first d smallest eigenvalues, which can be used as the subspace projection matrix W .

b: SOLUTION TO α
In the process of updating the coding coefficient α, we first fix the value of the projection matrix W , and the problem becomes the most primitive sparse coding problem. The objective function for updating the coding result is: The above problem is a typical Lasso optimization problem, and the toolbox of SPAMs [40] and CVX [41] can quickly solve this problem.

Algorithm 1 EEG-KLSD Algorithm
Input: source domain data X s and target domain data X t , kernel k and binary function k 0 , parameter µ, λ, η, β Output: the projection matrix W 1. According to K T stz T K stz + 2βM , use conjugate gradient method to optimize M ; 2. According to M andk(x, y) = k(x, y) + k T Z (x)Mk Z (y), find the kernel functionk; 3. Calculation matrixK and L, randomly initialize coding coefficients α; 4. Calculate matrix Φ by coding coefficient; 5. Perform eigenvalue decomposition on , and take the d eigenvectors corresponding to the first d smallest eigenvalues to form W ; 6. Solve coding coefficients α by mexlasso; 7. Iteratively solve steps 5 and 6, and stop iterating until the loss value is less than the set threshold; 8. Get the projection matrix W .

C. COMPLEXITY ANALYSIS
We use O 1 and O 2 to represent the time complexity and space complexity, where N s represents the number of samples in the source domain, N t represents the target domain, N = N s + N t . In RKHS learning, we first need to construct the kernel matrix, so the time complexity adopted is O 1 (N 2 + N t N + N s N ), the space complexity is O 2 (N 2 + N t N + N s N ), the time complexity of optimizing the kernel function using Riemannian conjugate gradient is O 1 ( 4 3 N 3 ), and the space complexity is O 2 (2N 2 +N ). If the number of the optimization algorithm iterations is l 1 , then the total time complexity is O 1 ( 4l 1 3 N 3 ), and the space complexity is O 2 (2l 1 N 2 + l 1 N ). In RKHS subspace learning, for the RKHS subspace learning algorithm based on the source domain dictionary, we need to update W through SVD decomposition. For each SVD, the time complexity is O 1 (N 3 ), and the space complexity is O 2 (N 2 ); Then, the sparse matrix α is updated by MexLasso, and the time complexity is O 1 (N 4 ), and the space complexity is O 2 (N 2 ). Φ need to be updated. The time complexity is O 1 (N s N t ), and the space complexity is O 2 (N + N t N ). If the number of update iterations is l 2 , then the time complexity is O 1 (l 2 (N 4 + N 3 + N s N t )), and the space complexity is O 2 (l 2 (2N 2 + N + N t N )). According to the time complexity and space complexity of RKHS learning and RKHS subspace learning, and the overall time complexity is O 1 (N 2 +N t N +N s N + 4l 1 +3l 2 3 N 3 +l 2 (N 4 + N s N t )), and the space complexity is O 2 ((2l 1 + 2l 1 + 1)N 2 + (l 2 + 1)N t N + (l 1 + l 2 )N + N s N ).

IV. COMPARISON WITH OTHER RELATED STATE-OF-THE-ART ALGORITHMS
In this chapter, we will introduce five state-of-the-art algorithms. The five algorithms all adopt the subspace learning method to solve the domain adaption. TCA and SSTCA are some of the earliest subspace learning algorithms and achieve good results in domain adaption. IGLDA, TIT, and GSL proposed in recent years, and they all have their advantages in domain adaptation. This chapter mainly describes the theoretical comparison between our algorithm and the comparison algorithms, and the experimental comparison will be elaborated in the next chapter.

A. COMPARISON TO TCA
The model of TCA [32] is TCA adopted MMD as the domain adaptive criterion, map the data by a kernel function to RKHS, and then through the subspace projection matrix W mapped to RKHS subspace. The data in the subspace minimizes the distance of the source domain and target domain data. TCA adopts W for controlling the complexity of W . W T KHKW = I d maximizes the variance of data after mapping for preserving attributes that are useful for categorizing tasks. The difference between EEG-KLSD and TCA is that EEG-KLSD adopts the two learning framework and the source domain dictionary regularization so that the spatial distribution of the source domain and target domain data in the same category in the subspace is as close as possible. In addition, the TCA subspace constraints are only derived from mathematical formulas, while our subspace constraints are derived from the RKHS subspace orthonormal basis.

B. COMPARISON TO SSTCA
Based on TCA, SSTCA [32] adds a manifold-regular optimization term, and its model is W T KHK yy HKW = I uses HSIC to enhance the correlation between labels and data; The second term in formula (32) is a manifold-regular term that preserves the data-geometric structure. The difference between EEG-KLSD and SSTCA is that EEG-KLSD adopts a two learning framework. In subspace learning, SSTCA uses manifold regularization, and it wants to map all data into the subspace as close as possible. The source domain dictionary regularization combines the linear discrimination and hopes that the spatial distribution of the source domain and the target domain data in the same category in the subspace coincidences as much as possible.

C. COMPARISON TO IGLDA
The IGLDA [33] model is as follows: (36) where, c is the number of categories, N l is the number of data in each category. From the above model, IGLDA is similar to TCA. They both adopt MMD as the criteria for domain adaptation and µ W 2 for controlling the complexity of W . W T KHKW = I d also maximizes the variance of mapped data. The difference between EEG-KLSD and IGLDA is not only in the two learning framework but also in the regular terms. The regular term adopted by IGLDA is similar to the inter-class divergence of source domain data, but it does not require the distance of source domain data of different categories to be as far as possible, which will lead to misjudgment of target domain data labels. The source domain dictionary regularization combines the linear discrimination and hopes that the target domain data can be distributed around the source domain data that is linearly related to it. In general, the data with the strongest linear correlation are the same class, so in theory, source domain dictionary regularization can improve the classification accuracy very well.

D. COMPARISON TO TIT
TIT [34] obtains false labels through multiple experiments to improve the final accuracy. Its model is as follows: where W = [W s , W t ], K t represents the target domain data on RKHS, and W t represents the dimension-reduction matrix of K t . As in the above models, TIT also adopts MMD, but this model adopts W 2,1 in the construction subspace, which means that W are as sparse as possible. Based on SSTCA, TIT adds a regularization K t − W t W T t K t 2 F that is similar to PCA and makes the projected target domain data as close as possible. If KNN is used as a classifier, this regular term does not play a role. Compared with TIT, EEG-KLSD adopts a two learning framework and maps both source domain and target domain data to the same subspace, so that the target domain data is distributed around the linearly related source domain data, which is better than TIT in theory.

E. COMPARISON TO GSL
The model of GSL [35] is as follows: GSL algorithm maps source domain and target domain data to two different subspaces by two different projection matrices and solves the problem that the number of source domain data and target domain data is not equal by Z so that the projected source domain and target domain data are closer to each other. GSL expects the learning of the source domain subspace W s to guide the learning of the target domain subspace W t , thus minimizing the Bregmans divergence between trains a classifier. Unlike EEG-KLSD, GSL is not trained on the RKHS. The reason for the comparison with GSL is that W T t X t − W T s X s Z 2 F adopted in GSL is somewhat similar to our source domain dictionary regularization and it also has achieved great results in domain adaptation.

V. EXPERIMENTS
This section performs classification tasks on the BCI Competition IV 2a dataset and compares our algorithm with the TCA, SSTCA, IGLDA, TIT, and GSL algorithms mentioned in the article, and verifies the effectiveness of our algorithm.

A. THE INTRODUCTION OF BCI IV 2a DATASET
The BCI Competition IV 2a dataset is obtained by collecting the EEG signals of 9 healthy subjects. They are performing four different motion-imaging tasks, which are imagining the movements of the left hand, right hand, feet, and tongue. For each subject, two sessions were recorded on two different days, each session contained 288 EEG signals, and 72 trials were performed for each category. In each experiment, the researcher will give a reminder in the form of an arrow pointing to the left, right, down, or up corresponding to one of the four categories to remind the subject to perform the corresponding motion-imaging task. EEG signals are recorded in the sensory-motor area of the subject through 22 electrodes on the device at a sampling rate of 250 Hz. From the appearance of the prompt to the end of the motion image task, the motion imagery lasted for 4 seconds. Starting from the suggestion that the user is prompted to perform the mental task, the time interval of the processed data is limited to between 0.5 seconds and 2.5 seconds. The EEG signal of each experiment is band-pass filtered by a fifth-order Butterworth filter in the 10-30Hz frequency band. The subjects were identified as A01-A09, the first experiment day was E, and the second one was T, e.g. the data of Subject 1 on the first experiment day was identified as A01E.

B. VALIDITY EXPERIMENT
In this section, we will prove the validity of the data and the validity of the two innovations in our algorithm. We will first verify whether the data meets the basic conditions of domain adaptation, and then verify whether the probability distribution of the source domain data and the target domain data after processing by our algorithm are consistent. In terms of algorithm effectiveness, we design the proposed algorithm into three experiments, namely RKHS learning + MMD, MMD + regular term, and RKHS learning + regular term, which are compared with MMD algorithm and baseline respectively. Baseline is an algorithm that uses pca for dimensionality reduction in the original space, and then uses SVM for classification. The purpose of pca dimensionality reduction in baseline is to ensure the consistency of the experiment, because the algorithm and MMD proposed in this paper both use subspace dimensionality reduction.
In terms of data, we divided each subject's two sessions recorded on two different days into a source domain and a target domain, and used tSNE [42] to visualize the data. We found that the data in the source domain and the target domain meet the basic requirements of domain adaptation. The source domain and the target domain are geometrically separated, which proves that the probability distributions of the source domain and the target domain are not the same but related. The distribution diagram of source domain data and target domain data is shown in Figure 2.
We used A01T to A09T as the source domain, and A01E to A09E as the target domain. A total of 9 experiments were designed. For each experiment, all the source domain data is used as the training set, and the target domain data is used as the test set. We will use RKHS learning+MMD (RL+MMD), MMD+source domain dictionary regularization(MMD+SDR), EEG-KLSD (the proposed method), MMD and baseline for experiments. This experiment uses SVM as the classifier, trains the SVM model according to the low-dimensional representation of the training set data and its labels, and then predicts the low-dimensional representation of the test set. The same experiment is repeated 10 times. Our algorithm parameter settings are: EEG-KLSD: µ = 1, λ = 0.1, η = 1, β = 1. The kernel function uniformly uses the rbf kernel, and the dimension of all data is reduced to 50.
The bolded data in Table 2 is the highest classification accuracy in each experiment. From the table, we can know that RL+MMD and SDR+MMD have higher classification accuracy than baseline and MMD algorithms. The average classification accuracy of RL+MMD algorithm is 11.63% and 3.02% higher than baseline and MMD algorithms, respectively. The SDR+MMD algorithm has improved the classification accuracy more significantly. The SDR+MMD algorithm is 35.53% and 26.91% higher than the baseline and MMD algorithms in average classification accuracy, respectively, which proves that the two innovations we proposed are Effective. In addition, the EEG-KLSD algorithm obtained by combining RL and SDR has a higher classification accuracy than RL+MMD and MMD+SDR, which use one of the innovative points alone. The EEG-KLSD algorithm is better than RL+MMD and MMD+ respectively. The SDR algorithm is 24.9% and 1.02% higher, which also proves that the EEG-KLSD algorithm is effective.
In addition to the classification accuracy, it can also be seen from the data distribution that our algorithm has played a very good role in the consistency of the probability distribution of the source domain data and the target domain data.  The point in Figure 3 is the data saved after processing by our algorithm, and then visualized using the tSNE algorithm. The red dots represent the source domain data, and the cyan dots represent the target domain number. From Figure 2 we can see that the distribution of source domain data and target domain data have overlapped, which is completely different from the situation in Figure 2 where the source domain data and target domain data can be completely separated and have boundaries. This situation shows that the probability distribution of the source domain and the probability distribution of the target domain are almost the same. In this case, using the source domain data to identify the target domain data can get a higher accuracy rate. Therefore, Figure 3 shows that our algorithm is very effective.

C. EEG CLASSIFICATION
We compare our algorithm with five state-of-the-art algorithms to prove that our algorithm. We used A01T to A09T as the source domain, and A01E to A09E as the target domain. There are 9 experiments in total. For each experiment, all the source domain is used as the training set, and the target domain is used as the test set. The EEG-KLSD and the comparison algorithm are used to train on the training set to obtain the low-dimensional representation in the subspace of the training set. The training set and the test set are respectively multiplied by the low-dimensional representation to get the reduced dimensional training and test data. The classifier is SVM in this experiment and its model trains on the low-dimensional representation of the training set data and labels. SVM predicts the low-dimensional representation of the test set. Our algorithm and comparison algorithm parameter settings are as follows: EEG-KLSD: µ = 1, λ = 0.1, η = 1, β = 1; TCA: µ = 1; SSTCA: µ = 1, λ = 10 −9 ; IGLDA: µ = 1, λ = 10 −7 ; TIT: µ = 10 −5 , λ = 10 −7 , β = 1; GSL: λ = 0.1, β = 4 × 10 9 . The kernel function and binary function of the EEG-KLSD algorithm use the rbf kernel uniformly, and the kernel function in the comparison algorithm also uses the rbf kernel. The dimension of all data is reduced to 50. The same experiment was repeated 10 times.
Number represents different subjects. The bolded data in Table 3 is the highest classification accuracy in each experiment. In this experiment, in addition to the experiments of the algorithm in this article and the five state-of-the-art algorithms, experiments on the baseline and MMD algorithm are also added. The baseline algorithm in this section is the same as the baseline algorithm mentioned in the previous section. From the table 3, we can see that the five state-of-theart algorithms are significantly improved compared with the baseline algorithm and the MMD algorithm. Compared with the baseline algorithm, the five state-of-the-art algorithms have increased by 22.48%, 22.88%, 23.08%, 28.67%, and 29.26% respectively. Compared with the MMD algorithm, the five state-of-the-art algorithms have increased by 13.87%, 14.27%, 14.47%, 20.06%, and 20.65% respectively. It can be seen that the five state-of-the-art algorithms have a large improvement compared with the ordinary baseline algorithm and the MMD algorithm, but our algorithm also has a good improvement compared with the five state-of-the-art algorithms. This shows the superiority of our algorithm. From the table, we can see that compared with the other five the stateof-art algorithms, our algorithm has improved their average accuracy by 14.08%, 13.69%, 13.48%, 7.89%, 7.3%, respectively. The improvement rate reached 12.21% 26.57%, which shows that the improvement effect of classification accuracy is very significant.
The bolded data in Table 4 and Table 5 are the highest classification accuracy in each experiment. It can be seen from Table 4 and Table 5 that our algorithm has higher classification accuracy in different dimensions than TCA, SSTCA, IGLDA, TIT, and GSL algorithms. In the A01 data set, the average classification accuracy of the EEG-KLSD algorithm is 5.19%, 4.91%, 5.06%, 3.22%, 4.07% higher than the five comparison algorithms, respectively, and the increase rates are 6.96%, 6.56%, 6.77%, 4.21%, 5.38%. In the A02 data set, the average classification accuracy of the EEG-KLSD algorithm is 29.41%, 28.92%, 29.68%, 8.26%, and 8.40% higher than the five comparison algorithms, respectively, and the increase rate is 100.77%, 97.43%, 102.67%, 16.42%, 16.74%. It can be seen that the EEG-KLSD algorithm has a very large improvement on some data sets, and the classification accuracy of the comparison algorithm is twice as high.
It can be seen from Figure 4 and Figure 5 that the classification accuracy of the five comparison algorithms when  the dimensions are 10 and 20 is lower than the classification accuracy when the dimension is 30 or more. It may be that the dimensionality reduction is too low, which reduces the effect of the regular term of the comparison algorithm. For the TCA algorithm, it requires the maximization of the variance in the process of subspace dimensionality reduction so that the data can be separated, but when the dimensionality is too low, the data loss is too much, which may cause the data to be separated and affect the classification effect. The source domain dictionary regularization used by our algorithm can make good use of the information of the source domain data. Even if the dimensionality reduction dimension is small, the classification accuracy of the algorithm will not be seriously affected. From Figure 4 and Figure 5, we can also see that the classification accuracy of the EEG-KLSD algorithm in each dimension is very stable and does not fluctuate greatly, while the classification accuracy of the comparison algorithm fluctuates relatively large. For some researchers who need to reduce the dimensionality below 30 dimensions, our algorithm is a better choice.
Next, we will continue to explore the influence of the parameter µ and λ on the effect of the EEG-KLSD algorithm. This experiment will use A02 as the experimental data, and use its source domain data and target domain data to analyze the influence of the parameters, and the subspace dimension is set to 50 dimensions. In this experiment, first fix η = 1, and set µ and λ to five values of 10, 1, 0.1, 0.01, and 0.001 respectively to test the classification accuracy of the EEG-KLSD algorithm under these different µ and λ. The parameter µ controls the complexity of the subspace projection matrix W , and the parameter λ controls the sparsity of the coding coefficient α. Generally speaking, the larger the value of the parameter µ, the lower the complexity of W ; and the larger the value of the parameter λ, the sparser the coding coefficient α.
The classification effect of the parameters µ and λ on the EEG-KLSD algorithm is shown in Table 6. It can be seen from the table that under different values of µ, the classification effect of EEG-KLSD at λ = 0.1 is the best. When the value of µ is fixed, as the value of λ decreases from 10 to 0.001, the classification accuracy of the EEG-KLSD algorithm first increases and then decreases. Therefore, it can be considered that the coding coefficient α is sparser, the accuracy of algorithm classification may not be higher. The reason for this result is that when the value of λ is large, the coding coefficients are too sparse, causing a large amount of effective data to be lost when the source domain dictionary linearly represents the target domain data, and the target domain data cannot be truly represented by the source domain dictionary. When the value of λ is small, the coding coefficients are too redundant, and some source domain dictionaries that have nothing to do with the target domain data are also used to represent the target domain data, causing the model to learn wrong information and reducing the classification accuracy. When the value of λ is fixed, the parameter µ decreases from 10 to 0.001. From the table, we find that the regularity of the change in classification accuracy caused by the change of µ value is not particularly obvious. But in general, when the µ value is between 0.01 and 1, the classification accuracy of the EEG-KLSD algorithm is higher, and this range is a good choice for selecting µ value.

VI. CONCLUSION
In order to solve the problem of the mismatch between the marginal probability distribution of the source domain and the target domain in EEG mental recognition, we propose an EEG mental recognition algorithm based on RKHS learning and source domain dictionary regularized subspace learning. In this article, our main contributions are as follows: (1) At present, most of the domain adaptation learning only learns the linear subspace of RKHS, and RKHS itself is not learned. In view of the complexity and nonlinearity of EEG mental recognition, this paper proposes an EEG mental recognition algorithm based on two learning. The two learning is RKHS learning and RKHS subspace learning. Compared with single learning, two learning makes the marginal distribution of source domain data and target domain data closer. (2) According to the Moore-Aronszajn theorem, RKHS can be uniquely generated by the kernel function. The existing RKHS is rarely learnable. The reason is that it is difficult to find a kernel function that is learnable and easy to optimize. In RKHS learning, this paper uses a learnable kernel function that we have published, and the kernel function is easy to optimize. The kernel function can be used to generate a learnable RKHS. (3) In RKHS subspace learning, most of the existing methods use the MMD, but it cannot make the spatial distribution of the same category of source and target domain data overlap as much as possible, And the label of the target domain data is unknown. So we propose an RKHS subspace learning algorithm based on source domain dictionary regularization. The algorithm requires that the target domain data distributes around the source domain with the strongest linear correlation based on MMD. It indirectly reflects the requirement that the spatial distribution of the source domain data and target domain data of the same category is as consistent as possible. Sufficient experiments show that the source domain data and target domain data processed by our algorithm overlap geometrically. Moreover, our algorithm far exceeds the classification effect of several state-of-the-art algorithms in terms of classification accuracy, which fully proves the effectiveness of our algorithm.
MMD is a commonly-used domain adaptive criterion. But two zero mean random variables, although their distributions are different, their MMD is zero. Therefore, finding better criteria and the addition of various rules are important research directions of domain adaptation. In our proposed algorithm, a regular term of dictionary learning is exploited for BCI application. Other regularization methods can be tried in theory and practice in the future.