Electromagnetic Source Imaging With a Combination of Sparse Bayesian Learning and Deep Neural Network

Accurate reconstruction of the brain activities from electroencephalography and magnetoencephalography (E/MEG) remains a long-standing challenge for the intrinsic ill-posedness in the inverse problem. In this study, to address this issue, we propose a novel data-driven source imaging framework based on sparse Bayesian learning and deep neural network (SI-SBLNN). Within this framework, the variational inference in conventional algorithm, which is built upon sparse Bayesian learning, is compressed via constructing a straightforward mapping from measurements to latent sparseness encoding parameters using deep neural network. The network is trained with synthesized data derived from the probabilistic graphical model embedded in the conventional algorithm. We achieved a realization of this framework with the algorithm, source imaging based on spatio-temporal basis function (SI-STBF), as backbone. In numerical simulations, the proposed algorithm validated its availability for different head models and robustness against distinct intensities of the noise. Meanwhile, it acquired superior performance compared to SI-STBF and several benchmarks in a variety of source configurations. Additionally, in real data experiments, it obtained the concordant results with the prior studies.


I. INTRODUCTION
E LECTROENCEPHALOGRAPHY (EEG) and magnetoencephalography (MEG) are widely used noninvasive techniques to explore human brain function with millisecond temporal resolution. Reconstruction of the cortical activities from the E/MEG recordings plays an important role in the field of neuroscience research and decease diagnose [1], [2]. The aforementioned problem can be addressed by solving the forward and inverse models sequentially. In forward model, the lead field matrix, which represents the linear relation between sources and measurements, is obtained depending on the configuration of sensors as well as the geometric and electric attributes of individual brain anatomy [3]. The inverse problem aims at estimating brain current sources from E/MEG recordings based on the known lead field matrix. In distributed source imaging models, current sources are located on voxels discretized over the entire cortical surface. Unfortunately, it involves solving an inherently ill-posed linear inverse problem since the source space dimensionality (∼O(10 4 )) always enormously exceeds the sensor space's (∼O(10 2 )).
In order to acquire a unique meaningful estimate in the neurophysiology context, prior assumptions are used to narrow the solution space. In the minimum-norm estimate (MNE) [4], L 2 -norm of sources was introduced and interpreted as the underlying energy should be minimized. As a variant of MNE, low resolution brain electromagnetic tomography (LORETA) [5] and standard LORETA [6] employed the discrete Laplacian operator to reflect the spatial correlation between discretized voxels. Notably, the aforementioned algorithms can be framed into the Bayesian schema with L 2 -norm interpreted as the ingredient of negative log of likelihood function. Under empirical Bayesian framework [7], also referred to as sparse Bayesian learning (SBL), the hyper parameters introduced in regularization terms can be released and optimized with other variables simultaneously, which induce the estimates with high sparsities. Champagne [8] was a representative SBL algorithm and outperformed several benchmarks in various source configurations. In addition, the usage of constraints in temporal domain also benefits to source localization. Temporal basis functions (TBFs) are widely used in the inverse problem, with which we are allowed to formulate the model in latent space with dimensionality tremendously smaller than the number of time points. TBFs can be predefined [9] or extracted in data-driven manner, such as singular value decomposition (SVD) [10] and stimulus-evoked factor analysis (SEFA) [11]. Furthermore, in source imaging based on spatio-temporal basis function (SI-STBF) [12], TBFs were alternately updated with other coefficients in the variational inference.
With the rapid development of deep learning in last decade, more and more attempts have been made to solve the inverse problem using deep neural networks (DNNs), attracted by whose effectiveness in building up the mapping between variables via extracting the underlying correlation from large-scale training dataset [13]. For example, a DNN model composed of long short-term memory (LSTM) layers was presented in [14] to estimate the location and time-course of singledipole source. It formulated a sequence-to-sequence (seq2seq) network with the same length between spatio-temporal input and output. Hecker et al. proposed a tailored convolutional neural network (CNN) called ConvDip [15], which converted an interpolated 2D image of EEG data on single time point into the corresponding estimated source vector. In source imaging framework network (SIFNet) [16], the inverse problem was reinterpreted as a supervised multi-class classification task and the discriminative model was constructed based on residual convolutional layers [17]. In data-synthesized spatio-temporal denoising autoencoder (DST-DAE) [18], a denoising autoencoder [19] based model was trained with samples generated via a data synthesis strategy with the usage of TBFs.
Different from conventional imaging algorithms, where the prior knowledge is imposed through constructing the probabilistic generative model with explicit mathematical expressions, the emerging DNN-based methods implicitly integrate the prior assumptions in the derivation of training data. Nevertheless, most of the presented approaches generate the data in heuristic fashion with little mathematical foundation. On the contrary, probabilistic generative model has the advantage in deriving the explicit and hierarchical correlation between variables, but it often remains intractable to acquire the true posterior distributions of latent variables. Even though we consider the approximation using mean field assumption, it takes much time on the alternately update for groups of parameters in every complete variational inference process.
In this paper, to solve the inverse problem, we propose a novel source imaging framework with a combination of sparse Bayesian learning and deep neural network (SI-SBLNN). Specifically, the network is employed to construct a mapping from observed measurements to latent sparseness encoding parameters in the probabilistic graphical model of specific SBL based algorithm. The network is trained with the synthesized data derived from the above probabilistic model. Once the training is accomplished, the variational inference in original algorithm can be significantly compressed because the critical parameters are directly output from the trained network and the convergence can be readily fulfilled. This proposed framework is flexible to achieve different realizations based on distinct SBL algorithms or probabilistic graphical models. In this work, we selected SI-STBF as the backbone to illustrate the implementation details and corresponding imaging results. The main contributions of this paper are presented as follows: 1) A novel framework is proposed to solve the inverse problem in source imaging, where sparse Bayesian learning is incorporated with deep neural network. 2) Gamma distribution is utilized for sampling the sparseness encoding parameters in training data. 3) Adaptive weighted mean square error (MSE) is presented as the loss function to measure the discrepancy between the estimated parameters and ground truth. The rest of this paper is organized as follows. In Section II, we derive the novel framework for inverse problem and illustrate the implementation in detail. Meanwhile, we also expound the experimental protocol and performance metrics. Section III demonstrates the source imaging results on both simulated and real E/MEG data. Some discussion and conclusions are illustrated in Section IV and V respectively.
For notation, upper and lower case bold face letters denote matrices and vectors respectively. Scalars are represented by roman lowercase variables. ∥·∥ F denotes the Frobenius norm of a matrix and N (x|µ, ) denotes a Gaussian distribution of x with the mean µ and covariance .

A. Probabilistic Generative Model
With the usage of Maxwell's equation [3], E/MEG recordings can be approximately expressed as the linear mapping of brain current sources as: where B = [b 1 , b 2 , . . . , b T ] ∈ R P×T is the E/MEG measurements with P sensors and T time points, b t represents a snapshot of measurements at time point t. S ∈ R D×T indicates the sources and D is the number of discretized voxels over the cortical mesh. L ∈ R P×D is the lead field matrix whose columns reflect the sensors response induced by the unit current sources. The orientation of each source is restricted perpendicular to the cortical surface. ε ∈ R P×T denotes the measurement noise with independent and identical Gaussian distribution N (ε t |0, ε ) at each time point. Because D always enormously exceeds P, it is inherently ill-posed to estimate brain sources from E/MEG measurements with the known lead field matrix. Therefore, prior knowledge is of great importance to acquire unique meaningful estimate in the context of neurophysiology. Because of the convenience to integrate prior assumptions, probabilistic graphical model is widely used for the inverse problem. For instance, in SI-STBF, hierarchical structure is constructed between multiple latent variables, which are respectively imposed different prior distributions. As shown in Fig. 1, sources S are decomposed into the product of three ingredients including spatial basis functions A, temporal basis functions , and coefficients : indicates the kth TBF with T time points. Spatial basis functions A are constructed based on a data driven parcelization [20] of the cortical mesh with each A [i] ∈ R D×r i derived from one parcel. The parcelization is implemented based on Multivariate Source Pre-localization (MSP) [21], which estimates the contribution of each dipole to the measurements. Coefficients build up the connection between spatial and temporal basis functions. Additionally, [i] ∈ R r i ×K is imposed the prior distribution with precision parameter γ i as: Hence, the prior probability of entire can be expressed as: where θ k represents the kth column in . Meanwhile, is imposed the prior distribution with parameters α as: With the application of mean field assumption, the variational posterior distribution is decomposed as: In VB-EM process, q (θ k ) , q ( ) , γ , and α are alternately updated until the convergence of variational lower bound F, also called free energy, is fulfilled. Specifically, free energy has the definition as: . The specific expression of free energy in SI-STBF is presented in [12]. Additionally, q (θ k ) is updated as follows: where F = L A.

B. A Combination of Sparse Bayesian Learning and Neural Network
It is worth noting that precision parameters γ have the dominant influence on the reconstructed sources. Optimized values of γ always achieve enormous difference in magnitude between ingredients and hence it leads to sparseness in coefficients owing to the formulation in (3). Moreover, if γ are settled down during the optimization in VB-EM, iterative update process is tremendously compressed and the computational cost can be greatly reduced.
In this study, we introduce a mapping based on deep neural network represented as the red dashed arrow in Fig. 1, which directly reconstructs precision parameters γ from measurements B. Using this mapping, we are allowed to acquire the estimated sources with much fewer epochs of update in variational inference. Based on the hierarchical structure of the graphical model, through repeatedly conducting the top-down sampling procedure, we can obtain the pairs for network training. Specifically, above all, spatial and temporal basis functions A and are predefined before the sampling procedure. Precision parameters γ are drawn from a specific distribution with the same dimensions as the number of components in A. Afterwards, , S, and B are sequentially generated based on (4), (2), and (1).

1) Gamma Distribution for Precision Parameters Sampling:
As mentioned above, precision parameters γ need to be sampled before other variables. Nevertheless, how to derive the training samples {γ (n) } N n=1 , which maintain the innate attributes in SI-STBF, is not straightforward. Intuitively, we have the preference for γ in which outliers exist with small values, because it implies the sparseness in sources S.
Here, we declare gamma distribution is an appropriate option to derive {γ (n) } N n=1 . In SI-STBF, two update rules, including EM based manner and convex approximation, are tractable for γ optimization. The latter one leads to much higher convergence rate but they have the same stationary points theoretically [22]. Suppose γ are regarded as variables whose components are imposed gamma distributions as the prior. Simultaneously, mean field assumption is adopted and parameters in variational posterior are iteratively updated to promote the free energy. These update formulas have slight difference from the EM update rule in SI-STBF. Therefore, gamma distribution is satisfactory for γ sampling.
Gamma distribution is defined in terms of two parameters, called shape a > 0 and rate b > 0, which respectively control the sparsity and scale. To concretely illustrate how gamma distribution is influenced by the shape parameter, in the supplementary material, we exhibit the histograms of samples drawn from gamma distributions with distinct a and fixed b. As demonstrated, shape a dominates the sparsity of gamma distribution and outliers with small scale can be readily sampled with tiny a. Hence, relatively small values of a were used to derive {γ (n) } N n=1 in subsequent experiments. 2) Network Architectures: In this work, we chose a multilayer perceptron (MLP) and two different CNN-type models for validation, including EEGNet [23] and residual convolutional network used in SIFNet [16], which are both widely used in temporal signal processing. EEGNet is a compact CNN composed of three convolution blocks and one fullyconnected layer. The former one contains depthwise and separable convolution modules which extracts different patterns of information from the signal. The residual network used in SIFNet incorporates a sequential of residual blocks where each one consists of two CNN layers with batch normalization to alleviate the covariance shift [24]. For above models, softmax functions in last layers for classification were eliminated to fulfill the demand in our proposed framework. In the rest of this paper, EEGNet maintains its notation and residual network is denoted as ResNet.
3) Loss Function: Even though MSE is popular in the field of machine learning for its effectiveness, direct use of MSE between the estimatesγ and ground truth cannot meet our requirement. Because of the inherent attribute of gamma distribution, γ are readily drawn with components which cover a wide range of values. To avoid reconstructing extremely enormous values, logγ is output from the neural network and the discrepancy between log γ and logγ is adopted in the loss function. More critically, only the information of components in γ with small scale can be completely propagated to the sensor space through the hierarchical structure in SI-STBF. From measurements B, it is infeasible to explicitly estimate how small [i] , is when whose information is covered by other components with huge scale. Owing to the coupling between γ i and [i] in (3), it remains unrealistic to build up a precise mapping from B to great γ i .
In this work, we proposed an adaptive weighted MSE to release the restriction of explicit reconstruction of γ among all components, where the weighting coefficients are related to the scale of corresponding ingredients. Specifically, the large weighting coefficient is equipped for small γ i because precise estimation is desired. In addition, we need to obviate small estimate for γ i with great scale, otherwise it leads to redundant or interfered imaging result. Weighting coefficients are set to different values for distinct scenarios as shown in Table I, where the threshold γ * is defined as the minimum of γ multiplied by 10 4 and w ≫ 1. Eventually, the loss function has the expression as: where W is a diagonal matrix with the weighting coefficients as the diagonal elements.

C. Full Algorithm Implementation
The entire algorithm is composed of the data synthesis phase, model training phase, and source estimation phase. 1) Data Synthesis Phase: First of all, spatial basis functions A is constructed based on the partition set P over cortical mesh. To achieve flexible source reconstruction, P consists of parcelizations P s with different clustering scales s. Each P s is derived based on random MSP scores. Temporal basis functions (n) are composed of specific number of Gaussian-damped sinusoidal time courses with the definition as and ω denote the frequency, phase, and damping rate respectively, which are randomly sampled from specific intervals to promote the diversity of training data. Afterwards, precision parameters γ (n) , with the same dimensionality as the number of clusters in P, are drawn from gamma distribution with random shape a. According to (4) and (5), θ (n) k is sampled from the Gaussian distribution with precision matrix (n) derived from γ (n) . Negative values in (n) are clamped and complemented with zeros. Sources S (n) are the product of aforementioned components as in (2). Finally, depending on (1), we obtain measurements B (n) corrupted by the noise ε with specific signal-to-noise ratio (SNR). Additionally, B (n) are re-scaled by division of their absolute maximum and γ (n) should be normalized compatible with B (n) . After repeatedly conducting the above top-down sampling procedure, we acquire abundant data {B (n) , γ (n) } N n=1 for the model training phase. The detailed procedure is encapsulated in Procedure 1.
2) Model Training Phase: The neural network is trained through minimizing the loss function in (10) among the training dataset {B (n) , γ (n) } N n=1 . Stochastic gradient descent is used for parameters update within the neural network.
3) Source Estimation Phase: Once the training phase is accomplished, using the trained neural network, the unseen realistic E/MEG signals B are converted into the estimated precision parametersγ . With the fixedγ , variational inference is substantially compressed, where the convergence of the free energy is always fulfilled around five epochs of update with neglectable computational cost compared to SI-STBF. Estimated sourcesŜ can then be obtained based onŜ = A¯ , where¯ is the variational mean and is extracted TBFs. Detailed procedure is exhibited in Procedure 2.

4) Parameters Setting and Computation Requirements:
The partition set P consisted of three parcels P s with s = 1, 2, 3 unless otherwise specified. Shape parameters in gamma distributions were drawn from uniform distribution between 0.15 and 0.25. For Gaussian-damped sinusoidal time courses, f was sampled from log-uniformly distribution between 1 and 5, which implied the preference for low frequency. In addition, τ and ω were respectively drawn from uniform distribution over intervals (0.2, 0.5) and (0.05, 0.1). We set K = 3. In Monte-Carlo simulations, time courses were with sampling rate of 50 Hz and partitioned into pre-and Procedure 1 Data Synthesis Phase Input: Lead field matrix L, spatial basis functions A, dataset size N , dimensionality of precision parameters d γ , number of TBFs K , and specific SNR.
post-stimulus periods with 50 time points per phase. Sources were regarded as active during the post-stimulus period. Noise was integrated with distinct values of SNR including −5, 0, 5, 10 dB. The whole neural networks were implemented in Pytorch and trained on single 16 GB NVIDIA Tesla T4 GPU. Detailed configurations of the neural networks, including the architectures and parameters, were presented in the supplementary document. For the loss function in (10), the weighting coefficient was set w = 100. Adam [25] was chosen as the optimizer where learning rate was 10 −3 and other parameters for gradient rectification was set as recommended with β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 . Batch size was 256 and 100 epochs of training were adequate for convergence. Tolerance δ was set to be 10 −6 .
D. Evaluation 1) Simulation Protocol: To validate the availability of SI-SBLNN across head models, three different cortical meshes distributed in Brainstorm [26] were used in the numerical simulations, which were derived from the default MR images of ICBM152, FsAverage, and Colin27. They were respectively downsampled with 1000, 1500, and 2000 voxels while the orientation of located dipoles was imposed perpendicular to the cortical surface. Lead field matrix was acquired via OpenMEEG [27] software package embedded in Brainstorm, based on the sensor configuration in 64-channel Neuroscan Quik-cap system with two reference electrodes removed and three layers head surface obtained from Boundary Element Method (BEM).
In order to obtain spatial extended sources for Monte-Carlo experiments, we randomly selected the seed dipoles on cortical mesh and then iteratively absorb neighbor elements into the patches until the desired area of activated region is fulfilled. To reduce the variance of performance metrics caused by diverse positions of the patches, 100 Monte-Carlo experiments were conducted for each experimental scenario such that the majority of cortical surface is covered by the generated extended patches. For the evaluation data, dipoles were activated with the time courses randomly generated with the same manner as illustrated in Procedure 1. Simulated sources were projected onto the sensor space with lead field matrix and Gaussian noise was integrated with specific SNR based on (1). Simulated measurements were normalized by division of their absolute maximum and the estimated sources required to be rescaled.
2) Performance Metrics: To acquire the quantitatively and comprehensively evaluation of the source imaging performance, we utilized four validation metrics for assessment. The first metric is the area under the receiver operating characteristic curve (AUC) [28], which is an effective metric for sensitivity and specificity of the detection. The large AUC value implies the algorithm is capable to detect sensitively. The next metric is the spatial dispersion (SD), which measures the spatial blurriness of the estimated sources [29], [30]. To evaluate the correctness of the localization, the distance of localization error (DLE) [29], [30] is introduced. Finally, a mean square error (MSE) between the simulated and estimated sources is utilized to evaluated the accuracy of time courses reconstruction over the whole brain [31]. For good source imaging performance, the small values of SD, DLE and MSE are desired. The detailed calculations of AUC, SD, DLE, and MSE are presented in the supplementary material.
3) Comparison With Conventional Algorithms: In this paper, we derive a realization of SI-SBLNN with SI-SBTF as the backbone, which plays the essential role in data synthesis and source estimation phases. Hence, it is necessary to have the proposed algorithm compared to SI-STBF in order to explore whether it maintains or even promotes the imaging performance. In addition, the algorithm was tested against the benchmark methods including SBL [7], LORETA [5], and VB-GLM [9]. In this comparison, both LORETA and VB-GLM were transformed within the empirical Bayesian framework

E. Application to Real Data
To further evaluate the practical imaging performance of the proposed algorithm, two real datasets were used for validation. The first data was obtained from the OpenNeuro database. Its accession number is ds000117. The dataset is composed of simultaneous MEG/EEG recordings from 16 participants performing a simple visual recognition task from presentations of famous, unfamiliar and scrambled faces. The experimental design and multiple modalities data acquirement are expounded in [32]. Only the MEG measurements were used in our subsequent analysis, which were recorded at 1100Hz with an Elekta-Neuromag VectorView 306 system. The averaged response obtained from the subject 01 during famous face presentation epoch was downsampled at 100Hz and used for source estimation. The time window was chosen to be −500 ms to 500 ms (0 ms means stimulus onset). The noise covariance was estimated from the empty room recordings. The cortical surface was discretized with 1024 voxels. The second data is a public epilepsy dataset, which is available from the Brainstorm download page. The EEG data was recorded with sampling rate of 256 Hz from a patient suffering from focal fronto-parietal epilepsy, who was performed a left frontal tailored resection and was seizure-free with a followup of 5 years. 1 The average recordings, head model, and the lead field matrix can be acquired by following the procedures outlined in the Brainstorm tutorial. The time window was selected −200 ms to 200 ms (0 ms means the spike). Noise covariance was estimated from the recordings between 110 ms and 160 ms. The cortical surface was discretized with 1024 voxels.

III. RESULTS
For clear visualization of the source imaging results at specific time point, the thresholded absolute value of the current 1 http://neuroimage.usc.edu/brainstorm/Tutorials/Epilepsy sources, which is normalized with unit maxima, is shown on the cortical surface in this paper. The thresholds are determined upon the level of background activity based on Otsu's method [20], [31], [33].

A. Different Network Architectures and Different Sizes of Training Datasets
As a first step, we evaluated the performance of different network architectures and how they were influenced by the size of training dataset. Four networks were trained via three synthesized training datasets with different sizes. Concretely, we utilized two distinct configurations for ResNet and EEGNet respectively, which are indicated as ResNet-Shallow, ResNet-Deep, EEGNet-Narrow, and EEGNet-Broad. ResNet-Deep consists of more residual blocks than ResNet-Shallow. EEGNet-Broad shares the same architecture with EEGNet-Narrow, but the former one have more feature maps in each module. The detailed configurations including architectures and parameters are presented in supplementary document. Two MLP models with 90 and 1000 hidden units, respectively denoted as MLP-90 and MLP-1000, are also used for comparison. Three training datasets with distinct sizes are denoted as D.1, D.3, and D.6, which are composed of 1×10 4 , 3×10 4 , and 6×10 4 samples for each level of SNR. The training dataset with smaller scale is included in the one with larger scale and hence we have D.1 ⊂ D.3 ⊂ D.6. In this section, simulated sources were activated with single patch and different extents of 2, 6, 10, 18 cm 2 . Only the head model derived from ICBM152 was utilized. SNR was 5 dB.
First, we focus on the distinction between the MLP and CNN-type models. As demonstrated in Table II, for three training datasets, the CNN-type models consistently acquired the better results among all the validation metrics. Besides, it is easy to validate that there are enormously more learnable parameters in MLP-1000 than other compared models. Hence, compared to MLP, EEGNet and ResNet are more efficient in capturing the meaning information. Second, we investigate how the imaging performance of CNN-type models is

B. Different Head Models
To verify the availability of the proposed algorithm for different head models, in this section, we conducted the comparison between SI-SBLNN and SI-STBF across distinct head models derived from cortical surfaces including ICBM152, FsAverage, and Colin27. The experiments retained the settings in Section III-A. For Colin27, the parcelizations P s were generated with clustering scales of s = 2, 3, 4 to avoid too many clusters over the cortical mesh, which lead to the difficulty in estimating precision parameters with huge dimensionality. Table III depicts that, for all extents across different head models, SI-SBLNN acquired the superior performance compared to SI-STBF with larger AUCs and smaller SDs, DLEs, and MSEs. Moreover, as the spatial extents of activated region enlarged, SI-SBLNN obtained relatively stable values of AUC and decreased values in other metrics. It validated that the novel framework is available for different cortices and distinct numbers of discretized voxels over the cortical surface.

C. Influence of SNR
In this section, we tested the influence of SNR for distinct algorithms. We used the head model derived from FsAverage. Two patches of simulated sources were activated with spatial extents of approximately 6 cm 2 and white noise was integrated with different SNRs including −5 dB, 0 dB, 5 dB, and 10 dB. Fig. 2 depicts the performance metrics for different algorithms with varying SNRs. As the SNR increased, all the algorithms produced increasing values of AUC and descending values of SD, DLE, and MSE implying the higher detection accuracy and smaller spatial dispersion, localization error, and time courses reconstruction error. SI-SBLNN was relatively robust to intensity of the Gaussian white noise and acquired more stable imaging result compared to conventional algorithms. Additionally, except AUCs, SI-SBLNN produced significantly better results than the benchmarks under small SNR. With increasing values of SNR, the discrepancy in multiple metrics between SI-SBLNN and benchmarks gradually descended. Specifically, with SNR of 10 dB, SI-STBF acquired the comparable performance with SI-SBLNN among AUC, DLE, and SD and even better results of MSE. To intuitively demonstrate the source imaging performance of different algorithms influenced by varying SNRs, one realization of randomized simulations is presented in Fig. 3. As depicted, under the Gaussian noise with SNR of −5 dB, SI-SBLNN was capable to reconstruct the sources with spatial extents and locations basically in line with the ground truth. In the same circumstance, other algorithms obtained interfered imaging results in irrelevant regions over the cortical surface. As the noise intensity decreased, most methods acquired more plausible imaging results. With SNR = 5 and 10 dB, SI-SBLNN, SI-STBF, and SBL successfully localized the sources in correct region without spurious estimations. Notably, thanks to the usage of spatial basis functions, SI-SBLNN and SI-STBF reconstructed the sources with most accurate extents. In contrast, SBL acquired overly focal imaging results while LORETA and VB-GLM reconstructed diffuse sources covered many irrelevant regions.

D. Influence of the Number of Patches
In this section, we investigated how the imaging results of SI-SBLNN were influenced by the number of activated sources. We used the head model derived from FsAverage. We generated the simulated sources with various patches ranging from 1 to 4. The extent of each patch was fixed approximately 6 cm 2 and SNR was 5 dB. The performance metrics for all compared methods with various numbers of patches are presented in Fig. 4. As the number of patches increased, SI-SBLNN produced degraded imaging performance with decreasing AUCs and growing DLEs, SDs and MSEs. However, it still outperformed the other algorithms among all the metrics except SBL in AUC. It is worth noting that, SI-SBLNN acquired the inverse trends in SD and DLE compared to most benchmarks, including its backbone SI-STBF. It is most probably caused by the relatively small values of shape a in the data synthesized phase. It induced the preference for source estimation with high sparsity which was empirically not appropriate for the cases where sources were activated simultaneously among so many cortical regions. To further exhibit the capability of different algorithms to reconstruct multiple sources, Fig. 5 presents the spatial maps for one realization of randomized simulations with three sources. The imaging results are depicted at the time points at which the peak of individual time courses were attained.  SI-SBLNN produced the imaging results with consistent locations and extents with the ground truth at entire time points. SI-STBF and SBL were capable to localize the sources at correct regions. However, the former obtained slightly larger extents than the ground truth while the latter acquired smaller ones. Additionally, SI-STBF produced the irrelevant estimate at inferior temporal area. LORETA and VB-GLM acquired overly extended estimations with too many redundant sources.

E. Result of Real Data 1) Results
With Face Processing Experiment MEG Data: In this section, we investigated the imaging result of SI-SBLNN for real MEG data, whose description is presented in Section II-E. Both the data synthesis and model training phases maintained the same configurations as expounded in Section II-C. Fig. 6 presents the visual responses of distinct algorithms at 100 ms and 170 ms after the presentation of famous faces stimulus. As shown in the plots, SI-SBLNN successfully localized the bilateral neural activation at the occipital area at 100 ms, which is regarded for early visual processing. At 170 ms, SI-SBLNN reconstructed the sources in the vicinity of right fusiform area concordant with the previous reports [9], [34]. SBL and SI-STBF estimated the neural activation at the similar regions over the cortex. Meanwhile, the former reconstructed sources with smaller extents than SI-SBLNN while the latter acquired spurious reconstruction at left temporal pole. Besides, LORETA and VB-GLM failed to locating the neural activation at right occipital area and obtained too diffuse estimates which covered many functionally irrelevant regions.
2) Results With Public Epilepsy EEG Data: In this section, we validated the imaging performance of proposed algorithm for real EEG recordings with its acquisition declaration presented in Section II-E. Both the data synthesis and model training phases maintained the same configurations as presented in Section II-C.
As shown in Fig. 7, all the methods except SBL can successfully localized the sources on left frontal region over the cortical surface. Nevertheless, both LORETA and VB-GLM reconstructed too widespread estimates including many spurious sources at functionally irrelevant areas. On the contrary, SI-SBLNN and SI-STBF produced the estimations with good spatial contiguity, which are consistent with the previous clinical report in [35].

IV. DISCUSSION
Because of the inherent ill-poseness of E/MEG source imaging problem, prior constraints are of great importance to acquire meaningful solution in the context of neurophysiology. It has been widely accepted by previous studies [36], [37] that cortical activations enjoy spatial coherence and local homogeneity. With the employment of Bayesian modeling, the sources are factorized into multiple ingredients which can be imposed prior knowledge with distinct attributes. Many conventional source imaging methods can be framed into this schema, such as MNE, LORETA, and VB-GLM. When hyper parameters introduced in the prior distribution are released to be updated, the algorithm is transformed within empirical Bayesian framework, which has the preference for sparse estimations, like Champagne and SI-STBF. The Bayesian modeling based algorithm constructs the probabilistic model with explicit mathematical formulas. However, the more sophisticate model we derive, the more difficultly we can obtain the posterior distribution of latent variables. In most cases, it is even intractable to acquire the true posterior and we can merely achieve the approximate estimation via variational inference, which always has high computational cost and slight use for real-time applications. On the contrary, DNN-based methods, such as ConDvid, SIFNet, and DST-DAE, avoid the iteration and hence reduce the computation time in each single inference process. Additionally, prior assumptions are implicitly imposed through the synthesized training data, instead of constructing the probabilistic graphical model.
In this paper, we proposed a novel framework termed SI-SBLNN, which combines sparse Bayesian learning and deep neural network, and implemented an algorithm with SI-STBF as the backbone for training data synthesis and source estimation. As demonstrated in Table III, the proposed algorithm had the availability for different cortices and corresponding head models. Meanwhile, the plots in Fig. 2 depict that SI-SBLNN was robust against SNRs and acquired superior imaging performance than the benchmarks under low SNRs (≤ 5 dB). Furthermore, as shown in Section III-D, SI-SBLNN was capable to duel with the cases where multiple sources were activated. On the contrary, the DNN-based method proposed in [14] was limited in single dipole reconstruction. In [18], individual models are required to be trained to acquired promising results for different scenarios. In Section III-E, we demonstrated that SI-SBLNN was capable to produce plausible imaging results for realistic E/MEG recordings, even though it was merely trained with synthesized data.
For the Bayesian learning algorithms with complicated graphical structures where latent variables are heavily correlated, variational inference is widely used to acquire the approximation of the posterior distribution, such as mean field assumption. However, the decomposition over variables can induce severe deviation from the true correlation. Besides, the error is readily accumulated and amplified through the alternate updates in VB-EM process. On the contrary, the proposed method directly extracts the relationship between measurements and essential latent variables from the synthesized training data, which in principle reduces the bias caused by the user-specified approximate posterior and the iterative update process. Therefore, compared to the backbone algorithm, it makes sense for SI-SBLNN to obtain the more accurate estimation of precision parameters and acquire the superior imaging performance in multiple scenarios as presented in Fig. 2 and 4. Moreover, in SI-SBLNN, the computational cost of variational inference are tremendously compressed using the trained neural network. As shown in Table S5 of the supplementary material, across Monte-Carlo simulations, the average inference time for SI-SBLNN was around one-thirtieth that of SI-STBF.
Compared to the purely DNN-based methods, the proposed algorithm has the main distinctions in following aspects. First, rather than generating the training data in heuristic manner, we employed SI-STBF as the backbone, which consists of probabilistic graphical model with explicit mathematical expressions, to derive the training data by conducting the top-down sampling procedure. Second, rather than making estimations in the source space with huge dimensionality like SIFNet and DST-DAE, the proposed algorithm reconstructs the precision parameters with enormously less dimensions thanks to the matrix factorization in SI-STBF, which implies the spatial coherence of cortical activations. It substantially alleviates the difficulty in network training. In the supplementary document, we compared the performance for different head models between SI-SBLNN and SIFNet. As shown in Table S4, SIFNet acquired the superior results for ICBM152 and FsAverage, which consist of relatively small number of dipoles. For the head model with most discretized voxels over the cortex, the source detection sensitivity and localization accuracy of SIFNet significantly degraded. In contrast, SI-SBLNN was capable to obtain consistently satisfactory results among different head models with tolerable additional computational cost. As depicted in Table S5, the inference time of SI-SBLNN and SIFNet are essentially of comparable magnitude.
In this paper, rather than deriving a specific source imaging algorithm which outperforms the benchmarks, the main contribution is that we proposed a flexible framework, which combines the sparse Bayesian learning and deep neural network and is allowed to achieve different realizations with slight modification in many aspects. Specifically, the proposed strategy can be applied to other algorithms within empirical Bayesian framework in addition to SI-STBF. For instance, we can choose BESTIES [38] as backbone and the mappings are formulated from measurements to both precision parameters and the learnable scale in spatial smoothing kernel. Additionally, a variety of network architectures can be used to construct the mapping. Graphical convolutional neural network [39] is a reasonable option because of its inherent compatibility to face connectivity over the cortical mesh. Furthermore, we can build up the network by unrolling the iteration in variational inference [40]. The network is composed of stacked modules with the same architectures but different learnable parameters, where the architecture is elaborately designed to match the update procedure in single epoch. LSTM layer can be employed to get rid of the constraint of fixed-size temporal dimensions [14].
To the best of our knowledge, current DNN-based algorithms, including the proposed algorithm in this study, have to train individual neural networks for distinct head models. It restricts the clinical applications owing to the long training time for subject-specific models. In the future, we will concentrate on the design of network architecture and collaborative algorithm towards the generalizability across different head models, which eliminate the network training phase for new subjects and sensor configurations. It will significantly improve the practicality of DNN-based algorithm for source imaging.

V. CONCLUSION
The current study presented a novel E/MEG source imaging framework termed SI-SBLNN, in which the neural network is incorporated into a source imaging algorithm based on sparse Bayesian learning through constructing the straightforward mapping from measurements to latent sparseness encoding parameters. We achieved an implementation with SI-STBF as the backbone and employed MLP, EEGNet and ResNet to formulate the mappings. In numerical simulations, MLP performed poorly among the comparison and ResNet obtained better imaging performance than EEGNet with abundant training data. The proposed algorithm validated its availability for different head models and robustness against distinct intensities of noise. Meanwhile, it acquired superior performance compared to SI-STBF and several benchmarks in a variety of source configurations. Additionally, in real data experiments, SI-SBLNN produced the concordant results with the prior studies when it was trained with synthesized data merely.