An Interpretable Deep Learning Model for Speech Activity Detection Using Electrocorticographic Signals

Numerous state-of-the-art solutions for neural speech decoding and synthesis incorporate deep learning into the processing pipeline. These models are typically opaque and can require significant computational resources for training and execution. A deep learning architecture is presented that learns input bandpass filters that capture task-relevant spectral features directly from data. Incorporating such explainable feature extraction into the model furthers the goal of creating end-to-end architectures that enable automated subject-specific parameter tuning while yielding an interpretable result. The model is implemented using intracranial brain data collected during a speech task. Using raw, unprocessed timesamples, the model detects the presence of speech at every timesample in a causal manner, suitable for online application. Model performance is comparable or superior to existing approaches that require substantial signal preprocessing and the learned frequency bands were found to converge to ranges that are supported by previous studies.

neural control applications. Electrocorticography (ECoG) is an invasive measurement of the electrical potentials generated from the neocortex of the brain [2]. ECoG signals have been shown to successfully control the movement of an upper-limb neuroprosthetic [3] or typing interface [4], as well as decoding speech processes [5].
Deep learning has been demonstrated to be an effective method for decoding speech from ECoG signals and its inclusion in the decoding and synthesis pipeline has increased in recent years [12], [16], [20], [21]. Although an end-toend architecture may eventually be wholly effective with sufficient training data, some current approaches have adopted a modular scheme with several sequential component models, each configured for a specific aspect of the speech decoding process [15], [16], [22].
Regardless of the specific approach, the overarching goal is to decode imagined or attempted speech directly from brain signals to provide an alternate communication channel for those who have lost the ability to speak. Here, the goal is not to maximize a metric for the quality of speech decoding. Instead, the approach is conceived from the perspective of identifying brain activity associated with intervals of intended speech output, with the ultimate objective of reliably detecting activity associated with imagined speech.
The present work introduces a component model, SincIEEG, based on a convolutional neural network (CNN) architecture developed for the task of speech activity detection [23]. The model is designed as a gateway, constantly monitoring brain activity to identify the segments pertinent to speech production. These detected segments can then be sent to downstream models for subsequent speech decoding This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and synthesis. SincIEEG, unlike a traditional CNN, learns a set of bandpass filter coefficients at its input layer. This provides several advantages over a traditional CNN since the number of required model parameters is significantly reduced by comparison, making it computationally efficient in terms of training and implementation. This compactness allows for flexibility without increasing the optimization problem. Moreover, unlike most traditional CNNs, the SincIEEG model has the distinct advantage of yielding interpretable parameters. The bandpass filters learned by SincIEEG can be visualized and equated to conventional spectral brain features.
The results demonstrate that SincIEEG is capable of detecting the presence or absence of speech during each time interval with a high level of accuracy, and compare the model's performance to a traditional CNN model, as well as non-deep learning methods. In addition, the generalizability of the model architecture is highlighted in terms of providing empirical, interpretable insights about the discriminable bandpass spectral features for any physiological data that can be represented as an aggregate of bandpass activity.

II. MATERIALS AND METHODS A. Participants
ECoG data were recorded from 5 participants with pharmacoresistant epilepsy undergoing clinical monitoring for surgical planning. No participants reported hearing deficits. In all cases, a tumor was not the source for the seizures and no lesions were indicated by any electrode used for analysis. All participants gave written informed consent and the study protocol was approved by the institutional review boards of Virginia Commonwealth University; University of California, San Diego; Old Dominion University; and Mayo Clinic, Florida.
Participants were implanted with subdural electrode grids or strips (Ad-Tech Medical Instrument Corporation, 1-cm spacing) based purely on their clinical need. Electrode locations were verified by co-registering preoperative MRI and postoperative computerized tomography scans. For combined visualization, electrode locations were projected to common Talairach space. Electrode locations were rendered using NeuralAct [24], as shown in Figure 1. While brain areas associated with speech are predominantly found on the dominant hemisphere, which is the left hemisphere in the majority of right-hand dominant people, the neural correlates of speech production are not exclusively localized in the left hemisphere [25], [26]. For this reason, both left and right hemisphere cases are evaluated. In total, ECoG activity was recorded from 416 (96 left hemisphere, 320 right hemisphere) subdural electrodes. Of these, electrodes that exhibited unnatural signal anomalies based on visual inspection were excluded from the analysis, leaving 364 electrodes (96 left hemisphere, 268 right hemisphere). For each participant, the number of electrodes implanted, analyzed, and identified as not located over the auditory cortex (non-auditory) are provided in Table I.

B. Task
Participants were instructed to read aloud single words presented in sequence on a computer screen while their  brain activity and voice were simultaneously recorded. The words were selected from a bank of 431 unique words, split into 4 sets of 115-116 words. The bank of words are primarily monosyllabic and comprised of the Modified Rhyme Test [27], supplemented with additional words to better reflect the phoneme distribution of American English [28]. While this experimental paradigm was originally designed to examine neural correlates of American English phonemes [7], the data are being used in the present analysis exclusively for speech activity detection without consideration of phonetic aspects.
The experiment begins with a fixation cross at the center of the screen. The cross is then replaced by a word that stays on the screen for 2.5 seconds. The word is then replaced with the cross for 0.5 seconds, before the next word is presented. Words are chosen randomly from the set of 115 words for each session and each session contained a different subset of words. Participants completed between 2 and 4 sessions, depending on willingness and ability to complete the sessions.

C. Data Acquisition
ECoG and audio data were concurrently recorded during the task. ECoG data were bandpass filtered between The filtered signals are normalized with respect to the band dimension using spatial normalization before convolutional layers learn kernels across time and passbands. All hidden layers use batch normalization for regularization and Leaky Rectified Linear Units for activation. The model predicts the likelihood of speaking using a Sigmoid activation at its output layer. 0.5 and 500 Hz, notch filtered at 60 Hz and recorded using g.USB amplifiers (g.tec Medical Engineering). The data were recorded at a sampling rate of 1200 Hz and subsequently decimated to 600 Hz.
The time series and its frequency spectra were visually inspected for anomalies. Channels having uncharacteristic frequency spectra, substantial artifacts, and/or saturated amplitudes, were excluded from the analysis. In total, 364 (96 left hemisphere, 268 right hemisphere) electrodes were used for analysis.
This basic preprocessing is standard for ECoG acquisition and the data decimation can be equivalently achieved by using a lower sampling rate at the time of data acquisition. Thus, the data used as input to the SincIEEG network effectively represent the raw ECoG timesamples.
Audio data were recorded in parallel using a Blue Microphones Snowball iCE USB microphone connected to the research computer, sampled at 48 kHz. All data recording and stimulus presentation were facilitated by BCI2000 software [29].

D. Speech Labeling
Speech labels used for training the model were made in reference to the stimulus cue of the word being presented in the experiment. Every time-sample from 0.5 seconds after the word presentation cue to 1.5 seconds after the cue were labeled as 'speaking'. Every time-sample from 2.0 seconds after the word presentation cue to 3.0 seconds after the cue were labeled as 'not-speaking'. The other segments, from the cue to 0.5 seconds after, and from 1.5 to 2.0 seconds after, were purposefully left unlabeled.
This labeling scheme was chosen based on the stimulus presentation cue, as opposed to direct energy detection in the audio signal, to develop a more robust model that does not directly rely upon the acoustic signal. This was done to emulate the scenario where the user is unable to speak, and thus precise labels for the presence or absence of speech would be unavailable. Instead, the proposed labeling indicates the time segments where speech is most expected, which can be generalized to imagined speech.

III. MODEL DESIGN AND OPTIMIZATION
The SincIEEG model is a Multi-SincNet based convolutional deep learning architecture adapted for real-time detection of human speech from ECoG input signals. Proposed in [30] for hand-pose classification from myoelectric sensor readings, and based off the work from [23], the Multi-SincNet architecture learns the coefficients of a set of parallel finite impulse response (FIR) bandpass filters, applied across the input channels. Subsequent convolutional layers learn kernels that aggregate across time and bandpass frequency dimensions. A final global view, established by a fully connected layer and sigmoid activation, classifies either 'speaking' or 'not-speaking' from labeled data. Figure 2 illustrates the SincIEEG model and its layer configurations. This section details the architecture and training strategy to produce models for validation described in Section IV.
In overview, the inputs to the model are 500 ms windows of raw IEEG data (300 time samples) with a stride of 2 ms (1 time sample). Each 500 ms window represents one training sample for the model, described in Section II-D. A model was trained for each participant, using all of the quality electrodes available. Electrodes over the auditory cortex were excluded for a model validation check, detailed in Section IV-C.2. A K-fold training methodology was used and is detailed further in Section III-E.

A. Multi-SincNet Input Convolution
The first layer in the SincIEEG model is a Multi-SincNet layer, an extension to the the Kaldi speech framework's [38] SincNet, which applies a SincNet to each of the incoming sensor channels. A SincNet layer learns a configurable number of bandpass filters, parameterized through two cutoff frequencies, f L and f H . The Multi-SincNet layer can therefore be used to decompose a collection input signals into a fixed set of learned bands.
In equations 1 and 2, multiple filters are conceptualized as vectors of low and high cutoffs, F L and F H respectively, identifying regions of the input's spectrum that the model uses for classification. These vectors are a parameterization of a SincNet layer, which is shared in the experiments across all sensors s ∈ S.
Sharing bandpass filters across each sensor reduces parameters, improves model latency, and regularizes the treatment of sensor data.
Each FIR filter, k is implemented as a set of kernel coefficients and applied through convolution with the input signal X.
where X is the input signal and k f L , f H is the vector of kernel coefficients that allows frequencies in [ f L , f H ] to remain in the signal. Additional details on the calculation of k coefficients and how they compare to learned kernels can be found in [23].
Filters are initialized to uniformly sub-divide the majority of the available spectrum (i.e., 0-300 Hz) with a 3 Hz region of overlap between adjacent bands. The original Kaldi implementation initializes bands starting at a low-cutoff of 30 Hz, but this minimum starting frequency is reduced to 10 Hz for the present analysis to help encourage use of lower frequencies that may be relevant for this application [19]. The Kaldi SincNet implementation also includes a minimum frequency and minimum bandwidth constraint, which are configured to be 1 Hz and 3 Hz, respectively. Kaldi enforces these minimums by increasing the absolute value of the learned low-cutoffs and bandwidths by their respective minimums. Future work should explore the impact of different potential initialization schemes.

B. Activation
Rectified linear units (ReLU), defined as y = max(0, x), provide a linear gradient for all input x ∈ R + and 0 gradient for x ≤ 0. With zero-centered bandpass outputs, a large portion of values will not have a gradient with ReLU activation. Instead, the Leaky ReLU activation (LReLU) provides a small gradient for x ≤ 0, while still being non-linear and computationally simple. The LReLU activation is defined in equation 7, where the default α = 0.01 is used for for all experiments.
Using LReLU on zero-centered data still greatly diminishes negative inputs. However, the learned affine parameters within the batch normalization layers can learn to offset any inputs into regions with higher variance.

C. Batch Normalization
The amplitude of the output from the Multi-SincNet filters scale directly with the amplitude of the input signal. Betweensensor relative magnitudes are important to maintain, so scaling at the sensor dimension of intermediate data is avoided in the early layers.
Brain dynamics are not evenly distributed in the frequency domain, however, and will tend to have higher amplitudes at lower frequencies. This means the additional bandpass dimensions may be distributed at different scales, making it difficult to learn shared kernels in subsequent convolution layers. Furthermore, the scale of the intermediate values may shift as the cutoff frequencies of the learned bandpass filters are optimized.
Therefore, in order to balance influence when learning kernels applied across bands, and to scale hidden outputs to activation regions, a spatial batch normalization [39] is applied at the band dimension in the three hidden outputs following the Multi-SincNet input layer. Re-scaling each band independently maintains within-band relative dynamics that can be learned using shared weights.
where B is the batch size, S is the set of sensors, F is the set of bandpass regions, and T is the number of input samples. Learned affine parameters β and γ allow the model to adjust the center and scale away from the origin and unit variance. Following cross-band convolution, spatial normalization is applied across sensors -computing μ s and σ s analogous to μ f and σ f . At this point in the architecture, distributions across sensors are well-normalized and suitable for batch normalization's regularizing effect, reducing internal covariate drift.

D. Monte Carlo Dropout
Sensor systems with many highly responsive input channels may have spurious errors or drift, and sometimes must be removed in pre-processing. Additionally, for general tasks such as speech activity detection from an ECoG array, some important brain regions may have multiple sensors covering them, resulting in high co-linearity across channels. To regularize co-linearity across sensors, channel dropout [40] is applied on the input to the model during training. Channel dropout on the sensors zeros all signal values for a sensor with an independent Bernoulli random number parameterized by probability p. It is common to avoid using dropout when using batch normalization since the noise caused by the dropout will skew the mean and variance statistics used in normalization towards zero.
However, for SincIEEG, the data modality is already centered at zero, and the practical application motivates robustness to sensor dropout.

E. Optimization Procedure
All deep learning models in this work, both the SincIEEG described above and CNN model described in Section IV-C.4, use stochastic gradient descent from gradients produced by error back-propagation. The Adam optimizer [41] is employed with the learning rate fixed to α = 0.001 for all experiments. Binary cross-entropy loss between the target label and the model's output is used as the objective criteria.
Models are evaluated through multiple refits using a K-Fold procedure across a participant's sessions. A single holdout session is used for evaluation in each fold and the remaining sessions are used for training. Some participants had three sessions, providing two training sessions per fold, while others had only two sessions overall and provided one session per training fold. The training data is randomly split into a 25% cross-validation portion for monitoring model performance during training. After each epoch of training, a model under optimization is applied to the cross-validation data and scored. For the SincIEEG and CNN experiments, the best model on the cross-validation is maintained and stored after 100 epochs of training.
Experiments without auditory sensors and other supplementary architecture exploration used early stopping. For these experiments, if the cross-validation performance did not improve for 10 epochs during training, then the best model at that point was stored and the training procedure ended. The early stopping procedure generally produced models with similar performance to their 100 epoch counterparts. Other configurations that were explored using this truncated procedure include variations of activation function, batch normalization, number of learned kernels, and other modifications to convolution configuration. Performance was robust for most configurations and these preliminary experiments focused on reducing model complexity.

IV. MODEL VALIDATION
ECoG data acquired from participants performing the speech task were used to further validate the model. The models are validated both quantitatively for predictive performance, as well as qualitatively for convergence of the spectral band filters to physiologically plausible ranges.

A. Prediction Accuracy
The prediction accuracy is simply computed as the proportion of windows correctly classified as 'speaking' or 'not-speaking'. Visualizations that overlay the stimulus cue, curated labels, speech audio signal, and the model's predicted likelihood of speech are presented. Aligning recorded speech with model predictions across multiple training windows enables an examination of the model's predictions with both the labeled regions and recorded speech data. The model's ability to predict speech occurring outside the labeled region help to validate the model's generalization capabilities. Ultimately, this visualization provides an indication as to how the model would perform in practice. For instance, frequent oscillations in the predicted likelihood may achieve reasonable accuracy but ultimately be unreliable for use in a classification pipeline.

B. Spectral Band Convergence
A key aspect of this model's utility is its ability to learn spectral bands that minimize the loss function of the network. When the band parameters are combined with the loss and cross validation loss for each training batch, a visualization of the band convergence over time can be obtained. This visualization can serve several purposes. For the present analysis it serves as an additional method of model vetting and interpretation, to establish the frequency bands the model identified as empirically predictive. For other analyses, it could serve as an exploratory tool to investigate whether frequency information is central to the phenomenon.

C. Comparison Models and Benchmarks 1) Randomization Tests:
In order to compare the model performance to random chance, model prediction was assessed when trained on randomly labeled segments. The labeling scheme maintained a proportional amount of speaking/ not-speaking labels, and thus the chance accuracy should be 50%. To confirm this, the train and test paradigms were kept identical, except that before training, a labeled segment was randomly assigned a 'speaking' or 'not-speaking' label. The hyperparameters chosen for model configuration were 1-Band with a dropout of P = 0.5.
2) Auditory Cortex Electrode Removal: To verify that classification performance was not merely being driven by auditory feedback, electrodes in the auditory cortex region were manually identified based on anatomical landmarks and removed from the analysis (see Figure 1). An abbreviated evaluation of SincIEEG was performed to confirm that the classification performance was not significantly degraded by the exclusion of the auditory electrodes. Optimization time of these additional models was reduced by using early stopping as described in Section III-E. Additional testing verified that early stopping does not unfavorably bias the resulting model performance.
3) LDA and SVM Benchmarks: To explore whether the frequency bands that the SincIEEG model identified would confer some benefit over using the entire broadband spectrum, the performance using the bands that 3-band SincIEEG learned for each participant was compared to the performance using broadband activity from 0.5-170 Hz frequencies. The 3-band version was chosen to compare because it is more distinct from broadband than the 5-band version which generally occupies a greater proportion of the spectrum. A Linear Discriminant Analysis (LDA) and a linear Support Vector Machine (SVM) were implemented as performance benchmarks. Because these comparatively simple classifiers are not capable of attaining reasonable performance using raw ECoG timesamples, a preprocessing method derived from [13] was implemented that generates a band power aggregate measure over a 500 ms window that updates every 50 ms. The labels were accordingly downsampled to 20 Hz. For each label, the preceding 500 ms of the corresponding preprocessed ECoG signals were used to compute the input features. The resulting feature array was flattened into a vector for training the LDA and SVM models. This process was performed for both the broadband and 3-band SincIEEG versions.

4) Standard CNN:
To establish how SincIEEG performs compared to a traditional deep learning method, a standard CNN was implemented and evaluated based on [42]. For this CNN, the first convolutional layers aggregate across time with kernels and stride of five samples, and a dilation of two samples to further downsample. The next layer maintains the kernel's size and stride, but returns to default dilation of one. The remaining two convolutional layers learn 3x3 kernels with unit stride and dilation until a final dense layer outputs to a sigmoid activation. A total of 16 filters were learned in each convolutional layer. The standard convolutional network model is an important alternative to SincIEEG as it uses the same convolution operation but is not directly interpretable. The training and testing paradigms remained unchanged, only the model architecture was exchanged.

A. Prediction Accuracy
The average SincIEEG model accuracy across all participants was 94.1% (s.e. 3.5%), and all but one participant achieved an accuracy above 90%. Figure 3 shows the accuracy of all model configurations per participant with each configuration repeated three times. Results from Participants 1 and 2 were very consistent regardless of hyperparameter, while Participant 3 showed significant variability in the 3-and 5-band versions, and Participant 5 performed better without dropout. These differences are most likely mediated by electrode number and placement. However, the ability of the model to achieve good performance on such a variety of electrode locations is a testament to its robustness, and the advantages of a participant-specific feature set.
As described in Section II-D, target labels were created from the timings of experiment cues, rather than the participant's speech. Therefore, to better gauge speech detection performance for practical speech detection applications, predictions were qualitatively assessed by visual inspection into one of three categories: Full Success, Partial Success, and Failure.
A word trial was considered a Full Success if the prediction captured the entirety of the spoken word prior to onset and  For each participant's best model configuration, the model with the best cross-validation performance was selected and its test-set predictions were assessed using the criteria described above. Table II shows the proportion of words assigned to each category for a 115 word test set for each participant for the respective best model configuration. Participant 1 and 2 models were able to very consistently predict speech before speech onset, suggesting that the model and electrode location combination may capture aspects of speech planning. Participant 3 and 4 models had a majority of partial successes. These trials largely exhibited clipping the beginning portion of words, suggesting that the model may be capturing aspects of speech production rather than speech planning.   band evolutions during training when dropout is included in the model. With dropout, bands tended to converge more smoothly, rather than exhibiting large jumps in value as observed without dropout. With shared parameters, zeroing a sensor channel eliminates its influence and subsequently allows other sensors of varying magnitudes to drive parameter updates. Furthermore, zeroed sensors bias downstream normalization layer statistics towards zero. It is posited that these aspects result in the higher variance stochastic search of frequencies illustrated in Figure 5.

B. Spectral Band Convergence
The final bands learned for each participant, aggregated across sessions and model configurations, are shown in Figure 6, with the bands aggregated across participants shown in Figure 7. For better visualization, only SincIEEG models with performance in the top 50% for each participant are included in the figures. The bands are superimposed on a single frequency spectrum as a density plot at high transparency. Each band is plotted in a different color, with more saturated hues representing frequencies common across more participants and model configurations than less saturated hues. This provides a compact conceptualization of the final converged frequencies across models.
For the 1-band case, the general tendency is for the band to be broad. However, the aggregated data shows that the bands commonly overlapped around 25-75 Hz, implying the lower frequency band may be more predictive than high gamma for the task, as supported by [19].
The 3-band case indicates one lower-frequency band in a narrow range from 20-40 Hz, a broader middle band roughly spanning 120-200 Hz, and a high frequency band converging above 250 Hz. The 5-band case shows similar bands at the low and high ends of the spectrum, with intermediate   A benefit of the interpretability of learning frequency bands is that the results can be directly compared to known physiologically-relevant bands. Kanas et. al. examined 8 Hz wide frequency bands from 0 to 248 Hz, and produced a histogram ranking bins by contribution to speech detection [43]. It is a multi-modal distribution, with two larger peaks, one spanning 0-40 Hz and one 180-200 Hz, with two smaller, broader peaks in the intermediate frequencies.  Table III shows the performance of all validation measures in comparison to SincIEEG. The SincIEEG and SincIEEG Non-Auditory results are the mean test fold accuracy for each participants' best performing model configuration, effectively the highest bar for each participant in Figure 3. Excluding the auditory cortex electrodes did not significantly impact model performance. The causal formulation of the model, and accurate capture of speech onset within the predicted speech window, provides a strong indication that perception of speech was not a driver of the model classification accuracy. The CNN architecture performance is overall on par with SincIEEG. This shows that the interpretable and parsimonious architecture of the SincNet does not compromise model performance.

C. Comparison and Benchmarks
The bands identified by the 3-band SincIEEG for each participant were compared to a broadband approach and classified with LDA and SVM. For both classifiers across participants, using learned bands instead of the broadband showed an improvement in classification accuracy. This implies that SincIEEG provides unique and relevant features due to the participant-specific, empirical, and/or parsimonious nature of the learned SincIEEG bands.
It should be noted that, regardless of whether using learned bands or broadband, the LDA and SVM classifiers with the preprocessed ECoG signals did not achieve better results than SincIEEG. Additionally, SincIEEG was able to achieve better results with greater time-domain resolution than the methods using the preprocessed features.

VI. DISCUSSION
This work introduces SincIEEG, a deep learning model with an interpretable architecture. SincIEEG is capable of detecting overt speech using unprocessed ECoG recordings based on a diversity of electrode coverage. SincIEEG meets or exceeds the performance of other ECoG speech detectors, with several additional advantages.
In prior work on using ECoG for speech activity detection, Kanas et. al achieved maximum accuracies of 92% [22], and 98.8% with non deep learning classifiers [43]. Other studies used the detection model as part of a larger speech decoding analysis and so did not report specific results on speech detection performance [15], [16]. In comparison to SincIEEG, which uses unprocessed ECoG recordings, these approaches require appreciable signal preprocessing prior to speech detection. Since the feature extraction is inherent in SincIEEG, any latency introduced via explicit, potentially suboptimal, data-independent preprocessing is mitigated in the processing pipeline -which is critical for real-time implementation.
The architecture of SincIEEG is CNN-based, like that of the foundational work of EEGNet, which showed the viability of CNN's for several tasks using non-invasive EEG signals [44]. The EEGNet architecture was subsequently extended for application in a movement task to intracranial signals, including the addition of a spatial component [45]. This approach is also capable of determining data-driven frequency features, albeit in a manner distinct from SincIEEG. While it is demonstrated that SincIEEG is capable of speech activity detection from ECoG signals, the original implementation was used for acoustic speech detection [23], and it has also been applied to EMG signals [30]. Using a related approach for seizure detection using non-invasive EEG, Fukumori et. al.
showed that a data-driven approach was superior to static filter banks [46]. Such models that learn the task-relevant spectral bands can be applied to other domains where frequency analysis is central. This is mainly due to the utility of learning bandpass filters, and the flexibility of the scope on which different filters can be learned.
In terms of interpretability, visualization of the learned bands provides a unique modality for studying the relevant spectral features. One consistent observation is that, across all 1-, 3-, or 5-band models and all participants, a low frequency component was always included. This supports prior work that suggests lower frequency features can play a key role in speech detection in addition to broadband gamma [7], [43]. While the present analysis did not attempt to specifically identify the subset of electrodes related to speech production processes, due to the consistent performance results regardless of the hemisphere of the implant, it is expected that the contributions are largely from the ventral primary motor cortex as shown in prior work [6], [11], [13], [47].
Beyond interpretability, the flexibility of the SincNet architecture's ability to learn different combinations of relevant frequency bands make it promising for implementing transfer learning to leverage existing data for development and training of generalizable models. Gathering sufficient data and learning robust models for new participants is challenging, particularly for intracranial recordings where available data is limited and the electrode locations are generally sparse and not consistent across participants. In this context, transfer learning can be used to refine the model on a new participant's data after having learned its initial parameters from other participants' data -which can significantly reduce training time and improve model robustness and performance.
Because SincIEEG is capable of learning task-relevant spectral bands across multiple participants independent of precise electrode locations, it has the potential to learn generalized bands for brain regions sampled by the population of electrodes across participants. Furthermore, specific bands can be learned for channel context labels, such as in which brain region an electrode resides. This allows for encoding a spatial component to the transfer learning, initializing different bands dependent on electrode location.
Ultimately, toward the development of a practical speech neuroprosthetic, future work must examine the efficacy of SincIEEG on transfer learning and, moreover, on imagined speech and integration with the subsequent speech decoding pipeline.