Rethinking Delayed Hemodynamic Responses for fNIRS Classification

Functional near-infrared spectroscopy (fNIRS) is a non-invasive neuroimaging technology for monitoring cerebral hemodynamic responses. Enhancing fNIRS classification can improve the performance of brain–computer interfaces (BCIs). Currently, deep neural networks (DNNs) do not consider the inherent delayed hemodynamic responses of fNIRS signals, which causes many optimization and application problems. Considering the kernel size and receptive field of convolutions, delayed hemodynamic responses as domain knowledge are introduced into fNIRS classification, and a concise and efficient model named fNIRSNet is proposed. We empirically summarize three design guidelines for fNIRSNet. In subject-specific and subject-independent experiments, fNIRSNet outperforms other DNNs on open-access datasets. Specifically, fNIRSNet with only 498 parameters is 6.58% higher than convolutional neural network (CNN) with millions of parameters on mental arithmetic tasks and the floating-point operations (FLOPs) of fNIRSNet are much lower than CNN. Therefore, fNIRSNet is friendly to practical applications and reduces the hardware cost of BCI systems. It may inspire more research on knowledge-driven models for fNIRS BCIs. Code is available at https://github.com/wzhlearning/fNIRSNet.


I. INTRODUCTION
F UNCTIONAL near-infrared spectroscopy (fNIRS) is a non-invasive neuroimaging technology that records changes in the concentration of oxygenated hemoglobin (HbO) and deoxygenated hemoglobin (HbR) by measuring the absorption of near-infrared light between 650 and 950 nm [1].Brain-computer interfaces (BCIs) decode signals from patients suffering from movement disorders to establish non-muscle communication with the external environment [2].Owing to its non-invasiveness, user-friendliness, and portability [3], fNIRS has attracted attention in the BCI community.
Methods of classifying fNIRS signals include traditional machine learning and emerging deep learning.Statistical values (mean, variance, peak, kurtosis, skewness, and slope) are extracted from fNIRS signals to train support vector machine (SVM), linear discriminant analysis (LDA), and k-nearest neighbor (KNN) [4], [5].Vector-based phase analysis including change in cerebral blood volume ( CBV), change in cerebral oxygen exchange ( COE), vector magnitude, and angle is also commonly used to train these classifiers [6], [7].However, traditional classifiers rely heavily on manual feature engineering and prior knowledge.In recent years, deep learning has become the mainstream of fNIRS classification research.Convolutional neural networks (CNNs), long short-term memory (LSTM), and Transformers have been developed for fNIRS classification [8], [9], [10], [11].Deep learning is notoriously data-hungry, but limited fNIRS data severely hinders its applications.Unfortunately, the scarcity of fNIRS data is difficult to address in a short time.The high cost of fNIRS equipment may limit the acquisition scale and the burdensome signal acquisition procedures may limit the number of participants.Although some complicated models have been developed, the insufficiency of fNIRS data still limits the improvement of classification performance.More importantly, the domain knowledge of fNIRS signals is not exploited.The changes in HbO and HbR are a slow metabolic process manifested as delayed hemodynamic responses which are also inherent properties of fNIRS signals.Hence, the number of sampling points per unit time is less than high temporal resolution signals such as electroencephalogram (EEG) [12].The delayed hemodynamic responses occur in both onset and cessation of neuronal activity [13], [14], [15].Nambu et al. [16] found a 4 s hemodynamic delay when measuring human motor-cortical activation.Shin et al. [4] found that fNIRS classification accuracies reach the maximum after a delay of several seconds.HbO and HbR do not change significantly in the first few seconds of experimental stimulation, while they still have solid hemodynamic responses when the stimulation is over.
Unlike computer vision and natural language processing supported by large-scale data, some general design principles may not be suitable for the fNIRS field, such as deeper architectures and small convolutions.In order to improve classification performance, researchers tend to design more complex network architectures by increasing the number of kernels and network depth.However, these operations may lead to over-parametrization and overfitting on limited fNIRS data.Finally, researchers have to adopt more regularization methods to solve these tricky problems, such as dropout [17] and flooding [18].In addition, some studies [19], [20], [21] use small convolutional kernels, e.g., 3 × 3 and 4 × 4, to extract fNIRS signal features.He et al. [22] used smaller kernel sizes of 2 × 1 and 1 × 4 to extract temporal and spatial features, respectively.One-dimensional (1D) CNNs are also popular for processing fNIRS signals and their kernel size is usually set at least three [23], [24].The biggest issue is that fNIRS signals are fed directly into deep neural networks (DNNs) without considering domain knowledge.A small kernel with a limited receptive field is challenging to extract the features of delayed hemodynamic responses because there is no significant change in HbO and HbR in small neighborhoods.However, stacking more convolutional layers to obtain larger receptive fields may cause overfitting on limited fNIRS data.Therefore, we rethink delayed hemodynamic responses and systematically explore a simple but efficient design philosophy for a deep learning-based fNIRS classification model.
In this study, two core ideas are presented: 1) delayed hemodynamic responses as domain knowledge should be introduced into fNIRS classification models; 2) a simple and efficient model is beneficial to practical applications on limited fNIRS data.We propose a compact fNIRS classification network named fNIRSNet which consists of three convolutional layers and one fully connected (FC) layer without pooling, dropout, and other complicated structures.Three design guidelines are empirically summarized for fNIRSNet: 1) the size of convolutional kernels is critical for extracting features of delayed hemodynamic responses and decoupling network depth and receptive fields; 2) concatenating standard convolutions and depthwise separable convolutions can balance the stability, speed, and efficiency of fNIRSNet; 3) activation functions with saturated negative values can alleviate information loss in the first layer.fNIRSNet achieves superior performance on open-access datasets, while it has extremely few parameters and computational consumption.To the best of our knowledge, fNIRSNet is the least resource-consuming deep learning-based fNIRS classification model.Our study illustrates that a compact model infused with domain knowledge outperforms big models in the fNIRS field.These advantages make fNIRSNet more valuable for applications on mobile and embedded devices.Code is available at https://github.com/wzhlearning/fNIRSNet.
The rest of this article is organized as follows.Section II describes the design ideas of fNIRSNet.Section III introduces open-access datasets, signal preprocessing, and evaluation protocols.In Section IV, comprehensive experiments demonstrate the superiority of fNIRSNet.Discussion is provided in Section V. Finally, Section VI concludes this article.

A. Hemodynamic Response
Neurovascular coupling that links changes in neural activity to the cerebral blood flow (CBF) is the cornerstone of many functional neuroimaging techniques based on hemodynamic responses [25], such as functional magnetic resonance imaging (fMRI) [26] and fNIRS.fMRI can measure blood oxygenation level dependent (BOLD) signals that are modeled as a convolution of the hemodynamic response function and the stimulus function.The hemodynamic response function can be generated by three gamma functions (•) [27]: where A, α, and β control the height and direction, shape, and scale of hemodynamic responses, respectively.Fig. 1 illustrates the canonical hemodynamic response function.It is divided into three phases: initial dip, positive response, and poststimulus undershoot [28].In the fNIRS field, the initial dip manifests an initial increase/decrease in HbR/HbO, which is associated with neural activity consuming oxygen in nearby local regions.The positive/negative response for HbO/HbR is caused by a large increase in CBF, which is usually manifested as an increase in HbO and a decrease in HbR.The poststimulus period is characterized by an undershoot of HbO and an overshoot of HbR, and the period typically starts between 10 and 20 s after stimulus cessation and lasts up to 60 s [29].The main reasons for poststimulus undershoot are the continuous increase in the metabolic rate of oxygen and delayed vascular compliance [30].
The delayed hemodynamic response is an inherent property and a major limitation of fNIRS signals.Limited by local receptive fields, small convolutions are challenging to model the long-term dependency on hemodynamic responses.Therefore, we hypothesize that convolutions with fNIRS channel-level receptive fields can extract delayed response features and convolutions with global receptive fields can explore activation patterns of different brain regions.

B. fNIRSNet 1) Notation:
The fNIRS tensor is defined to facilitate the following description.In Fig. 2(a), HbO and HbR are arranged to form an fNIRS tensor X ∈ R C×S×D , where C is twice (two chromophores: HbO and HbR) the number of fNIRS channels, S = f ×T is the number of sampling points, f is the sampling frequency, T is the sampling time, and the depth D is 1.
2) Overview: The architecture of fNIRSNet is illustrated in Fig. 2(a).fNIRSNet consists of a delayed hemodynamic response module (DHR Module) and a global module.Specifically, fNIRSNet contains three convolutional layers with different kernel sizes and a fully connected (FC) layer.Delayed hemodynamic response convolutions (DHRConv) extract the channel-level features of delayed hemodynamic responses.Depthwise separable convolutions consisting of depthwise convolutions (DWConv) and pointwise convolutions (PWConv) are used to reduce model parameters [31].Batch normalization (BN) accelerates network training and improves classification performance [32].The sigmoid activation function is used for nonlinear activation and alleviates information loss in the first layer.The feature maps are flattened to 1D vectors and then fed into an FC layer.Finally, a softmax function calculates the conditional probabilities of the K classes.The proposed fNIRSNet does not have pooling, dropout, and other complicated structures.Overall, fNIRSNet is concise and efficient, and comprehensive experiments demonstrate its superiority.Three design guidelines are empirically summarized.
3) Guideline 1: The size of convolutional kernels is critical for extracting features of delayed hemodynamic responses and decoupling network depth and receptive fields.Table I shows the configurations of fNIRSNet.For an fNIRS tensor X ∈ R C×S×1 , the kernel size of DHRConv is 1 × S, which means that the width of this kernel equals the number of  4) Guideline 2: Concatenating standard convolutions and depthwise separable convolutions can balance the stability, speed, and efficiency of fNIRSNet.In Fig. 2(c), a depthwise separable convolution is a factorized convolution that factorizes a standard convolution into DWConv and PWConv [31].DWConv applies a filter to each input channel, and PWConv projects the output of DWConv into a new channel space.Compared with standard convolutions, depthwise separable convolutions reduce computational cost significantly.The computational cost of standard convolutions is defined as where K h × K w is the kernel size, F in is the number of input channels, F out is the number of output channels, and M h × M w is the size of feature map.Depthwise separable convolutions have the computational cost of: Note that DHRConv is a standard convolution in the first layer because the depthwise separable convolution performs poorly in low-dimensional space (the first layer) [33].Applying standard convolutions at the first layer has a trade-off between stability and speed.The computational cost of DHRConv is S × F 1 ×C.The computational cost of the global module using standard convolutions and depthwise separable convolutions is number of flattened neurons, an FC layer without dropout is used for classification.Except for the number of filters F 1 and F 2 , fNIRSNet has almost no other hyperparameters.Therefore, fNIRSNet is friendly to BCI devices because it has fewer parameters and computational cost.5) Guideline 3: Activation functions with saturated negative values can alleviate information loss in the first layer.We found that activation functions with saturated negative values (e.g., sigmoid, hyperbolic tangent (tanh), and exponential linear unit (ELU) [34]) work better than other mainstream activation functions (e.g., ReLU and leaky ReLU (LReLU) [35]).ELU, ReLU, and LReLU alleviate vanishing gradients caused by increasing model depth via the identity for positive values.However, we ignore vanishing gradients because fNIRSNet is a shallow model.Since fNIRSNet has very few trainable parameters, inappropriate activation functions lead to information loss in the first convolutional layer (i.e., DHRConv), which is a real concern for our study.The negative value input to ReLU cannot be activated, which causes backpropagation to fail to update weight parameters, called the dead neuron problem.Although LReLU avoids dead neurons by a small and non-zero gradient, it cannot ensure a noise-robust deactivation state [34].Sigmoid and tanh are bilateral saturation activation functions, while ELU is a one-sided negative saturation that reduces forward propagated variation and information [34].In Section IV-C, the ablation experiments validate Guideline 3.

C. Label Smoothing
Label smoothing is commonly used to prevent DNNs from over-confidence by the weighted average of hard targets and uniform distribution over labels [36].A network predicts the probability of each class label k ∈ {1, . . ., K }: where z i is the logit.The cross-entropy loss function is defined as where y k = 1 for the ground truth, and y k = 0 for the rest.Label smoothing is defined as where ε is the smoothing parameter and is set to 0.1 by default, and u k = 1/K is the uniform distribution.Finally, the cross-entropy loss function with label smoothing is written as
1) MA: It consists of 29 healthy subjects (14 males, average age 28.5 ± 3.7 years) [4].For the MA task, the subjects were instructed to perform subtraction such as "three-digit number minus one-digit number" according to the screen and short beep instructions.For the baseline (BL) task, they were instructed to relax by gazing at a black fixation cross on the screen.Each subject was asked to perform 30 trials for each task.This is a hybrid EEG-fNIRS dataset, but only the fNIRS signals are used for our experiments.
2) UFFT: The dataset contains fNIRS signals of 30 subjects (17 males, 23.4 ± 2.5 years old) for ternary classification tasks [5].During the task period, the subjects were required to randomly perform three types of overt movements according to instructions on the screen, including right-hand finger-tapping (RHT), left-hand finger-tapping (LHT), and foot-tapping (FT).Each movement was performed randomly for 25 trials.They were instructed to relax during the rest period.
The MA dataset contains MA and BL categories, and the UFFT dataset includes RHT, LHT, and FT categories.

B. Signal Preprocessing
Following the original studies [4], [5], the fNIRS signals of MA and UFFT are downsampled to 10 Hz and 13.3 Hz, respectively.Signal preprocessing usually include the modified Beer-Lambert law [37], filtering, segmentation, and baseline correction.The modified Beer-Lambert law converts optical density O D into concentration changes of HbO and HbR from the absorption of near-infrared light.At time t, it is described as where

C. Evaluation Protocols
Currently, evaluation protocols for fNIRS classification are confusing and some experimental details are inadequately  described.In addition, few researchers release their source code for the fNIRS community.It is difficult to reproduce these studies and make fair comparisons.We discuss this issue in Section V.In Fig. 4, we adopt more general and transparent protocols: subject-specific and subject-independent [11], [38].
1) Subject-Specific: DNNs are trained for each subject using a 5-fold cross-validation (KFold-CV) that splits training and test sets according to trials to avoid information leakage.For example, each subject in MA includes 60 trials where 48 trials are used as a training set and 12 trials are used as a test set.Thus, the training set contains 480 samples (48 trials × 10 segments) and the test set includes 120 samples.The final experimental results are the average of all subjects' test sets.
2) Subject-Independent: A leave-one-subject-out crossvalidation (LOSO-CV) can rigorously validate inter-individual differences and model generalization.One subject's data is used as the test set and the rest as the training set.The process is repeated until all subject's data has been tested.The reported results are the average of all subjects.

D. Experimental Settings
In the subject-specific experiments, F 1 and F 2 of fNIRSNet are 4 and 8, respectively.Considering the increase in training samples, F 1 and F 2 are set to 16 and 32 in the subjectindependent experiments, respectively.For the baseline model, the hyperparameters of Transformer-based fNIRS-T3 [11] are adjusted to fit the size of input fNIRS signals.The kernel sizes of Conv S and Conv C are 5 × 10 and 1 × 10, respectively.The Transformer layers of fNIRS-T are set to 4, and the dimension of the linear projection and multi-layer perceptron (MLP) layer is set to 32.Other baseline CNN 4 [8], LSTM 4 [8], and 1D-CNN5 [23] follow the original references.The CNN contains three convolutional layers, whereas 1D-CNN consists of six 1D convolutional layers, and they both use BN and ReLU.The LSTM has three LSTM layers and each layer has 20 LSTM cells.
All models are optimized by AdamW [39] with an initial learning rate of 0.001.Label smoothing is used to improve model generalization.For subject-specific, all models are trained with a batch size of 64 for 120 epochs, and the initial learning rate is decayed by a factor of 10 at 60 and 90 epochs.For subject-independent, we apply the cosine learning rate [40] for 30 epochs, and its maximum number of iterations is 30.

A. Comparison With DNNs
Comparison experiments demonstrate that fNIRSNet has excellent advantages.The experimental results (mean ± standard deviation) are shown in Table II.In the subject-specific experiments, fNIRSNet achieves the highest average accuracy on test sets (Wilcoxon signed-rank test, p < 0.001).The average accuracy and F1-score of fNIRSNet are 5% higher than fNIRS-T on MA.All metrics of fNIRSNet are significantly higher than the other models (Wilcoxon signedrank test, p < 0.001).The performance of LSTM is lower than other models because individual differences and the scale of fNIRS data prevent LSTM from capturing context information.
The subject-independent experiment can assess model generalization performance because the target subject is not involved in parameter tuning and model training.In Table II, all performance metrics show overall deterioration.However, fNIRSNet still outperforms the other models.Owing to individual differences and limited data, it is important to collect data from target subjects to customize a model.He et al. [22] reported that the accuracy of motor imagery classification decreases by 20% in subject-independent experiments.Although deep models perform better in subject-specific than subject-independent, subject-independent is more suitable for practical applications because it reduces training cost and calibration time significantly.
These results demonstrate the importance of domain knowledge.DHRConv has a channel-level receptive field to extract features of delayed hemodynamic responses, and DWConv with global receptive fields aggregates spatial information.

B. Comparison of Efficiency
fNIRSNet is a lightweight model with extremely low parameters and computational cost.Table III reports the efficiency metrics for each model.Inference time and FPS are only used as references due to the metrics relying on hardware platforms.These tests are conducted on NVIDIA GTX 1080 GPU with 8 GB memory.All metrics of fNIRSNet significantly outperform other models.In the subject-specific experiments, fNIRSNet with 498 parameters is 6.58% higher than CNN with 4.54 M parameters on MA, and the inference time of fNIRSNet is 6 times lower than CNN.fNIRS-T has the highest FLOPs because the complexity of the self-attention mechanism is the square of input dimension [41].Thus, it has a high inference time and low FPS.We also observe a similar situation on UFFT.Therefore, fNIRSNet is friendly for practical applications and could be deployed on embedded devices.

C. Ablation Study
Subject-specific experiments are conducted on the UFFT dataset to ablate the three design guidelines.
1) Guideline 1: In Table IV, the accuracy and F1-score keep improving as the width of DHRConv increases.For a signal   tensor X ∈ R 40×40×1 (i.e., 20 channels × 2 chromophores = 40 and 13.3 Hz × 3 s = 40), the 1 × 40 DHRConv can extract the complete features of delayed hemodynamic responses instead of using small convolutions.The 1 × 40 DHRConv also reduces FLOPs significantly.Furthermore, fNIRSNet exhibits poorer classification performance when the height of DWConv is reduced from 40 to 10.Therefore, convolutions with global receptive fields are more beneficial to fNIRS classification than local receptive fields and avoid over-parameterization caused by stacking many convolutional layers to enlarge receptive fields.
2) Guideline 2: The hybrid pattern that the first module uses standard convolutions and the second uses depthwise separable convolutions mainly to balance the stability, speed, and efficiency of fNIRSNet.The experimental results are reported in Table V.The pure depthwise separable convolutions (i.e., DWS/DWS) have the lowest parameters and FLOPs, while inference time and FPS show deterioration.In practice, the arithmetic intensity (ratio of FLOPs to memory accesses) of depthwise separable convolutions is too low and the hardware usage is inefficient [42].However, the hybrid pattern (i.e., STD/DWS) yields lower and more stable standard deviations and the highest running efficiency.
3) Guideline 3: The type and position of activation functions affect fNIRSNet performance significantly.The results are summarized in Table VI.Saturation activation functions, such as sigmoid and tanh, outperform the more popular ReLU and LReLU by about 6%.The hybrid activations (i.e., Sigmoid/ReLU and ReLU/Sigmoid) further reveal the effect of activation position on performance.Experimental results using sigmoid and ReLU as the first and second activation functions are significantly higher than those using ReLU and sigmoid.The first convolution layer followed by saturating activation functions is beneficial to preserve information for fNIRSNet.The results of ELU with saturated negative values and LReLU further reveal that this benefit comes from negative saturation that decreases the forward propagated variation and information [34].

D. Visualization
In this subsection, advanced visualization techniques are used to explain how fNIRSNet works on UFFT.Grad-CAM [43] is adopted to study the effect of each convolutional layer.Grad-CAM uses the gradient of the target flowing into convolutional layers to generate a rough heat map to highlight important regions.As shown in Figs.5(a) and 5(b), the heat map activation pattern of fNIRSNet is different from CNN.The heat map of DHRConv of fNIRSNet covers the entire fNIRS channels.The 1 × S DHRConv has channel-level receptive fields, and the C × 1 DWConv has global receptive fields.The heat map of CNN only covers a part of the fNIRS channels, especially for the first two layers.This phenomenon is related to the local receptive fields of convolutions.As the depth of CNN increases, receptive fields gradually increase.Thus, the heat map of the third layer tends to extend to the whole channel.
Fig. 6 illustrates the grand average of all subjects.Fig. 3(c) shows the fNIRS channel locations.The motor cortex regions in contralateral hemispheres are well-activated when subjects perform finger-tapping tasks [5].For RHT and LHT, HbO The t-SNE [44] is used to visualize features learned by the FC layer in a two-dimensional space.In Fig. 5(c), the t-SNE of fNIRSNet has a distinct feature distribution that intra-clusters are tightly together and inter-clusters are highly separated into a triangular structure.However, Fig. 5(d) shows that the features learned by CNN exhibit non-separability in the middle region.Therefore, fNIRSNet has excellent feature learning capabilities.

E. Pooling and Dropout
The architecture of fNIRSNet does not use pooling or dropout.We are interested in whether these components can further improve performance.A 2 × 1 average pooling layer is inserted between the DHR and global modules.A dropout layer with a dropout rate of 0.5 is added after the global module.In Table VII, average pooling and dropout do not enhance performance.The average pooling reduces model parameters and dropout prevents overfitting.In addition, average pooling has a higher impact on performance deterioration than dropout.They may lead to underfitting and decreased learning capability because fNIRSNet has fewer model parameters.

F. Parameter Sensitivity
Parameter sensitivity is presented for F 1 and F 2 of fNIRSNet.Fig. 7 shows the average accuracy of the UFFT dataset in subject-specific experiments.We found that increasing F 2 can improve the classification performance of fNIRSNet when F 1 is fixed.Increasing F 2 helps the global module capture the contextual dependencies of fNIRS channels.When F 1 equals 4, the classification performance gradually saturates as the value of F 2 increases to 32.Therefore, we recommend setting the value of F 2 to at least twice that of F 1 if other researchers apply fNIRSNet to their data.

V. DISCUSSION
Currently, evaluation protocols for fNIRS classification are confusing.Deep learning-based fNIRS classification research has become popular in recent years, while some early protocols for open-access datasets are based on traditional machine learning classifiers.We found that these protocols are not suitable for evaluating the performance of DNNs.Although these studies may achieve higher performance, experimental results need to be further investigated.For example, Shin et al. [4] classified signal segments from the same time for each subject on the MA dataset, which does not validate classifier generalization across different time segments.In addition, Shin et al. suggest that their study is not dedicated to benchmark machine learning classifiers [4], while some studies follow the protocols.Sun et al. [23] used DNNs to perform 5-fold cross-validation on 60 segments from the same time segments.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Subject-specific experiments are conducted using various sliding window sizes on the UFFT dataset.
However, DNNs are difficult to learn more generalized feature representations from a small amount of data.Kwak et al. [45] reported the average and maximum accuracy among 10 segments in a trial.Bak et al. [5] published the UFFT dataset and used leave-one-out cross-validation (LOO-CV) to evaluate classifier performance.However, LOO-CV is rarely used to evaluate DNNs considering the high cost of training.More importantly, the above studies do not conduct subjectindependent experiments.Our previous work [11], [46] used KFold-CV and LOSO-CV for trial-wise fNIRS classification.Moreover, we segment signals by sliding windows to increase the number of samples to alleviate overfitting.
The design philosophy of fNIRSNet is different from other models.Our motivation is to introduce domain knowledge into the model design: 1) convolutions with channel-level receptive fields extract the features of slow delayed hemodynamic responses rather than small convolutions with local receptive fields sliding over fNIRS signals; 2) convolutions with global receptive fields help discover the activation patterns of different brain regions.Other deep models that do not introduce domain knowledge struggle to extract more meaningful and discriminative features.Moreover, these over-parameterized models would bring more optimization problems on a limited dataset.fNIRSNet with fewer parameters and FLOPs achieves higher classification performance and reduces BCI hardware resource consumption significantly.In subject-specific experiments, fNIRSNet with only 498 trainable parameters yielded better results than CNNs with millions of parameters on MA.In addition, the model inference time (see Table III) is much lower than other baseline models.Therefore, our study may inspire more knowledge-driven models.
Our study still has limitations.The size of the sliding window limits the long-range dependency of hemodynamic responses, potentially affecting classification performance.This situation is illustrated by subject-specific experiments on UFFT.The step size is set to 1 s and signals for each trial are split into 8 segments to maintain a fixed total number of data samples.As shown in Fig. 8, the average accuracy of fNIRSNet improves from 68.67% to 71.79% when the sliding window increases from 3 s to 7 s, i.e., the size of DHRConv increases from 1 × 40 to 1 × 93 (i.e., 13.3 Hz × 7 s).After that, the average accuracy starts to decrease.However, this blurs the boundary between the task and rest period because some of the signals in the rest period are also considered as a continuation of the task period.For example, the eighth segment of the signal covers a time interval of [7,17] s when the window size is 10 s, which includes a 7-second rest period.Furthermore, this continuation may interfere with real-time classification for hybrid EEG-fNIRS BCIs.EEG has returned to the resting state, whereas fNIRS still has a delayed response because its lower temporal resolution compared to EEG.The primary aim of this study is to investigate fNIRS classification during the task state.In fact, fNIRS classification studies have rarely discussed this continuation operation, which could be related to specific tasks.

VI. CONCLUSION
In this study, we rethink delayed hemodynamic responses for fNIRS-based BCIs and propose a concise and efficient fNIRSNet for fNIRS classification.We summarize three design guidelines for fNIRSNet.The proposed model with fewer parameters and FLOPs achieves better classification results on open-access datasets.Furthermore, Grad-CAM and t-SNE explain the role of each convolutional layer and feature learning capabilities.fNIRSNet is ideally suitable for real-world applications and reduces the hardware configuration of BCI systems.
is a non-invasive neuroimaging technology for monitoring cerebral hemodynamic responses.Enhancing fNIRS classification can improve the performance of brain-computer interfaces (BCIs).Currently, deep neural networks (DNNs) do not consider the inherent delayed hemodynamic responses of fNIRS signals, which causes many optimization and application problems.Considering the kernel size and receptive field of convolutions, delayed hemodynamic responses as domain knowledge are introduced into fNIRS classification, and a concise and efficient model named fNIRSNet is proposed.We empirically summarize three design guidelines for fNIRSNet.In subject-specific and subject-independent experiments, fNIRSNet outperforms other DNNs on open-access datasets.Specifically, fNIRSNet with only 498 parameters is 6.58% higher than convolutional neural network (CNN) with millions of parameters on mental arithmetic tasks and the floating-point operations (FLOPs) of fNIRSNet are much lower than CNN.Therefore, fNIRSNet is friendly to practical applications and reduces the hardware cost of BCI systems.It may inspire more research on knowledge-driven models for fNIRS BCIs.Code is available at https://github.com/wzhlearning/fNIRSNet. Index Terms-Functional near-infrared spectroscopy (fNIRS), brain-computer interface (BCI), deep neural network (DNN), delayed hemodynamic response, domain knowledge.

Fig. 1 .
Fig. 1.A canonical hemodynamic response function generated by three gamma functions.

Fig. 2 .
Fig. 2. (a) Overview architecture of fNIRSNet.(b) Schema of receptive fields.The green and yellow feature maps are the output of DHRConv and DWConv, respectively.The solid lines indicate that the input is directly obtained from the previous layer and the dotted lines indicate the corresponding receptive fields indirectly in the fNIRS tensor.(c) Depthwise separable convolution.
input sampling points.DHRConv with channel-level receptive fields can directly extract single-channel hemodynamic response features.The kernel size of DWConv is C × 1, which indicates that the height is twice (two chromophores) the number of fNIRS channels.In general, model designers stack many convolutional layers to get larger receptive fields in deeper layers, but this stacking operation brings more computational cost.The C × 1 DWConv can directly obtain global receptive fields without stacking layers.Fig. 2(b) shows the schema of receptive fields.In addition, DWConv can compensate for spatial information because DHRConv alone cannot aggregate spatial information from multi-channel fNIRS, which helps DWConv focus on activation patterns in different brain regions.Therefore, 1 × S DHRConv and C × 1 DWConv decouple the contradiction between network depth and receptive fields.Finally, the 1 × 1 PWConv projects the output of DWConv into a new channel space.
ε H bO (•) and ε H b R (•) are extinction coefficients of HbO and HbR at wavelength λ, d is the differential path-length factor, and l is the distance between source and detector.Raw fNIRS signals contain instrument noise, physiological noise, and motion artifacts[3].As a result, a band-pass filter with a passband of 0.01-0.1 Hz is used for MA and UFFT.Baseline correction solves the baseline drift problem by subtracting the average value of a reference interval from fNIRS signals.The reference intervals for MA and UFFT are [−5, −2] s and [−1, 0] s, respectively.The fNIRS signals are divided into segments by a sliding window (window size = 3 s, step size = 1 s)[4],[23].The segmented signal intervals for MA and UFFT are [−2, 10] s and [0, 10] s, respectively.Thus, a trial of MA and UFFT is split into 10 and 8 segments, respectively.Each subject of MA and UFFT includes 600 samples (30 trials × 10 segments × 2 categories and 25 trials × 8 segments × 3 categories).Finally, these segments are normalized by the z-score standardization to accelerate convergence.

Fig. 3 .
Fig. 3. (a) Experimental paradigms for MA and UFFT.A trial consists of an introduction period, a task period, and a rest period.(b) Sensor location layout for MA [4].The red and green squares are fNIRS sources and detectors, respectively.Solid black lines indicate fNIRS channels.The blue and black (ground) circles are EEG electrodes.(c) fNIRS channel locations for UFFT [5].Ch 1-10 and Ch 11-20 are located around C3 (Ch 9) and C4 (Ch 18), respectively.

Fig. 4 .
Fig.4.Schematic diagrams for subject-specific and subject-independent.A dataset contains subjects and each subject has N trials.
, T P is true positive, F P is false positive, F N is false negative, and n represents the number of pairwise combinations.The final performance metrics are the average of all cross-validation results.Efficiency metrics include model parameters, floatingpoint operations (FLOPs), inference time, and frames per second (FPS).

Fig. 5 .
Fig. 5. (a) and (b) are the Grad-CAM visualization of fNIRSNet and CNN for Subject 1, respectively.(c) and (d) represent the t-SNE visualization of fNIRSNet and CNN for Subject 1, sequentially.

Fig. 6 .
Fig. 6.Grand average of the fNIRS signals over subjects.The X-axis interval is [−2, 24] s and the Y-axis interval is [−0.0038,0.0085] mM•cm.The red vertical dotted lines indicate the end of the task period (0-10 s).The solid and dotted curves represent HbO and HbR, respectively.The red, blue, and green curves correspond to RHT, LHT, and FT, respectively.

Fig. 8 .
Fig.8.Subject-specific experiments are conducted using various sliding window sizes on the UFFT dataset.

TABLE I CONFIGURATIONS
OF THE PROPOSED MODEL

TABLE II EXPERIMENTAL
RESULTS FOR SUBJECT-SPECIFIC AND SUBJECT-INDEPENDENT. THE BOLD INDICATES THE BEST RESULT

TABLE IV SUBJECT
-SPECIFIC EXPERIMENTAL RESULTS OF DIFFERENT CONVOLUTIONAL KERNEL SIZES ON THE UFFT DATASET

TABLE V SUBJECT
-SPECIFIC EXPERIMENTAL RESULTS FOR DIFFERENT TYPES OF CONVOLUTIONS ON THE UFFT DATASET.STD MEANS STANDARD CONVOLUTION, AND DWS DENOTES DEPTHWISE SEPARABLE CONVOLUTION

TABLE VI SUBJECT
-SPECIFIC EXPERIMENTAL RESULTS OF DIFFERENT ACTIVATION FUNCTIONS ON THE UFFT DATASET

TABLE VII SUBJECT
-SPECIFIC EXPERIMENTAL RESULTS ON THE UFFT DATASET