Machine Learning Approach for Classifying College Scholastic Ability Test Levels With Unsupervised Features From Prefrontal Functional Near-Infrared Spectroscopy Signals

Learning ability evaluation has been critical in educational and medical fields to investigate learning achievement or cognitive impairment. Previous researchers utilized biosignal data such as functional near-infrared spectroscopy and an electroencephalogram to reflect neural variation in factors related to learning ability. Additionally, machine learning algorithms have been used to identify the inherent associations between learning ability and related factors. Herein, we propose a classification framework for college scholastic ability test levels using unsupervised features extracted from a functional near-infrared spectroscopy signal dataset based on machine learning models. To extract unsupervised features from functional near-infrared spectroscopy signals, we constructed a one-dimensional convolutional autoencoder with an electroencephalogram dataset as a transfer learning approach. Eight handcrafted features (signal mean, slope, minimum, peak, skewness, kurtosis, variance, and standard deviation) with various window length conditions were calculated to compare influences on classification performance. Five evaluation metrics (accuracy, precision, recall, F1-score, and area under the curve) were applied to evaluate the proposed framework’s performance. Among the five classification algorithms (XGBoost classifier, support vector classifier, naive Bayes classifier, decision tree classifier, and logistic regression), the XGBoost classifier was the best at classifying college scholastic ability test levels. We found that unsupervised features extracted from deep learning algorithms are more usable for classification than handcrafted features. Furthermore, the applicability of transfer learning between two different neural modals was validated using the experimental results. The results of this study provide new insights into the relationships between hemodynamics in functional near-infrared spectroscopy signals and college scholastic ability test levels.


I. INTRODUCTION
Evaluation or validation of learning ability has been widely investigated by researchers in the educational and medical domains. In particular, many researchers in the educational field have tried to identify education or study levels in student groups [1]- [3]. Kellaghan and Greaney [4] suggested an assessment of students' learning achievement to improve education quality. They focused on the models and standards of a national assessment. Pereira et al. [5] introduced learnercentered assessment methods for higher education through a review of previous studies. In addition, the influences of examinations and written tests in relation to learning evaluation were validated through an analysis.
In the medical domain, Prigoff et al. [6] assessed the effectiveness of medical education achievement in the increased virtual learning environment amid the coronavirus disease 2019 outbreak. Researchers have proposed the need for adjustments in curricula based on exam scores from student groups. Further, Barulli et al. [7] proposed test tools for memory capacity to detect cognitive impairment in a clinical setting. This memory test showed strength for neuropsychological evaluation.
Many methodologies (e.g., surveying, testing, and counseling) have been used in previous allied studies to measure the learning ability of study participants. Dermo [8] applied an online questionnaire to evaluate the use of e-assessment as a means to improve the quality of student learning. Le and Tam [9] validated eight assessment methods, including seminars and open-book tests, to compare the effectiveness of these methods with regard to students' understanding and attitudes. Additionally, Burnett [10] used counseling to assess the learning outcomes of participants. The author suggested that strategies and techniques need to maximize learning outcomes in counseling. Further, scholastic assessment tests have been widely used to evaluate overall academic achievement or level in specific subjects such as mathematics [11]- [14].
To evaluate diverse conditions in learning ability, previous studies have applied neural-related measurements (e.g., functional magnetic resonance imaging, electroencephalography, and functional near-infrared spectroscopy). Denervaud et al. [15] evaluated errors in learning in Montessori and traditionally schooled children. They used brain functional magnetic resonance imaging (fMRI) data to identify patterns associated with errors in children. Kim et al. [16] collected electroencephalogram (EEG) signals from college student groups to investigate influences of indoor thermal conditions on college students' learning performance. Relationships between the detailed ability of students, including working memory and executive ability, and thermal conditions, were examined. Firooz and Setarehdan [17] recorded functional near-infrared spectroscopy (fNIRS) and EEG signals from graduate students to estimate intelligence quotient (IQ) test scores. The researchers validated the usability of fNIRS and EEG as evaluation modalities in their research.
Various analysis methodologies have been utilized to analyze potential relations from the neural variations of participants. Howard et al. [18] examined associated brain regions with cognitive load for several tasks. Each region-related task was identified through a t-test and partial least squares analysis. Daly et al. [19] validated motivations for learning mathematics using EEG signals. Significant differences in prefrontal signals were verified using a t-test. Sugiura et al. [20] utilized fNIRS signals to evaluate performance in second-language-learning among young adolescents. Signals from the regions of interest were compared using a generalized linear modeling method.
Based on the aforementioned studies, recent studies have utilized machine learning and deep learning to identify latent patterns of neural data. Mao et al. [21] proposed a deep learning classification algorithm to classify the fMRI of attention deficit/hyperactivity disorder patients. The proposed framework of the authors showed a state-of-the-art performance compared with previously proposed algorithms. Evgin et al. [22] attempted to classify bipolar disorder using fNIRS signals on the basis of convolutional neural network models. They demonstrated the possibility of fNIRS analysis using feed-forward neural network algorithms.
Based on previous studies, we developed a classification framework using machine learning algorithms for learning ability levels based on fNIRS signals. To prescribe the operational definition of ''learning ability,'' we collected and utilized college scholastic ability test (CSAT) scores from 73 participants. Additionally, the collected scores were set as a dependent variable of machine learning algorithms. Further, we hypothesized that unsupervised features extracted from deep learning algorithms can show better performance than the handcrafted features used in previous studies for fNIRS classification tasks. To validate this hypothesis, we included in our research design the construction of deep learning models as a feature extractor and comparisons between extracted features and calculated handcrafted features.
To construct deep learning models for feature extraction, the collected fNIRS dataset was insufficient to train and evaluate algorithms from scratch. We utilized EEG signals with characteristics similar to those of fNIRS signals for algorithm training in terms of transfer learning. Zhang et al. [23] adopted a transfer learning approach to evaluate deep convolutional neural networks using an EEG dataset. Moreover, EEG and fNIRS signals showed several common advantages, such as high temporal resolution, over other neural modals. Trambaiolli et al. [24] determined signal properties between neuro-electric (i.e., EEG) and neuro-hemodynamic (i.e., fNIRS) on the basis of their analysis results. They focused on not only the characteristics of time-series data, but also the capabilities of describing hemodynamic alterations in the occipital/visual cortex using EEG signals. As a result, we concluded that the application of deep learning algorithms trained by EEG signals was reasonable for feature extraction.
Furthermore, we attempted to examine the possibility of transfer learning without additional fine-tuning in deep learning algorithms between datasets collected from the same domain. Peng et al. [25] compared the model's performance based on transfer learning between five similar image datasets. They verified the potential of the transfer learning VOLUME 10, 2022 approach based on their experimental results. In addition, unsupervised algorithms trained with a single dataset were evaluated using four datasets, excluding the fine-tuning steps. Zhong et al. [26] used three classification algorithms with datasets collected from different domains in the transfer learning approach. Each trained algorithm was verified and compared without additional fine-tuning. Referring to these previous studies, deep learning algorithms trained with EEG signals were utilized without additional training with fNIRS signals as feature extractors.
In this study, we developed a five-step research scheme. First, suitable participants with regard to age and CSAT scores were recruited, and experiments using eight-session task materials were conducted to collect fNIRS signals. Second, the collected fNIRS signals were preprocessed and converted from raw signals to HbO (oxyhemoglobin) and HbR (deoxyhemoglobin) concentration signals. Third, one-dimensional convolutional autoencoder models were developed using the EEG dataset as a feature extractor for unsupervised feature extraction. Handcrafted and unsupervised features were extracted from the preprocessed HbO and HbR concentration signals. Fourth, five machine learning classifiers were trained with extracted features (handcrafted and unsupervised features) to classify CSAT levels. Finally, the classification performance of each classifier was compared to identify the optimized algorithms and conditions for our research topics. The overall research scheme is shown in Figure 1.
This work provides three main contributions to the field: • We propose a novel classification framework based on machine learning algorithms for CSAT levels using fNIRS signals.
• The applicability of a one-dimensional convolutional autoencoder model trained with EEG signals as a feature extractor was validated in terms of transfer learning.
• We checked the usability of unsupervised features for classification through comparisons with handcrafted features.
The remainder of this paper is organized as follows. In Section II, we present the detailed procedures and methods for developing our proposed framework for CSAT-level classification. In Section III, we present the experimental results to evaluate our machine learning-based framework. In Section IV, we explain the significance and implications of our research. Finally, we conclude the paper in Section V.

A. PARTICIPANTS FOR fNIRS DATASET COLLECTION
To collect fNIRS signals, we recruited participants from undergraduate freshman groups at three different universities (Yonsei University, Honam University, and Gyeongsang National University). Seventy-three healthy undergraduate students participated (mean age: 19.20; female: 41, male: 32).
We used the NIRSIT Lite device of OBELAB Inc. (Seoul) to collect information on the hemodynamic activities of participants. The sensor array of this device consisted of five dual-wavelength laser diodes (780 and 850 nm) and seven photodetectors separated by an 8 mm unit distance. The optical signal collected from each channel was sampled at 8.138 Hz. The laser and detector pairs were separated by a distance of 3 cm. The position schemes of optodes in the NIRSIT Lite device are depicted in Figure 2 [27].
Prior to the experiment, we explained the fNIRS signal collection procedure to the participants, and all experiments were conducted after obtaining their consent. The experiments were designed and conducted in accordance with the guidelines of the Declaration of Helsinki and institutional review board approval at Yonsei University (7001988-202104-HR-659-06).

B. EEG DATASET
In this study, we compared unsupervised features extracted using deep learning models and handcrafted features. To extract features from fNIRS signals, one-dimensional convolutional autoencoder models were trained and evaluated using EEG signals. An open-source brain-computer interface (BCI) IV competition EEG dataset was utilized for 50866 VOLUME 10, 2022 our feature extractor [28]. This dataset was provided by the Berlin BCI group (Berlin Institute of Technology, Fraunhofer FIRST, and University Medicine Berlin). Researchers from the BCI group devised six experimental sessions to evaluate motor imagery in participants. Three options for the motor imagery task (the left hand, right hand, and foot) were sequentially assigned visual cues. EEG signals included in the BCI competition IV dataset were collected from seven healthy subjects (all male; aged 26 to 46 years) using 59 channel devices.

C. EXPERIMENTAL PROCEDURE
To examine hemodynamic variation in various tasks, we put together consecutive sessions with widely used cognitive tasks in related previous studies. Eight sessions (resting state and seven cognitive tasks) were finally selected and provided to participants. First (resting state step), brain activity in the resting state was measured to investigate hemodynamic changes before conducting the seven cognitive tasks [29]. Second (Corsi block-tapping task step), participants were instructed to select consecutive positions of colored boxes in the Corsi block-tapping task (CBT) [30]. The positions of the yellow box were changed sequentially to evaluate the memory of participants with regard to the stimulus sequence in the task. Third (emotion task step), human-face pictures with specific expressions were proposed to the participants for the selection of emotions. The participants were shown the pictures for a few seconds. Thereafter, the pictures disappeared and they were configured to offer a choice [31]. Fourth (recognition task step), to evaluate the recognition capacities of participants, facial pictures without expressions or emotions were shown on the monitor [32]. Recognition was assessed by having participants answer whether they saw the picture in the previous task by selecting appropriate buttons. Fifth (Stroop task step), participants were provided with colored words to select the color of words in the Stroop task [33]. In addition, participants were asked to verify the meaning of the words regardless of the word. Sixth (Tower of London task step), participants performed the Tower of London tasks and were provided with three beads and sticks to assess their planning and problem-solving capabilities [34]. Seventh (N-back task step), participants took part in the N-back task, which involved the sequential positions of stars in a grid [35]. Participants selected the previous positions of stars in three types of tasks (1-back, 2-back, and 3-back). Finally (verbal fluency task), participants' verbal fluency was evaluated by speaking related words in a limited period [36]. In addition, the time at which the word was spoken was recorded to identify the continuity of the answers. The procedures for the cognitive tasks and their time periods are listed in Table 1.

D. fNIRS SIGNAL PREPROCESSING
After data were collected from participants via the cognitive tasks, fNIRS data were preprocessed to remove artifacts or noise in the signal. To exclude physiological and environmental noise, band-pass filtering with a range of 0.005 to 0.1 Hz was used. In addition, a 30 dB signal-to-noise ratio was applied to qualify the noise of detected channels. After applying two processes (band-pass filter and signal-to-noise ratio), we calculated relative changes in oxy-Hb (i.e., HbO concentration signal) and deoxy-Hb (i.e., HbR concentration signal) using the modified Beer-Lambert law (MBLL) [37], [38].

E. EEG SIGNAL PREPROCESSING
To construct a feature extractor (i.e., a one-dimensional convolutional autoencoder), we utilized the BCI IV EEG dataset. Unlike previous studies that analyzed in detail (e.g., power spectrum analysis or spectral analysis), we preprocessed only basic characteristics of EEG signals (e.g., sampling frequency and scale of signal values). The EEG signals included in the BCI IV dataset consisted of 1000 Hz signals. Based on previous studies [39], [40], we downsampled the data to 100 Hz and normalized the scale of values to range from −1 to +1 to achieve faster training of algorithms. VOLUME 10, 2022  After the aforementioned steps for preprocessing, we composed three datasets for training and evaluation of deep learning algorithms. Signals collected from seven participants were divided into training (five participants), validation (single participant), and test (single participant) datasets. In the case of the training dataset, five participants were randomly assigned. Additionally, two other participants were also randomly assigned to the validation and test datasets.
To identify optimal hyperparameters such as the length of layers and size of latent vectors in deep learning models, we compared datasets and the different lengths of the EEG dataset. Referring to previous studies using similar algorithms, we selected five conditions for the length of the input layer (i.e., length of input signal data) and the length of latent vectors (i.e., length of extracted vectors) [41]. Accordingly, five datasets with different signal lengths were composed for the construction of feature extractors. The training, validation, and test datasets were included in each dataset. Further, we checked that the dimensions of the datasets were similar for each condition. The detailed dimensions of each dataset are listed in Table 2.

F. CONSTRUCTION OF FEATURE EXTRACTOR
To extract unsupervised features from fNIRS signals, we utilized one-dimensional convolutional autoencoder algorithms. Additionally, as mentioned in the previous paragraph, we compared five conditions to determine the optimal hyperparameters for the algorithms. The detailed conditions for the comparison are listed in Table 3.
The reconstruction performance of each condition was evaluated using three evaluation indices (root mean squared error, mean relative error, and mean absolute error) and a comparison of figures. Among the five conditions for hyperparameter setting, the third condition (720 length of input layer and 60 length of latent vector) showed lower index values than the other conditions (the detailed results are explained in Section III.). Based on these results, we trained and evaluated algorithms with a 720-length of input layers and 60 latent vector conditions. The detailed structures of the one-dimensional convolutional autoencoder are listed in Table 4. We utilized the encoder module (layers 1-4) as a feature extractor.

G. EXTRACTION OF UNSUPERVISED FEATURES
We obtained HbO and HbR signals following the previously mentioned preprocessing steps. To extract unsupervised features from the feature extractor, preprocessed HbO and HbR signals were divided into 720-length units. Single signals of 720-length were applied to the feature extractor. From the feature extractor (encoder module), one-dimensional 60-length vectors with 64 output channels were obtained. Extracted feature vectors were converted to single vectors by averaging channels without changes in length.

H. CALCULATION OF HANDCRAFTED FEATURES
To compare the influence of unsupervised features and handcrafted features on classification performance, we calculated eight handcrafted features used in previous studies.

1) SIGNAL MEAN
The signal means of HbO and HbR concentration signals were calculated as follows: In this formula, µ w is the mean value for a given window. Subscript w indicates the window for the calculation. i 1 and i 2 denote the start and end points of the window, respectively. N w is the number of signal values in the window, and HbX refers to the HbO or HbR concentration signal data. In many previous studies, signal mean values were utilized for classification in BCI research [42], [43].

2) SIGNAL SLOPE
To extract the signal slope features, we referred to the calculation methods used in previous studies [44]. The highest and lowest signal values in the window were compared. The signal slope features were calculated as follows: where subscript w indicates the window, and Slope w is the calculated signal slope feature value. H w and L w denote the highest and lowest values in the window, respectively.

3) SIGNAL PEAK
The signal peak feature is the peak value of the signal values in the window. Some previous studies have shown that peak value features worked best in fNIRS research [45], [46].

4) SIGNAL MINIMUM
The signal minimum feature is the minimum value of the signal in a given window. In associated studies on fNIRS-BCI, authors validated the usability of these features [47]- [49].

5) SIGNAL SKEWNESS AND KURTOSIS
The signal skewness feature was calculated as follows: where Skewness w indicates the skewness feature value calculated from the signal values in the window. σ 3 in the denominator represents the standard deviation of the HbO or HbR concentration signal value for the given window. In the numerator, µ w denotes the mean value in the window, and E x denotes the expectation of HbO or HbR signal. The signal kurtosis was computed as follows: where Kurtosis w indicates the calculated kurtosis feature value. These features (skewness and kurtosis) have been utilized in related fNIRS research [50], [51].

6) SIGNAL VARIANCE AND STANDARD DEVIATION
We calculated the variance and standard deviation values for a given window. These features have also been reported as being effective in fNIRS research [52], [53].

I. CLASSIFICATION ALGORITHMS
We applied five machine learning classifiers for our research topics (CSAT level classification using fNIRS features). The first classification algorithm was decision tree classifiers [54]. This classification algorithm is mainly composed of flow charts, such as tree structure flow charts (nodes and branches). The tree was built in two phases. First, in the build (growth) phase, the training dataset was split recursively based on local optimal criteria until the samples included in the dataset belonged to each of the partitions in the same class labels. Second, to prevent overfitting of the models, noise and outliers were removed in the pruning phase. Moreover, the second phase was conducted using fully grown trees. In terms of the model structure, three sub-structures (internal nodes, branches, and leaf nodes) consisted of these algorithms. We utilized decision tree classifiers with an iterative dichotomiser 3 (ID3) algorithm. ID3 algorithms use information gain to select the splitting attribute. Further, information gain represents the variation of entropy values. In summary, information gain was calculated using the difference in entropy before and after splitting. The second classification algorithm was logistic regression [55]. A maximum likelihood estimation method was used to estimate the coefficients of the regression models. Subsequently, the regression model calculated a likelihood value L(x), where 0 ≤ L(x) ≤ 1. The association between class label and input vectors was indicated by the likelihood values. If the likelihood values were higher than the threshold (0.5), the class was classified as having high CSAT levels in binary cases. In the three class condition Y, we considered Y as a specified value of either ''low,'' ''middle,'' or ''high.'' As a result, the logistic regression model calculated the probability values to categorize each class under diverse class conditions.
The third classifier was a naive Bayes algorithm [56]. This probabilistic classifier utilizes the Bayes theorem. All attributes in the dataset are assumed to be independent. Support vector classifiers (SVC) were used as the fourth classification algorithm [47]. In our study, this classifier was applied using a nonlinear kernel (radial basis kernel). The feature space of the dataset was classified using hyperplanes separated by class labels. In the research by Bhavsar & Panchal [58], the authors compared classification performance via SVC models under linear, polynomial, and radial basis kernel conditions. They showed advantages of radial basis kernels for high dimensional classification tasks. Therefore, to evaluate the classification performance of the different algorithms under various class conditions, we selected a radial basis kernel with non-linear characteristics. Additionally, the participants in the dataset were completely separated to prevent overfitting of the models.
Finally, the XGBoost classifier was utilized to compare the classification performance with the aforementioned algorithms [59]. This classifier was an ensemble of several decision-tree models. Furthermore, this model comprised gradient-boosting algorithms with regularized objectives. We minimized the regularized objective function to optimize the algorithms. The differences between the predicted y i and target yi were compared in differential convex loss function. Penalization term was added to adjust the complexity of the models. An additional regularization term smoothens the last learned weight to avoid overfitting. In our study, we assigned categories of CSAT levels (e.g., ''low,'' ''middle,'' and ''high'' in three class case) in y i .

J. EVALUATION METRICS
To compare the classification performances between algorithms, we applied five evaluation metrics. True positive (TP), true negative (TN), false positive (FP), and false negative (FN) were calculated from a confusion matrix to evaluate performance using other indices with accuracy. TP and TN values indicate the number of correctly classified samples. In contrast, FP and FN values represent incorrectly classified samples. Using the four basic values from the confusion matrix, we obtained four additional indicators (precision, recall, F1-score, and accuracy). Additionally, the true positive rate and false positive rate were checked to draw the receiver operating characteristic (ROC) curve. Further, the performance of the algorithms was evaluated using area under the curve values from the ROC curve.

K. TRAINING AND EVALUATION OF MACHINE LEARNING CLASSIFIERS
We utilized datasets consisting of features (both handcrafted and unsupervised) and class labels to train and evaluate five machine learning algorithms. To validate the classification performance of classifiers in various class conditions, we set three class conditions in our experiments (three, four, and five classes). Further, detailed experimental conditions based on the characteristics of fNIRS signals (HbO or HbR, channels of signal) and window length for handcrafted feature extraction were applied to compare the effects of the conditions on the classification performance. For example, in the case of the characteristics of fNIRS signals, the HbO and HbR signals were separately applied for feature extraction. Additionally, the fNIRS dataset used in our study consisted of distinguished signals collected from each of the 15 channels. Individual signals were separately applied for comparison. Further, nine different window length conditions (from 2 s length to 10 s length of window) were used for handcrafted features to compare and validate the influence of features on performance in our research settings. As a result, we conducted experiment with 32,400 conditions (8 features × 9window length × 15 channels × 5models × 3class labels × HbO and HbR = 32,400) for handcraft feature conditions and 450 conditions for unsupervised feature conditions (15 channels × 3class labels × 5 models × HbO and HbR = 450).
To train and evaluate the algorithms, we utilized 10-fold cross-validation to prevent overfitting. The number of rows in the dataset was the same (1,530 rows) for both handcrafted and unsupervised feature conditions. Additionally, the number of columns (i.e., features) differed according to the feature conditions. For example, in the case of unsupervised feature conditions, the dimension of the dataset was (1530, 61). Unlike unsupervised feature conditions, the number of columns in the handcrafted feature condition differed on the basis of length of the windows. The average number of columns in a handcrafted feature was

A. CONSTRUCTION OF THE FEATURE EXTRACTOR
To extract unsupervised features from fNIRS signals, we constructed a one-dimensional convolutional autoencoder as a feature extractor. Five hyperparameter conditions were compared to confirm the model structure and training parameters. Among the five conditions, the third condition (720 length of input layer and 60 length of latent vector) showed the lowest error values among the three error indices. The detailed error values are listed in Table 5. Additionally, we checked the similarity between the original and reconstructed signals in terms of visualization. The visualization of EEG signals is depicted in Figure 4.

B. CLASSIFICATION PERFORMANCE OF ML CLASSIFIERS
We evaluated the classification performances of five ML classifiers based on various experimental conditions for feature extraction and class labels. Among the five algorithms (decision tree classifier, logistic regression, naive Bayes classifier, support vector classifier, and XGBoost classifier) for classification, the XGBoost classifier showed the best classification performance. Detailed experimental results from XGBoost and other classifiers were presented in Appendix A and B. The performance of the XGBoost classifier in classifying CSAT levels under unsupervised feature conditions is shown in Tables 6-8.
In the case of handcrafted features, XGBoost classifiers showed that the averaged values of the evaluation metrics were approximately 79%. In contrast, the classification performance was relatively higher using the unsupervised features extracted from the deep learning algorithms under all experimental conditions (averaged evaluation metric value was 87%).

IV. DISCUSSION
In our study, we attempted to classify CSAT levels based on machine learning classifiers with several features extracted from fNIRS signals. Unsupervised features and handcrafted features were extracted from fNIRS signals in different processes to compare the effects for CSAT-level classification.
To propose reasonable evidence for our research topics (classifying CSAT levels through machine learning algorithms with fNIRS signals), we identified several associated previous studies on two aspects (learning ability evaluation with neuro-related dataset and analysis with machine or deep learning algorithms).
First, considering the relationship between neuro-related datasets (e.g., EEG or fNIRS) and learning ability, Kaewkamnerdpong [60] evaluated human learning ability using neuroimaging (EEG and fNIRS signals). He suggested that utilizing the real-time brain state for evaluation of the target learning ability was valuable from the experimental results. Artemenko et al. [61] collected fNIRS signals in event-related potential (ERP) measurements to investigate an individual's math ability. The authors compared the variations of fNIRS waves with arithmetic materials between high and low performers. Soltanlou et al. [62] examined cognitive development related to mathematics and language using fNIRS signals. Brain activation changes were measured during the language-and mathematics skills-related experiments in schoolchildren groups. Based on previous research, including the aforementioned studies, we concluded that the application of neuroimaging techniques (especially fNIRS) was suitable for the classification of CSAT levels as learning ability measurement.
Second, in terms of analysis through machine learning or deep learning algorithms, Benerradi et al. [63] applied machine learning and deep learning classifiers to classify mental workload status in a continuous human-computer interaction (HCI) research with an fNIRS dataset. They checked the promise of machine learning models for fNIRS analysis in their research. Hosseini et al. [64] discovered discriminative characteristics within fNIRS data collected from children with language disorders. A total of five machine learning classifiers were used to detect hemodynamic differences in healthy and disordered groups. Rojas et al. [65] suggested a classification framework for pain assessment using fNIRS signals collected from nonverbal patients. K-nearest neighbor algorithms were used for pain assessment. The authors focused on the advantages of machine learning models to investigate functional biomarkers for pain using fNIRS signals. Based on the previous mentioned studies, we verified that machine learning models have the potential to analyze fNIRS datasets for CSAT level classification. As a result, we confirmed that our research topic regarding CSAT level classification with fNIRS signals based on machine learning algorithms was well founded.
To reflect variations in fNIRS signals for CSAT level classification, we utilized several features used in previous studies. Khan and Hong [66] extracted eight features (mean oxyhemoglobin, mean deoxyhemoglobin, skewness, kurtosis, signal slope, number of peaks, sum of peaks, and signal peak) from prefrontal fNIRS signals. The extracted features were applied to classify the neural states between alert and drowsy states. A total of 15 window conditions were used to extract the features. Yoo et al. [67] extracted mean, slope, kurtosis, and skewness features to decode multiple sound categories from fNIRS collected from the auditory cortex. Yang et al. [68] considered seven features (HbO mean, HbR mean, HbO slope, HbR slope, time to peak in hemodynamic response, skewness, and kurtosis) extracted from fNIRS signals as digital biomarkers to identify mild cognitive impairment (MCI). Among the diverse features utilized in previous studies, we found and extracted eight common features to compare the influences about classification performance with unsupervised features in our experimental settings. In addition, differences in window length conditions for feature extraction have been examined in previous studies [69]- [71]. In this regard, we extracted and applied handcrafted features with nine conditions (from 2 s length to 10 s length) for the length of windows.
In many previous studies that analyzed datasets using machine learning or deep learning algorithms, datasets collected from enough participants were utilized for research. For example, Jang et al. [72] used an electrocardiogram (ECG) signal dataset measured from 1,278 patients to train VOLUME 10, 2022 deep learning models. They showed that the amount of data was sufficient to train and evaluate algorithms. Additionally, Jang et al. [41] utilized an actigraphy dataset gathered from 14,482 healthy individuals for analysis. However, in our study, we obtained fNIRS signals from 73 participants. We considered that the number of participants could be relatively insufficient for analysis using deep learning algorithms directly. To overcome the shortage of datasets, we selected a transfer learning approach based on previous studies [73], [74]. To extract unsupervised features from deep learning algorithms, one-dimensional convolutional autoencoder models were trained using the BCI IV EEG dataset preferentially. Whether the characteristics of the EEG data were well reflected in model parameters was verified through a comparison of five hyperparameter conditions between the reconstructed and actual signals. As a result, an encoder module trained by EEG signals was used as a feature extractor to extract fNIRS unsupervised features. Moreover, in the case of the dataset collected from the same domain (i.e., EEG and fNIRS), we attempted to examine whether the pretrained algorithm was suitable for feature extraction without additional fine-tuning.
From our experimental results, the overall classification performances were compared to find optimized algorithms for our research topics. Among the five classifiers, the XGBoost classifier showed the highest evaluation metric values under all experimental conditions. Similar results have been reported in previous studies. Zhu et al. [75] classified major depressive disorder groups using machine learning algorithms based on collected fNIRS signals. In this case, the performance of the XGBoost classifier was higher than that of the random forest classifier. Additionally, Khan et al. [76] verified the suitability of XGBoost classifiers in a finger movement classification task using an fNIRS dataset. As a result, we confirmed that the XGBoost algorithms were most suitable for fNIRS signal analysis for CSAT-level classification in our research scheme.
In experimental conditions with handcrafted features, averaged classification performances of XGBoost classifiers in eight features were compared to locate the common window length and signal conditions (i.e., HbO and HbR) for each feature extraction. First, in the case of signal slope features, a 4 s length window and HbO concentration signal condition were commonly found in three class conditions (three, four, and five classes). Noori et al. [77] utilized signal slope features with a relatively short window length for calculations. They verified that the experimental conditions using slope features showed the best classification performance.
Second, in conditions with signal peak features, we checked that HbR concentration signal and 3 s length of window was found in all conditions. Third, the HbO signal and 6 or 7 s length window (three class conditions: 6 s window length condition, four class conditions: 7 s window condition, and five class conditions: 6 s window condition) were found for signal standard deviation feature conditions.
Finally, under the HbR signal condition for signal standard deviation features, we found that 4 or 5 s length windows (three class conditions: 4 s window length condition, four class conditions: 5 s window condition, and five class conditions: 4 s window condition) commonly showed the best performance. Ghaffar et al. [78] used signal standard deviation features with a 5 s window for classification in their fNIRS-BCI research. The authors used the KNN and LDA algorithms with standard deviation features and verified higher accuracy than for other frameworks proposed in benchmark studies. By comparing the classification performances between our research and previous studies, we identified similar tendencies in our experimental results with regard to handcrafted features.
Further, to find suitable channel conditions for classification utilizing handcrafted features, we compared the frequency of channels based on the best classification performance in each condition. Fifteen channels were divided into three groups based on their position in the prefrontal regions. Channels 1, 4, 7, 10, 13, and 15 were included in the orbitofrontal cortex (OFC) group. Channels 2,5,8,11,and 14 were sorted into the frontopolar prefrontal cortex group. The remaining channels (channels 3, 6, 9, and 12) were included in the dorsal prefrontal cortex group. Among the three region groups, we found that the frequency of channels belonging to the orbitofrontal cortex group was the largest. Based on these results, the signals collected from the orbitofrontal cortex groups (channels 1, 4, 7, 10, and 13) were found to be relatively more suitable for classifying CSAT levels than other channels in terms of handcrafted features. Spinella and Miley [79] examined the relationships between educational attainment and OFC regions. They found that reinforcing goal-directed behaviors and impulse control from the education process can influence the OFC regions.   To compare the influence of each feature on classification performance, we applied unsupervised features from a feature extractor trained by EEG signals. With regard to the classification performances of the XGBoost classifier, the five evaluation metric values of the unsupervised feature condition were higher than those of the handcrafted features. We confirmed that unsupervised features extracted using deep learning algorithms were more appropriate for classification than the eight handcrafted features. Based on the aforementioned results, we verified that a deep learning algorithm can work well as a feature extractor without fine-tuning when the transfer learning approach is applied between similar domain datasets.
In addition, we sorted the experimental results for each channel in ascending order based on the classification performance to compare the appropriateness of signals collected from channels for classification. After sorting the results, we selected the channel with the highest number of cases for each metric. The frequency of channels in the OFC group was higher than that in the other groups. In the case of channel importance for classification, we found a similar trend with handcrafted features (i.e., signals of the OFC group were relatively appropriate for our research topic).
In summary, we compared the classification performances of five machine learning classifiers between handcrafted and unsupervised features to verify the usability of the features for CSAT classification. As a result, the XGBoost classifier was found to be most suitable for classification. We concluded that unsupervised features were more usable for classifying CSAT levels based on the experimental results. In addition, the applicability of the transfer learning approach without fine-tuning was verified in deep learning models between the same domain dataset (EEG and fNIRS). Furthermore, fNIRS signals measured from the OFC groups were more adequate than those measured from the other groups for our research topics.

V. CONCLUSION
In this work, we proposed a machine learning-based framework for classifying CSAT levels using unsupervised features extracted from deep learning algorithms. Based on previous studies on the relationship between learning ability and neural activities, hemodynamics in fNIRS signals using the NIRSIT Lite device were measured to extract handcrafted and unsupervised features. To evaluate our framework from various perspectives, we designed experiments using various class labels and feature extraction conditions. We found that the XGBoost classifier exhibited the best classification performance and that unsupervised features extracted by the feature extractor trained with EEG signals were suitable for classifying the CSAT levels.
The first strength of this study was the application of fNIRS signals, which are not widely used to classify CSAT levels. Second, we determined the ideal conditions for fNIRS signals for feature extraction. Third, in terms of transfer learning, we checked the usability of one-dimensional convolutional autoencoder algorithms as feature extractors with different neural modals (i.e., EEG and fNIRS). Fourth, an fNIRS dataset collected from undergraduate students in three different universities with eight cognitive task sessions was used to reflect variations in CSAT levels and neural activities.
Our study has some limitations. First, fNIRS signals can include detailed differences between diverse cognitive tasks. These differences can affect the learning ability evaluation results and CSAT levels. However, we considered overall characteristics instead of specific changes to classify the CSAT levels. Second, deep learning algorithms can be used to detect latent patterns in fNIRS signals for CSAT level classification. An additional fNIRS dataset needs to be collected for applying deep learning models in further studies. Finally, to generalize our framework, we need to consider external validation through fNIRS signals collected from other participant groups (e.g., other countries or societies) in further studies.

APPENDIX A
Experimental Results with handcraft features from three classification algorithms (XGBoost, logistic regression, and support vector classifier YONGWAN PARK received the Ph.D. degree in marketing from Virginia Tech. He is currently an Assistant Professor of marketing with the College of Business, Gyeongsang National University. His research interests include consumer judgment and decision based on behavioral decision theory, and consumer perception about IT products.
JIHYUN CHA received the Ph.D. degree in cognitive and brain sciences from Washington University, St.Louis. She is a Researcher at OBELAB, Inc. Her research interests include investigating cognitive and neural biomarkers of individual differences and clinical symptoms through academic, clinical, and commercial applications of fNIRS.
JONGKWAN CHOI received the Ph.D. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST). His research interests include portable functional near-infrared spectroscopy (fNIRS), biomedical integrated circuits, and optical communication system. He received the IEEE International Symposium on Circuits and Systems Second Best Paper Award of the Biomedical and Life Science Circuits, in 2012. Since 2016, he has been with OBELAB, Inc., Seoul, South Korea, a bio start-up that manufactures portable functional brain imaging systems and working on developing a system architecture.
SANGHOON HAN was born in 1977. He is currently a Professor with the Department of Psychology, Yonsei University. His research interests include decision making and cognitive science. VOLUME 10, 2022