Deep Learning Approach to Generate a Synthetic Cognitive Psychology Behavioral Dataset

Synthetic data generation is critical in machine and deep learning research to overcome the shortage of samples or dataset sizes. Various algorithms, including the generative adversarial network and autoencoder models, have been applied to generate artificial datasets in previous studies. In this study, we propose a synthetic data generation framework for a tabular dataset collected from cognitive psychology behavioral experiments based on deep learning algorithms. Tabular datasets for the Stroop task were used to develop our framework. On account of the relatively small sample size (N=102) of the dataset used in our study, we used a pre-trained generative adversarial network model to complement the size of the dataset. Furthermore, we proposed and applied five evaluation methods with statistical tests (overlapped sample test, constraint reflection test, correlation reflection test, distribution distance test, and feature distance test) to validate generation performance based on internal levels of table structure (instance level, feature level, and whole-set level evaluations). The proposed framework with a fine-tuned generative adversarial network algorithm was compared with a random generation method to verify generation performance, including the representation of the statistical characteristics of the original datasets. We found that the generated datasets from the proposed framework exhibited more similar statistical characteristics with the original dataset than the randomly generated datasets based on five evaluation methods. The results of this study provide not only generation algorithms for cognitive psychological datasets with tabular type but also a solution to the sample size issue for researchers.


I. INTRODUCTION
Sample or dataset size is considered a critical factor for various data analysis methodologies, including statistical and machine learning methods [1]- [4]. In terms of statistical analysis, many statistical tests require an appropriate sample size to verify the power or reliability of the results [5], [6]. For example, Lachin suggested the importance of sample size determination and power analysis in clinical trials [7]. Additionally, Maccallum et al. introduced a framework to determine the minimum sample size for power in empirical behavioral research [8]. In previous studies, many researchers applied formulas for sample size calculation to support the verification of their research questions or hypotheses [9]- [11]. Moreover, an adequate dataset size The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . is essential for machine and deep learning methodologies. Ajiboye et al. emphasized the size of a dataset to construct supervised learning algorithms [12]. Among the three different sizes of datasets, large datasets showed lower performance errors (mean absolute errors) than other datasets. Furthermore, Sun et al. suggested a relationship between dataset size and model performance in visual deep learning models [13].
However, there are several reasons for the shortage of datasets in research practice. First, in the case of structured data, including survey results, lack of follow-up or nonresponse by participants can result in missing data [14], [15]. Second, in terms of unstructured data (e.g., actigraphy or electrocardiogram as a time-series), issues of devices collecting data or participants' mistakes can influence missing or blanked data [16], [17]. For instance, Brick et al. attempted to handle missing data due to nonresponses in a survey VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dataset [18]. Further, Schlomer et al. suggested three handling methods for data missing from participants in counseling psychology research [19]. Angerer et al. replaced missing data owing to the removal of devices from the body with a median value in the analysis of circadian rhythms in patients with brain injuries [20].
Researchers have attempted a statistical approach to overcome the causes of data insufficiency [21]- [24]. Imputation methods, including both single statistics (e.g., median or mean value) and integration of multiple candidate datasets (i.e., multiple imputation) were applied to treat missing data [25]- [28]. Similarly, bootstrap methods have been utilized to reduce estimation errors in the imputation process [29]. In previous studies that used machine learning or deep learning methodologies, various methods were applied to manage missing datasets. Saqib et al. resampled variables, including missing values, in the analysis of electronic medical records (EMRs) to predict sepsis [30]. Furthermore, Perez and Jason suggested the effectiveness of data augmentation methods in image classification using a deep learning model [31].
In behavioral science fields, such as experimental psychology, sample size is also considered an important factor for data analysis [32]- [35]. Schweizer et al. focused on sample size in sports psychology research [36]. They emphasized the disadvantage small sample sizes had in improving confidence in analysis results. Furthermore, Sassenberg et al. compared trends in social psychology research from 2011 to 2016 [37]. In Sassenberg's research, the sample size used in the study gradually increased to complement the statistical power.
In particular, large-scale datasets are considered a promising factor in the field of cognitive psychology. Peterson et al. suggested that large-scale datasets and machine learning algorithms can be used to identify new cognitive or behavioral phenomena [38]. They focused on risky choices and extensively studied issues in decision theory [39], [40]. In addition, Agrawal et al. proposed methodologies for building models and identifying novel phenomena in large datasets [41]. To overcome noise artifacts included in the datasets, they utilized sufficiently large datasets with data-driven models.
However, some challenges remain regarding the collection of large datasets through experimental research. First, in the case of repetitive and difficult tasks, participants can select extreme or incompatible answers. Inconsistent responses or outliers in the experimental results affect the overall sample size and analysis results [42], [43]. Second, negative changes in the environment, including the Covid-19 pandemic, can affect the recruitment of study participants. Suspension of follow-up for specific groups or experiments also influences the overall study [44]. Although researchers can use alternative tools, such as Amazon's Mechanical Turk (MTurk), these have potential limitations regarding research materials and conditions [45]. Consequently, several methodologies for data augmentation or resampling need to be considered to increase sample size.
In the case of tabular datasets collected in behavioral experiments in cognitive psychology studies, several characteristics can limit the application of data augmentation methodologies proposed in previous studies. First, individual variables within a dataset are deeply associated with each other [46], [47]. For example, in the popular Stroop test, the reaction time of participants refers to the reaction of participants through cognition about the proposed material, including words and colors. It indicates that the variables of reaction time and material (e.g., words, objects, and colors) are not independent. Second, different types of variables are included in the datasets [48]- [50] that consist of categorical and countable variables, not just continuous variables. For example, datasets can include age or reaction time variables as continuous variables and specific groups and levels as categorical variables. Consequently, many characteristics of the behavioral experiment dataset face challenges in applying the proposed augmentation methods (e.g., extracting, transforming, and random sampling).
In many studies, machine and deep learning methods have been applied to propose imputation or augmentation methodologies. Lashgari et al. introduced a data augmentation method based on a deep learning model for electroencephalography (EEG) [51]. Jang et al. suggested a deeplearning-based imputation methodology for missing intervals in actigraphy data [52]. In addition, Rizos et al. proposed a deep learning method for short-text data augmentation in speech classification [53].
Based on diverse methods with machine and deep learning models, we attempted to suggest a data generation framework for synthetic behavioral datasets with deep learning models. Various algorithms have been used to propose data generation frameworks in previous studies. Semeniuta et al. utilized convolutional variational autoencoder algorithms to generate text datasets [54]. Similarly, Guan et al. suggested a generation method for electronic medical record datasets using generative adversarial network (GAN) algorithms [55].
In our study, we applied GAN algorithms to generate a behavioral experiment dataset with a tabular structure. To improve the performance of the framework, pre-trained GAN algorithms for tabular datasets were applied [56]. We fine-tuned a pre-trained algorithm using an open-source psychology behavioral experiment dataset with a Stroop task collected from 102 participants [57], [58]. We generated 1000 datasets by applying both deep learning algorithms and random generation and compared the results to evaluate the performance of our framework. Moreover, we proposed five evaluation tests using an internal dataset level. First, an overlapped sample test was applied to evaluate whether the deep learning model simply copied the data. Second, the constraint reflection test evaluated the reflection of the range of individual variables (minimum, median, and maximum values) in the generated dataset with the original range. The first and second methods checked differences with the original dataset at the instance and row levels (i.e., instance level evaluation). Third, the correlation between the variables in the generated dataset was examined using the correlation reflection test. Fourth, the distances of variables between the original and generated datasets were evaluated using the distribution distance test. In the third and fourth tests, we investigated statistical characteristics in terms of variable and feature levels (i.e., feature-level evaluation). Finally, in the feature distance test, we compared the feature distance with the extracted latent features using a pre-trained AlexNet model. In the last test, latent features inherent in the dataset were compared using Euclidean and Manhattan distances (i.e., whole-set level evaluation). Based on the five aforementioned tests, we examined whether the generated dataset had statistical characteristics similar to those of the original dataset.
The objective of this study was to develop a synthetic behavioral experiment dataset generation framework based on open-source Stroop task data using GAN algorithms. The major contributions of this study are as follows: (1) We proposed a GAN-based data generation framework for a behavioral experiment dataset in the field of cognitive psychology based on an open-source Stroop task dataset. In addition, we applied a relatively large dataset (N=102) to reflect the statistical characteristics of Stroop tasks. We also evaluated the generation performance of the framework compared to a randomly generated dataset.
(2) Advancing from generating a synthetic tabular dataset of behavioral experiment data, we proposed five individual tests based on statistical tests (overlapped sample test, constraint reflection test, correlation reflection test, distribution distance test, and feature distance test) to examine various characteristics in the generated dataset. Furthermore, we compared the generation performances at three levels for the tabular dataset (instance level, feature level, and whole-set level evaluation) based on the five tests.
(3) Based on the synthetic dataset with similar statistical characteristics, our framework can help overcome a shortage in sample size. In addition, environmental restrictions, including the Covid19 pandemic, on conducting experimental studies can be overcome with artificial datasets. Furthermore, the fatigue or physical burden of participants can be reduced by complementing with generated datasets.
The remainder of the paper is structured as follows: Section II includes a detailed description of the methodologies and Stroop task dataset used in the study. In Section III, the generation performance of the proposed deep learning-based framework is described. In Section IV, we discuss the results of the experiments and their implementation. Finally, conclusions and a summary of our study are presented in Section V.

A. OVERVIEW
This study consisted of four phases. First, we collected behavioral experimental datasets from cognitive psychology research. Second, the pre-trained GAN algorithm was finetuned using the datasets collected in the first phase. Third, we generated 1000 datasets with the same sample size using a fine-tuned GAN model and random generation methods. Finally, five evaluation tests were conducted to examine the generated datasets. The detailed procedure is shown in Figure 1.

B. DATA SOURCES
In this study, we used the open-source cognitive psychology dataset released by the Leibniz Institute for Psychology (ZPID) in Germany [57]. Several psychological datasets, one based on a Stroop task that is well-known as an experimental design in cognitive psychology, were selected for our experiment, collected from 102 participants (54 females and 48 males) to examine associative and affective congruency effects. Two words (priming and target words) were successively shown to evaluate the priming effect of words. After being shown the priming words, the participants were instructed to choose the terms associated with the words shown earlier. Their responses were collected vocally using a microphone. The reaction time of the participants regarding selection was recorded to evaluate the priming effects. To precisely measure their responses, several variables related to the response were stored in the dataset files. This dataset consists of 21 columns, and the descriptions of each column are listed in Table 1.  Additionally, the dataset consists of two sub-datasets of experiments with different objectives. In the first subdataset, category-specific priming effects were examined using related words and reaction times. The priming effect is a phenomenon that affects reaction through exposure to certain stimuli (e.g., words or colored objects) [59].
In the case of category-specific priming effects, researchers wanted to confirm the effect by showing words belonging to a similar semantic category and examining participants' responses to them.
In the second sub-dataset, the color condition was added to the task design of the first sub-dataset. The dimensions of the first sub-dataset were (5184, 21) (number of rows and columns), and the second sub-dataset were (9216, 21).

C. DATA PREPROCESSING 1) SELECT VARIABLES FROM DATASET FOR EXPERIMENTS
To apply the appropriate characteristics of the Stroop task dataset, we extracted only eight variables (VPID, MAT_GR, M_W, ALTER, TARG, PRIM, RT1, and RT2) from 21. We selected variables based on the need for data analysis because of the practical applicability of the generated datasets. First, demographic information (M_W and ALTER) was needed to consider differences in age and sex in the behavioral results. Second, three variables (MAT_GR, TARG, and PRIM) for the experimental material were selected to check the words shown to the participants. Third, reaction time variables (RT1 and RT2) were selected to evaluate the effects of priming and target words. After selecting these variables, the dimensions of the dataset were changed from (5184, 21) and (9216, 21) to (5184, 8) and (9216, 8), respectively.

2) REMOVE MISSING OR EXTREME SAMPLES IN DATASET
In the Stroop task dataset, missing or extreme values in the reaction time of participants were coded with '9999' values. We confirmed the distributions of the three continuous variables (ALTER, RT1, and RT2) to check the overall distribution. Based on this confirmation, we established that the RT1 and RT2 variables in the Stroop 1 sub-dataset included extreme samples. After removing rows with extreme RT1 and RT2 values, the Stroop 1 sub-dataset had dimensions of (5180, 8) without missing or extreme samples. In the case of the Stroop 2 sub-dataset, extreme samples were not included in the dataset. Therefore, the dimensions of the Stroop 2 subdataset did not change. Histograms of the distribution of variables are shown in Figure 2.

D. PRE-TRAINED GAN MODEL FOR TABULAR DATASET
In this study, we attempted to generate a behavioral experimental dataset. The Stroop task dataset was relatively small to train and evaluate deep learning algorithms from scratch and achieve high performance. To complement the sample size, transfer learning and fine-tuning of a pre-trained algorithm were applied. To improve the performance of our framework, pre-trained conditional GAN models with tabular datasets were used in our study [56]. This model, similar to other general GAN algorithms, consists of two sub-modules (i.e., generator and discriminator). To generate tabular datasets with data distributions, conditional vectors were concatenated in the calculation process of the generators. In addition, normalization was applied to each feature to deal with complicated distributions. The authors named the generator module containing the conditional vector the ''conditional generator.'' A total of ten fully connected layers constructed conditional generators. Except for the input and output layers, three hidden layers in the front were composed of 256 neurons, and the five hidden layers in the back were composed of 512 neurons. ReLU activation functions and batch normalization were applied to each layer.
The discriminator module (i.e., Critics) in this model consisted of five fully connected layers. A discriminator module was constructed with 256 hidden layers. Leaky ReLU activation functions and dropout with a 0.2 ratio were used for four hidden layers without the input and output layers. For model training, the Adam optimizer and a 2 × 10 −4 learning rate were used. The detailed parameters of the model are listed in Table 2.
These algorithms were validated with Census Income, KDD Cup 1999 Data, and the Online News Popularity dataset, which consists of tabular structures. We selected and applied pre-trained conditional GAN algorithms because the size of our dataset was relatively insufficient for building a model; the pre-trained GAN algorithms were also trained with similar types of tabular datasets. To increase the usability of the proposed framework in terms of reproducibility, we used the same parameters (e.g., model architecture) and hyperparameters (e.g., optimizer or learning rate) of the pre-trained model for fine-tuning. Consequently, for fine-tuning the pre-trained algorithm, Adam optimizer, 2 × 10 −4 learning rate, and 300 epochs were used as training hyperparameters.

E. GENERATED TABULAR DATASET BY PRE-TRAINED GAN MODEL
After fine-tuning the pre-trained conditional GAN model, we generated 1000 datasets with the same sample size as the original datasets to evaluate the generation performance of the framework. For example, in the case of the Stroop 1 sub-dataset, 1000 different datasets with dimensions (5180, 8) were generated. The Stroop 2 sub-dataset was applied to generate 1000 datasets with dimensions (9216, 8).

F. RANDOMIZE GENERATED DATASET
To evaluate the generation performance with fine-tuned GAN algorithms with a Stroop task dataset, we randomly generated 1000 datasets with the same sample size. Instances in the generated datasets were selected from the column values of the original dataset. For example, ALTER column values in the generated dataset were randomly selected within values of the same variable from the original datasets.
Based on this process, we generated 1000 datasets for the Stroop 1 and Stroop 2 sub-datasets. The dimensions of the randomly generated datasets were the same as those of the original dataset.

G. EVALUATION METHODS BY LEVEL OF DATASET
In our study, we proposed a deep-learning-based generation framework for a synthetic behavior experiment dataset. For a detailed evaluation of the framework, we considered the inherent levels of the tabular dataset. A total of three standards were applied. First, instance and row-level evaluations were considered. An overlapped sample was checked for an instance and row in the dataset. Second, variable and feature levels were applied. The distribution and characteristics of the variables were confirmed. Finally, a whole-set level evaluation was performed. In the case of the wholeset level, the overall characteristics and latent features of the dataset were compared. Detailed descriptions of each evaluation method are provided in the following subsections.

1) INSTANCE LEVEL EVALUATION a: OVERLAPPED SAMPLE TEST
In this test, we attempted to confirm the overlapped samples using the original dataset. If there is an overlapped sample in the generated datasets, it is considered a copy rather than a generation. To verify the duplicated samples, we organized a test in four steps. First, both RT1 and RT2 values for the same word pair were extracted from the generated datasets based on the TARG and PRIM words in the original dataset. Second, a one-sample t-test was applied to examine the difference between the RT1 and RT2 values in the original and the generated datasets.
The null hypothesis of the test was that the RT1 and RT2 values in the generated dataset were the same as the original values. Third, the number of word pairs (TARG and PRIM) was statistically significant. Finally, the test index was calculated as the ratio of the number of statistically significant results among all the results.
An example of the calculation in the test is depicted in Figure 3.

b: CONSTRAINT REFLECTION TEST
Each variable had a range of values that required verification of whether the generated values in the row were included in the range of the original variables. We attempted to confirm the reflection of ranges from the minimum, median, and maximum values. This test consisted of three steps. First, we calculated the minimum, median, and maximum values of the continuous variables (ALTER, RT1, and RT2) in the original dataset. Second, the same values were calculated from the generated datasets. Finally, the absolute differences between the original and generated datasets were calculated VOLUME 9, 2021 to evaluate the reflection status of the generated datasets. An example of the test application is shown in Figure 4.

2) FEATURE LEVEL EVALUATION a: CORRELATION REFLECTION TEST
In the Stroop task dataset used in this study, the dataset showed a correlation between each variable. We examined the correlation reflections in the generated dataset. This test was conducted in three steps. First, we calculated the correlation coefficients for both the original and generated datasets. Second, the averaged coefficients of the variables from the generation methods were compared with the coefficients from the original dataset using absolute differences. Finally, we evaluated the reflection status of the generation methodologies by comparing the differences. An example of this test is presented in Figure 5.

b: DISTRIBUTION DISTANCE TEST
Through the preprocessing step, we confirmed that each variable in the dataset has its own distribution ( Figure 2). In this test, we compared the distributions of the original and generated values of the variables. In the Stroop task dataset, five categorical variables (MAT_GR, M_W, TARG, PRIM, and ERR) and three continuous variables (ALTER, RT1, and RT2) were included. We applied the Hamming distance metric to compare the categorical variables. The Hamming distance indicates the quantified differences between two data vectors consisting of categorical data [60]. A 2-sample Kolmogorov-Smirnov (KS) test was used to compare the continuous variables.
We checked whether the two compared distributions were drowned out with the same distribution [61]. The statistical significance of the test results (p < 0.05) indicates that the two are drawn from the same distribution. Furthermore, KS statistics represent the quantified distance of the empirical and cumulative distribution functions between the two variables. After applying metrics for the variables, we compared the average distance values between the generated datasets. Figure 6 presents an outline of the distribution distance test.

3) WHOLE-SET LEVEL EVALUATION a: FEATURE DISTANCE TEST
In the previous four evaluation tests, we verified the differences in fragmentary characteristics (instance and feature levels) in the dataset. Furthermore, we attempted to evaluate the inherent characteristics of the datasets using latent features. A pre-trained AlexNet model was applied to extract the latent features. Before applying the dataset to a pre-trained model, the TARG and PRIM variables were converted from word to categorical dummy values. Three conditions of the features (3, 5, and 7 feature lengths) were extracted to evaluate them.
After extracting the features, we applied the Minkowski distance metric, which indicates the generalized version of the Euclidean and Manhattan distances. The Minkowski distance was calculated using (1) [62]: In (1), if the value of p (power) is 1, it is the same as the Manhattan distance, which is the L1 norm. In addition, when the value of p is 2, it indicates that distance has the same meaning as Euclidean distance, which is the L2 norm. In our study, both cases, where p was 1 and 2, were evaluated. An outline of the feature distance test is shown in Figure 7.

H. STATISTICAL VERIFICATION
After we received the results of applying the five evaluation methods, we compared the characteristics and distance of the generated dataset between the fine-tuned GAN algorithm and randomized generation. To identify the differences more clearly, we used statistical tests for the evaluation results. For example, we found differences in the averaged distance values between datasets from the GAN and random generation.
To confirm the differences between the two values, we used a two-sample t-test to calculate the distances. Owing to the different methodologies used for data generation, we hypothesized the independence of the two distances. The null hypothesis of the test was that the difference in the average distance values between the GAN and random generation was 0.

I. TOOLS
All code for the deep learning model and data preprocessing were written using Python (version 3.6.0) and Pytorch framework (version 10.0.1). Statistical figures are depicted using R (version 4.0.3).

III. RESULTS
We evaluated the generation performance of the proposed deep-learning-based framework using five test methods divided by the internal levels of the dataset (instance level, feature level, and whole-set level evaluation). In the case of the overlapped sample test at the instance level evaluation, we evaluated different samples in the generated dataset based on one-sample t-test results. Then, the number of samples that were validated from the p-value were calculated as a ratio of the total number of results. In the Stroop 1 subdataset, the GAN-based model showed 68.17% and random generation showed 57.92% of significantly different samples for the RT1 variables. Additionally, 67.47% and 65.15% were confirmed by the GAN-based and RT2 variables, respectively. In the Stroop 2 sub-dataset, we found 90.62% for the GAN-based model and 79.86% for the random generation in the RT1 variable.
For the RT2 variable, 80.38% and 84.72% were found in the GAN-based and random generation models, respectively. Table 3 lists the detailed results of the overlapped sample tests.
Additionally, in the constraint reflection test in instance level evaluation, the reflection of the range of each variable with minimum, median, and maximum values was compared. The differences between the three range values (minimum, median, and maximum) were compared with the original dataset values to evaluate the reflection status. In the Stroop 1 sub-dataset, the GAN-based model    Tables 4 and 5. Second, for feature level evaluation, we compared the absolute differences of the averaged correlation coefficients between the original and generated datasets using the correlation reflection test.
In the Stroop 1 sub-dataset, the GAN-based model condition showed a difference of 0.101 from the correlation coefficients of the original to the coefficients of the generated datasets by the GAN-based model.
In addition, 0.138 was observed under the random generation conditions. In the Stroop 2 sub-dataset, we checked 0.071 in the GAN-based model conditions and 0.108 in the random generation condition. The results of the correlation reflection test, statistical test, and absolute correlation coefficient value list are shown in Tables 6, 7, Tables 9 and 10, respectively. Finally, we examined the feature distance test to compare the distances of latent features between the original and generated datasets to evaluate the generated dataset at the whole-set level. Three, five, and seven length feature conditions were evaluated. In the Stroop 1 subdataset, in the case of seven length features, the GANbased model showed an average distance of 34901.9; 83229.6 and 102138.0 were checked for five and three length features.
Furthermore, 73728.9, 115097.1, and 123034.9 were confirmed as averaged distance values for seven, five, and three length features, respectively.
In the Stroop 2 sub-dataset, 230547.1, 194275.8, and 190009.5 were found for seven, five, and three length feature conditions, respectively, from the GAN-based model. In addition, 348281.5, 293877.7, and 250526.9 were checked for random generation. The detailed results and statistical test results are listed in Tables 11-14.

IV. DISCUSSION
In our study, we attempted to generate a behavior experiment dataset with a tabular structure collected from cognitive psychology research and based on fine-tuned GAN algorithms. The Stroop tasks dataset was applied to verify our research agenda: artificial dataset generation for the behavioral experiment dataset using a deep learning algorithm.
To provide reasonable evidence, we reviewed several studies using ''data generation'' and ''deep learning methods'' as keywords. First, in relation to synthetic data generation, Pargas et al. [63] proposed data generation methods based on genetic algorithms with a population dataset. They suggested the advantages of test data generation in related studies. In addition, Tracey et al. [64] suggested an automatic data generation framework for structural datasets; thus, they applied dynamic optimization-based search methods to the framework. Furthermore, they demonstrated the efficiency and effectiveness of the test data generation by comparing various experimental conditions. Brissette et al. [65] attempted to generate synthetic weather datasets based on stochastic methods. Methodologies using the Wilks approach were used in this study to generate weather information from multiple sites. Advantages, including simplicity and complements to climate research, have been emphasized by   the authors. Jones et al. [66] attempted to generate mutation data from protein sequences using raw mutation frequency matrices to generate and evaluate datasets. In addition, the generated datasets were validated using the SWISS-PROT database. Their study proposed the benefits of dataset generation for associated research.  Second, as mentioned above, methodologies with statistical or mathematical approaches have been applied to various datasets (e.g., protein structure, climate dataset, and population dataset) for synthetic data generation.
Similarly, in terms of machine and deep learning algorithms, diverse algorithms have been utilized for data generation. Guo and Herna [67] suggested boosting and generation methodologies to complement imbalanced data using boosting and ensemble-based learning algorithms. Researchers have evaluated improvements in data generation with respect to the prediction power of classification algorithms using synthetic datasets. In Bloice et al. [68], augmentation methods were based on machine-learning models for image datasets. In their methodologies, various traditional augmentation methods (e.g., rotation and resize) and machine learning models have been used to generate augmented image datasets. Ekbatani et al. [69] generated synthetic images, including pedestrians and objects on a load, using deep learning algorithms. Among various deep learning models, convolutional neural networks (CNNs) have been applied for image generation. Norgaard et al. [70] applied supervised learning deep learning algorithms to generate synthetic sensor datasets. GAN algorithms with supervised learning characteristics were used in their research. The effectiveness of the proposed framework was validated using a human activity dataset with similar time-series characteristics. Chen et al. [71] proposed a deep learning framework for artificial CT image generation.
U-net, which is constructed using a symmetric convolutional neural network, was applied to the generated image datasets. The proposed framework was developed and evaluated using a Cone-beam computed tomography (CBCT) image dataset. Based on studies, including those mentioned above, we concluded that our research topic was well founded. Among the various algorithms used to generate synthetic datasets, we applied GAN models to generate a behavior experiment dataset with a tabular shape. Many researchers used GAN algorithms to generate structured datasets, including tabular datasets, and complement insufficiency within datasets.
Zhou et al. [72] used GAN algorithms to efficiently deal with an imbalanced dataset. To complement the imbalance in the dataset, they improved the framework using two methods. First, generated artificial samples were added to the minority class to optimize the loss function of the algorithms.
Second, a fully connected network module was utilized to improve the performance of the framework. After developing the two methods, we evaluated the framework using two open-source structured datasets. First, the Alibaba-MIFD dataset was tested using trained models. This dataset was composed of 69 variables. including medical information about people (e.g., cost of medicine and time of hospital stay). Second, the JD-RPLI dataset was used to evaluate the proposed framework. Information about the user (e.g., login time and user account) was contained in this dataset. Both the aforementioned datasets are tabular-type, and the proposed framework performed better than the other algorithms in these datasets. From this previous study, we confirmed that GAN algorithms have the potential to handle tabular datasets in addition to their usefulness in the analysis of an imbalanced dataset with a shortage of class samples.
Yan et al. [73] used GAN algorithms to generate synthetic electronic health record (EHR) datasets for sharing datasets between research groups. To address the challenges of the complexity of EHR dataset in GAN-based generative models, several modules were added to the framework. First, the penalization module was applied to the learning process of the algorithms. Based on this module, violated values for the original values in the dataset were removed from the generated datasets. Second, the modified generator and discriminator models in the framework influenced both efficient model training and the generation of artificial instances. Third, several methods for generating datasets have been proposed for performance evaluation. The EHR dataset used in this study was composed of several variables (e.g., ICD code, BMI index, and blood pressure) with a tabular structure. The authors evaluated generation performance by comparing the distribution and statistical characteristics (e.g., correlation coefficient and Bernoulli success probabilities) of the original and generated datasets.
Xu and Veeramachaneni [74] suggested a GAN model to generate tabular datasets based on medical and educational datasets. In their framework, both discrete and continuous variables are considered in the training algorithms.
Researchers have used two methods to handle variables to improve generation performance. First, in the case of numerical variables, multimodal distributions were normalized to improve the processing datasets. Normalized numerical variables showed values in the range of −1 to +1. Second, the softmax function was used to smooth the distributions of the categorical variables. In addition, one-hot encoding was applied to categorical variables. A total of three open-source datasets with tabular structures were used to evaluate the performance of the framework. Moreover, the synthetic datasets were evaluated using the original dataset through machine learning classifiers.
Based on the aforementioned studies, we considered steps similar to the experimental design in our study. First, the statistical characteristics of the dataset were confirmed before developing the generation framework to handle the challenges related to training algorithms. Second, the GAN algorithm was trained using a tabular dataset. Third, evaluation methods were proposed for generation performance using statistics from the generated dataset and the application of machine learning tasks.
However, many researchers using the GAN model in their work have pointed out the ambiguity of evaluation methods for generated datasets [75]- [77]. Because of the components of the algorithms (i.e., GAN models consist of generator and discriminator models), we cannot consider it the gold standard of evaluation, unlike other algorithms [75]. For this reason, in the case of image generation tasks, the generated dataset was evaluated based on human decisions and latent features extracted by pre-trained deep learning models [78]- [80]. In addition, researchers have evaluated performance with relative comparisons between methodologies [81], [82].
To evaluate the generated tabular datasets more precisely, we conducted an evaluation using the internal levels of the table structure. First, we evaluated the generated dataset in terms of instances and row levels. The existence of overlaps in the generated dataset was verified using the original dataset. To identify important elements in the Stroop task dataset, the values of word-related variable (PRIM and TARG) and reaction time variable (RT1 and RT2) pairs were compared. Furthermore, a one-sample t-test was used to verify the differences in variable values between the original and generated datasets. In this test, we considered that higher ratios of statistically significant results indicated fewer overlaps with values from the original dataset. Most of the significant result ratios in the dataset generated by the GAN-based model were higher than those of the randomly generated dataset.
In addition, the ranges of values for the variables were compared with the original values. We calculated the minimum, median, and maximum values as constraints of continuous variables. The absolute differences between the original and generated constraints were compared. The difference values of the GAN-based model conditions were generally lower than those of the random generation condition. From these results, we confirmed that the ranges for the GAN-based model conditions were closer to the range of the actual data than the random generation conditions. Second, the variable and feature level characteristics of the dataset were evaluated. Correlations between the variables were confirmed from the generated datasets. We calculated the absolute difference between the original and generated values. We found that the absolute values of the GAN-based model conditions were lower than those of the random generation conditions. In addition, the rank of the correlation coefficients for the GAN-based model condition was compared with that of the original dataset. We found that several common elements were included in the coefficient list.
Additionally, the distances between the distributions were evaluated for the datasets. For categorical variables, the Hamming distances were calculated. Two-sample KS tests were conducted for the continuous variables. In the GAN-based model conditions, the absolute differences in distance values were lower than those in random generation conditions. From these results, we confirmed that the distributions of variables generated by the GAN-based models were more similar to the original distribution than the randomly generated datasets.
Finally, the generated datasets were evaluated in terms of whole-set levels. A total of three lengths (seven, five, and three length features) of latent features were extracted using a pre-trained AlexNet model. To compare the distances between the extracted latent features, the Minkowski distance, including two distance metrics (Euclidean and Manhattan distance), was applied. Overall distance values in the GAN-based model conditions were lower than those in the random generation.
Based on the aforementioned experimental results, we concluded that the statistical characteristics (e.g., similarity or distances) of the generated datasets from the proposed frameworks are closer to the original datasets than the characteristics of randomly generated datasets.

V. CONCLUSION
In this study, we proposed a data generation framework based on a deep learning model for a behavior experiment dataset collected from cognitive psychology research. Based on previous studies associated with tabular data generation, we designed experiments using the development of algorithms and evaluation methods. To complement the relatively small sample size dataset used in our study, we used a pretrained GAN model for the framework. Furthermore, five evaluation methods with internal tabular structure levels were applied for a more detailed evaluation. In addition, a random generation method was compared with the proposed framework to evaluate its generation performance. Based on the experimental results, we confirmed that the proposed framework with GAN algorithms can generate statistically similar synthetic datasets with the statistical characteristics of the original dataset.
The first strength of this study is the application of a behavior experiment dataset with a tabular structure from cognitive psychology to a deep learning generation algorithm. Second, we propose novel evaluation methods based on the tabular structure levels. Third, we consider not only the generation of structural characteristics, but also the reflection of the statistical characteristics of the original dataset.
Furthermore, the proposed framework has advantages in terms of data analysis and related research. First, the generated datasets can help reduce the sample size in related experimental studies. Second, the synthetic dataset generation framework can overcome environmental restrictions (e.g., the Covid-19 pandemic) in conducting experimental research. Finally, a complement based on an artificial dataset with similar statistical characteristics can reduce the burden on participants.
Our study has some limitations. First, we compare only random generation methods to evaluate our framework. However, we do evaluate the diverse aspects of the generated dataset, which was advanced from previous studies. Second, we only apply a Stroop task dataset from various task designs in cognitive psychology research. Although a Stroop task is one of the most established task designs used in related studies, future studies should consider other tasks to generalize this framework. In addition, for the utility of synthetic datasets, validation studies need to be considered in future studies.