EEG Datasets for Healthcare: A Scoping Review

The rapidly evolving landscape of artificial intelligence (AI) and machine learning has placed data at the forefront of healthcare innovation. Electroencephalography (EEG) has gained significant attention for its potential to revolutionize healthcare applications. However, the effective utilization of EEG data in advancing medical diagnoses and treatment hinges on the availability and quality of relevant datasets. In this context, we conducted a scoping review to explore the wealth of EEG datasets designed for healthcare applications. This review serves as a critical exploration of the current landscape, aiming to identify datasets related to healthcare conditions while assessing their reusability. Our findings highlight both the opportunities and limitations in the wealth of open access EEG datasets. Available. As AI increasingly relies on high-quality, well labelled data, barriers impeding the sharing and utilization of EEG data for healthcare (such as lack of comprehensive documentation or adherence to FAIR principles) must be addressed so as to leverage the potential of advanced deep learning models to unlock new possibilities for diagnosis and analysis of a wide array of medical conditions.


I. INTRODUCTION
Electroencephalography (EEG) is a method of recording the electric field generated by postsynaptic activity of large groups of neurons in the cerebral cortex.This is generally accomplished by placing electrodes on the subject's scalp.The EEG signal recorded is an oscillating continuous signal whose characteristics change depending on the consciousness and general neural activity of the subject [1].
In general, there are two types of EEG recording: rest and task.EEG recordings carried out during rest usually involve participants being seated on a chair and instructed to relax and keep their eyes open or closed.Task recordings, instead, are those where the participants are instructed to perform a specific task (i.e., cognitive, motor, etc).Both rest and task recordings are the direct measurement of EEG without the need for stimulations or markers.
On the other hand, Evoked Potentials -EP (also known as event-related potentials, ERP) are the category of EEG recordings where a stimulus, usually audio or visual, is shown The associate editor coordinating the review of this manuscript and approving it for publication was Berdakh Abibullaev .
to the subject and a marker is sent to the EEG measurement device to record when the stimuli presentation started.These stimuli cause a specific known reaction in the recorded EEG, such as the P300 wave for visual recognition [1] or a driving response for oscillating stimuli where the EEG wave synchronizes with the stimulus frequency [2].The EEG signals are time-locked to the stimuli, and appear as somatosensory, visual, or auditory potentials.For example, the P300 wave is an ERP that indicates that the subject has detected the target item in a sequence of non-target stimuli items.
From a medical point-of-view, EEG has a long history of being used to study and detect abnormalities in participants.One of the first documented experiments of EEG recording in humans was detecting EEG changes during sleep [3], [4], pointing to the existence of different phases of sleep.Today, it is used in a clinical setting for epilepsy diagnosis [5], Alzheimer's Disease diagnosis [6], [7], sleep studies [8], [9], [10], and monitoring depth of anaesthesia, coma, and other brain states [11].It is also used in research for other neurological and psychiatric conditions [12], [13], [14].Brain-Computer Interface (BCI) is an active area of research, and it uses EEG to allow people to control a computer even when paralyzed or with neuromotor difficulties [15].BCI also studies how to identify emotional states in human participants [16].A comprehensive overview of BCI methods of EEG analysis is given in [17].
With the development of mobile wearable devices, EEG headsets are now more available and accessible for both consumers and researchers.At the same time, the development of novel deep learning (DL) models has made possible to automatically classify and diagnose diseases and conditions.For example, DL models have achieved lower error-rate in cancer diagnosis than humans [18].Specifically for EEG analyses, DL models have a better performance that traditional non-DL methods [19].
High-quality datasets are the cornerstone of successful DL algorithms [20].Although many scientists are in favour of data sharing, there are still barriers to the practice being standard [21], [22].The main reasons identified include lack of time, funding, and skills.Researchers are also wary of the quality of the data collected by others, as there is no way to be certain of the skill level of the data collectors, or that the correct methodology was followed during the data capture.Also, concerns regarding data documentation exist when discussing the reusability of datasets.Contextual information on the data is often necessary to fully understand and replicate analysis, but this creates a higher burden to researchers when publishing data [21].
However, sharing data used in research studies is one of the fundamental mechanisms for combating the reproducibility crisis which impacts on the use of AI in the health domain [23], [24], [25].This is particularly important in the field of machine learning, as the models developed depend on the data used to train it.Even if one has access to the original data, it is not always possible to reproduce ML results [26], [27].Indeed, when training models, there are various settings that can be passed to the software packages that alter how the model is trained.Added to that, deep learning models are trained via stochastic non-deterministic methods.As Summers and Dinneen identified [28], if the initialization parameters are even slightly different, the same model with the same data produces vastly different results.Therefore, to fully reproduce a DL model one needs not only the data, but the training methodology used to develop the model which includes the software code, initializations, and all parameters involved.
Availability of research data decreases with publication age [29], which means that older publications are less likely to have their data shared.The costs of hosting and maintaining data over time are barriers to making data available for years after publication.Also, Miyakawa [30] suggests that some results are not supported by raw data -97% of 41 manuscripts that had been marked for review and asked by the editor to provide raw data were retracted or rejected.In some cases, raw data might be restricted due to what the ethics approval obtained to carry out any study allows to be shared.
There has been an increased emphasis in recent years to make science more open and reproducible.Efforts like the FAIR data [31] and the increased requirement of data management plans by funding organizations has increased the visibility of best practices in data sharing.
Considering this, to determine which datasets are available that contain EEG signals, and to understand which medical conditions are contemplated in these datasets, and how reusable these datasets are, the following scoping review was carried out by the authors.
The research questions being addressed in this scoping review are as follows: RQ1 -Which datasets containing EEG signals focused on healthcare applications (non-epilepsy related) are available?RQ2 -Which health conditions are covered in these datasets?
RQ3 -How reusable are these datasets?Epilepsy diagnosis and seizure detection are thoroughly discussed in [32] and [33], thus datasets containing epileptic patients or seizure signals were excluded from this review.Also, one of the largest EEG datasets is the TUH-EEG dataset [34], which contains 26,846 clinical EEG recordings and annotations.Since it contains seizure events, it has been excluded from this review.It is available on https://isip.piconepress.com/projects/tuh_eeg/.
A significant portion of the available EEG datasets are not healthcare related; instead, those datasets are the result of work in BCI and Affective / Emotion detection.The authors suggest any interested readers to read the publication [19] for information regarding general EEG datasets, [16] and [35] for data on emotion recognition, [36] for motor imagery, and [37] for depression, while [19] and [38] provide an in-depth review of deep learning and EEG.

II. METHODS
This review follows the PRISMA guidelines for Scoping Reviews [39].The methodology for the review is detailed below.

A. SEARCH STRATEGY
The search was conducted using the following publication databases: PubMed, Web of Science, and Scopus.On these databases, the search focused on published articles that describe and/or analyse an EEG dataset.The search was also conducted on the following data portals: Data Cite (which include results from IEEE DataPort, FigShare, and Zenodo) and Mendeley Data.Further datasets were screened from known EEG dataset repositories, listed in    Publications were deemed eligible if they were describing a dataset.The datasets which were deemed eligible were the ones with a valid and unique DOI reference.The inclusion criteria for datasets were: • dataset has a valid and unique DOI; • the dataset description (in the case of a publication) has a link or direct reference to the dataset; • dataset can have any type of access (open, restricted, request); • the dataset must include EEG signals; • dataset may include other types of data, but it is not required; • dataset must have been collected to study a health condition, disease, or diagnostic; • dataset must include information about each recording, such as to which group (i.e., diagnosis, or patients versus healthy controls) the recording belongs to.The exclusion criteria for datasets were: • in the case of publications, if the publication uses the dataset but does not describe it; • data collected for interventional studies, i.e., the study is analysing the results of an intervention (medication or treatment) to a group of participants; • datasets investigating epilepsy and/or seizures; • non-health-related dataset, such as general BCI datasets, motor BCI, emotion recognition, meditation, etc.; • datasets with no reference to ethical approval for the study (either in the dataset description or in its related publication); The datasets were found via three methods: (i) searching publication databases for articles that described a dataset; (ii) searching for datasets through data citation and data aggregation portals (e.g., DataCite); and (iii) by searching published datasets in repositories dedicated to neurological and physiological datasets.These specific repositories (such as OpenNeuro, PhysioNet, etc.) were identified from searching for EEG repositories in a general search engine (Google).
The results were first screened by title, keywords, and description, if these were available for the datasets, or alternatively from the abstract in the case of publications.In this step, datasets containing no mention of EEG signals were excluded.From the keywords and description, datasets were excluded if there was no mention of a medical condition or state, or if there was mention of epilepsy and/or seizures, according to the exclusion criteria.After this first screening, the full text of articles describing datasets were retrieved and surveyed.If no reference to a dataset was found (either a link or citation with DOI), it was rejected.The other inclusion and exclusion criteria discussed above were also assessed.For datasets, the repository page with metadata was accessed and screened for suitability.Datasets were also rejected in the case that their filenames were not in English, or not descriptive enough to allow understanding of the contents of the file and to which class (patients or controls) the file belonged.One of the datasets identified in the first screening was not yet published, which made it impossible to retrieve its full metadata, and therefore it was also excluded.This whole process is shown in Figure 1, where the number of records retrieved, screened, and rejected or included are shown.
For each dataset included in this review, the related publication was found usually linked in the dataset description or indicated as the preferred method of citation for the dataset.In the cases that the publication was not linked, the corresponding articles were searched via the authors listed in the dataset metadata.This was carried out because the data collection protocol, for most of the datasets, was usually not described in the dataset description or files.Instead, it was usually described in the methodology section of the related publication.For all datasets included in this review, the information of where the data collection protocol was described is included in the data extracted.

C. EXTRACTED INFORMATION
For each included dataset, the following information was extracted (either from the dataset link or its related publication): • dataset reference (DOI and full citation); • related publication reference(s); • dataset URL (unique); • access type; • country and data collection site; • year of publication; • health condition or diagnosis of focus; • data modalities contained in the dataset (such as subject demographics, clinical history or Electronic Health Record -EHR, psychometrical or psychological test, physiological signals, etc.); • population included and number of participants for each group; 39188 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• number of classes in the dataset (e.g., patients versus controls, sleep stages, patient subgroups); • type of EEG data collection: rest, task, evoked potentials; • number of recording sessions, recording time, number of trials/tasks, and duration; • mode of EEG data annotation: manual, task/stimulation/ event markers, automatic/algorithm/machine; • EEG device used and form factor (i.e., cap, free electrodes, headset/headband, polysomnography -PSG -setup), number of channels, electrodes placement and references, sampling frequency; • EEG data format (.eeg, .bdf,BIDS, etc); • if the EEG data provided is raw or is pre-processed (and in this case, which pre-processing has been implemented); • whether the recording protocol is available in the dataset description/files, the associated publication, or not available; • whether the dataset includes an explanation of how to analyse it, files describing the data/variables/etc, or code used by the researchers to analyse the dataset.

D. ASSESSMENT OF RECORDS
For each dataset included in this review, we assessed the reusability of the dataset to answer research question 3.This was carried out using two methods: the first was an automated tool developed by the FAIRsFAIR Project [40], and the second method was via a questionnaire completed by the authors when extracting metadata from the datasets.
The FAIR guiding principles [31], Findability, Accessibility, Interoperability, and Reusability, inform researchers and data producers and publishers on the best approaches to publish data and other research objects.The FAIRsFAIR project [41] is an European project that aims to develop global standards for FAIR certification of repositories.Developed by the FAIRsFAIR project, the ''F-UJI Automated FAIR Data Assessment Tool'' [42] is a REST web service tool to programmatically assess a data object according to the FAIRsFAIR Data Object Assessment Metrics [43].These metrics were developed by the FAIRsFAIR project to provide a method to systematically assess a data object.For this review, we used the version v2.0.2 of the F-UJI tool, available at https://github.com/pangaea-datapublisher/fuji/releases/tag/v.2.0.2 .The F-UJI web server was run in a computer, and programmatically queried for each dataset contained in this review.The code used for this is available in the Supplementary Materials, and at xxxxx (permanent link to repo).
The F-UJI Automated tool reports a score for each of the FAIR assessment metrics [43] grouped by the principles Findable, Accessible, Interoperable, and Reusable.In this review, we report it in terms of percentage: the highest score possible in one of these metrics is shown as 100%, and the lowest score (zero) is shown as 0%.Considering the FAIR data requirements, information on the documentation level of the dataset and the file format provided was also collected.For this, the file format of the data files was recorded during the data extraction.A scale of 0 to 3 points was created to determine the documentation level of the dataset -this represents how well the context of the data provided is explained, i.e., if the dataset contains the data collection protocol, the methodology, and the code used for analysis.For each type of documentation present in the dataset its score would gain one point.In this way, a score of zero means the dataset has no contextual information (i.e., only the data files and a brief abstract is provided), and a score of 3 means the dataset includes the collection protocol, an explanation of the file structure and internal variables, and the code used for the analysis.The datasets that provide the collection protocol only in the associated publication have been noted and a 0.5 score deduction was applied.

A. OVERVIEW
From the 141 records (datasets) sought for retrieval one dataset was not published yet, so it was not retrieved.Of the 140 records assessed for eligibility 37 were journal articles that were not describing a dataset, 23 records did not have EEG data, 13 records were not exploring a specific diagnosis / condition, 9 did not have a DOI, 8 were general BCI datasets (not healthcare related), 7 datasets had no ethics disclosed either in the dataset metadata or the associated publication, 5 had no link or reference to where to find the data, 3 were Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.repeated (same dataset but different DOIs), 2 were seizure related, 2 had no explanation about what each file represented, and 1 was not in English.In total, 110 records were excluded and 30 records included.
In general, the datasets included fall into three broad categories of condition of focus: sleep studies and related disorders, psychiatric conditions, and neurological conditions.Sleep and related issues have the biggest number of datasets found for a condition, with seven datasets.The number of datasets found for each condition is shown in Figure 2a.Two datasets were found related to phobias, each focusing on different phobias (claustrophobia and arachnophobia) but were placed under the same condition.One dataset contains multiple conditions for the paediatric population (developmental disorders), thus it was not included in any single condition.
The number of datasets published in each country is shown in Figure 2b.Some datasets were collected by researchers collaborating in multiple locations, so the same dataset was counted for each country in the collaboration.For example, [73] was collected in multiple sites in Italy, Germany, and Switzerland, and [47] was collected in the USA and Norway.
The year of publication for each dataset was also recorded and shown in Figure 2c.The number of published datasets has increased in the last years, which could be due to recent efforts on open science and data sharing.
The main data modalities other than EEG included in the datasets are shown in Figure 2d.The most common data modalities included are physiology related: electromyography (EMG), electrocardiography (ECG), and electrooculography (EOG).Due to the considerable number of sleep datasets in this review, a lot of the included types of data are pertaining to PSG recordings, such as EMG to detect muscle activity, ECG to capture heart rate metrics, EOG to record eye movements, respiratory rate, airflow, audio recording for snoring, body position, oxygen saturation (SaO2), and CO2 measurement.Structural Magnetic Resonance Imaging (MRI) and functional MRI (fMRI) are also included in some datasets for further brain analysis.Other than the modalities listed above, no other physiological data collection was found in the datasets included in this review.
The following sections will list and discuss the datasets included in this review.The datasets are organized according to the health condition of focus on (as per the research question RQ2), grouped by three broad categories: sleep studies, psychiatric conditions (such as depression, schizophrenia, among others), and neurological conditions (such as Parkinson's and Alzheimer's Disease).

B. CONDITIONS 1) SLEEP
The CAP Sleep Database [66] is a dataset of polysomnographic recordings from 108 participants -16 healthy controls, and other sleep related pathologies, such as insomnia, narcolepsy, sleep-disordered breathing, among others.Each recording contains three or more EEG channels, EOG, EMG, airflow, respiration, SaO2, and ECG.The EEG recordings are annotated with sleep stage and cyclic alternating pattern (CAP) events which are correlated with sleep instability and related pathologies.The associated publication [74] describes how to classify CAP events, and how they present in an EEG recording.
The ISRUC-Sleep Dataset [67] consists of polysomnography recordings of 108 participants with sleep disorders, of which 8 have two different recording sessions, and 10 healthy participants.Each recording contains 6 EEG channels, EOG channels for each eye, EMG (measured on chin and legs), ECG, airflow, respiration, SaO2, and body position.For each subject, the dataset includes details such as demographic data (age and gender), medication taken, and sleep scoring.The associated publication describes in detail the dataset and a method of automatic sleep stage classification and its performance evaluation compared with manual expert annotation.The automatic sleep stage classification method is comprised of preprocessing, feature extraction based on maximal overlap discrete wavelet transform (MODWT), and a supervised learning step which uses a Support Vector Machine classifier.
The dataset described in [68] is composed of recordings of 64-channel EEG from 22 participants during short periods of sleep, with manually annotated sleep stages and spindles (specific pattern of EEG that occurs during the second part of REM sleep).Recordings were carried out on two separate days, after the participants undertook a high-or low-load visual working memory task.Added to the EEG raw data, the authors also provide code in python for signal processing and artifact correction using Independent Component Analysis (ICA).This is especially useful for reusing the dataset, as it enables researchers to replicate the authors' data processing and verify their findings.It also enables other researchers to build solutions based on the data collected.
The NCH Sleep DataBank [69] is a large dataset containing 3,984 paediatric sleep studies on 3,673 unique patients associated with EHR, which includes medications, measurements, diagnoses, etc.The data modalities included are EEG for sleep stage identification, EMG channels on chin and leg, EOG, ECG, airflow and respiratory effort, blood oxygen saturation, and carbon dioxide measurement of exhaled air.The data is annotated in real-time by technicians and reviewed later by another expert.This dataset has restricted access by means of credential assignment.Code in python for data analysis is also provided in a separate github repository [75].The publication [76] specifies the transformations carried out on the raw data (filename, EDF header, random date shift, etc.) to anonymize patients.The publication is a full description of the dataset and what is included in it, going into detail such as file naming conventions, variables included in the files, and how annotation was made.The description of files included is helpful to potential users as they can identify what is included and if it is relevant for their purposes.Also, due to the huge amount of annotated raw EEG data, it is very well suited for the development of deep learning models -which the authors highlight in the publication.They also suggest potential applications of the data and provide starter code for analysis.
The dataset in [70] is another dataset dealing with paediatric sleep which contains sleep recordings of infants (between one week and seven months of age), with a highdensity EEG cap and video to measure movements.The access to the files in this dataset is restricted to credentialed authorized users.Analysis of the data is carried out in the associated publication [77].The publication reports that the MATLAB code used for the analysis is shared in a GitHub repository, but as of April of 2023, the repository does not exist.Data collection protocol is described in the publication.
The Haaglanden Medisch Centrum sleep staging database [71] is comprised of 151 whole night polysomnography recordings, containing EEG, EOG, chin EMG and ECG, including the manual annotations of technicians.The dataset webpage describes the data collection protocol and the data description of what each file contains and their formats, as well as usage notes on how to open and interact with the files provided.The associated publication [78] presents a deep learning model for automatic sleep staging using this dataset and other open sleep datasets.
The dataset in [72] includes EEG and fMRI signals collected from 33 healthy participants during resting state and sleep sessions, before and after a visual-motor adaptation task.The data collection protocol is briefly described in the dataset description, but expanded and detailed on the associated publication [79].

2) PSYCHIATRIC CONDITIONS a: DEPRESSION
The dataset in [50] and [52] collected EEG recordings of 122 university students aged 18-25 years during rest [50], with eyes open and eyes closed, and during cognitive tasks [52].The participants were screened for major depressive disorder (MDD) using the Beck Depression Inventory, and classified in one of 4 groups: MDD, past MDD diagnosis (but not current), not meeting the diagnosis criteria for MDD, and not tested for MDD.The article associated with this dataset is [80] which explores how depression and anxiety relate to reward systems in the brain.The dataset is provided in the BIDS format, which is a standard in the field [81], [82].The data collection protocol for this dataset is not described fully, and the dataset description notes that some of the data might be mislabelled, some EEG channels have been interpolated, and no raw data is available leading to some question as to its useability in future work.The cognitive task part of the dataset contains the associated code to produce the stimuli presented to participants, and the code developed in MATLAB to analyse the data and replicate the study.
Another dataset developed in the study of depression is [51] which provides EEG and audio data from 24 patients clinically diagnosed with depression, and 29 matched healthy controls.The EEG recordings were created with a standard medical 128-channel device, and a novel 3-channel wearable device, during rest and activities such as responding to questions, reading, and picture description.The dataset is provided in the BIDS format, and contains a README file, a file describing the methodology used for the data collection, and the information sheet given to patients for consent in participating in the study.The publication associated with this dataset is [83] which describes the dataset in detail.The dataset in [44] contains the raw data from a study [84] on working memory and response inhibition.This study was carried out on participants aged 9-16 years old, 34 of those diagnosed with attention deficit hyperactivity disorder (ADHD) and 25 typically developing participants.The protocol for the data collection is in the publication, and no other demographic data is given.In the study, the participants had to perform working memory tasks and response inhibition tasks.The EEG data is from 21 channels sampling at 500 Hz, provided in .cntformat.
Another dataset [46] has EEG data acquired from adult individuals aged 18-68 years, 28 of those with a diagnosis of autism spectrum disorder (ASD), and 28 neurotypical controls.The study investigated how brain aging is different for adults with ASD compared to healthy controls.The data was collected during rest for 150 seconds of eyes closed for each subject.The EEG data is provided for each subject with a pair of files .fdtfor the raw EEG, and .setwith details on the recording parameters.However, no information is provided for which participants have ASD and which are the controls.
The dataset ''Healthy Brain Network (HBN) Biobank'' [55] includes data on over 4,000 participants aged 5 to 21 years old.The data modalities are EEG, MRI, behavioural and cognitive phenotyping, actigraphy, eye tracking, genetics, audio, and video.The publication details the complete methodology of data collection: subject screening, inclusion and exclusion criteria, assessment tests, and recording protocols for each data modality.For each different diagnosis included in the dataset, the respective assessment tests that are relevant are mentioned.The assessments included are anxiety, ADHD, ASD, cognitive and executive functioning, depression and mood, obsessive-compulsive disorder, physical tests, sleep, family structure and trauma, substance abuse or addictive behaviour, verbal learning, and others.Added to that, the publication also contains the lessons learned over the implementation of a large-scale data collection, which is of great benefit for other researchers also collecting large amounts of data.The access to the full dataset is restricted by request / project submission for validation.

c: SCHIZOPHRENIA
The dataset in [63] consists of EEG recordings of 14 patients with paranoid schizophrenia and 14 healthy controls.In the dataset repository, there is only information on the sampling frequency and electrodes used.The groups are identified via the filename, starting with h or s.The publication that analyses this dataset [85] describes the protocol used for the EEG recordings and the analysis implemented comparing the controls to patients using connectivity measurements extracted from the EEG data.
The dataset in [64] is an EEG dataset of 19 patients and 24 healthy controls recorded during a task of ultimatum game.The protocol for the data collection is described in the associated paper [86].EEG was recorded with 128-channels at 2048 Hz.No other information about patients or about data pre-processing is available.
Another dataset for schizophrenia is divided into two DOIs: [65] and [87].In this dataset, EEG and MEG (magnetoencephalography) data were recorded for two independent samples of healthy controls and First Episodic Psychosis (FEP) individuals.For each subject, the dataset includes 5 minutes of resting EEG collected using a 60-channel cap setup.The associated paper [88] has the test protocol used and analyses run on the collected data.

d: PHOBIAS
For the study of reactions of people suffering from phobias, two datasets were found.
The first dataset [61] consists of 9 participants with self-identified claustrophobia and 13 healthy controls.The EEG was recorded under 3 different conditions: in a welllit spacious room, in a chamber with moderate light, and a moderate lit smaller room.Information about each participant, such as age, gender, and group are present in the dataset.The associated publication [89] describes in detail the dataset organization and files, and the data collection protocol.
The second dataset in [62] consists of EEG recordings of 40 patients with arachnophobia and 53 controls.No demographic information is given for the participants.The data collection protocol is detailed in the associated publication [90].However, the dataset includes the code used for the analysis done for the publication, which allows other researchers to review the methodology and replicate results.

3) NEUROLOGICAL CONDITIONS a: PARKINSON'S DISEASE
The dataset in [56] studies freezing of gait in Parkinson's Disease.The test cohort consists of 14 participants with Parkinson's Disease experiencing gait freezing, 14 patients with Parkinson's Disease but without freeze of gait, and 13 healthy controls.No data other than to which group the participants belong to is presented in the dataset.The associated paper [91] describes the data collection protocol, i.e., how the participants were screened, and the motor task carried out during the EEG recordings.The participants were recorded performing ankle dorsiflexion while sitting in a chair and with gaze fixed at a specified point.
The dataset in [57] is a curated dataset derived from [92].It contains EEG recording of 15 patients with Parkinson's Disease and 16 healthy controls.The dataset provides other information regarding participants, such as age, gender, handedness, and clinical history.The data collection protocol is explained in [92], which consisted of rest and a stopsignal task.The dataset curators ask that researchers using the dataset approach them via email, as they have published the dataset with open access, but would like to have people ask for permission and guidance to use.
The three datasets in [58], [59], and [60] were all collected at the same institute by the same research group.The dataset consist of EEG recordings for 28 participants with Parkinson's Disease and 28 healthy controls (the last dataset [60] contains recordings of just 25 participants for each group).The EEG was collected during a cognitive task [58], reinforcement learning [58], and rest and auditory stimuli [60].Some brief description of the data collection protocol is present in the dataset descriptions, and the author cites an article that explains the task, as well as supplies the Matlab code necessary for replicating it.Also present in the dataset are other code files for further data analysis.Results from analysis were published in [93] for the first dataset, [94] for the second dataset, and [95] for the third.

b: ALZHEIMER'S DISEASE
For Alzheimer's Disease, the dataset [45] includes 230 participants, with 5 subgroups: 33 healthy control elders, 34 with subjective cognitive decline, 79 with mild cognitive impairment, 48 participants with Alzheimer's Disease, and 36 healthy young controls.The full dataset has restricted access, but 4 samples are included in the open version.The dataset description is very informative on the data collection protocol, EEG device and setup, and preprocessing done to the data.The analysis of this dataset was published in [96].

c: BRAIN INJURY
The dataset in [47] contains EEG recordings of 14 patients with unilateral prefrontal cortex lesions and 20 matched healthy controls.Raw and pre-processed data are included, as well as analysis code to replicate results.The data collection protocol and results are published in [97].The access to the dataset is credentialed -e.g., the user needs to create an account to download the dataset.
The dataset in [48] consists of EEG recordings of patients with traumatic brain injury (TBI).The population is composed of 23 chronic TBI patients, 38 sub-acute mild TBI, and 24 healthy controls.The dataset also contains demographic data, and neurophysiological assessments.Some of the data collection protocol is explained in the dataset description, but a full detailed explanation is present in the associated article [98].The task performed in this dataset generated auditory evoked potentials using a 3 stimulus oddball paradigm (where presentations of the different stimuli have different probabilities of occurring), and EEG was also collected during rest.

d: OTHERS -COGNITIVE IMPAIRMENT, HEARING IMPAIRMENT, STROKE, MIGRAINE, ETC.
The dataset in [73] is a multi-center (Italy, Germany, and Switzerland) dataset of various modalities of data collected to study upper-limb movements.The dataset contains 65 post-stroke participants and 91 healthy participants.Data modalities included are EEG, ECG, EMG, actigraphy and kinematic data, and fMRI.The publication [99] has all of the details of data collection procedures and what is contained in the dataset.The dataset also contains a Matlab script to plot data.
The dataset in [54] contains high-density (128-channels) EEG recordings of 17 migraineurs and 18 healthy controls.The recordings were obtained for rest and evoked potentials, both visual and auditory.Included in the dataset files are README files with description of the files, Matlab code used to generate the visual and auditory stimuli, and a document describing the data collection protocol in detail.The analysis of the data is found in the publication [100].
EEG and MEG recordings to study hearing impairment are present in the dataset [53].It contains recordings of 17 normal hearing younger adults (18-30 years old), 14 normal hearing older adults (over 60 years old), and 17 hearing impaired older adults (over 60 years old).The dataset description contains an overview of the data collection protocol and file naming convention.More details and the analysis of the data are found in the publication [101].
The dataset in [49] investigates brain functionality involved in chronic post-burn itch.It contains EEG data from 15 patients and 14 healthy controls recorded during rest and stimulation of skin.The dataset also contains documents describing the collection protocol, SPSS scripts for statistical analysis, and the literature review completed by the authors.The access to the dataset is restricted, i.e., the user needs to request access to the dataset owner to download the files.

C. EEG CONFIGURATION
The Table 4 shows the configuration of EEG system used for the data collection of each dataset in this review, as well as the recording type, and annotation method.

D. ASSESSMENT: HOW REUSABLE ARE THE DATASETS?
The Table 5 shows the result of using the ''F-UJI Automated FAIR Data Assessment Tool'' [42] to programmatically evaluate the datasets.For each category (Findability, Accessibility, Interoperability, and Reusability), the value shown represents what percentage of the FAIRsFAIR Data Object Assessment Metrics [43] the dataset fulfils.
Since the automated tool is not able to test if the dataset uses the standard format for EEG data (BIDS), or determine the level of documentation of the dataset, we collected the information shown on table 6.It indicates if a dataset has a comprehensive explanation, a README file, some indication of the file structure used, if the filenames are descriptive enough to enable another researcher to identify what the file is about, where the data collection protocol is located (in a file inside the dataset files, in the associated publication, or not disclosed), and if there is any code provided by the authors to create the environment and stimuli used and/or the data analysis done.

IV. DISCUSSION
Electroencephalography (EEG) is a reliable clinical diagnostic method which can be used for different health conditions, such as epilepsy, Alzheimer's Disease, sleep disorders, and anaesthesia monitoring.On the research side, EEG is widely used for psychiatric, psychological, and neurophysiological studies of varied cognitive and psychological conditions.It is also used in Brain-Computer Interfaces, where the EEG signals can be used to control computers, speech devices, and other accessibility tools.Recent research in BCI also uses EEG to identify emotional states in participants.With these varied use-cases and combined with its relatively low-cost (when compared to other neurological diagnostic tools), EEG is a crucial tool for research in neurology and related areas.
Recent advances in Machine Learning and Deep Learning have opened up various new opportunities in research.However, these methodologies require large volumes of data.Open datasets greatly advance research on these methods.
A number of considerations (i.e., ethics, data availability, etc.) needs to be taken in to account when analysing such datasets as those selected in this review and these are discussed below.

A. ETHICS
Seven datasets were excluded from this review as they had not indicated whether or not they had ethics approval for the studies undertaken.In general, with some exceptions, the included datasets disclosed ethics approval only on the associated publication.Also, some of the datasets did not include a copyright disclaimer, whereas some of the repositories (e.g., PhysioNet) included copyright by default.These disclosures are important from a legal perspective, as it informs potential users of who owns the data, and what can be done with it.The most common copyright of the datasets included in this review is ''CC0'' in which the researcher gives up all the copyright, placing the dataset in the public domain, where it TABLE 4. EEG data collection characteristics: device, number of channels, and sampling frequency used, type of recording and annotations, file format, and where the recording protocol is available.can be used with no conditions.The second common ''CC BY 4.0'' which is similar to CC0 but requires the user to attribute credit to the publisher.

B. HEALTH CONDITIONS PREVALENT IN THE DATASETS
Of the different conditions covered by the datasets, the most common is sleep-related problems.This was not unexpected, as it is standard clinical practice to record EEG associated with other sensors (e.g., PSG) for the diagnosis of sleep problems.Added to that, the use of EEG in research and clinical settings has originated from the first use of EEG in humans by Berger [3], [4] monitoring sleep.
Although some diagnoses are very prevalent in the general population, such as migraine, depression or anxiety, there are very few datasets found in this review that study that condition: 1 for migraine, 3 for depression, and none for anxiety.In the case of migraine, for example, the prevalence in the general population is estimated to be 1 billion people worldwide, i.e., 15% of the general population is estimated to have a migraine attack in a year [102].However, there is only one relevant dataset available (from [100]) which, while very valuable in terms of amount of data, still does not cover the condition as a whole since only EEG recordings from migraine patients in the interictal phase (between the attacks, i.e., at least 3 days before or 3 days after a migraine attack) are available.No open EEG dataset that has recordings of patients in different migraine phases is openly available for use at the time of publication.

C. DATA AVAILABILITY
Various research studies published in recent years have a data availability statement, where the authors indicate that TABLE 5. F-UJI FAIR evaluation of the datasets.F is the Findability category, A is the Accessibility category, I is Interoperability, and R is Reusability.The last column shows all of the criteria combined.The percentage values indicate the extent to which the dataset complies to the FAIRsFAIR metrics [43].A score of 100% indicates that the dataset was identified by the automated tool to be compliant with all of the metrics for that category.
''data is available upon request''.However, as Tedersoo et al. [103] have found, only approximately 40% of data requests, in general, are fulfilled.As the authors of [29] have discovered, availability of research data decreases with publication age.
The availability of data is not only important from a replication perspective, but it creates opportunities for further analysis with new methodologies and tools created in the future (i.e., developing deep learning approaches to diagnosis and treatment of diseases).The benefit of using AI models in medical applications is the ability of the model to deal with noisy data, uncertainties, and detect patterns in data that are not found using more standard statistical methods.The increase in available data also allows researchers to compare and validate results, and further strengthens the derived conclusions.
However, data sharing increases the burden on the researchers, as the dataset needs to be prepared to be reused.As Perrier et al. [21] identified, researchers may also be weary of data due to possible misuse or misinterpretation of data.Indeed, one of the datasets in this review has a note by the authors asking potential users to correspond with them to make sure the data is being interpreted correctly.Also identified by [21] is the lack of incentives for researchers to share data as it does not have the same weight as published articles in regards to academic appointment and promotion.
One of the datasets in this review indicates that the associated code is available and provides a link, but as of the time of writing this review it was no longer available in the link provided (dataset [70], the associated paper [77] links to https://github.com/gsokoloff/Infant-Sleep-Study-Iwhich does not exist).This could be solved by treating source code for analysis and models as datasets by hosting them in a repository and providing a DOI and metadata.For example, the F-UJI automated tool provides a DOI associated with a Zenodo record containing a version of the automated tool (DOI 10.5281/zenodo.4063720)[40].

D. DATA COLLECTION PROTOCOL
The Table 6 shows which datasets include the recording protocol in their description or files.Of the 30 datasets included in this review, 16 of them did not have a data collection protocol within their files or description.Most of them have the protocol explained on the associated publication, and not all of those had a direct citation or link to the publication -e.g., it was necessary to search the publication databases with author names and general keywords.In some cases, the publication was hard to find, or not found.The protocol details the assumptions, EEG recording task description and method, and other particularities of the methodology used.
In the case of sleep related datasets, there is a standardization of the recording protocol, e.g., the polysomnography study protocol is generally the same across datasets, therefore, the comparison of recordings from different datasets is possible.The same might not be the case for other tasks.Even for tasks observing the same phenomena (for example, visually evoked potentials) the recording protocols might be different, which makes it harder to compare different datasets and their results.
Also, without the protocol, data analysis is either very hard or impossible, as the assumptions made when collecting the data are not clear, and therefore the data analyst can ''go fishing'' for significant results (also called p-hacking) [104].
A number of datasets in this review, however, have included the code used to generate the stimuli used in the recordings.This enables researchers to use the same recording protocol, down to the same stimuli, which makes the comparison between different recordings more reliable.

E. DATASETS FILE TYPE AND STANDARDS
Although there is not an enforced standard on the type of file for EEG recordings, recent advancements have been made with the publishing of the BIDS standard by Gorgolewski and colleagues [81], and the further extension for EEG by Pernet and colleagues [82].The BIDS aims to standardize the file format, organization, metadata, and distribution of neuroimaging data.This makes it easier for researchers to access and reuse data.Added to that, the BIDS standard allows analysis, software packages, and processes to be TABLE 6. Dataset reusability: does it have a description?Is a README file included?Is there a document or text that indicates the file organization / structure of the dataset?Are the filenames descriptive?Where is the protocol located?Is the code used for recording and data analysis available?
interoperable with any dataset that follows the standard.This negates the need to develop analysis and software for each dataset, reducing possible errors as well.
For example, datasets that distribute the files in .matformat require specific software (Matlab) to open the files, while the BIDS files are an open standard that can be opened by any software framework or tool that can deal with EEG data.This allows a greater flexibility and accessibility to the data.Nine of the 30 datasets are published using BIDS due to the requirement enforced by the repository PhysioNet where they are hosted.
Other than the modalities listed in Section III, no other physiological data collection was found in the datasets included in this review.More physiological data would be beneficial in multiple studies; for example, in the case of Parkinson's Disease, an activity tracker (actigraphy) placed in the wrist could help measure the intensity of tremors [105].

F. FAIR GUIDING PRINCIPLES AND DATASET REUSABILITY
The FAIR assessment for the datasets indicates that they are easily findable, as the F score was high in general.However, this may be due to survivor bias, as the inclusion criteria for this review required a DOI.This then enforces that the datasets score at least 70% on Findable.The only exceptions to this high value are two datasets [67], [68] for which the DOI correspond to an article describing the dataset and not linked to where the dataset is hosted.
The Accessible criteria analyses the metadata related to level of access, and if the data can be accessed using standard networking protocols (such as HTTP, HTTPS, FTP, etc).Since most of the datasets are hosted in repositories which provide the infrastructure for data access, the A score is generally high.
The datasets hosted on data repositories such as Figshare, Mendley Data, and PhysioNet, have a high score on the Interoperable principle.The metrics in this principle measure how easy it is for machines to access and read the metadata of the dataset.Most of the data repositories provide by default the metadata since they are built for this purpose.The datasets that score low in this principle are those hosted on different places (other types of repositories, custom servers, etc) or for which the DOI refers to a published article.
Finally, the Reusable principle metrics assess if the dataset provides a license, if there is any information about data provenance, and if the dataset follows a metadata standard and provides the files in this standard.Most of the datasets in this review scored 50% or lower, which reflects the lack of information about data provenance and the fact that most of the datasets do not provide files in a standard format.
However, the FAIRsFAIR data object assessment metrics has limitations in the assessment of reusability, which is also highlighted by the metrics authors [43].Specifically, the metric FsF-R1-01MD related to the Reusability principle requiring rich metadata can only verify if the repository of the dataset contains information described in general metadata standards.For the case of observational clinical and/or behavioural data, the 'relevant attributes' should contain the data collection protocol, as that directly affects how the data is analysed and interpreted.However, there is no way of automatically or programmatically check this using the F-UJI tool.Added to that, the metric FsF-R1.3-01M,which verifies if the dataset follows the standard recommended by the target research community, does not correctly identify the BIDS format.Therefore, a low score on the Reusability category on Table 5 does not necessarily represent a true negative score on reusability for EEG datasets, as the tool cannot identify the community standard BIDS format.Dataset publishers should aim for total compliance with the FAIRsFAIR metrics, but the metrics and the automated tool also need to reflect the requirements and standards of the EEG research community.
The F-UJI tool is capable of checking repositorylevel information, i.e., the metadata available such as author names, identifiers, keywords, description, and machine-readable information (file names, formats, file structure, dates, and identifiers).However, it is unable to check if the dataset contains documentation such as file description, variable description and codification (if any), data collection protocol, and other information needed for analysis and reuse.
For the assessment of documentation level, we created a scale of 0 to 3 points to represent how well the context of the data provided is explained.We looked specifically for data collection protocols, descriptions of the dataset and its file contents, and software or code used for analysis.
The review by Roy et al. [19], which looks at published deep learning models and results, has identified that 53% of the articles used public data (i.e., data shared) and 42% use private data that is not shared, which means that these models are not reproducible.The review also highlights that 21% of the studies mention that more public data is needed to support the research of deep learning models of EEG data.The other problem raised by the review is that there is a lack of labelled clinical data, and labelling requires time and expertise.As for the source code for the models, Roy et al. found that only 13% of the studies included in their review provided the code, which means that just 12 out of 154 studies are reproducible (with code and data shared) [19].
Added to that, it is hard to combine multiple datasets together for a deep learning model due to the different recording protocols and electrode montages (electrode placement and references).As Yao et al. [106] and Hu et al. [107] the reference method (mono-polar or bipolar) creates systematic changes in the distribution of the signal frequency power.

G. SUMMARY AND FINAL CONSIDERATIONS
EEG is widely used in clinical and research settings.In recent years, EEG data has been used pervasively in Machine Learning algorithms developed to diagnose and further study different health conditions.However, DL models require more data for training and validation.To identify what open datasets have been published with EEG data related to healthcare, and to ascertain how reusable these datasets are, a scoping review was carried out.
A large number of potential datasets were found, but only 30 datasets of the 140 records retrieved and assessed actually fulfilled the inclusion criteria.The main reasons for rejection were: not pointing to a dataset, not related to healthcare or a diagnosis, or had no disclosure of ethics approval for the data collection.Of the 30 included datasets, 7 were related to sleep -which was expected as EEG is widely used in clinical settings during a polysomnographic assessment of sleep.The other diagnosis or conditions covered by the datasets found were either psychiatric or neurological related, such as Parkinson's Disease, Depression, Schizophrenia, Brain Injury, and so on.We found a lack of datasets for some prevalent conditions in the general population.For example, migraine is a neurological condition that afflicts 15% of the population, but we found only one dataset for migraine.
The majority of the datasets found were open access, while the others required either a credential (by registering into the dataset repository) or a request to the dataset publishers.In general, the datasets were licensed as entirely placed in the public domain, or as requiring only attribution to the dataset authors.
The biggest challenge found was that more than half of the datasets in this review did not have any description of its data collection protocol in the dataset.Of these, the majority had the protocol detailed in a different publication -generally a study published analysing the dataset -that in some cases were not linked to in the dataset page.In those cases, to find the protocol we had to search the dataset authors' publications and try to identify which one was based on that dataset.This makes it harder for other researchers to reuse the dataset.However, some datasets had not only the data collection protocol, but also the code used to process and analyse the data by the authors, which provides great documentation and resources for other researchers.By providing the data and the code, researchers make their results more easily reproduceable and verifiable, and also provide resources for the research community in general.
Added to this, the FAIR principles are important guidelines created by the research community to improve data sharing and reuse.They specify requirements for Findability (e.g., DOI registration, metadata, registered in an index), Accessibility (metadata has to be accessible and retrievable using standard protocols), Interoperability (metadata follows a certain standard, easily readable by machines and humans), and Reusability (data provenance, attributes, and other aspects relevant to the domain are clear and detailed).The FAIRsFAIR Project created an automated method of scoring datasets by the FAIR principles [42].However, the Reusability criteria does not easily translate into scoring criteria that can be verified programmatically.The automated method does not assess if data collection protocols and data explanations are included in the datasets.We created a scale to measure reusability in terms of documentation level of the dataset.
In general, there is a need to: (i) share more data, as there are no datasets for certain health conditions; and (ii) better document the data shared, which makes it easier for other researchers to verify and use the data.

H. RECOMMENDATIONS FOR DATASET PUBLICATIONS
In line with the discussed above, and the reusability metrics created for the dataset assessment in Table 6, we recommend that authors, when publishing research data, follow the guidelines: • Publish the data in data repositories that assigns a permanent identifier (DOI) to it.Some examples of free data repositories are: Zenodo (https://zenodo.org/)and Figshare (https://figshare.com/) for general data; and OpenNeuro (https://openneuro.org/)and PhysioNet (https://physionet.org/)for EEG and physiological data.
• Include a data availability statement with the link to data repository (or DOI) when publishing analyses of the dataset.Also include a link/reference to the published article in the dataset.
• Explicitly assign a license to the dataset -if the dataset is open to be reused by other researchers, good licenses are CC0 (''No Rights Reserved'') or CC-BY 4.0 (attribution required); see https://creativecommons.org/shareyour-work/cclicenses/ for more options and information.
• Populate the metadata to make it easier to be discovered, by giving it a meaningful name, adding relevant keywords, and linking to and from the published article.
• Make sure the description clearly identifies the study objectives, the participants, devices and recording protocols used, and any other relevant information.This should be treated as the ''abstract'' for the dataset.Where is located the participants data (such as age, gender, which group the participant belongs to, etc.)?
• If possible, include a separate file with the detailed data collection protocol.The code (or a link to its repository) used for stimuli generation, if any, should also be included.
• The filenames should be descriptive.Include the participant ID and the task done in the recording, and any other relevant information (e.g., ''HC1 Rest'' for the recording of Healthy Control ID 1 during rest).
• Make sure the dataset follows a standard format.In the case of EEG recordings, the standard format is BIDS [81], [82].
• Any code used for analysis should also be shared.
It can be included with the dataset or hosted in another repository such as GitHub (https://github.com/).

V. CONCLUSION
EEG data holds immense potential for advancing healthcare through machine learning and deep learning.While our scoping review identified numerous datasets on this topic, it is apparent that there is room for improvement in terms of data sharing and documentation.Many datasets lacked essential data collection protocols and explanations, hindering their reusability and the replication of results.As we move forward, the research community must prioritize sharing data for health conditions currently underrepresented and focus on comprehensive documentation by adhering to the FAIR principles with the ultimate goal to enhance dataset transparency.

FIGURE 1 .
FIGURE 1. Dataset selection for inclusion using the PRISMA framework.

FIGURE 2 .
FIGURE 2. Distribution of datasets found for each health condition, country, year of publication, and secondary data modalities included.

•
Include a README file with: -The authors or persons responsible for collecting data, including email addresses and affiliations.-Dates of data collections.-Geographic information of where the data was collected.-Ethics approval information.-Licensing or other restrictions placed on the data.-Links to publications that cite or use the data.-Data collection protocol: which devices where used, how many electrodes (and specific montage used), what tasks were performed during the EEG recording.This should have enough information so that another researcher could replicate the data collection protocol.-Has any preprocessing been done to the data?If so, make clear what has been done.-Describe the file organization / structure of the dataset.Are the recordings separated into folders by participant or by task?Are raw and preprocessed data separated into folders?

Table
. The search was conducted on the week of 14/Nov/2022.

TABLE 1 .
List of EEG repositories screened for suitable datasets.
B. INCLUSION AND EXCLUSION CRITERIA

TABLE 2 .
List of EEG datasets included in this review.

TABLE 3 .
Data modalities included in each dataset.