BERT Learns From Electroencephalograms About Parkinson’s Disease: Transformer-Based Models for Aid Diagnosis

Medicine is a complex field with highly trained specialists with extensive knowledge that continuously needs updating. Among them all, those who study the brain can perform complex tasks due to the structure of this organ. There are neurological diseases such as degenerative ones whose diagnoses are essential in very early stages. Parkinson’s disease is one of them, usually having a confirmed diagnosis when it is already very developed. Some physicians have proposed using electroencephalograms as a non-invasive method for a prompt diagnosis. The problem with these tests is that data analysis relies on the clinical eye of a very experienced professional, which entails situations that escape human perception. This research proposes the use of deep learning techniques in combination with electroencephalograms to develop a non-invasive method for Parkinson’s disease diagnosis. These models have demonstrated their good performance in managing massive amounts of data. Our main contribution is to apply models from the field of Natural Language Processing, particularly an adaptation of BERT models, for being the last milestone in the area. This model choice is due to the similarity between texts and electroencephalograms that can be processed as data sequences. Results show that the best model uses electroencephalograms of 64 channels from people without resting states and finger-tapping tasks. In terms of metrics, the model has values around 86%.

and cognitive responses or efferences. Between afference and 23 efference, there is an analysis of information that can be 24 The associate editor coordinating the review of this manuscript and approving it for publication was Md Kafiul Islam . altered in pathological states. The interactions between neu-25 rons are mediated by molecules called neurotransmitters that 26 allow them to reach different membrane states that produce 27 their activation and depolarization, thus creating an electric 28 stimulus that neurophysiological techniques can register. 29 Electroencephalography, invented by Hans Berger, is a 30 method for recording superficial brain waves as electroen-31 cephalograms (EEGs) [21]. EEGs are a particular type of 32 data called time series. We define them as sets of repeated 33 observations of a single unit or individual at regular inter-34 vals over many instances [55]. EEGs can be recorded using 35 explaining the functional changes in this disease, previous 92 works demonstrate the changes produced by this medication 93 [48], so the precise moment when the EEG is registered may 94 be a crucial factor for the analysis. So, the primary motivation 95 of this paper is to find a non-invasive diagnostic method that 96 will prevent patients from taking large amounts of medicine 97 that could be harmful to their health. In this respect, the 98 most useful would be raw EEGs (without transformation 99 that could lead to information losses) as they are the default 100 physiological brain signal. Also, visual recordings of PD 101 must not allow physicians to make diagnoses, so we need 102 to apply Artificial Intelligence techniques. These techniques 103 nowadays allow the processing of large amounts of data. 104 In particular, deep learning techniques are models that com-105 prise multiple processing layers to learn data representations 106 with various abstraction levels [12]. The use of such analysis 107 applied to EEGs implies its use as an early diagnosis method 108 with an impact on disease characterization and management. 109 In this paper, deep learning techniques from the Natu-110 ral Language Processing (NLP) research area have been 111 applied to build a model for characterizing Parkinson's 112 EEG changes in different states of dopaminergic stimulation. 113 These changes are compared with those from controls and the 114 obtained differences. Then, this information could give hints 115 for early diagnosis of the disease. Considering the nature of 116 the data (texts and EEGs) can be processed as data sequences. 117 Within all the NLP models, we use Bidirectional Encoder 118 Representations from Transformers (BERT) as the last break-119 through in the area [16]. The paper's main contribution is 120 the application of this model that will lead to obtaining a 121 non-invasive method to help clinicians diagnose PD. As far 122 as we know, this is the first time BERT has been adapted to 123 an EEG classification task for diagnosis. This fact is endorsed 124 by Maitin et al. [34], a review of machine learning techniques 125 for PD classification. 126 The main benefit of applying deep learning techniques for 127 diagnosing PD using EEGs is that there are no evident brain 128 structural alterations as may be the case of epilepsy, and the 129 functional changes such as motor performance depend on 130 the dopaminergic stimulation. Thus, the cortical activity may 131 vary depending on the degree of degeneration. The external 132 dopamine administration makes it quite challenging to dif-133 ferentiate from healthy subjects depending on the patient's 134 functional state.

135
The rest of the paper is structured as follows. 136 Section 2 summarizes the state of art related to computer 137 science models and PD diagnosis. Section 3 describes the 138 dataset used in the research and defines the methods used. 139 Section 4 discusses the results obtained during the study. 140 Finally, section 5 gives some conclusions and suggests some 141 future works.

143
There are many studies of EEGs with classical machine learn-144 ing techniques. A work that uses EEGs from Alzheimer's 145 patients can be found in Podgorolec. It applies subspace 146 methods and its version of decision trees. In another case,  [49], and [57], respectively. As far 243 as we know, there are no papers where EEGs of PD have 244 directly been used with BERT models as described in our 245 paper.

246
In this paper, inspired by the language representation 247 model BERT, we developed a neural model to process and 248 classify EEGs diagnosing if a patient suffers from PD or not. 249 The main novelty of this work is the direct use of EEGs (for 250 being a non-invasive technique) to diagnose PD with BERT 251 models.

253
The following subsections describe the resources used in this 254 work and the techniques applied. First, a brief description 255 of the EEGs and their collected data. Secondly, a formal 256 definition of the deep learning models that have been applied. 257

259
The data in this research corresponds to some EEG tests on  Every channel of an EEG is a sequence of values measuring 341 potential differences at each point of the process. NLP state-342 of-the-art neural models can process sequences efficiently to 343 generate different outputs. These models can even attend to 344 other parts of an input sequence to produce the desired result 345 ([6], [16], and [54]).

346
This paper considers a parallelism between EEG and texts. 347 An EEG channel is a sequence of measurements like a 348 VOLUME 10, 2022   As has been said before, each encoder has two sublayers: self-389 attention and feed-forward, which also receive information 390 from a residual layer. This layer aims to introduce informa-391 tion from previous states that could be lost during the data 392 processing [16]. BERT's latest Transformer connects to a 393 simple neural network classifier with several hidden layers 394 and a bicategorical output layer using a SoftMax activa-395 tion function. SoftMax will let BERT discriminate between 396 Parkinson's patients and healthy people [16]. As texts, EEGs have the particularity that a value in a spe-418 cific moment needs to consider the previous values to be 419 understood. In our case, EEGs must be evaluated using what 420 is happening in all the channels at given moments. This 421 approach determines how EEGs get into the neural models. 422 A sliding window mechanism uses all the channels at the 423 same time and splits each EEG into different small pieces. 424 The use of small data has the advantage of reducing the 425 input data and allowing a more populated dataset with small 426 instances. This sliding window has two parameters to decide 427 how to create the instances. The first parameter is called the 428 step and controls how much the start of a window is shifted 429 concerning an instant of the EEG, which is the beginning 430 of a previous window. The second parameter is the width 431 and controls the number of values between the window's 432 start and end. In the present work, these parameters have the 433 following values: step comprises 95% of the data and width of 434 256 instances. In this way, we go through the EEG employing 435 windows with an overlapping of 5% to maintain its continuity. 436 Fig. 3 describes this paragraph. C1 to C6 denote six channels, 437 and t1 to t9 are nine timestamps corresponding to the win-438 representation. Notice that the lack of the embedding layer 453 reduces the size of the classification models and thus saves 454 training time. Moreover, a large amount of data is needed to 455 obtain good embeddings, around billions of words for good 456 word embeddings [32].

458
To compare the results, we develop two experiments based 459 on BERT models with the same architecture. First, a model 460 is trained and focused on processing the 28 most interior 461 channels, assuming that the peripherical channels add noise 462 and predict if it comes from a person with PD or not. 463 Then, a model is implemented processing a 64-channels-EEG 464 which means using all the information collected in the EEGs. 465

466
Both experiments are trained, including all the EEGs (both 467 tappings and resting state) for each individual (training strat-468 egy 1) and then removing the corresponding to a resting state 469 (training strategy 2). The motor task that has been chosen is 470 finger-tapping, consisting of a self-cued repetitive opposition 471 of the thumb and index of each hand is one of the most 472 informative tasks included in clinical evaluations such as the 473 UPDRS. The reason is that the hand has a pre-dominant 474 somatotopic representation in basal ganglia and is one of the 475 earliest locations of motor alterations identified in the disease 476 [30]. On the other hand, the resting state has been extensively 477 used in functional magnetic resonance imaging (fMRI) to 478 study functional connectivity among specific brain regions 479 organized into networks [26]. These networks' dynamics 480 and disruption may be associated with various diseases. 481 The resting-state has been extensively used to study EEG 482 microstates [29] that are altered in PD depending on the 483 dopamine administration [48]. 484 Then, we have to split the dataset into training, validation, 485 and test subsets. Train and validation comprise the training 486 stage, and then the test stage is used to do new classifica-487 tions of EEGs. In this case, 80% for training and validation 488 applying 5-fold cross-validation, and 20% of the cases were 489 used for the test (examples never seen by the model during 490 training). The different subsets were chosen randomly in 491 terms of individuals but always maintained the percentage 492 of patients and controls. Although BERT-based models can 493 work with unbalanced classes [37], this double validation 494 allows us to eliminate the bias produced by the choice of data 495 and to identify failures during the training process through 496 the use of the CV method, and to verify the generalization 497 capacity of the model by means of a test blind set. The split 498 into train/validation/test sets was carried out guaranteeing 499 patient independence, and then the division into windows 500 was performed. The classification models give a result that 501 belongs to a particular instant time of an EEG for a specific 502 class. The final classification probability is the average of the 503 probabilities for each EEG fragment.

532
Goncharova et al. [22] claim that electrodes situated on 533 peripheric areas of the brain are more suitable for collecting 534 noise. Considering that, the 64-channel-EEG baseline model 535 is replicated but uses only the most interior 28 channels.

536
The elimination of peripheric electrodes does not affect the 537 central electrodes, which recollect the information from the 538 primary motor and sensitive areas. These areas expect to 539 reflect most of the changes produced by the dopaminergic 540 stimulation changes in the disease. The reason for selecting 541 these particular channels is two-fold. First, to confirm the 542 previous hypothesis, and second because they allow us to 543 maintain the four attention-heads in the model's architecture. 544

545
Trained approaches compile accuracy, specificity, sensitivity, 546 and precision as metrics. Since we are dealing with a medical 547 use case, the metrics should consider false positives and false 548 negatives [33]. In this work, a false positive is a healthy person 549 misdiagnosed with PD. A false negative is a person with PD 550 diagnosed as healthy.

551
After training the 28 channel models with both training 552 sets (with and without resting-state EEGs) during five epochs, 553 we obtained results from Table 2. It contains the four metrics 554 for both pieces of training, separating training validation and 555 splitting with its standard deviation. 556 TABLE 2. Evaluation of the 28 channels models with both trainings.
As seen in Table 2, we can interpret the results by consider-557 ing the bias-variance trade-off [9]. First, bias seems accurate 558 in some metrics, as diagnostic accuracy is slightly over 80% 559 [44]. In terms of variance, the model trained without resting 560 tests has good results in all metrics except specificity due to 561 its differences between stages. Similar results happened when 562 resting states.

563
If we analyze the results in-depth, we can see that the 564 variability of results for the true and false negative (sensitivity 565 and specificity) without resting states is lower than using 566 these tests but still significantly high. In this experimentation 567 (28 channels), we have less data than in the other case by 568 dispensing with one of the EEG tests. However, percentage-569 wise, the difference between the classes is maintained. This 570 result affects the specificity metric, as we can see in the 571 results. In addition to having a significantly low value, its 572 standard deviation exhibits high values, around 30%. When 573 the resting test remained unused, we did not observe signifi-574 cant differences between the precision and accuracy metrics 575 results.

576
Analyzing the results of all the tests, we found the follow-577 ing. On the one hand, sensitivity, a metric responsible for pro-578 viding the rate of true positives, has values above 90% in all 579 stages of experimentation (that is, train, validation, and test). 580 Still, it exhibits very high deviation values, around 44% in all 581 cases. On the other hand, specificity, the metric responsible 582 for providing the rate of true negatives, has very low values, 583 around 20%. When performing a 5-fold strategy, we find high 584 imbalance. We can corroborate this result with the increased 626 precision and accuracy metrics concerning the model of 627 28 channels using all data.

629
As a final way to check the performance of our model, we are 630 comparing it with two classical deep learning models widely 631 used with EEGS: CNNs and RNNs with Gated Recurrent 632 Units (GRUs). Both models are inspired by Shi et al. [50] 633 but have been adapted to our data. We trained both models 634 in the same conditions as our BERT model with underfitting 635 results. So, we decided to augment the number of epochs to 636 obtain well-trained models. The results of this comparison are 637 in Table 4. 638 TABLE 4. Evaluation of the 64 channels models with both trainings.
As seen above, our model improves the results of the RNNs 639 but is slightly worse than the CNNs. However, it should be 640 considered that we needed more epochs to obtain a non-641 underfitting model. We also want to remember that this 642 work aims to demonstrate that powerful NLP techniques like 643 BERT can be used in biosignal processing. In fact, there is 644 a tendency to use these models in other fields. For example, 645 He et al. [24] uses BERT for image classification. In this way, 646 the next step would be to test the performance of BERT and 647 EEGs in a more complex problem that could be difficult to 648 solve with CNNs.

650
The main aim of this work has been to develop a neural 651 model that could differentiate between Parkinson's patients 652 and healthy subjects using EEGs as time series and taking 653 advantage of NLP techniques. For this purpose, first, we have 654 collected a set of EEGs from PD subjects and controls. 655 Parkinson's EEGs have been recorded in several conditions, 656 considering that there may be significant changes according 657 to the degree of the disease or even with motor activation. 658 Then, we retrained different versions of the BERT model to 659 prove our hypothesis. Also, additional training strategies have 660 been developed to achieve the results. 661 We obtain two main conclusions. First, EEGs without rest-662 ing states help the models discriminate better between Parkin-663 son's patients and healthy controls than only finger tapping 664 EEGs. Secondly, the model corresponding to a 64 channels 665 model best differentiates between PD and healthy subjects. 666 To summarize, our main conclusion is that 64 channels model 667 without resting EEG was the best option in this case. Results 668 in different metrics are around 86% of performance classify-669 ing EEGs between a patient with Parkinson's and a healthy 670 subject.

671
This value may occur because a BERT model requires 672 more data to perform training, and removing part of the 673 electrodes does not contribute to improving the results of the 674 VOLUME 10, 2022 classification problem. We could also think that, in the case of PD, the affected area extends to peripheral regions; therefore, 676 these electrodes also contain information about the disease.  artificial intelligence and machine learning dis-996 ciplines, cognitive science, and brain-computer 997 interfaces. Application fields of her research have 998 evolved from automating industrial process and 999 generating robots' behavior, passing by knowl-1000 edge discovery in exhaustive data volumes to more 1001 recently computational linguistics, and cognitive modeling and technological 1002 platforms for human-machine interfacing.

1003
JUAN PABLO ROMERO received the master's 1004 degree in neurobiochemistry, biotechnology, and 1005 neuropsychology, and the Ph.D. degree in neu-1006 rodegenerative diseases mortality from the Com-1007 plutense University of Madrid. He is a Neurolo-1008 gist specializing in movement disorders and brain 1009 damage rehabilitation. He is currently a Professor 1010 of neurology and neuroanatomy at the Francisco 1011 de Vitoria University, Madrid. He is the main 1012 Researcher and the Chief of the Neurorehabilita-1013 tion of Movement Disorders and Brain Damage Research Group with several 1014 research lines funded by national and international grants. His research inter-1015 ests include the non-invasive neuromodulation applied to the rehabilitation 1016 of cognitive and motor functions on Parkinson's Disease and brain damage 1017 and biosignal processing for identifying disease progression markers.