Classification of Motor Imagery EEG Signals Based on Deep Autoencoder and Convolutional Neural Network Approach

The technology of the brain-computer interface (BCI) employs electroencephalogram (EEG) signals to establish direct interaction between the human body and its surroundings with promising applications in medical rehabilitative services and cognitive science. Deep learning approaches, particularly the detection and analysis of motor imagery signals using convolutional neural network (CNN) frameworks have produced outstanding results in the BCI system in recent years. The complex process of data representation, on the other hand, limits practical applications, and the end-to-end approach reduces the accuracy of recognition. Moreover, since noise and other signal sources can interfere with brain electrical capacitance, EEG classifiers are difficult to improve and have limited generalisation ability. To address these issues, this paper proposes a new approach for EEG motor imagery signal classification by using a variational autoencoder to remove noise from the signals, followed by a combination of deep autoencoder (DAE) and a CNN architecture to classify EEG motor imagery signals which is capable of training a deep neural network to replicate its input to output using encoding and decoding operations. Experimental results show that the proposed approach for motor imagery EEG signal classification is feasible and that it outperforms current CNN-based approaches and several traditional machine learning approaches.


I. INTRODUCTION
Brain-computer interface (BCI), also known as the brainmachine interface (BMI), creates a non-muscle channel that allows the human body to communicate directly with external devices [1] [2]. There are three types of BCI paradigms: invasive, partially invasive, and non-invasive. The most powerful procedures, such as visual or motor implants [3] [4], are the most invasive, but they come with many risks associated with surgery, such as scar tissue and infections. On the other hand, non-invasive procedures such as electroencephalography (EEG) [5] can help with medical diagnosis and research while also addressing real-world issues. Electrocorticography (ECoG) [6] is a semi-invasive method that only requires surgery to implant devices on the brain's surface.
The most extensively studied non-invasive BCI is EEG, which is relatively inexpensive and simple to carry out, while providing fine temporal resolution. As a result, it is a common method for analyzing and monitoring changes in brain electrical activity.
The frequency range of brain oscillations is typically between 0.5 and 40 Hz. Based on these measurements, EEG signals are divided into five rhythms: Delta δ (0.5-4 Hz), Theta θ (4-7 Hz), Alpha α (8-13 Hz), Beta β (14-30 Hz), and Gamma waves γ (> 30 Hz) that are all present in different parts of the brain [7]. EEG aids in the detection of a variety of brain abnormalities such as sleeping disorders, emotional variance, and seizure detection, and researchers have recently proposed user identification systems [8], neuromarketing [9] and rating prediction systems [10].
EEG-based BCI has been studied for its effects on various paradigms, including exogenous and endogenous characteristics. One of the most common modes is motor imagery (MI). While some oscillating activity in the brain's sensorimotor cortex corresponds to specific imagination, MI refers to subjects performing an action by imagining a specific part of their body (such as the left and right hands, as well as the feet) instead of moving it [11].
Machine learning technology is frequently used to catego- VOLUME 4, 2016 rize and identify these MI tasks. The MI-BCI system can be used to control external devices such as wheelchair controls or neural prostheses for people with disabilities, as well as to assist healthy people with difficult tasks, such as controlling devices, automatic driving, and epilepsy diagnosis. Because of the low cost, high availability, and lack of any manual support with EEG signals, motor imagery BCI systems have become increasingly powerful intelligent machines.
The MI-BCI system has traditionally been divided into five phases: acquisition of signal data, preprocessing of data, feature extraction, classification, and device control interface [12]. MI-EEG signal collection, signal digitalisation, and data storage are all part of the data acquisition phase. Filtering, cleaning, and transformation of data are all part of the data preprocessing phase. Discriminative features are extracted from EEG signal data that contain useful information during the feature extraction phase. The extracted features are used as input to train machine learning models in the classification phase. Different signals and MI tasks can be classified using the trained models. Finally, in the device control interface phase, the categorized signals are converted into commands to control devices like robots and home appliances [13]- [15].
In the field of MI-EEG research, a wide range of feature extraction and classification techniques have recently been used. The Common Spatial Patterns (CSP) and its variants, such as the filter-bank CSP (FBCSP), have progressed considerably in the classification of MI. Deep learning techniques, on the other hand, do not rely on handcrafted features.
In this paper, a novel deep learning architecture for MI EEG signal classification based on DAE and CNN is presented. The CNN convolution kernel is trained using an unsupervised deep autoencoder. The main benefit of combining a deep autoencoder with a CNN is that the DAE is a nonlinear feature extraction technique that is used before classification of a high-dimensional dataset to remove redundant information that exists in the EEG signal. It also has a significant advantage over other competitors because it is capable of learning a deep neural network that is trained to replicate its input to its output through encoding and decoding operations.
The CNN convolution kernel is trained using an unsupervised deep autoencoder. The deep autoencoder is a fully connected three-layer neural network with an unsupervised training method because a label is not required during training. Encoder training aims to learn better representations of input data in order to train better classifiers.
The remainder of the paper is organised as follows: Section II provides a comprehensive review of the main approaches developed for MI-EEG. Section lll describes the proposed DAE and CNN approach. The experimental results and discussion of the approach are presented in Section IV, followed by conclusions in Section V.

II. RELATED WORK
A variety of machine learning algorithms and feature extraction methods have been studied in the field of MI-EEG in order to overcome the limitations of small datasets and low signal-to-noise ratios. The research background can be formed into two groups of traditional machine learning approaches and modern deep learning-based ones. Following is a more in-depth discussion of these methods.

A. TRADITIONAL MACHINE LEARNING APPROACHES
Many techniques for MI-EEG classification have been proposed, with the traditional method of hand-crafted feature extraction in the time and frequency domains being one of them. One of the most popular and powerful methods for feature extraction is the Common Spatial Patterns (CSP) [16] [17], which has been used in a variety of other approaches, for example Filter-Bank CSP (FBCSP) [18] [19], Sub-band CSP (SBCSP) [20], sparse CSP [21], and discriminative filter bank CSP (DFBCSP) [22] that has achieved good accuracy on public datasets such as BCI Competition IV datasets 2a and 2b.
Many researchers use time frequency signal processing methods for feature extraction, such as the short time Fourier transform (STFT) which was used by Tabar and Halici [23] to extract information about time, frequency, and location from raw EEG signals while also converting them to images, continuous wavelet transform (CWT) which was used by Lee and Choi [24] to convert EEG signals into spectrums of time and frequency, and empirical mode decomposition (EMD) [25]. Furthermore, principal component analysis (PCA) and independent component analysis (ICA) are examples of dimension reduction techniques that is used to improve MI task recognition performance [26]- [28].
In addition, traditional classifiers such as Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM) have been used to discriminate hand-crafted feature vectors. A transfer function for dimensionality reduction was proposed by Ji and Ye [29] as a consolidated framework for generalised Linear Discriminant Analysis (LDA). The framework describes the characteristics of various algorithms as well as their relationships.
The SVM can also produce promising results when combined with the right features [30]. Kumar et al. [31] proposed a mutual information based frequency band selection bases on CSP features and LDA is used to further reduce dimensionality before SVM is used to classify the patterns. Furthermore, the same authors published a follow-up study in which, before classifying with SVM, they use a Long Short-Term Memory (LSTM) network for temporal filtering and LDA for spatial filtering [32].
Park and Chung [33] developed a method for extracting more discriminative CSP features for an SVM classifier using method of channel optimal selection, resulting in 88.62% accuracy.
Islam et al. [34] proposed a multiband tangent space mapping with sub-band selection (MTSMS) approach where the tangent features of a multichannel electroencephalogram (EEG) signal are estimated on each sub-band after it is decomposed into multiple sub bands. Sub-bands with features capable of improving motor imagery classification accuracy are selected using a mutual information analysis-based effective algorithm. Feature space is created by combining the obtained features of selected sub-bands and to reduce the features dimension, a principal component analysis-based approach is used, the data is then classified using SVM.
Despite the fact that these traditional approaches have been successful in recognising MI, designing accurate BCIs remains a challenge, leaving room for further research in this area.

B. DEEP LEARNING APPROACHES
In the field of EEG signals, the traditional machine learning approach is commonly used. However, its EEG signal processing performance and accuracy are insufficient. To address this issue, researchers have begun to investigate the feasibility of employing many deep learning approaches in the analysis of EEG signals.
The CNN was used by Li et al. [35] to classify MI signals. They began by extracting primary features such as EEG channel dependency as well as temporal features. After that, CNN is used to extract high-level features. Moreover, based on deep CNN, Chaudhary et al. [36] developed a method for recognising MI tasks in a BCI system. The DCNN is specifically used to classify EEG signals from right hand and foot MI-tasks. The proposed method uses time-frequency (T-F) approaches to convert the input EEG signals into images. Short-time Fourier Transform (STFT) and continuous-wavelet-transform (CWT) are two timefrequency approaches that are commonly used. The DCNN stage is used to apply the images of MI-tasks EEG signals after T-F transformation. AlexNet, a DCNN model that has been pre-trained, is investigated for classification. Lee and Choi [37] constructed 2D images are used to train a CNN model by using CWT.
Zhang et al. [38] developed a novel approach by combining deep learning and data augmentation. To improve classification accuracy for the small training dataset, empirical mode decomposition method on EEG frames, two neural networks were used to generate new artificial frames, train the weights, and classify two classes of motor imagery signals: a CNN and a wavelet neural network (WNN) which is a new type of neural network that replaces convolutional layers with wavelets employed. For MI classification, Wu et al. [39] proposed the paralle (MSFBCNN) multiscale filter bank convolutional neural network. In a layered end-to-end network structure, a feature-extraction network extracts temporal and spatial features. On small datasets, you can use this method to train an individual model for inter-subject classification, a setup and fine-tuning strategy for networks is used to improve the transfer learning ability.
Amin et al. [40] proposed a multi-layer CNNs (MCNN) method for improving EEG MI classification accuracy by combining CNNs of various characteristics and architectures (CCNN) when capturing spatial and temporal features from raw EEG data, different convolutional features are used. The MCNN and CCNN outperform all state-of-the-art machine learning and deep learning techniques.
Lu et al. [41] developed a new deep learning method based on restricted Boltzmann machines (RBM). For comparison, the frequency representation of EEG recordings was obtained using the Fast Fourier transform (FFT) and wavelet package decomposition (WPD) separately. The frequency domain data is then used to train a deep belief network (DBN) made up of RBM.
Xu et al. [42] employed transfer learning to train their CNN, which was designed to classify MI signals. By transferring the parameters of the VGG-16 (visual geometry group) pre-trained network to their proposed CNN model, they were able to take advantage of it.
Alazrai et al. [43] employed the Choi-Williams distribution (CWD) transformation on EEG signals to create 2D images for CNN input. The energy distribution in both the time and frequency domains is reflected in these images.
Zhao et al. [44] proposed a multi-branch CNN for MI signal classification. The 3D representation is created by converting EEG signals into a 2D array sequence that keeps the spatial distribution of sampling electrodes intact. For the 3D representation, the classification strategy and multi-branch 3D CNN are designed specifically. Three deep learning models for decoding movements in motor imagery precisely from raw EEG signals have been proposed by Tayeb et al. [45]: (1) a short-term memory that is both long and short (LSTM); (2) a spectrogram-based CNN; and (3) a recurrent convolutional neural network (RCNN). State-of-the-art machine learning techniques were outperformed by deep learning models in terms of classification performance, which could pave the way for the development of new robust EEG signal decoding. Using the CNN-based BCI, they successfully controlled a robotic arm in real time.
Kwon et al. [46] developed a large EEG database based on motor imagery (MI) and proposed a deep CNN-based framework that is subject-independent. A large-scale MI database is used to represent the overall brain signal patterns by using spectral-spatial input generation which shows superior performance of the feature representation combined with a method based on deep neural networks.

III. METHODOLOGY
Fig 1 depicts the proposed system framework. Each subject data was downloaded separately making a total of 10 data which is mixed randomly according to trial that is later divided into one piece of data for the test set and the remaining nine are for the training set. The test and training set combined data from several subjects. Each trial yielded the MI-EEG raw signals from nine pairs of symmetrical electrodes spread across the motor cortex, with each pair's signals forming a sample. Moreover, each extracted signal contains significant amount of noise. Hence, variational auto encoder (VAE) was used to filter the noise.
Moreover, the proposed CNN model was trained in conjunction with a deep auto decoder, which aids in the repre- sentation of the input at the output layer so that both are as similar as possible. The EEG features were learned using a 5layer CNN, and the dimensions were reduced using 4-layer max pooling. MI was divided into four categories by the FC layer: left fist, right fist, both fists, and both feet. The best training model can then be found by comparing it to the four types of labels. Finally, the test set was used to assess the model's validity.

A. DATASET DESCRIPTION
The Physionet dataset [47] used in this paper was recorded by the BCI2000 system developers [48]. It consists of over 1,500 EEGs (one and two minutes) recordings using a sampling rate of 160 Hz from 109 different subjects. For each subject, four MI tasks were completed: left fist (T1), right fist (T2), both fists (T3), and both feet (T4), with each MI task requiring 21 trials. Fig 2 illustrates the trial's timing diagram.
The trial begins at t = -2 seconds, and the subject relaxes for 2 seconds. The target appears on the screen at t = 0, and the symbols in The MI task was given to the subject for four seconds. At t = 4s, the target vanishes, indicating that the trial is over. A new trial begins after a 2 second break [49]. The motor imagination runs for about 4 seconds each time with a sampling frequency of 160 Hz. As a result, each electrode's effective data size per test is 640. A pair of symmetric electrodes are present in the sample, and their data is connected in series. Therefore, the sample size is 1,280.
A total of 84 trials were completed by each subject on each MI task, totalling 21 trials per MI task. The dataset in this study was subjected to 10-fold cross validation. Each subject's trials were divided into ten parts. Each task class has its own set of requirements, two test trials were used and the remainder as training. As a result, the test set contains 8 trials, while the training set contains 76 trials. In 10 subjects' datasets (S1∼S10), there are 840 trials, 760 for training and 80 for testing. Furthermore, each trial yielded 9 samples. For the training of models and the validation of generalisation performance, 10 subject dataset with 7,560 samples were selected in this experiment.

B. PREPROCESSING
The data is subjected to a signal amplification and filtration process at the time of acquisition. To demonstrate how to segment the data stream, data segmentation was used. The following steps were used to preprocess the EEG dataset: 1) The data was sampled at a rate of 128 Hz.
2) The Variational Autoencoder (VAE) was used to remove EEG noise using a blind source separation technique, based on prior EEG noise reduction research [50].
3) The discrete wavelet transform (DWT) was used to split EEG signals into bands; beta and Gamma waves were considered in this experiment. 4) A common reference was used to average the data and the pre-trial segments of three seconds were removed due to the fact that the data was divided into 60 second trials.

C. DEEP AUTOENCODERS
As shown in Fig 3, an autoencoder is a fully connected artificial neural network system with a bottleneck layer [51]. In this paper, a deep autoencoder is used to extract ERP and morphological features from EEG signals in order to help capture neural activities related to both sensory and cognitive processes.
The encoder and the decoder are the two blocks that make up an autoencoder. At the bottleneck layer, the encoder's goal is to convert a higher-dimensional input feature vector into a lower-dimensional representation. At the decoder end of the autoencoder, the bottleneck features are transformed into a higher dimensional representation, and the input and output features drive the autoencoder learning, ensuring that the bottleneck layer presents a lower dimensional representation of the input features. The operation of encoding can be described as follows: where the input feature vector x is represented by the bottleneck feature vector z, which propagates through hidden layers. θ = {W, n}. W is the weights of the network, n is the network biases, and l is a linear or non-linear activation function.
At the output stage, the bottleneck feature vector z, which propagates through hidden layers at the decoder, is mapped to the higher dimensional representation y as where θ ′ = {W ′ , n ′ }. As a result, the DAE output can be expressed as a function of the encoder and decoder stages weights and biases, namely {θ, θ ′ } and written as y = j (θ ′ ; (f (θ; x))). The DAE parameters θ and θ ′ are optimised to bring y as close to input/target x as possible while also maximising p (x | y). Moreover, mean square error (MSE) backpropagation between target x and network output y is used to optimise the autoencoder parameters. The deep autoencoder was chosen because it learns a compressed representation of the input in order to reconstruct it later, making it suitable for dimensionality reduction. It is made up of an encoder and a decoder, which are used to deal with or mitigate the effects of the curse of dimensionality, and they feed the learned features to a standard CNN classifier. The features are derived from data pertaining to the stimulus onset of self-face images by averaging voltages in 20 nonoverlapping time-windows with a width of 10 sample points, yielding a 5 × 20 matrix per trial, which only uses the selected channels and time intervals. The sparse method was used to expand the 5 × 20 matrixes to 20 × 20 matrixes, i.e., adding zero to the blank space, and a 20 × 20 feature map was generated for each trial to organise the features for CNN training.

D. CNN
The dataset consisted of 10 subjects and 7,560 samples, with 6,840 samples in the training set and 720 samples in the test set. The number of layers in the structure, as well as their parameters were determined through a series of experiments. Five layers of CNN and four layers of max pooling have been identified. This paper used the deep convolution neural network as shown in Fig 4, which is useful for identifying key features vector. Table 1 shows the CNN architecture that was chosen: The input layer is the first layer; the convolutional layers are the second, third, fifth, seventh, and ninth layers; the max pooling layers are the fourth, sixth, eighth, and tenth layers; and the fully connected layer is the eleventh layer.
The CNN input data format is T × N , where T is the number of electrodes used and N is the sampling amount of each channel. In this paper, the output of the DAE were used as T = 488 and N = 64.
Down sampling is the core of the pooling process. Max pooling was chosen, which is achieved by taking the largest value of the neighbourhood's attributes. It can better extract feature information by suppressing the situation where network parameter error causes the predicted mean value to move. An algorithm [52] is shown in the max-pooling equation (3) below: where z x,w is the result of the kth feature map's pooling operator and y i,j is a pooling region that represents a local neighbourhood around the position at (i, j) of l x,w (x, w). The FC layer is deployed after feature extraction to improve the network's nonlinear mapping capacity. It interprets global data and combines local features learned from the convolutional layer to create global classification features. In the same layer, there are no connections between neurons in this layer, and every neuron in the preceding layer is linked to all neurons in the layer before it.. The formula is as follows: where n denotes the previous layer's number of neurons. In this layer, the weight of the connections between neurons j and neurons I in the preceding layer is w (l) ji , j neurons have a bias of b (l) , and the activation function is a.
A softmax layer with four neurons [t 1 , t 2 , t 3 , t 4 ] that represent the four categories generates the FC layer's output. It maps the outputs of several neurons to the interval (0, 1), which could be thought of as a possible outcome of multiclassification. The formula is as follows: Rectified linear units (ReLu) [53] were chosen as the activation functions as the gradient vanishing problem during training is less likely and training convergence is faster as follows: Furthermore, the proposed CNN architecture uses the batch normalisation (BN) algorithm to improve classification accuracy. The batch normalisation method involves performing normalisation processing (mean value 0 and standard deviation 1) before each layer of the network input for each batch of data. That is, for any neuron in this layer,x (k) uses the following formula (presuming the k-th dimension): where x (k) is the kth neuron's original input data, E[x (k) is the mean of the kth neuron's input data p, and V ar x (k) is the standard deviation of the data in the kth neuron. Batch normalisation adds more constraints to the data distribution, improving the model's generalisation ability. After normalisation, the mean and standard deviation of the input distribution are both forced to be 0 and 1. Reconstruction of the transformation and learnable parameters γ and β are introduced in a particular implementation to re-distribute the data as it was originally: where γ (k) and β (k) represent the input data distribution's variance and deviation, respectively. The formula for the batch normalised network layer's complete forward normalisation process is as follows: where X ′ i is the input, µ is the mean, and the variance is σ 2 . The training method used was mini-batch training, which divides the total number of the sample that were trained into reduced chunks and instead of learning one single sample, all parameters are updated after learning one mini-batch. The batch size used was 64.
The Adam optimisation algorithm was used to minimise the loss function using a constant of 1 × 10 −5 as the learning rate of the network in this study.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
In this paper, DAE was used to extract the features and the performance of the features evaluated by using the CNN classifiers was explored.
In the classification step, the dataset was subjected to 10fold cross validation, with 90% of the data being used to train the CNN model, which was then tested for robustness to data changes. To ensure that the model was valid, 10% of the dataset was employed as a test set. The training and test sets were then normalised before being sent to CNN for processing.

A. EVALUATION METRICS
The following metrics are used to compare the proposed method performance to that of other approaches: accuracy, F1 score and receiver operating characteristic (ROC) curve. The following formulas are used to calculate these metrics: P recision = T P T P + F P Recall = T P T P + F N higher than a random chosen negative instance and has a range of 0.5 to 1. The method is more reliable if the AUC is close to 1.0.

B. CLASSIFICATION ACCURACY
To obtain valid results, all trials of a specific subject were divided into 10 parts, nine for training and one for testing to make sure no data blocks were split between the training and test sets as a result of this.
The model was then trained and tested for accuracy. Data segmentation, training, and testing were all part of the process. Moreover, for each subject, ten cycles were created. Their average was used to get the global averaged accuracy of each subject. The overall accuracy (ACC) and area under curve (AUC) for the 10 subjects are shown in Table 2.
The global average accuracy of ten subjects is 95.91%, as shown in Fig 5(a). S8 had the best classification result. It has 98.47% (T1), 96.31% (T2), 99.56% (T3), and 98.15% (T4) MI accuracies, respectively. S10 has the lowest average accuracy (94.41%), while MI accuracies are 91.11% (T1), 89.17% (T2), 97.56% (T3), and 90.12% (T4), respectively as shown in Fig 5(b). T1 has the lowest accuracy, indicating that S9 has the worst classification effect on T1. On four MI, the average accuracy of ten subjects is 94.71% (T1), 95.52% (T2), 95.09% (T3), and 95.35% (T4). Both fists and both feet are the best and worst of the four types of MI tasks respectively. Fig 6 illustrates the loss and accuracy function curves of the physionet dataset for 10 subjects. The convergence of the models can be seen under a variety of conditions. The number of iterations is represented by the abscissa, while the loss and accuracy values are represented by the ordinate. As shown in Fig 6(a) and (b), for the first two epochs, the loss on the training set decreases rapidly. The loss in the test set does not decrease at the same rate as in the training set, but instead remains nearly flat over several epochs. This indicates that our model generalises well to new data. Moreover, In the first two epochs, the accuracy increases rapidly, indicating that  the network is learning quickly. After that, the curve flattens, indicating that the model can be trained with fewer epochs as shown in Fig 6(c) and (d).
The mean confusion matrix for all subjects is shown in Fig 8 which show their group-level classification results. The percentage of correct classification is represented by the numbers in the diagonal lines, while the percentage of misclassification is represented by the numbers in the other lines. Both the left and right hands had the best MI discrimination.

C. PERFORMANCE COMPARISON
The work of Ma et al. [54], Pinheiro et al. [55], Dose et al. [49], Karácsony et al. [56], Hou et al. [57] and Lun et al. [58] who performed the same MI tasks on the same dataset were compared to verify the effectiveness of the proposed approach, as shown in Table 3.
The proposed method outperforms the most effective presented models, demonstrating that DAE improves model decoding performance and CNN improves model generalisation performance. The CNN method is effective and improves the model's generalisation performance in the classification of MI.
While EEGs are frequently low-amplitude and noisy, the patterns are not consistent between the various subjects and patterns. However, the cross-subject model has a high level of accuracy and performs well. In comparison to most previous works, the VAE performs well in terms of removing the noise and then the DAE are generating the features which are fed to the CNN for classification.
The accuracy and stability of high-dimensional BCI classification are improved with this approach while also improving their generalisation ability. As a result, a BCI may become more widely used and not just suitable for one individual.

V. CONCLUSION
This paper presents a novel DAE and CNN classifier for motor imagery classification using EEG signals. EEG signals from ten subjects were recorded while they performed four MI tasks to assess the proposed framework performance.  This study's findings demonstrate that the proposed framework is capable of distinguishing between MI tasks within subjects. Moreover, this approach can be further tested with a clinical-grade EEG system, and it should be investigated to see if the number of electrodes can be reduced even further.