Ensemble Convolutional Neural Networks With Support Vector Machine for Epilepsy Classification Based on Multi-Sequence of Magnetic Resonance Images

Classification of brain abnormalities as a pathological cue of epilepsy based on magnetic resonance (MR) images is essential for diagnosis. There are some types of brain structural abnormalities as a pathological cue of epilepsy. To identify it, a neurologist can involve some sequence of MR images at a time. Existing algorithms for abnormalities classification usually involve only one or two sequences of MR images. In this paper, we proposed ensemble convolutional neural networks with a support vector machine (SVM) scheme to classify brain abnormalities (epilepsy) vs. non-epilepsy based on the axial multi-sequence of MR images. The convolutional neural network (CNN) models on the proposed method are base-learner models with different architectures and have low parameters. The performance improvement on the proposed method is made by combining the output of the base-learner models and the combination of predictions from these models. The combination of predictions uses majority voting, weighted majority voting, and weighted average. Henceforth, the combined output becomes input in the meta-learning process with SVM for the final classification. The dataset for evaluation is the axial multi-sequences of MR images that include abnormal brain structures causing epilepsy and non-epilepsy with various subjects’ histories. The experimental results show the proposed method can obtain an accuracy average and F1-score of 86.37% and 90.75%, respectively, and an improvement of accuracy of 6.7%-18.19% against the CNN models on the base-learner and 2.54%-2.65% against the combination of predictions. With these results, the proposed architecture also provides better performance compared to the two existing CNN architectures.


I. INTRODUCTION
Epilepsy is a chronic disease of the brain characterized by repeated seizures and is an unconscious movement that involves part of the body or the whole body [1]. Efforts The associate editor coordinating the review of this manuscript and approving it for publication was Xinyu Du .
to detect the disease early will help determine the cause of epilepsy. EEG (electroencephalogram) is generally used to check whether a patient is having an epileptic seizure, determine the type of seizure, or even a trigger factor for epilepsy. However, this diagnosis has not been able to understand the etiology and has the low spatial resolution to detect the brain abnormality as the cause of epilepsy [2]. Magnetic resonance imaging (MRI) can detect changes in the microstructure of the source of epilepsy because it has a relatively high spatial resolution. Therefore, the study in [3] recommended structural MRI as the standard of investigation in epilepsy patients. Identification involving several sequences of MR images will be advantageous in detecting the brain abnormalities as a source of epilepsy (e.g., hippocampal sclerosis, cortical dysgenesis, brain tumor, cerebral vascular, and others). The HARNESS-MRI protocol shows the advantage of each sequence of MR images in identifying the brain's structural abnormalities (microstructural changes) [4]. However, each sequence of MR images provides different benefits in identifying any structural brain abnormality, as reported in [5]. Therefore, increasing the performance of the automatic method in processing MR images will help improve the sensitivity in the epilepsy identification.
Several researchers have previously reported the detection/ classification results of epilepsy based on brain structure abnormalities (e.g., temporal lobe epilepsy, focal cortical dysplasia). Most of the researches they do are for the detection or classification of only one abnormality type, e.g., detection or classification abnormalities in temporal lobe epilepsy shown in [2], [6]- [9], focal cortical dysplasia (FCD) is reported in [10]- [12]. The results of studies in [6] have shown the use of one sequence of MR images to classify microstructural abnormalities in temporal lobe epilepsy (TLE) against non-TLE. Visual assessment of two sequences T1 and T2, has also been used for the diagnosis of hippocampal sclerosis (HS) in patients with mesial temporal lobe epilepsy (MTLE) [7]. In the case of FCD lesion detection, studies in [10] and [11] have reported the use of T1-weighted sequence as input for detection. Meanwhile, the use of two sequences (T1-MPRAGE and T2-FLAIR) for FCD detection is also discussed in [12]. These two abnormalities constitute the most significant percentage of epilepsy patients, as reported by Wellmer et al. [13]. A diagnosis of other types of brain abnormalities also uses a specific sequence of MR images to get the best results. Therefore, specific imaging protocols are required to identify a structural abnormality [13]. The initial diagnosis of whether a person has structural abnormalities of the brain or not must involve several sequences of MR images. Involving these many sequences of MR images in manual diagnosis is a maximal effort, but it is complicated and time-consuming. Therefore, the need for automated detection or classification with reliable methods, in this case, is essential. However, the automated detection/classification of epilepsy involving multiple sequences of MR images and types of abnormalities as simultaneous has not been investigated. Fig. 1. shows most of the previous studies, only using one or two sequences of MR images for identification/detection /classification of one brain structural abnormality type. Consequently, the studies involving only one or two sequences of MR images and a type of abnormality have drawbacks such as: not being able to identify/detect/classify epilepsy caused by other types of abnormalities at initial diagnosis and can decrease sensitivity.
Based on the weaknesses of the previous studies and the diagnostic protocol for each type of brain abnormality in [4], [5], and [13], the initial diagnosis needs many sequences of MR images to see the various possible abnormalities in each of these sequences. Therefore, we propose the method for the two-class classification of brain structures (epilepsy, non-epilepsy) by involving several sequences (multi-sequence) in the training process. Fig. 1 illustrates the focus of our study using multi-sequence of MR images with some types of brain abnormalities that cause epilepsy in training.

FIGURE 1.
Most of the previous studies and our research focus on classifying brain structural abnormalities that cause epilepsy.
The multi-sequence of MR images impacts high data variability that it greatly affects the classifier's performance in identifying/classifying brain structural abnormalities. We use a convolutional neural network (CNN) as a classification method that has proven powerful for image data [14] and a CNN model ensemble technique to improve classification performance. The CNN model in this study is built by considering the low model parameters and the limited learning data, and maintaining the resulting performance. These CNN models serve as base-learner models in the ensemble technique. We use the ensemble technique to improve classification accuracy and reduce the variability of the results [15], [16]. The meta-learner stage using machine learning is beneficial in improving classification performance. Support vector machine (SVM) is one machine learning that has proven reliable in classifying brain abnormalities that cause epilepsy [2], [6]. Therefore, we propose an ensemble scheme for these CNN models using SVM at the metalearner stage based on an axial multi-sequence of MR images (emsCNN-SVM) to improve classification performance. For that, we have conducted several experiments to evaluate the proposed emsCNN-SVM. The main contributions of this research are as follow: • We propose axial multi-sequence of MR images approach to classify brain structural abnormalities causing epilepsy against non-epilepsy brain structures. Axial multi-sequence of MR images involved in the learning process contains some types of brain structural abnormalities for epilepsy patients and some types of brain structures for non-epilepsy patients.
• We build the CNN model based on the multi-sequence of MR images as a base-learner model by considering the low parameter model and overfitting on the limited dataset to classify brain structural abnormalities that cause epilepsy vs. non-epilepsy brain structures.
• We propose a scheme CNN models ensemble on the base-learner with SVM on the meta-learner. It involves the output of the base-learner model and the predictions combination of these models, thus, it improves the performance and reduces variability in the classification of brain structural abnormalities that cause epilepsy vs. non-epilepsy brain structures.
The remainder of this paper is structured as follows: Section II discusses a survey of relevant previous research work on the classification of brain structural abnormalities that cause epilepsy. Section III describes the dataset of the experiment and the proposed method. The experimental scenarios and results are in Section IV. Section V discusses the experimental results. Finally, Section VI states the conclusions and suggestions for future research.

II. RELATED WORK
In this study, we classify brain structural abnormalities as cues that cause epilepsy vs. non-epilepsy subjects based on an axial sequence of MR images. Therefore, this section explores the relevant current research work in the literature from two prospective studies: first, the classification of brain structural abnormalities using machine learning, and second, the classification using CNN.
Classification of brain structural abnormalities that cause epilepsy using machine learning is reported in [6], [10]- [12], and [17]. Del Gaizo et al. [6] used diffusion MRI sequence to classify temporal lobe epilepsy (TLE) vs. non-TLE. They determined scalar diffusion from diffusion kurtosis imaging (DKI). Then, they used the weighted average of support vector machines models to classify TLE vs. non-TLE based on the scalar diffusion input. Their method yielded an accuracy of 68% (fractional anisotropy), 51% (mean diffusivity), dan 82% (mean kurtosis). The use of SVM was also reported by Wang et al. [17] to detect mesial temporal sclerosis (MTS) based on T1-weighted sequence. The detection begins with the segmentation of tissue (grey matter, white matter, and cerebrospinal fluid (CSF)), and hippocampus, followed by feature extraction of volume, shape, and ratio of CSF. The experimental results showed that their proposed technique provides promising performance for MTS. Studies on the detection of abnormalities in TLE are also reported in [7]- [9], only not using machine learning in its detection. Another abnormality classification, FCD, was performed by Qu et al. [10] using a multiple classifier fusion and optimization (MCFO) feature-based voxel-based morphometry (VBM) on T1-weighted MRI sequence. Their proposed MCFO involved several classifiers and minimized false positives using F-scores. The testing results with this method showed a decrease in false positives. The same study was conducted by Jin et al. [11] using T1-weighted sequence produced by three different magnetic resonance imaging scanners. They determined the morphological and intensity features as inputs for the non-linear neural network classifier. Their experiments at a threshold of 0.9 obtained an optimal sensitivity of 73.7% and a specificity of 90% in FCD detection. Mo et al. [12] also performed FCD lesion detection by combining quantitative multimodal surface features with an artificial neural network (ANN) to assess its clinical value. The testing results showed that the method's accuracy, sensitivity, and specificity were 70.5%, 70%, and 69.9%, respectively, which outperformed the unimodal classifier.
For the classification of brain structural abnormalities (epilepsy) by applying deep learning, some of them are reported in [2], [18], and [19]. Huang et al. [2] identified epilepsy using the DKI image. They segmented the hippocampus and used transfer learning VGG16 to get DKI image features. This feature was an input support vector machine (SVM) to classify epilepsy (hippocampus) vs. normal control. Their proposed method obtained the best classification accuracy of 90.8%. Torres-Velazquez et al. [18] used multimodal MRI to classify TLE. They introduced the Multi-Channel Deep Neural Network (mDNN) for TLE classification. Their experiments showed the potential of the mDNN approach to combine multiple data sets for TLE classification. Another abnormality classification (juvenile myoclonic epilepsy/JME) was conducted by Si et al. [19] using CNN-based transfer learning. They used diffusion MRI sequence to detect subtle changes in white matter. Using three CNN models, the experimental results showed that inception_resnet_v2 based transfer learning is better than Inception_v3 and Inception_v4 in classifying JME, with a classification accuracy of 75.2%.
It is considering the results of previous studies that combined extracting MR images features and machine learning to classify brain structural abnormalities as epilepsy cues. Most of these studies proposed the method to obtain MR image features that represent or combine some features [10]- [12], [17]. The researchers usually focused on one or two sequences of MR images to get these features. Besides, they typically used one classifier [6], [11], [12] or several classifiers [10] to get the best performance in the classification. These efforts are reasonable, but the best classification performance is not necessarily obtained by using the features that are considered representative. This approach can be ineffective and time-consuming, especially in studies involving multiple sequences of MR images and some types of abnormalities. Therefore, a reliable classifier is needed to solve this problem, such as the CNN classifier [2], [19]. A study in [2] showed that CNN is a robust classifier with a convolution process that will optimally perform feature extraction based on the classification results' loss function. The main problem we often encounter is that the dataset of MR images for epilepsy cases is relatively limited, consequently many researchers rarely use CNN because it will have an overfitting effect. Several techniques can be used to solve an overfitting, such as augmenting data [20], [21], architectural design with low parameters [16], and validation techniques in learning.
In a previous study in [22], we have reported the CNN model with low parameters for epilepsy classification based on EEG signals. To overcome the limitations of the dataset in training, we divided the EEG signal into many segments (multi-segment) and converted it into a spectrogram image. This study used a CNN model and decided on the final classification results using majority voting based on the model predictions in each segment. Although the method in this study yielded good performance, it did not necessarily obtain good performance for the epilepsy classification based on MR images. These results occurred because the signal pattern was different from MR images.
In this study, we included multi-sequence of MR images for the brain abnormalities classification (epilepsy) against non-epilepsy to increase the performance (accuracy, sensitivity) and to overcome the limitations of the dataset. Involving multi-sequence of MR images on CNN will have high variability in results [23] so that the ensemble technique of some CNN models is a solution to improve accuracy and can reduce variability [15], [16]. Therefore, in this study, we propose ensemble CNN that differs from the existing methods in some aspects: (i) involving multi-sequence MR images and some types of brain abnormalities causing epilepsy, (ii) using some CNN models with low parameters as base-learner models, (iii) involving the output of the base-learner models and combinations of predictions as input to the meta-learner.

A. DATASET ACQUISITION
We investigated several T1 and T2 sequences of 37 epilepsy patients. The patients consisted of 17 males and 20 females, including 48.6% with an additional history of epilepsy and seizures and 51.4% with an additional history of stroke, tumor, traumatic, temporal lobe, left focal epilepsy, syncope, cerebral edema, syncope, and hemianopia. Dataset sequences of MR images were obtained from Universitas Airlangga Hospital (Rumah Sakit Universitas Airlangga-RSUA), Surabaya, Indonesia, using a 1.5 T MRI scanner from 2018 to 2020. We have obtained the ethical clearance to use this retrospective dataset for research from the hospital's ethics committee. For the non-epilepsy dataset, we used nine healthy subjects and free of neurological disease, seven tumor patients, six patients of stroke, and five meningioma patients.
In this study, MRI sequences were acquired from each subject for the axial plane, including T1, T2-FLAIR, T2-FSE, DWI, T2-FLAIR PROPELLER, T2 PROPELLER. All MRI sequences were obtained with 2D acquisition type, slice thickness 5 mm, matrix 512×512 except for DWI 256×256, flip angle 90 degrees except for T2-FLAIR PROPELLER and T2 PROPELLER 160 degrees. While the repetition time in taking each MRI sequence was different, including T1 with a repetition time of 500 ms, T2 FLAIR 8800 ms, T2FSE 4212 ms, DWI, T2 FLAIR PROPELLER 8000 ms, and T2-PROPELLER 4780 ms.
From each sequence and a slice of epilepsy and nonepilepsy subjects, it was then converted into an MR image. Each image (frame) was selected and collected in an image dataset for experimental purposes. The total MR images used for the experiment were 4231, including 2515 epilepsy MR images and 1716 non-epilepsy MR images, as shown in Table 1.

B. DATA PRE-PROCESSING
The input image for the CNN model must be the same size. Therefore, resizing the image of each slice is an essential pre-processing step. In this work, we decided to use a fixed size of 512 × 512 pixels because most of the results in the acquisition of MRI scanners were 2D type with a size of 512 × 512 except for the DWI sequence 256 × 256. This effort was to avoid a negative impact on the performance of the classification model [24]. The DWI image sequence from 256×256 size was changed to a predetermined target of 512× 512 using MicroDicom, as shown in Fig. 2. The following pre-processing, which is also essential, is the normalization of each image pixel. The normalization is done to maintain process stability and convergence in the network. In this study, we normalized each image by changing each image pixel value from the range [0,255] to [0,1]. The normalization value was obtained by multiplying each image pixel by a scale factor of 1/255.

C. CONVOLUTIONAL NEURAL NETWORKS (CNNs)
A convolutional neural network is a deep learning model often applied to visual images and is proven to have high accuracy [14], [25]. There are five CNN architectures proposed in this study, each of which has several layers, namely input layer, convolutional layer, activation layer, pooling layer, fully-connected layer, and output layer. We name the five CNN architectures as msCNN 1 , msCNN 2 , msCNN 3 , msCNN 4 and msCNN 5 , as shown in Fig. 3. The CNN architectures are built to classify brain structural abnormalities VOLUME 10, 2022 causing epilepsy vs. non-epilepsy brain structures based on axial multi-sequence of MR images.
In this study, we built the architectures of CNN with different structures for the epilepsy classification. In 2D/3D, areas of structural abnormalities in the brain have different sizes between subjects (patients). In addition, the involvement of some brain abnormalities types in this study also causes higher variability in the shape and size of brain structural abnormalities. Therefore, we decided to build some CNN models with different structures to strengthen the classification of brain structural abnormalities that cause epilepsy.

1) INPUT LAYER
In this study, the input layer is the layer to enter the normalized sequence of MR images in the pre-processing stage into the convolution process. The input image size for each proposed CNN architecture is 512 × 512. These sizes are made equal to most of the original dimensions of each sequence of MR images to obtain complete feature information.

2) CONVOLUTIONAL LAYER
In this layer, the convolution process will be carried out on the input image of each MR sequence or input from the previous layer by shifting a filter. This process produces a feature map or image sequence pattern from a low to a high level [22]. Therefore, this convolution process will use many feature maps to obtain the characteristics of an image [26], [27].In this study, the convolution operation on the five proposed CNN models can be written as follows: where Z i is the output of the convolution process of the msCNN i model, X is the input of the sequence of MR images, f (.) is the activation function, W i is the weight of the convolution process of the msCNN i model, and b i is the bias of the convolution process of the msCNN i . These weights will undergo an update process to improve the classification results in the training process [28]. In this study, the number of filters used in each model is not the same. The architecture of msCNN 1 [29].

3) ACTIVATION LAYER
In this layer, an unsaturated activation function is applied to improve the nonlinearity of the decision function. In this study, the activation function used is the rectified linear unit (ReLU) [26], and for each model, it is presented in the following equation: withẐ i is the ReLU process outputs of the msCNN i model.

4) POOLING LAYER
The pooling process at the layer aims to reduce the spatial size of the representation, reduce computations, and prevent overfitting. In this study, the pooling used is max-pooling [30], with the filter size of each proposed model being 2 × 2.

5) FULLY-CONNECTED LAYER
After the convolutional layer and max-pooling layer is the fully-connected layer. In this layer, the feedback process is carried out by refreshing the weights and biases against the previous layer and reducing the loss of feature information.
The feature matrix of the prior layer process is converted into a feature vector (flatten) before the classification process. In this study, several proposed CNN architectures have different fully-connected layers. msCNN 1 and msCNN 2 have fully connected layers with all feature vectors (flatten) connected to the output layer, and 0.5 (50%) dropout is added. Meanwhile, for msCNN 3 , msCNN 4 and msCNN 5 all have fully-connected layer 1 with dropout 0.5 process and fullyconnected layer 2, which is fully connected with output layer. The number of neurons in the hidden layer for the msCNN 3 architecture is 32 with the ReLU activation function, while msCNN 4 and msCNN 5 have 64 neurons with the same activation function. In this study, the addition of a dropout process for fully-connected layer is proposed to prevent overfitting.

6) OUTPUT (CLASSIFICATION) LAYER
After the fully-connected layer, the results from this layer forward to the output (classification) layer to display the classification results, accuracy, and loss function. The loss function used in each proposed model is binary cross-entropy, while the activation function for classification is softmax. The softmax function of each proposed model can be written as in the following equation:  with y ik is softmax outputs for the msCNN i model in k th class. Z i is the process outputs at the fully-connected layer for the msCNN i models, and C is the number of classes (labels).
In this study, the number of classes in training and testing is two (epilepsy, non-epilepsy).
In addition to the CNN architecture proposed in the scope of the study, we used three CNN architectures presented in the literature. The three architectures were CNN in [22], VGG16 [31], and ResNet50 [32], which we used as a comparison against the architectures proposed in this study. We transferred the architectures and trained these architectures with the dataset used in the study. CNN in [22] has a simple architecture and consists of three convolution layers with an output layer of 2 (epilepsy and non-epilepsy).The VGG16 model has 19 layers arranged sequentially, consisting of 16 convolutional layers and three fully-connected layers. The input image dimensions of the original VGG16 architecture are 224 × 224 × 3 with a fully connected output layer of 1000. While ResNet50 consists of 50 layers with five stages of the convolution process. The input and output layers dimensions of the architecture are the same as VGG16. In this study, we made some modifications to the two architectures. We made the image input of these architectures the same as the original architecture. In this context, we classified two classes (epilepsy and non-epilepsy), therefore, the size of the output layer was adjusted to two labels in both architectures. For the ResNet50 architecture, besides being modified in the output layer, a GlobalAveragePooling layer was also added before that layer.

D. ENSEMBLE CONVOLUTIONAL NEURAL NETWORKS
In this study, we used ensemble learning on the classification results of each proposed CNN model to improve performance and reduce the variability of the classification results. One type of ensemble learning is stacking or stacked generalization, which includes two main parts, namely base-learner and meta-learner [15], [33]. In this study, the models of msCNN 1 , msCNN 2 , msCNN 3 , msCNN 4 and msCNN 5 are the baselearner models. While the support vector machine (SVM) is the meta-learner model. In our proposed scheme, between the base-learner and meta-learner, there is an ensemble process of base-learner models with a combination of predictions. The process is carried out by combining the prediction results of the base-learner model using majority voting, weighted average, and weighted majority voting [33]. The proposed scheme involving the combination of predictions is shown in Fig. 4.
The process of our proposed scheme begins with training on each base-learner model to get the y 1 , y 2 , y 3 , y 4 , dan y 5 using (3). For classifying brain abnormalities causing epilepsy vs. non-epilepsy (binary classification), the output has two probability values. Meanwhile, to predict the classification results of each model in the base-learner based on the largest probability value and mathematically, it can be written as follows: with g i is the predicted result of the msCNN i model. In our proposed scheme, we combine the results of g 1 , g 2 , g 3 , g 4 , and g 5 with majority voting (MV), weighted majority voting (WMV), and weighted average (WA).
This study uses majority voting to get predictive results based on the majority vote. If the msCNN 1 , msCNN 2 , msCNN 3 , msCNN 4 and msCNN 5 models are as neurologists (experts), the final decision will be based on the results of the majority with a vote exceeding 50%. For example, it is known that v ik is the voting result of the prediction of the i th model, k th class, then the value of v ik = 1 is taken if the evaluation result of g i is equal to the k th class and v ik = 0 if it is not the same. Furthermore, from the voting, the total vote for each class is V k = 5 i=1 v ik , k = 1, 2, and the ensemble result is determined based on the largest total voting value, which can be written as follows: A combination of predictions with a weighted majority voting is obtained by multiplying each prediction result of the model with a certain weight. In this study, the weights are obtained based on validation accuracy's proportional value in each base-learner model's last epoch. If a i is the validation accuracy of the i th model in the last epoch, then the weight of the results of each model is β i = a i / 5 i=1 a i . For the case of binary classification with five models in the baselearner, if g i = 0 the weights used are δ 1i = β i and δ 2i = 0 else δ 1i = 0 and δ 2i = β i . Furthermore, the ensemble with a weighted majority voting can be written as follows: The combination of predictions with the weighted average is obtained by averaging the value of the softmax (y ik ). Prediction result is determined based on the largest softmax average value among the existing classes. Mathematically the prediction result is written as follows: The outputs of the CNN models on the base-learner and the combination of predictions will be input to the training process in the meta-learner. We used SVM in the meta-learner stage for training and final classification. The classifier was chosen because it required few assumptions for input data and flexibility in using kernel functions [34], [35]. If it is known thatX is input data on SVM withX = {g, y, h,h,ĥ} then SVM, for binary classification, uses a linear model as follows: where ω dan α are parameters. About the use of kernel functions in the training process, a transformation ofX is carried out with a function ϕ(X ), which is called feature-space mapping, so that the classification function becomes The minimum geometric distance pointX from the hyperplane in the training sample is indicated by ω T ϕ(X ) + α / ω . Next, we want all data points to be correctly classified so that t n ω T ϕ X n + α > 0, for all n and t ∈ {−1, 1} is the target. Accordingly, the distance of pointX n to the decision surface is given by t n ω T ϕ X n + α / ω . To maximize the minimum geometric distance, it is equivalent to finding the following function The optimization problem requires that we maximize 1/ ω = ω −1 , which is equivalent to minimizing ω 2 and mathematically, it can be written as follows: argmin ω,α

E. CLASSIFICATION RESULT EVALUATION
To evaluate the classification results, we adopted several measurement indicators, accuracy (AC), precision (PR) sensitivity (SE), and F 1 -score (F1) [36]. The measurement is determined based on the parameter values of true positive (TP), false positive (FP), true negative (TN ), and false negative (FN ). In this study, TP is the number of times an epilepsy patient is labeled as epilepsy by the classification results. FP is the number of times a non-epilepsy data person is labeled as an epilepsy patient in the same way. TN is the number of times a non-epielpsy data person is labeled as a non-epilepsy patient in the same way. On the other hand, FN is the number of times an epilepsy data patient is labeled as a non-epilepsy person data in the same way. AC, PR, SE, and F1 values calculated using these parameters are defined mathematically in (12)- (15).

A. EXPERIMENTS
This study's total subjects were 64 people (37 epilepsy subjects and 27 non-epilepsy subjects). We divided the subjects into 45 subjects (25 epilepsy and 20 non-epilepsy) for training and the remaining (12 epilepsy and seven non-epilepsy) for testing, as shown in Table 1. We used stratified 5-fold crossvalidation [37] to evaluate each method in the classification of epilepsy with the number of frames for training, validation for each fold, and testing, as shown in Table 2. In this study, the success of the class label ''epilepsy'' classification is more precedence because of the urgency. Therefore, the number of frames (epilepsy) for training or testing is more than that of the non-epilepsy. Based on this consideration, the evaluation of each method was determined using (12)- (15). On the other hand, the evaluation results of each method are worth comparing, the training process uses the same index file. This study uses Google Collaboratory to implement all these evaluations in each experimental scenario.
The main stages of the proposed method in training refer to the proposed scheme as shown in Fig. 4, while the process steps at the meta-learner refer to Algorithms 1. Training of the base-learner model is carried out on each CNN model with the same input axial multi-sequence of MR images. The input shape in each scenario for the base-learner model is 512 × 512 × 1, as shown in Table 3. The training process for the base-learner model refers to the CNN architecture in Fig. 3. All training in each fold used the Adam optimizer because it is relatively consistent [38]. The default learning rate for training in each base-learner model is 0.001, while the batch size and epoch are 16 and {50, 100, 150}. Algorithm 1 shows the steps of training with SVM at the meta-learner stage. We used SVM to train the dataset on each fold with several kernel functions, including linear, RBF, and polynomial (several degrees). Furthermore, we selected the best results from these experiments. Besides the training using the proposed model, we also conducted a training using the existing CNN models, including CNN in [22], VGG16, and Resnet50. Training with this model was also carried out in each scenario with an input shape of 224×224×3, except for CNN in [22] with an input shape of 512×512×3. In this case, we used the stochastic gradient descent (SGD) optimizer for these models in the training process, while the learning rate used was 0.0001 (VGG16 and Resnet50) and 0.001 (CNN in [22]). We found that the optimizer and learning rate were suitable for these models in the pre-testing. y pred(ijk) ← the probabilities of each class (k) and X test using M ij 8 V test(jk) ← the total of voting of each class (k) and X test based on ε test(j) 9 h pred(j) ← argmax (V test(jk) ) 10 δ test(ijk) ← the weight of voting of each class (k) and X test based on β ij and g pred(ij) In this study, we saved each MR image in Portable Network Graphics (PNG) type with a resolution of 512 × 512 pixels. Images of type PNG have four channels (RGBA). To get the input shape of 512 × 512 × 3 (the three channels), we converted RGBA to RGB and RGB to grayscale to get the input shape of 512 × 512 × 1 (one channel). Meanwhile, to get the input shape with different resolution sizes (e.g. from (512 × 512 × 3) to (224 × 224 × 3)), we used the nearest interpolation method. We used this method to resize the resolution and applied it to each MR image and sequence of MR images for training and testing purposes.
We tested all methods on the test sample with the same dataset treatment in the testing phase. The test steps are shown in Algorithm 2. The training parameters for each fold were then used to classify all frames (images) on the same test dataset. Classification performance was obtained  by determining the average value of the classification results of all folds. The testing was carried out to see the average performance of the proposed method against other methods.

B. EXPERIMENTAL RESULTS
In this section, we report the experiment's results using our proposed method, including its constituent methods. The experimental results reported are the performance of the methods at the base-learner stage, the combination of predictions, and meta-learner. Therefore, all methods have been tested in each testing scenario, as shown in Table 4-8.
In the first scenario with epoch = 50, the CNN models on the base-learner yielded the classification accuracy average of 71.64%-77.43% with the standard deviation range of 2.1-5.94. The CNN model ensemble on the baselearner using the predictions combination (MV, WA, and WMV) obtained the classification accuracy average of 80.38%-80.53% with the standard deviation of 1.88 -2.10. The combination of predictions in this scenario obtained better classification accuracy than all base-learner models and lowered classification accuracy variability. However, testing with the proposed emsCNN-SVM yielded the classification accuracy average still better than it was. SVM with kernel polynomial and degree of 50 on meta-learner provided an accuracy improvement of the CNN models on base-learner by 5.33%-11.11% and 2.23%-2.37% on the combination of predictions. Generally, the proposed emsCNN-SVM presented deviation of the classification accuracy of each fold a relatively smaller than others, as shown in Table 4. The proposed method also yielded an average of sensitivity and F 1 -score better than others even though the classification precision was lower than the combination of predictions.
In the scenario with epoch = 100, the base-learner model yielded an accuracy average of 68.18%-79.67% with a standard deviation of 2.92-7.28. The combination of predictions obtained an accuracy average of 83.72%-83.83% and a standard deviation of 1.94-2.10. These results showed that the combining predictions using MV, WA, WMV obtained better results than base-learner models, but testing with the proposed emsCNN-SVM yielded the best results. SVM with the polynomial kernel (degree = 25) on the proposed emsCNN-SVM provided an accuracy improvement of the base-learner models by 6.70%-18.19% and 2.54%-2.65% for the combination of predictions. Based on the standard deviation value for classification accuracy, the proposed emsCNN-SVM yielded relatively lower variability than others.
Based on the resulting classification sensitivity value, the ensemble using our proposed emsCNN-SVM obtained the best average of classification sensitivity. Meanwhile, the base-learner model ensemble using the combination of predictions yielded an average of classification sensitivity better than the base-learner model. The proposed emsCNN-SVM provided an average improvement of classification sensitivity of 9.68%-28.45% for baselearner models and 7.24%-7.41% for all combinations of predictions. In general, this method also yielded lower variability in classification sensitivity than the others.
From the precision value in the epilepsy classification yielded in this scenario, emsCNN-SVM with the polynomial kernel (degree = 25) obtained a lower precision than the combination of prediction (MV, WA, and WVM). MV, WA, and WVM yielded the highest average value for classification precision with the lowest level of variability. However, in general, the proposed emsCNN-SVM yielded better average classification precision than the base-learner model with lower variability than those models. This method also obtained the highest F 1 -score and provided an average improvement of classification F 1 -score of 5.35%-16.75% for the base-learner models and 2.35%-2.45% for the combination of predictions.
In the experimental scenario with epoch = 150, the proposed emsCNN-SVM in general still presented a better average performance in the classification than the CNN model on the base learner and the combination of predictions. Even though at epoch = 100, the average classification performance of the proposed emsCNN-SVM was still better than epoch = 150, but at epoch = 150, it produced a lower level of variability than all scenarios. In this scenario, the CNN model on the base-learner provided a better level of variability in classification accuracy than the CNN model in other scenarios with a standard deviation of 1.63-4.20. The same results are also shown for sensitivity and F 1 -score.  The testing results with CNN in [22], VGG16, and ResNet50 for each scenario can be seen in Table 8. The results of testing at epoch = 50, 100, 150 with split evaluation 5-fold cross-validation showed that VGG16 obtained an average of accuracy and precision better than ResNet50, but still lower than CNN in [22]. Whereas our proposed emsCNN-SVM and emsCNN-SVM * yielded an accuracy average better than the others. At epoch = 50, emsCNN-SVM provided an average improvement of classification accuracy of 7.67% for CNN in [22], 10.61% for VGG16, and 14.48% for ResNet50. While, at epoch = (100,150), our proposed emsCNN-SVM presented an accuracy improvement of (12.41%, 8.82%) for CNN in [22], (12.66%, 10.52%) for VGG16 and (16.97%, 14.66%) for ResNet50. For an average of sensitivity and F 1 -score in classification, our proposed emsCNN-SVM also obtained the best results.

V. DISCUSSION
In this section, we investigated the performance of CNN models on base-learner, the ensemble of the base-learner model with a combination of predictions (MV, WA, and WMV) and meta-learner. At the meta-learner stage, we investigated the ensemble of the CNN models on base-learner by meta-training using SVM to classify the dichotomous axial sequence of MR images of the brain as epilepsy vs. nonepilepsy. On the other hand, we also investigated some existing CNN models compared to our proposed emsCNN-SVM.
The results of testing showed that the CNN model on the base-learner obtained classification performance with high variation. The CNN model on the base-learner yielded a classification accuracy average in testing in the range of 68.18%-79.67% with a standard deviation of 1.63-7.28. When viewed from the many parameters in the base-learner model, msCNN 2 was more than the other models, as shown in Table 3. However, the large number of model parameters does not guarantee that it is proportional to the classification performance produced, especially in the axial multi-sequence of MR images. The classification accuracy average of each CNN model on the base-learner is still below 80%. The training and testing data variability level are relatively high because it involves a multi-sequence of MR images, which affects the performance.
When likened to CNN models on the base-learner is a neurologist who reads axial multi-sequence of MR images, then the reading of each neurologist may give different results. Using the combination of predictions with majority voting, weighted majority voting, and weighted average can increase the accuracy of epilepsy classification and reduce  the variability of classification results. However, the increase stops at a certain level (saturation) and is difficult to increase because it depends entirely on the predictions of the classification of models on the base-learner. A meta-learner stage in the proposed emsCNN-SVM has become one of the solutions to improve classification accuracy and better than the combination of predictions. The improving classification accuracy can be found because, at the meta-learner stage, it depends not only on the results of the base-learner model but there is also meta-learning using SVM. The learning not only involves the prediction results of the base-learner models but also the results of the combination of predictions to improve classification performance. The proposed emsCNN-SVM accommodated the output of the CNN model on the base-learner and the combination of predictions (MV, WA, and WMV), accordingly, it yielded better and more stable performance for each scenario.
We realize that the best results in our proposed scheme involving SVM, in this case, do not apply to all kernels in training. At epoch = 50, 100, 150 kernel functions that give better results than others (e.g., RBF and linear) are polynomials with degree (d) = 50, 25, 10, as shown in Fig. 8. In this study, the criteria for determining the degree of the polynomial function are based on the best classification accuracy average, as shown in Fig. 5. In general, the greater the degree of the polynomial function, the higher the sensitivity values, but the impact on the precision decreases. The selection of polynomial kernel degrees based on the maximum sensitivity value will impact the low precision values. Therefore, the best choice is selecting polynomial kernel degrees in the proposed emsCNN-SVM based on the highest accuracy value. The option indirectly considers the value of precision, sensitivity, and F 1 -score.
The number of models in the ensemble also influences the performance of the proposed emsCNN-SVM in classifying epilepsy against non-epilepsy. By involving five CNN models on the base-learner, it gives a better classification performance than applying only three CNN models. Fig. 6 shows the accuracy value and F 1 -score for each fold involving five models giving better results than involving only three base-learner models. The involvement of inputs in meta-learning also affects classification performance. The proposed emsCNN-SVM involving three kinds of input: the base-learner model's predictions, the combination of predictions, and the softmax output of the base-learner models provides better classification accuracy than involving only two types of input and one kind of input, as shown in Fig. 7.
To know the performance or stability of our proposed method, we also compared the results with the existing models: CNN in [22], VGG16, and ResNet50. The results of testing with the same dataset treatment appeared that our proposed method improved all performances in the classification, as shown in Table 8. We realize that there are differences in the input image dimensions in these testing, which will affect the performance [39]. The proposed scheme has an input image dimension of 512 × 512 × 1, while VGG16 and ResNet50 are 224 × 224 × 3, respectively [31], [40]. We consider the comparison of these methods to be fair, even though our proposed method has a different input shape. In this case, we try to keep the original architecture of VGG16/Resnet50. However, the comparison results are fairer, we added the testing with an input resolution of 224 × 224 × 3 for each proposed CNN model. In this study, we adjusted to the existing architecture in the model. Although the conditions in the comparison are still far from ideal, at least our proposed emsCNN-SVM is feasible to compare with these models, especially in the classification of brain structural abnormalities that cause epilepsy vs. non-epilepsy.
Our study has several limitations, including the relatively small samples of sequence of MR images used in training and testing. At the clinical level, validation must be carried out on more data involving many institutions. On the other hand, studies involving multi-sequence of MR images and different types of brain abnormalities within a class of epilepsy certainly have the potential to reduce the classifier's performance. In addition, using only axial planes can also obtain lower performance than involving all other planes: sagittal and coronal.
This study only uses five CNN models on the base-learner. We understand that more CNN models in the base-learner will enrich the decisions and strengthen the results for the combination of predictions and processes on the meta-learner. However, more models in the base-learner will impact the number of model parameters used. Therefore, we decided to use five models of the base-learner for the ensemble process with better results than the three models of the base-learner.

VI. CONCLUSION
In this study, a method has been proposed to improve performance in the classification of epilepsy based on axial multi-sequence of MR images with an ensemble of several CNN models. The ensemble model is carried out by applying the principle of stacked generalization. The output of the CNN models of the base-learner and combination of predictions (majority voting, weighted average, and weighted majority voting) forwarded to SVM in the meta-learner stage. The proposed scheme can generally improve performance in classifying brain structural abnormalities causing epilepsy vs. non-epilepsy. The testing results show that the proposed scheme has a high potential to assist neurologists (clinicians) in identifying epilepsy patients based on multi-sequences of MR images.
For clinical purposes, in the future, there is still potential to improve the performance of epilepsy classification based on multi-sequence of MR images by increasing the amount of training or testing data and involving all planes of MR images.