G2-ResNeXt: A Novel Model for ECG Signal Classification

Electrocardiograms (ECG) are the primary basis for the diagnosis of cardiovascular diseases. However, due to the large volume of patients’ ECG data, manual diagnosis is time-consuming and laborious. Therefore, intelligent automatic ECG signal classification is an important technique for overcoming the shortage of medical resources. This paper proposes a novel model for inter-patient heartbeat classification, named G2-ResNeXt, which adds a two-fold grouping convolution (G2) to the original ResNeXt structure, as to achieve better automatic feature extraction and classification of ECG signals. Experiments, conducted on the MIT-BIH arrhythmia database, confirm that the proposed model outperforms all state-of-the-art models considered (except the GRNN model for one of the heartbeat classes), by achieving overall accuracy of 96.16%, and sensitivity and precision of 97.09% and 95.90%, respectively, for the ventricular ectopic heartbeats (VEB), and of 80.59% and 82.26%, respectively, for the supraventricular ectopic heartbeats (SVEB).


I. INTRODUCTION
Cardiovascular disease (CVD) is a chronic disease of aging with a high mortality rate. As reported in the ''Top Ten Causes of Death'' issued by the World Health Organization (WHO) [1], CVD is the No. 1 killer in the world as more people die from it annually than from other causes. By 2030, 23.6 million people are expected to die from CVD. Electrocardiogram (ECG) is a non-invasive diagnostic tool for cardiac pathology, which plays an important role in the classification of CVDs. However, the timely and accurate The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . detection of abnormal heartbeat signals in patients' ECG has become a major difficulty in the medical field.
As a main method of diagnosis of arrhythmia, ECG could objectively reflect the physiological and working conditions of all parts of the heart, which is of great significance for the detection of CVDs. The typical basic waveform of ECG mainly includes a P wave, a QRS wave, and a T wave, as shown on Figure 1. The P wave represents the depolarization process of two atria, the QRS wave represents the depolarization process of two ventricles, and the T wave represents the repolarization process of two ventricles [2].
Usually, arrhythmia is divided into two types, i.e., fatal arrhythmia, and non-fatal arrhythmia. The fatal arrhythmia needs to be immediately treated, otherwise it can be life-threatening. For the non-fatal arrhythmia, relevant examination and treatment are also required, and ECGs should be regularly conducted. Basically, early detection and treatment of arrhythmia can tackle the problem of sudden death [3]. Arrhythmia is usually caused by an irregular heartbeat, which can be found from the interval and amplitude of ECG signals, meaning that the shapes and other morphological characteristics of ECG signals determine the type of arrhythmia [4]. In the early clinical practice, doctors analyzed ECGs by visual evaluation and, based on their experience, identified the characteristics of ECG signals and provided diagnostic results. However, the strong non-linearity, non-stationarity, and randomness of ECG signals make the classification of arrhythmia in ECG signals a very difficult task.
This paper proposes an improved version of the ResNeXt model [5] for ECG signal classification, called G2-ResNeXt, which adds a two-fold grouping convolution (G2) to the original ResNeXt structure, along with a modified convolutional block attention module (CBAM), allowing the model to focus more on the changes in the characteristics of the ECG signals and, as a result, to achieve better automatic feature extraction and classification of ECG signals. According to the results obtained by experiments, conducted on the MIT-BIH arrhythmia database, the proposed model outperforms all state-of-the-art models considered, according to all evaluation metrics used, except the GRNN model which achieves better sensitivity and precision in classifying heartbeats of class S of the Association for the Advancement of Medical Instrumentation (AAMI) standard [6].
The remainder of the paper is structured as follows. Sections II and III present the necessary background of ECG signals and relevant neural networks, respectively. Section IV presents the related work done in the field of applying an artificial intelligence for ECG signal detection and classification. Section V describes the proposed G2-ResNeXt model. Section VI presents experiments, conducted for performance evaluation of the proposed model compared to other state-ofthe-art models, and corresponding results obtained. Finally, Section VII concludes the paper.

II. BACKGROUND OF ECG SIGNALS A. MIT-BIH ARRHYTHMIA DATABASE
The MIT-BIH arrhythmia database [7], provided by the Massachusetts Institute of Technology -Boston's Beth Israel Hospital, is among the most internationally-renowned and commonly used databases as a source of clinical ECG signals, along with the AHA database [8], provided by the American Heart Association, and the European ST-T ECG database [9]. The MIT-BIH arrhythmia database contains 48 half-hour two-channel ambulatory ECG recordings obtained from 47 subjects with a resolution of 11 bits and range of 10 mV. The ECG recordings of 25 male subjects (aged 32-89) and 22 female subjects (aged 23-89) are included in the database, 60% of which are inpatients [7]. MIT-BIH is composed of three files -a header file, a data file, and an annotation file. The latter was produced by two experienced cardiologists, whereby around 110,000 computer-readable reference annotations were made by them. In most recordings, the first lead is of type MLII (obtained by placing the electrode on the chest) and the second lead is usually V1 (occasionally V2 or V5, and in one instance V4). The MLII leads were chosen for use in the experiments, presented further in this paper, because these were the most numerous. 15 heartbeat types are mapped to the 5 main classes of AAMI standard, as shown in Table 1. As unknown beats (class Q) are difficult for identification, no data of Class Q are distinguished in the existing mainstream models. Similarly, in this paper, only data of classes N, S, V, and F are distinguished, with class-Q data neglected. VOLUME 11, 2023 34809 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

B. INTRA-PATIENT VS. INTER-PATIENT CLASSIFICATION
The classification of heartbeat signals has been subject to many studies. Generally, there are two classification methods that can be used, namely the intra-patient method and the inter-patient method [10], as depicted on Figure 2.
The inter-patient classification method treats all ECG data of each patient as a whole. When splitting the dataset into a training set, validation set, and test set, it ensures that the records of the same patient do not appear in more than one set. The resulting model has a strong generalization ability and can be applied to real-life situations with practical implications. The intra-patient classification method mixes all the data in the dataset together and then splits these into a training set, validation set, and test set [11], [12]. This results in training the model on part of the ECG data of a patient and then testing it on the rest of the ECG data of the same patient to achieve a higher performance, which is very unreasonable. Moreover, using the ECG data of the same patient for both model training and testing may lead to biased results: if the classification model has already seen ECG data of a patient during training, then the model is likely to have acquired the characteristics of that patient. Therefore, as being more realistic and meaningful, the inter-patient method was used in the experiments described further in this paper. More specifically, the MIT-BIH records of 44 patients were used as follows: the records of 17 patients were used for the training set, the records of 5 patients for the validation set, and the records of 22 patients for the test set ( Table 2).

C. ECG SIGNAL SLICING
At present, most of research on heartbeat classification relatively depends on the accuracy of QRS wave detection. As the characteristics of the human body and the form and parameters of a single signal are variable, certain differ- ence may exist in the positioning results of different QRS wave detection algorithms under the same accuracy [13]. For example, some QRS algorithms provide position slightly to the left, while others provide position slightly to the right, resulting in an obvious deviation in the final positioning results even if the same data are processed. As a result, the waveform performance of the acquired heartbeats can vary, leading to dependence on the QRS localization algorithm for the learning process of the neural network later on, making the network somewhat coupled to the QRS wave detection algorithm.
Instead of using a QRS wave detection, an ECG signal slicing method is used in the research, presented in this paper, whereby the location and connotation of useful information are learned by the neural network itself, freeing the network from coupling with the QRS wave detection algorithm, and making the signal diagnosis a simpler and more general process that can be used more widely in reality. In the experiments, reported further in this paper, 1080-length slices were obtained by capturing three seconds of an ECG signal at a sampling rate of 360 Hz. For the classification of slices, the following rules (illustrated in Table 3) were applied: 1) When only normal heartbeats are present in a slice, the slice is defined as belonging to the normal class (N). 2) When only one abnormal heartbeat is present in a slice, and all other heartbeats are normal, the slice is defined as belonging to the class of the abnormal heartbeat (i.e., V, S, or F). 3) When multiple numbers of different abnormal heartbeats exist in a slice, the most represented abnormal class (i.e., with the largest quantity) defines the class of the slice. 4) When multiple numbers of different abnormal heartbeats with the same highest quantity exist in a slice, the class of the first appearing abnormal heartbeat defines the class of the slice.
In addition, a slice overlapping method is used in the research, presented in this paper, to alleviate the class imbalance problem of AAMI heartbeat data contained in the MIT-BIH arrhythmia database, so that the obtained new samples and original samples be different (to some degree) and of better quality (theoretically). This imbalance problem may have a severe effect on the model training process, thus likely invalidating the neural network learning. To alleviate further this problem, a stacking of slices was utilized in the experiments, described further in this paper, allowing to collect more samples by using more overlapping between adjacent slices of a class with a small sample size. Compared with simple oversampling used to obtain completely consistent new samples, this slice-and-stack approach is more effective, because simple oversampling generates new data samples for fewer sample classes to participate in training, but may result in identical new samples, which is prone to model overfitting problems.

D. DATA DENOISING
The ECG signal is a low-frequency and high-impedance weak signal [14], which is easily affected by the in-vivo and in vitro environments during its acquisition. In-vivo effects refer to adverse reactions of the patient's body during the ECG measurement that cause errors in the test results, such as muscle contractions due to the patient's mental stress during the measurement resulting in inotropic interference, or the patient breathing excessively resulting in unstable ECG amplitude.
In vitro effects, such as electromagnetic interference in the surrounding environment, have a relatively small impact but nonetheless should be taken into account too. As a result, the acquired ECG signal is usually accompanied with lots of noise, such as baseline drifts, power frequency interference, EMG interference and motion artifact, which may cause wrong classification. To improve the classification precision, in this paper, the wavelet transform is used to eliminate the noise, as follows [15]: where α denotes the scale factor, used to stretch the basic wavelet ϕ(t) function, and τ reflects the displacement. Figure 3 shows the overall denoising process of the ECG signal. It is very important for signal denoising to select an appropriate decomposition level, because each wavelet in the wavelet transform has its own characteristics. In the research, presented in this paper, the DB8 wavelet was used because it could more effectively reduce the noise in the ECG signals compared with other wavelets, [16]. In response to the large fluctuation of ECG signals, a 9-layer wavelet decomposition was adopted, whereby the wavelet coefficient of each layer was kept for wavelet reconstruction. A soft threshold process was used, as per the following equation: where f (x) denotes the shrinkage function, w old denotes the input wavelet values, w new denotes the output wavelet values, and T denotes the threshold of the wavelet transform. The denoising effect is shown in Figure 4.

A. CNN
Compared with traditional neutral networks, convolutional neural networks (CNNs) are characterized by weight sharing and local connection, which could greatly improve their feature extraction capability and training efficiency. CNNs can be applied not only in the field of 2D images but also in the field of 1D data, such as natural language processing (NLP) and some physiological signals (e.g., blood pressure signals [17], respiratory signals [18], ECG signals [19], etc.). The basic CNN structure consists of an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer. Generally, several convolutional layers and pooling layers are used alternately, that is, a convolutional layer is connected to a pooling layer, which in turn is connected to another convolutional layer and so on, as shown in Figure 5. In some studies, Long Short-Term Memory (LSTM)-based models are used for classification. Compared with CNN, the LSTM structure is relatively complex. Each LSTM cell includes four fully-connected layers (MLPs). If a LSTM has a large time span and the network is very deep, the computation complexity is significant, and the training is slow. The parameter sharing nature of the convolutional operations in CNN allows for a significant reduction in the number of parameters to be optimized, increasing the speed of model training. Moreover, since convolutional operations mainly deal with grid-like data, they have significant advantages for the analysis and recognition of time series and image data. In addition, although LSTMs alleviate the long-term dependency problem of RNNs to a certain extent, they are also tricky for use with longer sequence data.

B. ResNet
The deeper a CNN, the better its performance. However, with the deepening of the network, problems related to vanishing gradients and exploding gradients occur. In order to solve these problems, He et al. proposed a deep residual network (ResNet) [20], which is more easily optimized than traditional CNNs and could provide a higher precision while increasing the depth. ResNet became the winner   of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 [21] in image classification, singleobject localization, and object detection tasks, as well as the winner of the Microsoft Common Objects in COntext (MS COCO) 2015 [22] detection and segmentation tasks (https://towardsdatascience.com/review-resnet-winnerof-ilsvrc-2015-image-classification-localization-detection-e39402bfa5d8).
In order to solve the degradation problem in deep neural networks, a residual structure was proposed in [20] as shown in Figure 6, producing the following output vector: where x denotes the input vector and function F (x, {W i }) expresses the error mapping to be learned. The connections in Figure 6 neither introduce additional parameters nor increase the complexity of computation, but x and F must have the same dimensions. If the input/output dimensions need to be changed (e.g., changing the number of channels), the following formula can be used: where W i denotes the weight matrix of the two-layer network in the ResNet block shown in Figure 6 and W s is used to adjust the input dimension to the output dimension so that the matrix summation operations can be performed. For the residual structure, the learned feature is recorded as H (x), and this module uses a kind of shortcut connection to allow it to directly learn the residual F (x) = H (x) − x, so as to output the target of F (x) + x. When the residual is 0, identify mapping is only made for the accumulation layer, ensuring at least that the network performance will not be degraded and the residual will not be 0; this also enables the accumulation layer to learn new features based on the input features, so as to achieve better performance.   much simpler than all Inception models, as it requires considerably fewer hyperparameters to be set manually. ResNeXt is essentially grouped convolution, where the number of groups is controlled by the number of variable bases. Figure 7 shows the three equivalent building blocks of ResNeXt. Figure 8 shows a simplified view of the ResNeXt structure, where the input information passes through a convolution, a batch normalization (BN), and an activation layer before entering a group convolution module, consisting of 32 branches with the same topology, whose outputs are combined and finally normalized, activated, and summed with the shortcut connection to obtain the output information.

IV. RELATED WORK
In 2012, AlexNet [23] reached a milestone by making the following contributions: 1) Using a graphics processing unit (GPU) for network training for the first time, which greatly accelerated the model training. 2) Using a Rectified Linear Unit (ReLU) non-linear activation function [24] instead of the traditional Sigmoid and Tanh activation functions. 3) Using a local response normalization (LRN). 4) Using dropout random deactivation neuron operations in the first two layers of the fully-connected layer to reduce overfitting. This was the first use of deep learning (DL) in the form of a CNN, which paved the way for a new generation of machine learning (ML) models and techniques. Since then, ML has experienced revolutionary changes. DL models, such as CNNs, LSTMs [25] and their variants, almost fully dominate over the other ML models in all application fields. In the medical field, for instance, researchers use DL or ML models for classification of arrhythmia, in order to make its detection and prediction an easier and more reliable process [26], [27].
In terms of data processing, a method, where the ECG signals are treated as one-dimensional data, is increasingly well accepted. In [28], a one-dimensional adaptive CNN model is proposed, in which two ECG modules are integrated into a single learning system for extraction and classification of features. In [29], a 9-layer deep CNN is proposed to classify one-dimensional ECG signals. In [30], following the stage of feature learning, a SoftMax layer is added to the top of the hidden layer obtained to form a deep neural network (DNN). After each iteration, the most relevant and uncertain heartbeats in the test records are marked and used to update the DNN weight, which significantly improves the detection and classification accuracy. A model, based on a one-dimensional CNN, is proposed in [31], where the data of the MIT-BIH database is divided into normal and abnormal heart activity, and a grid search method is used to find the hyperparameters of the CNN.
Recently, many powerful DL models have been produced to effectively classify ECG signals. In [32], a single-lead ECG signal extraction method based on wavelet transform is proposed, especially for positioning the end point of the T wave. In [33], a new automatic heartbeat classification method is proposed to effectively diagnose arrhythmia in a unsupervised medical environment. In [34], a 34-layer DNN based algorithm is developed to detect various cardiac rhythm disturbances with the single-lead ECG data generated by a sensing/monitoring equipment. The algorithm provides a higher accuracy rate than that of ordinary cardiologists, possibly due to the powerful feature learning ability of the DNN used. In [35], a new multi-scale wavelet CNN is proposed to automatically identify various cardiac rhythm disturbances, combined with an one-dimensional CNN and stationary wavelet transform. Distinguishable features are extracted from the ECG signals, along with their wavelet subbands, which greatly improves the feature learning process of the model at different scales. In [36], a parallel general regression neural network (GRNN) is used to classify heartbeats, and an online learning module is added to the GRNN to construct a personalized automatic classification model for patients. In [37], a multi-model DL integration model is proposed that can extract useful information from the input of multiple ECG waves. In [38], a symbolization model is proposed, specifically designed for ECG signals, based on a multi-perspective CNN (MPCNN). In [39], a new model for deep bidirectional LSTM network-based wavelet sequences, called DBLSTM-WS, is proposed for ECG signals classification. In [40], a 16-layer 1D CNN is designed for efficient and automatic classification of ECG signals, including AF. In [41], an end-to-end ECG signals classification algorithm, based on a novel segmentation strategy and 1D CNN, is proposed. In [42], a novel method is proposed, based on genetic algorithm back propagation neural network (GA-BPNN) for ECG signals classification with feature extraction using wavelet packet decomposition (WPD).
There are also many non-DL models that have achieved good results as well. In [43], a wavelet packet entropy (WPE) and random forest (RF) based ECG signal classification  [44] to select features using a filter method based on a mutual information ranking criterion, by reducing the number of features while improving the performance. An automatic ECG signals classification model, based on the combination of multiple support vector machines (SVM), is proposed in [45], showing a very good performance. In [46], a novel ensemble multi-label classification model is proposed, which combines multiple multi-label classification methods to build a highperformance classifier.

V. PROPOSED MODEL: G2-ResNeXt
The classic ResNeXt structure is usually used for classification of two-dimensional images. As such, it is not quite suitable for ECG signals classification considered in this paper, so it has to be improved. The improvement proposed in this paper, called G2-ResNeXt, adopts a one-dimensional convolution kernel of size 32, while the original ResNeXt uses 3 × 3 small convolution kernels. Such difference in sizes is mainly due to the different nature of the ECG signals and the image data. The resolution of the image input to the network is generally low and a 3 × 3 area may also contain significant information even with a small receptive field. However, for low-frequency and low-sampling-rate ECG signals, the area of 3 sampling points at any position is difficult to constitute a waveform change with specific significance and is easily interfered by noise, thus badly effecting the feature learning. For solving this problem, the proposed G2-ResNeXt model uses a large convolution kernel.
In the conducted experiments, the ECG signals were sliced up with the sampling rate of 360 Hz, whereby the length of each slice was 1080 sampling points, resulting in a 3-sec slice. However, as the slice length of 1080 is reduced to 135 after three down sampling operations and cannot be down sampled again, all slices were re-sampled to 1024 sampling points in The structure of the proposed G2-ResNeXt model is depicted on Figure 9, whereas Figure 10 shows the specific dimensions and values used in each layer of it.
For measuring the computational complexity of the proposed model, the number of parameters (Params) and floating-point operations (FLOPs) can be used, calculated as follows: where C in denotes the number of input channels, C out denotes the number of output channels, K denotes the size of the convolution kernel, H denotes the height of input features, and W denotes the width of input features. The number of FLOPs used by different G2-ResNeXt nodes is shown in  The data are first convoluted once and then passed through a BN layer for batch normalization, in order to solve the problem of changing the data distribution in the middle layer during training, prevent the gradient from disappearing or exploding, and speed up the model training. The normalized data are then fed into an activation layer that uses a Mish activation function, which proved to have more stable performance and higher precision than ReLU [47], and, in addition, can keep small negative values to stabilize the network gradient current. The data are then grouped and convoluted, which increases the accuracy without increasing the complexity, while also reducing the number of hyperparameters.
Unlike ResNeXt, the proposed G2-ResNeXt model utilizes a novel two-group structure, consisting of Group-A (the red box in Figure 9) and Group-B (the green box in Figure 9). Group-A consists of eight branches (with identical topologies). In each branch, the data pass through a convolution first and then through a Group-B and a parallel shortcut, which is down sampled using a combination of convolutional module and ReLU. Then, the output of the shortcut is summed with the output of Group-B, before undergoing normalization, activation, and another convolution at the end of the branch. The outputs of all eight branches of Group-A are fused at the end with features according to the following concat operation: where X i and Y i denote the two input channels, K i represents parameters of the first set of input convolution kernels, and K i+z represents parameters of the second set of input convolution kernels. Given C 0 input channels, C 1 output channels, and a convolution kernel of size K , for conventional convolution, the computational effort (measured in FLOPs) can be calculated as: For Group-A, the computational effort (measured in FLOPs) is: where G denotes the number of branches in Group-A. From (9), it is clear that the computational effort in Group-A is reduced G times compared to the conventional convolution. This allows to speed up the model training.
Group-B is a group convolution with four identical topologies, with only one convolution operation in each branch. The outputs of all four branches are fused with the features.
During the model training, if there are too many parameters, the trained model will be easily overfitted. In order to reduce the number of parameters in the grid along with lowering the model training time, a dropout is added in the convolutional layer.
To further improve the model performance, a modified convolutional block attention module (CBAM) [48] is added at the end. A typical CBAM combines a channel attention module (CAM) and a spatial attention module (SAM). CAM utilizes different feature information, whereby each channel of the features represents a special detector. After adding the features together, an activation function gets their weights and figures out what features are meaningful. SAM is spliced with the channel descriptions, obtained through average pooling and maximum pooling, and processed by the convolutional layer to obtain the weight coefficients. CBAM is mostly applied to two-dimensional images, whereas the ECG data is processed as a one-dimensional time-series signal. Therefore, some modifications were made to the typical CBAM structure. More specifically, in CAM, both global mean pooling (GAP) and global maximum pooling (GMP) are performed as one-dimensional pooling. In SAM, the two-dimensional convolution kernel of size 7 × 7 is replaced by a one-dimensional convolution kernel of size 7. The resultant modified CBAM (shown in Figure 11) is used in the proposed G2-ResNeXt model.
After the feature map is inputted into the modified CBAM, a channel attention is firstly performed, based on the width of the feature map for GAP and GMP. Then, the attention weight of the channel is obtained through multi-layer perceptron (MLP), followed by obtaining the normalized attention weight through a Sigmoid function. Finally, the channel attention is re-calibrated to the original features by multiplying the channel-by-channel weighting to the original input feature map and completing the channel attention to the original features as follows: (10) where F c avg and F c max denote the average pooling feature and the maximum pooling feature, respectively. In order to obtain the attentional features in the spatial dimension, the feature map output of the channel attention is also globally max-pooled and globally averaged based on the width of the feature map, followed by reducing the feature map dimension after convolution with a convolution kernel of size 7 and a ReLU activation function, then raising it to the original dimension after a convolution, and finally merging the feature map normalized by a Sigmoid activation function with the feature map output of the channel attention. Finally, the Sigmoid activation function normalized feature map is combined with the channel attention output feature map to complete the rescaling of the feature map in both spatial and channel dimensions as follows: The proposed G2-ResNeXt model consists of four main parts, namely a convolutional layer, an improved ResNet layer, a G2-ResNeXt layer, and a fully-connected layer. The convolutional layer is mainly used to extract the features of the input data prepared for the next layer.
In the second layer, the improved ResNet, the data is convoluted twice, whereby a Mish activation function and a dropout are added between the two convolutions. In addition, just before the convolution, the data are inputted into an average-pool module, the samples are divided into feature regions, and the mean value in the area is used as the region representative, which simplifies the computation and reduces the number of parameters. Finally, the data outputted from the average pooling are added to the data (with the same dimension) produced by the Mish activation function, aiming to inherit the optimization effect of the previous step, so that the model could be further converged.
The next layer, G2-ResNeXt, is used to increase the speed of the model convergence and maintain stability. After passing through the G2-ResNeXt structure six times, the data are inputted to the fully-connected layer, where these are mapped to a one-dimensional vector, regressed by a SoftMax function [49]. Finally, all outputs are added together and normalized in order to show the multi-classification results in the form of probability.
The loss function used is the focal loss function, which allows to handle various data imbalances present in the data set [50], as per the following formula: where γ is a parameter with a value within the range of [5,0] (when γ =0, FL is the same as the common cross entropy loss function), α and y denote the input data and labels, respectively, and p n represents the probability value of the n th class of the SoftMax output. The (1 − p n ) γ parts reduce the loss contribution of the easy samples and increase the loss proportion of the hard samples. Because the label is in the form of One-Hot encoding, the value in the label of a certain sample type is 1 in the corresponding position, and the rest are 0. The optimized formula is shown below: where α c denotes the weight of the class-c sample and p c denotes the probability value of the class-c output produced by SoftMax.
In order to speed up the convergence and limit the overfitting phenomenon, a L2 regularization is added to the convolutional layer, as follows [51]: where α denotes the regular term coefficient, W denotes the network weight, p denotes the predicted value of the heartbeat category, x denotes the heartbeat feature, and T denotes the number of weighted items.

A. EVALUATION METRICS
Typical evaluation metrics, used in multi-class classification problems, were utilized for comparing the performance of different models, namely sensitivity (Se), precision (P +), and overall accuracy (Acc). Sensitivity (also called recall) is used to measure the proportion of the positive samples that are correctly identified as such by a classification model to the total number of positive samples in a dataset, as follows: where TP (true positive count) denotes the number of positive samples that are correctly identified as positive, and FN (false negative count) denotes the number of positive samples that are incorrectly identified as negative. In our case, sensitivity expresses the percentage of the actual heartbeats correctly classified by a model, which reflects its ability to discover heartbeats of the classes considered.
Precision is used to measure the proportion of the positive samples that are correctly identified as such by a classification model, to the total number of samples identified as positive, as follows:   where FP (false positive count) denotes the number of negative samples that are incorrectly identified as positive. Generally, precision expresses the positive predictive value (PPV) of a model. In our case, it is more important to improve (as much as possible) sensitivity than precision of the classification model, because diagnostics of a possible disease is more important than not discovering it at all. In addition to sensitivity and precision, another evaluation metric, used in the conducted experiments, is the overall accuracy (Acc) which indicates the overall classification accuracy of a model. It is calculated by dividing the number of all correctly classified heartbeats to the total number of heartbeats present, as follows: where TN (true negative count) denotes the number of negative samples that are correctly identified as negative.

B. EXPERIMENTS
In the experiments, the initial learning rate was set to 0.1 and then dynamically adjusted during the training process. When the loss has been no longer reduced, the learning rate was reduced to half of the original, which allowed the network to converge faster in the right direction. As for the batch size, we first chose a larger batch size to fill up the memory, then observed the convergence of the loss, and reduced the batch size if it did not converge, or if the convergence was not good. Finally we chose a batch size of 128. The initialization of the weights is also a very important parameter. Correct weight initialization can promote fast convergence of the model. As the combination of 'ReLU + Conv' is used in many places in the proposed G2-ResNeXt model, the He initialization method was utilized as it works well for ReLU [52].
Conducted experiments showed that the use of the stochastic gradient descent (SGD) optimizer for tunning the parameters of the G2-ResNeXt model produces the best results compared to other optimizers used, i.e., Adagrad, Adam, Adadelta, and Adamax. This is reported in Table 6.
In the experiments, the proposed G2-ResNeXt model was trained according to Algorithm 1. After setting the learning rate, number of epochs, and batch size, the model is initialized with the weights and then the epoch's loop starts updating the weights, based on the input signal data (signalsData) and signal labels (signalsTags), calculating the loss value, and updating the value of the learning rate when the next round's loss is greater than the previous round's loss. In order to probe stability of the proposed model, it was trained five different times, one after another, with 80 iterations in each training session. The overall accuracy, and sensitivity and precision, achieved by the proposed model in each training session (Training_1-Training_5) for each AAMI heartbeats class, are shown in Table 5. As can be seen from the table, the overall accuracy fluctuates slightly (within 1% only) between different training sessions, and the mean square error (MSE) is equal to 0.2434. The model with the most balanced performance (corresponding to the median value of the overall accuracy achieved in the five training sessions), named Training_1, was chosen for participation in the performance comparison with the state-of-the-art models, presented in the next subsection.
C. RESULTS Table 7 shows the multi-class confusion matrix of the proposed G2-ResNeXt model (Training_1) applied to the MIT-BIH arrhythmia database. Table 8 presents the results of the performance comparison of the proposed model to the state-of-the-art models considered, 1 as regards the overall accuracy, and sensitivity and precision achieved in classifying AAMI heartbeats (results for classes V and S are presented only, as these contain most arrhythmias). The presented results clearly demonstrate the superiority of the proposed G2-ResNeXt model, according to all evaluation metrics used, except the GRNN model which achieves better sensitivity and precision in classifying AAMI heartbeats of class S. More specifically, the G2-ResNeXt superiority over the other models compared ranges: from 0.02% to 1.69% based on overall accuracy, from 2.34% to 9.73% and from 1.28% to 12.27% based on sensitivity and precision, respectively achieved in classifying AAMI heartbeats of class V, and from 2.04% to 60.59% and from 6.96% to 82.10% according to sensitivity and precision, respectively achieved in classifying AAMI heartbeats of class S.

VII. CONCLUSION
This paper has proposed a novel model, called G2-ResNeXt, for the classification of inter-patient ECG signals with overall accuracy of 96.16%. A slice-and-stack method is used to process the MIT-BIH data set by achieving an effective data preprocessing. The presented experimental results have clearly demonstrated that the proposed G2-ResNeXt model could effectively identify arrhythmia, by achieving sensitivity and precision values of 97.09% and 95.90% for the ventricular ectopic heartbeats (VEB), and 80.59% and 82.26% for the supraventricular ectopic heartbeats (SVEB), respectively, thus surpassing all state-of-the-art models used for performance comparison, based on all evaluation metrics used.
Therefore, the proposed model has great clinical application prospect.
Although the overall accuracy of the model is high, there is still room for further improvements, aiming at improving the recognition rate of class-F heartbeats, and reducing the increased computational complexity and execution time (due to the introduction of a modified CBAM module) as to be competitive in that sense to the original ResNeXt model.