Detection of Obstructive Sleep Apnoea Using Features Extracted From Segmented Time-Series ECG Signals With a One Dimensional Convolutional Neural Network

This paper reports on ongoing research, which aims to prove that features of Obstructed Sleep Apnoea (OSA) can be automatically identified from single-lead electrocardiogram (ECG) signals using a One-Dimensional Convolutional Neural Network (1DCNN) model. The 1DCNN is also compared against other machine learning (ML) classifier models, namely Support Vector Machine (SVM) and Random Forest Classifier (RFC). The 1DCNN architecture consists of 4 major parts, a Convolutional Layer, a Flattened Dense Layer, a Max Pooling Layer and a Fully Connected Multilayer Perceptron (MLP), with 1 Hidden Layer and a SoftMax output. The model repeatedly learns how to better extract prominent features from one-dimensional data and map it to the MLP for increased prediction. Training and validation are achieved using pre-processed time-series ECG signals captured from 35 ECG recordings. Using our unique windowing strategy, the data is shaped into 5 datasets of different window sizes. A total of 15 models (5 for each group, 1DCNNs, RFCs, SVMs) were evaluated using various metrics, with each being run over numerous experiments. Results show the 1DCNN-500 model delivered the greatest degree of accuracy and rapidity in comparison to the best producing RFC and SVM classifiers. 1DCNN-500 (Sensitivity 0.9743, Specificity 0.9708, Accuracy 0.9699); RFC-500 (Sensitivity/Recall (0) 0.90 / (1) 0.94, Precision (0) 0.94 / (1) 0.90, Accuracy 0.91); SVM-500 (Sensitivity (0) 0.94 / (1) 0.50, Precision (0) 0.65 / (1) 0.90, Accuracy 0.72). The model presents a novel approach that could provide support mechanisms in clinical practice to promptly diagnose patients suffering from OSA.


I. INTRODUCTION
Obstructed Sleep Apnoea (OSA) is a sleep disorder that affects the breathing as you sleep.Severe apnoea sufferers can have up to 600 episodes of apnoea per night, with each episode lasting up to 40 seconds [1].There is a range of symptoms that can indicate the presence of OSA, which The associate editor coordinating the review of this manuscript and approving it for publication was Wai-Keung Fung .include: chronic snoring, insomnia, gasping and breath holding, unrefreshing sleep and daytime sleepiness [2].OSA is a common condition, with many estimates showing it is currently affecting approximately 1.5 billion people worldwide [3], although it is proven to be more prevalent amongst the age group 30 to 60 years.The Apnoea-Hypopnoea Index (AHI) is used to indicate the severity of OSA with an AHI value <5 classed as Normal, Mild AHI ≥ 5, but < 15 per hour, Moderate AHI ≥ 15, but < 30 per hour, with Severe AHI ≥ 30 per hour [3].Estimates have shown that OSA affects 20% of the general population, where AHI is ≥5 [4], [5].
The effects of OSA can range from minor issues, such as daytime fatigue and tiredness to more life-threatening issues that include, heart failure and strokes.This not only puts a strain on health services, but also on the global economy, and it is estimated that direct and indirect costs of OSA, such as health care costs, accidents, decreased productivity and sickness reach into the billions annually [6].One of the biggest challenges to OSA is the correct diagnosis of the condition.
The diagnosis of OSA dates back over a century when in 1913 French scientist Henri Pieron examined the physiological impact of the sleep disorder.Since then, many giant steps and major advances have been made in the diagnosis of OSA, the development of sleep societies, organisations and bodies have been formed and there are now thousands of accredited sleep specialists and hundreds of clinical sleep laboratories worldwide.However, there is still much evidence that shows the diagnosis of OSA is still not speedy or precise enough to keep up with demand.It is suggested that many OSA sufferers go undiagnosed, with estimates showing that over 80% of OSA patients also remain incorrectly diagnosed [5], [6].Consequently, OSA represents a major public health concern and left untreated can lead to numerous negative health-related consequences and in some cases mortality [7], [8].
A major factor to this problem of diagnosis is through the many drawbacks and limitations of the traditional and existing diagnostic techniques and systems, which range from simple form filling and information gathering, such as the Epworth Sleepiness Scale (ESS) [9], Berlin [10] and STOP-Bang Questionnaires [11], to physical examinations and overnight clinical stays and high-tech monitoring systems, such as Polysomnography (PSG).All of which can be either time-consuming, expensive, complex and intrusive, often meaning OSA sufferers don't get adequate treatment in good time and sometimes never.
To tackle the issues of complexity, inconvenience and expense, a variety of portable OSA diagnostic systems were proposed.A well-established example of this is the Home Sleep Apnoea Testing (HSAT) system, known in Europe as polygraphy kits.These systems are lightweight, portable and in some cases, wearable, which means a reduction in physiological sensors and clinicians than that required for a standard PSGs [12].Since these systems are better accessible, they are now used for first line diagnosis of OSA, which has reduced waiting lists and lowered overall costs, however, some studies suggest their use as stand-alone diagnostics in routine clinical practice is yet to yield any convincing results [13], this is primarily since HSATs find it difficult to compute the Respiratory Event Index (REI) [14].
In more recent years, the introduction of machine learning systems began to emerge, these intelligent machines brought a whole new approach to the diagnosis of OSA.Many studies show this approach not only improved diagnosis, but just as importantly, it brought a reduction in the required equipment, time and costs [15], [16].Nevertheless, akin to the more traditional diagnosis systems, this approach also has its drawbacks, chiefly, the required domain expertise and consumed time.
This study presents a novel system that builds on recent advancements in the field of machine learning.Using deep learning neural networks for the automatic and early detection of OSA, could provide mechanisms in clinical practice to help diagnose patients suffering from OSA.This study also presents the results and findings from alternative ML classifier models, namely RFC and SVM, when compared against the IDCNN model.
Unique contributions of the paper include: • Deliver a support apparatus to diagnose patients suffering from OSA using a 1DCNN deep learning model to overcome the requirement to manually extract features.
• Introduce a unique windowing strategy of time-series ECG data to better train the model.This is beneficial for the following reasons, • Allows the reducing of the signal to capture more OSA events • Enables better training of more observations using smaller time series windows • Addresses the dataset class imbalance using real data, thus avoiding the use of synthetic data • Comparison of different classifiers (1DCNN, SVM, RFC) when utilising PhysioNet (Apnoea-ECG) dataset.The 1DCNN and the automated feature extraction associated with deep learning models proves significantly better than traditional machine learning models.
The remainder of this paper is structured as follows.Section II describes related work, Section III presents the methodology, which includes details of the data, test subject trial and system components.Section IV presents the experimental results.Section V provides a discussion of the results and Section VI concludes this paper, including any future work.
The following section looks at previous related studies, describing how some of these supervised ML algorithms have performed when used to classify a condition through the use of single-lead electrocardiogram (ECG) time-series data.

A. MACHINE LEARNING METHODS USING ECG TIME-SERIES DATA
In [23], the authors endeavour was to automatically diagnose Obstructed Sleep Apnoea using a novel approach, which was based on the transformation of the Cepstral domain using the statistical model, Hidden Markov Model (HMM) with SVM classifiers.Results suggested that Cepstrum and HM model classifier were not enough, therefore their next approach was to use the HHM kerne, whilst also introducing an SVM model to classify the data.This approach showed great improvements with excellent results for accuracy.
The approach in [24] was to develop a stacked SAE (sparse auto-encoder) based deep neural network (DNN) combined with Hidden Markov model (HMM) using classifiers SVM and ANN.The author realised that combining HMM and DNN, with a Confidence Score-based Decision Fusion method, improved both the classification accuracy and classification performance, along with a discriminating balance of sensitivity and specificity.They also found that further classification accuracy was achieved with the addition of an extra hidden layer, where 2 hidden layers empirically provided their best results.
In [25] the authors used the signal processing technique Tunable-Q Factor Wavelet Transform (TQWT) to extract specific apnoea features from sample ECG signals.OSA Classification was then demonstrated using Random Under Sampling Boosting (RUSBoost).Using this method provided well balanced results.Further to this, an evaluation of RUSBoost proved its superiority when compared to eight commonly used classifiers, being; extreme learning machine (ELM), Prazen's probabilistic neural network (Prazen PNN), bootstrap aggregating (Bagging), k-Nearest Neighbors (kNN), support vector machine (SVM), leastsquare SVM (LS-SVM), random forest (RF), and adaptive boosting (AdaBoost).
Reference [26] considered a more state-of-the-art approach.Here, they developed a 1DCNN model consisting of an architecture that included, a rectified linear unit (ReLU) activation, a max pooling and dropout layers.All experiments were conducted in a fully supervised manner.Optimal performance of the model was achieved through the fine-tuning of extensive and complex hyperparameters across a varied depth of convolutional layers.Excellent results were achieved using three-layer architecture all the way up to nine-layer, anything over nine layers caused both overfitting and underfitting.Further performance measurements of the 1DCNN architecture were compared to other models from previous studies, which showed the 1CDNN out performed each of these models; SVM, Fuzzy reasoning module, LDA, QDA AdaBoost, Bagging REPTree and Kernel density classifier.
In [27] detection of sleep apnoea (SA) was achieved using a multimodal approach that included the combined-channel feature analysis of ECG and SpO2.Classification was performed using RFC with excellent results; sensitivity 95.9, specificity 98.4 and accuracy 97.5.To better understand these results, other traditional classifiers where used, but with less success; SVM, KNN, and Linear Regression (LR).The authors found interesting results when testing both the stand-alone ECG and SpO2 signals, with SpO2 feature set providing a better accuracy, sensitivity, and specificity.However, SA results using SpO2 alone, have been shown to be imitated by other breathing conditions, namely chronic obstructive pulmonary disease or alveolar hypoventilation.
The authors in [28] used IHR (Instantaneous Heart Rate) signals for detection of SA.The experiment was performed using two network topologies, Single and Stacked LSTM (long short-term memory), with varying parameters.Training and testing were performed using various LSTM-RNN (recurrent neural network).Results ranged from very good to excellent; however, the authors did acknowledge training and testing was performed on small portions of the dataset.
Reference [29] proposed a snoring-based obstruction site detection model to identify the site of a collapse in the upper airway.They first processed the audio signal using VAD (voice activity detection) and mixed Gaussian distribution model.Features were then obtained from the audio signal using the popular method of MFCC (Mel-Frequency Cepstrum Coefficient) also known as Meyer cepstrum coefficient characteristic.They then ran 24 classification experiments using different feature vector dimensions each time across 3 separate classifiers, KNN, SVM and Gaussian NB producing satisfactory results.The KNN was seen to outperform the SVM, since the data set is much larger than the number of features and the Naive Bayes algorithm is often used in smaller feature sets with fewer outliers.The authors also claim that their model performed better than similar previous studies, since their data included the variables age, gender and BMI.
Reference [30] presents our previous study, a 1DCNN model, designed for the automated detection of OSA captured from single-lead (ECG) signals.The dataset was acquired from PhysioNet, used in this current study.The data was preprocessed into 5 exclusive datasets before being trained.The model consists of Convolutional Layer, Flattened Dense Layer, Max Pooling Layer and Fully Connected Multilayer Perceptron (MLP), with Hidden Layer and SoftMax output.The model was evaluated using various metrics.Results showed the model produced high classification, (Sensitivity 0.9705%, Specificity 0.9725%, F1_Score 0.9717%, Accuracy 0.9377%, ROCAUC 0.9945%).
Using digitised ECG signals [31] looked to compare thirteen classic ML models and four DL models for automatic detection of OSA.Preprocessing involved removing unwanted frequency noises using a digital IIR notch filter.Feature extraction codes were then applied to capture nine specific features from ECG signals, this helps reduce the data's high dimension and improve the overall performance.The results showed the 4 DL models outperformed the 13 classical ML models, with the hybrid model CNNLSTM network producing a best performance of accuracy 86.25%, sensitivity 88.8% and AUC 95.1%, when compared to other previous studies.
In [32] a CNNLSTM hybrid model was developed to automatically detect OSA using ECG signals.A 1D deep CNN model for the automatic detection of OSA using single-lead ECG signals was developed in [33].The 1DCNN used 10 identical convolutional layers, 5 Fully-Connected layers and 4 identical classification layers.Preprocessing was achieved using Butterworth bandpass filtering and z-score normalization.Compared to several studies, this model had the best accuracy 87.9%, specificity 92.0%, sensitivity 81.1% and AUC of 94 for per-minute apnoea detection.
A final study in [34] looked to address the limitations of feature extraction using traditional ML models, a 1D squeeze-and-excitation residual group network (1D-SEResGNet) using a multi-feature (RII+RA+QA) fusion method was proposed, to carefully extract the complementary information of HVR and EDR using a bandpass filter to find R-peak from 2-minute ECG signal segments to detect OSA.Results of segment detection showed a sensitivity 87.6%, specificity 91.9% and accuracy 90.3%.
All but two of the studies (26,29) in this section, used the same Apnoea-ECG database, but with varying formats, techniques and methods.Further to this, some of the approaches discussed are depending on 3 rd party signal processing applications to prepare their data, whilst others are using traditional ML methods with hand-crafted features, all of which can be time-consuming and expensive.Some are using LSTM and hybrid approaches, based on predictive measures, that rely on both accurate data and how well missing values can be guessed.Our model is based on classification, which simply identifies or determines an observations class.Comparing our model to other classification models discussed, our results are more than comparable, with a much simpler workflow.

III. DATA ACQUISITION, SUBJECT INFORMATION, AND PRE-PROCESSING
This first part of this section describes the dataset (Apnoea-ECG Database) used to train and test the 3 models, the proposed 1DCNN model, the RFC model and the SVM model.It shows the value of the dataset, the information held within the dataset and how the dataset was carefully pre-processed, and feature engineered.The latter part of this section looks at the construction of the three machine learning algorithms (1DCNN, RFC, SVM), including their architectural makeup, how they were evaluated, the metrics used to gauge their performances and their produced results.

A. APNOEA-ECG DATABASE
The Apnoea-ECG database (Table 1) was acquired from the publicly renowned on-line database website, PhysioNet.Researching this database showed it has been actively used and extensively published in previous high-quality publications and journals.The Apnoea-ECG database was constructed through the observation and merger of data taken from two separate studies, in 1993 and 1999, that involved the recordings of ECG signals from patients suffering with obstructive sleep apnoea [1].
A total of 70 night-time ECG/EEG recordings were observed in the database, the 35 annotated recoding were used for this study.Breaking down the data as presented in Table 2; Group A (Apnoea-Set) contained 20 subject recordings with 6250 mins of Apnoea and 3811 mins of Normal (Non-Apnoea).Group B (Borderline-Set) had 5 subject recordings, with 252 mins of Apnoea and 2060 mins Non-Apnoea.Group C (Normal-Set) had 10 subject recordings with 12 mins of Apnoea and 4740 mins of Non-Apnoea.

1) BUILDING THE DATASETS
Each Apnoea observation was scored and annotated by an expert sleep clinician.Using a feature selection process, all unwanted characteristics were identified and removed, leaving only the required variables and features for the dataset.The next step was to cross-reference and separate all the annotated files that resided in groups A, B & C, into two sample recording groups (Apnoea and Non-Apnoea).This resulted in 650 segmented files.A total of 314 files contained Apnoea and a total of 336 files contained Non-Apnoea, as shown in Table 3.

2) MERGING OF SEGMENTED SAMPLE FILES STAGE I
With the segmentation of Apnoea and Non-Apnoea files fully completed (Table 3), the construction of the dataset could now begin.This process involved the merging together of each individual segmented sample file within their own specific code and group.Completing this task resulted in 35 newly formed files for each of the two groups (35 Apnoea files and 35 Non-Apnoea files), shown in Table 4.

3) MERGING OF SEGMENTED SAMPLE FILES STAGE II
The processes to build the dataset were to firstly merge each of the 35 newly formed sample files within their own specific section within their specific group.i.e. the 20 newly formed files in the 'A' section under the 'Apnoea' group, were merged together to form one file, likewise the same process was applied to the files in the B and C sections of their specific groups, shown in Table 5 .
Using the same technique and keeping the separate groups (Apnoea and Non-Apnoea), the process of merging the newly formed files together continued, thus leaving two files, 1 file for Apnoea and 1 file for Non-Apnoea (Table 6).The final step before moving onto the windowing strategy involved two parts, one was to firstly remove any overhang, in this case ''Non-apnoea'' overhang rows where removed to match the ''Apnoea'' row size, and secondly to merge these two files together to create a balanced dataset.

4) DATASET WINDOWING STRATEGY
At this stage, the newly built dataset was reshaped into 5 separately balanced datasets of specific window sizes; 500, 1,000, 1,500, 2,000 and 2,500.Each newly formed window 1080 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.contained the same amount of Apnoea and Non-Apnoea, approx.37 million samples per group (approx.75 million samples).The Non-Apnoea rows, were labelled as '0', and filled the bottom half of the dataset and the Apnoea rows, labelled as '1' filled the top half of the dataset.Table 7 provides a view of the structure for each of the newly constructed 5 balanced datasets.It shows the number of columns (window size), n/rows (samples), n/Apnoea samples and n/Non-Apnoea samples.

B. MACHINE LEARNING ALGORITHMS
This section presents the three machine learning algorithms used in this study.It first talks about the proposed algorithm in this study, namely the 1DCNN classification model, in terms of how it was constructed, its architecture and how it works.To further evaluate the 1DCNN model, the section then describes two alternative classification models to be used for comparison experiments, namely Random Forest Classifier (RFC) and Support Vector Machine (SVM), The same datasets, training and validation were applied to all the models.

1) 1DCNN ARCHITECTURE AND EVALUATION
Over the last decade Convolutional Neural Networks (CNNs) have become highly recognised and a very popular means of performing machine learning tasks.This is mainly because CNN's are decisively more powerful and accurate when compared to traditional machine learning algorithms.Similar to Artificial Neural Networks (ANNs), CNNs use the feedforwarding technique.The most common types of CNN's are the 2 and 3-dimentional models, which have very high accuracy when dealing with complex image processing tasks.However, in recent years 'the state of the art' 1DCNN has become a desirable choice for classification tasks, particularly where time-series data is used.They are also proven to work well with one-dimensional arrays, providing excellent feature extraction capabilities, thus avoiding the need for domain expertise.Figure.1 shows the architecture of the 1DCNN.The model was developed using the Python programming language combined with the high-level APIs Keras and the open-source platform TensorFlow as its backend.The model is constructed using a number of key layers and important functions, including an Input layer, a Convolutional layer, a Max Pooling layer and a Fully Connected Multilayer Perceptron (MLP) consisting of 1 Hidden layer, a Softmax output layer, a ReLU activation function and an ADAM activation function with Back Propagation.At the input layer a set of neurons, dictated by the batch size, feeds in and passes through the pre-processed time-series single-lead ECG-signal data.The convolutional layer, pre-set by the kernel size (matrix) hyperparameter, then slides across the input data extracting the most prominent features.These features are then built into a feature map and captured by the overriding filter hyperparameters.To further assist the convolutional process at this stage, a Max Pooling layer is incorporated.This layer skillfully summarises any captured features, thus reducing overfitting and computation, whilst increasing overall performance of the model.The penultimate section of the 1DCNN is the Fully Connected Multilayer Perceptron (MLP).Here the newly formed output is firstly received by an input layer before being propagated forwarded to the hidden layer.The hidden layer contains the activation function (ReLU), which transforms the input before passing through to the final layer, 'softmax output'.The purpose of the softmax activation function is to improve classification by using probability sums.The final process within the MLP is controlled by the method 'back propagation of error', using the ADAM optimiser algorithm, this method performs calculation iterations of the layers, continuously training the network by updating the neuron weights, thus minimising the errors, and making the difference between the predicted output and actual output.
The following describes the mathematical function to many of the essential components within the 1DCNN and a representation of how their own specific equations can be defined.
Below in (1) the IDCNN layers of the convolutional function is represented by y = conv1d(x,w,b).The input to this function is denoted by x, filters are shown as w, bias as b and the final output of the convolutional layers is presented as y [35].
The Max Pooling layer function is a primary process within the CNN.In (2) the Max Pooling formula uses stride values sx, sy and a pooling window, defined by filter fx, fy and channel sizes k.It operates by moving across the data capturing the highest valued features through input (X), where the values are summed and outputted i, j.By reducing overfitting and computation, this increases the overall performance of the model and can be defined as below [36].
The aim of a Fully Connected Multilayer Perception is to continuously recalculate and adjust the weight parameters through each layer and at each convolution.In (3) the operation of this function is shown.By using y= fullyCon(x,w,b), where x denotes the input, w = weights, b = bias and y = outputs [35].
(4) shows a mathematical representation of how the ReLU(Rectifier) activation function can be defined.This function is integral to the training and performance of the 1DCNN.Its role is to transform the weighted inputs x from each node and pass the outputted results ReLU (x) to the final layer [37].
In ( 5) the softmax fuction S is the final activation function of the 1DCNN.Its purpose is to improve classification output for the number of classes n.This is achieved by taking an input of vector numbers yi, applying an exponential function to convert these real numbers into probability sums, using normalisation to ensure each value is between 0 and 1 [35]. 2

) RANDOM FOREST CLASSIFIER ARCHITECTURE AND EVALUATION
The Random Forest Classifier first came into prominence approx.20yrs ago.It is a supervised machine learning algorithm, primarily used with non-linear classification tasks.Random Forests are constructed using an ensemble of decision tree classifiers in ( 6) {h(x, θk), k = 1, . ..},where each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest {θ k}, before a vote is casted for the most popular class x.This vote is achieved by taking the mean number from the output of the previous tree at each stage before making a final decision [38].
The RFC built to train the datasets was constructed using the hyperparameters of n_estimators, Random_State and Gini Index.The n_estimators determines the number of decision trees to be used within a forest to predict an outcome.For the RFC, this is set to 500.The Random_state controls the randomness of the data for training and testing.Setting this hyperparameter to 42, ensures stability in the results.The Gini Index (7) is a measure of impurity of the sample sets S.This is the probability Pi of the incorrectly labelling of a randomly selected class k [39].The Gini Index improves classification by decreasing the numerical value of feature importance at each node within a decision tree.Further to this, the Gini Index assists to provide quicker computations [40]. 3

) SUPPORT VECTOR MACHINE ARCHITECTURE AND EVALUATION
Support Vector Machine has been developed into a very robust and well-established supervised machine learning algorithm, which is primarily associated with classification tasks.Through the mathematical functionality of support kernels and the calculations of margins using plotted datapoints and hyperspace, the SVM finds the most meaningful hyperplane that enables it to separate one class from another class [41].
The SVM built to compare against the 1DCNN was constructed to train the datasets using the hyperparameters Random_State and the RBF_Kernel (Radial Basis Function).The RBF_kernel assists to make better classification decisions when training on non-linear data.Based on the Gaussian 1082 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Distribution kernel, which calculates the similarity or closeness of two fixed points.In (8) the fixed points {X 1 ,X 2 }, are calculated using the decision boundary parameter Y, the RBF_kernel K, maps the input data into a high-dimensional space, thus enabling the SVM to find the best position of the hyperplane for classification [42].

C. TRAINING THE 1DCNN MODEL
The model was trained using the 5 uniquely designed datasets with different window sizes (W=500 through to W=2500).This included running large volumes of detailed experiments using various numbers of layers and hyperparameters (n_Filters k_Size, Batch_Size, Epochs) to find the optimum performance of each model.Discussed in more detailed in ''Experiments and Results'' section.

D. PERFORMANCE METRICS
Performance metrics are critical gauges to evaluating how well the ML algorithm model is working.This section briefly describes all of the metrics used in the evaluation of the three ML models (1DCNN, RFC & SVM).

1) CONFUSION MATRIX
Confusion Matrix is a visualisation tool used to measure the performance of a classification model.Represented as a table of predicted and true classes, it better summarises the performance and facilitates the calculation of other metrics, that includes, Recall, Precision, Accuracy, F1 score and AUC-ROC curve.
True Positives (TP) when the actual value is Positive and predicted is also Positive, True Negatives (TN), when the actual value is Negative and prediction is also Negative, False Positives (FP), when the actual is negative but prediction is Positive and False Negatives (FN), when the actual is Positive but the prediction is Negative.

2) VALIDATION LOSS AND VALIDATION ACCURACY METRICS
Validation loss and Validation Accuracy metrics function in a similar way to the loss and accuracy metric by evaluating the quality and performance of the model.However, the validation loss metric is measured after each iteration of epoch.Furthermore, the validation loss metric does not signal the model to update the weights at each passing.

3) SENSITIVITY AND SPECIFICITY METRICS
The function of the Sensitivity and Specificity metrics is to demonstrate the accuracy of a classification test.This is calculated by the presence or absence of an instant.The Area under the ROC Curve (AUC) is a visual representation of a model's performance and accuracy.ROC measures the probability of the model by plotting sensitivity (True positive rate) against specificity (False positive rate) and the AUC measures the ability of a model to distinguish between the two classes.This measurement is achieved by using a ranking system, which scores the separate classes on a scale of 0 to 1.The higher the AUC, the better the model is at prediction and class separability.

7) LOSS & ACCURACY METRICS
These two metrics are calculated very differently, however they both indicate how well the model is learning through the progression of training.On each batch iteration of the training set, the loss metric calculates the sum of error/bad predictions and then presents how good/bad the model is performing.Through calculation of these sums the model will continually attempt to improve its performance by altering the neuron weights (cost function) at each passing.The lower the loss, the better the model.The function of the Accuracy metric is to evaluate the model's performance in an interpretable way.It calculates and presents the number of correctly classified predictions against the actual number of true predictions, it can be defined as ( 16), shown at the bottom of the page, [44].

8) KAPPA_SCORE
Another accuracy indicator is Kappa score.This metric demonstrates the level of agreement between two raters on a classification problem.The closer the score is to 1, the better the agreement between the raters and the better the model is at classification.The sum of Kappa score is achieved by calculating Po (accuracy), Pe (expected accuracy) and 1 -Pe(Value range).Po, being the amount of observed agreement in relation to the total number, Pe, being the amount of observed probability of chance agreement and 1 -Pe, being the kappa value range, -1 no agreement to +1 complete agreement [45].

9) LOG_LOSS
A further accuracy indicator is Log Loss or Binary Cross Entropy Loss.Based on probabilities, it measures the accuracy of a classification model, where the output is a value between 0 and 1.It achieves this by comparing the prediction probability result to the actual result.The closer these two sums are, the smaller the log loss becomes and the more accurate the model is at classification.In this formula p is the probability of class 1, and (1p ∧ ) is the probability of class 0 [46].
Macro averaging reduces the multiclass predictions down to multiple sets of binary predictions.It then calculates the corresponding metric for each of the binary cases before averaging the results together.

11) WEIGHTED AVERAGE
Weighted average is a calculation that takes into account the varying degrees of importance of the numbers in a dataset.
In calculating a weighted average, each number in the data set is multiplied by a predetermined weight before the final calculation is made.

IV. EXPERIMENTS AND RESULTS
This section evaluates the effectiveness of all models by presenting their results for training and validation.The section looks at three main areas.Firstly, Subsections A, B and C, present the best performing model from each group (1DCNN, RFC and SVM).Following this, Subsection D (Tables 11 -13) show the results for all 15 models.Finally, presented in Subsection E. (table 14) is a classification results comparison study of the proposed model against our previous study and also other OSA studies, discussed earlier in the II Related Works section.Each experiment was run and executed on the same computer and specifications: Intel i7 processor, Nvidia GTX 1080 and 16GB Ram.The main objective of these experiments is to find the model that frequently produces the best performances, using the least computational power and in the quickest times.A total of 15 models, 5 for each group (1DCNNs, RFCs, SVMs) were part of this experiment.Each model was run numerous times through training and validation.The first models to be assessed was the 1DCNNs.This was conducted by running separate experiments using the 5 pre-built balanced datasets (W=500 through to W=2500).For each of these experiments the data was split into 72% training, 20% testing and 8% validation.These sizes are calculated based on the amount of data contained within each dataset.The same experiments, using the same datasets, were again performed on the RFCs and SVMs models.Performance of each experiment was measured using a variety of common metrics, presented earlier in Performance Metrics.Tables 8 through to X and (figures 2 -6), show the best performing model of each group (1DCNN, RFC, SVM) their optimum configuration (inputs) and results (outputs), and where available, a confusion matrix measurement is presented.

A. ONE-DIMENSIONAL CONVOLUTIONAL NEURAL NETWORK CONFIGURATION AND RESULTS
This section assesses the best performing 1DCNN model (1DCNN-500) after training and validation.It shows the model's hyperparameter configuration (Table 8), along with graphical representations of accuracy, loss and ROCAUC results (Figures 2 -4).

1) 1DCNN CONFIGURATION AND RESULTS
Table 8 presents both the inputs and outputs for the 1DCNN-500 model when running the W=500 dataset.Applying inputs of 150 Filters (n_Filters), with a Kernel size (k_Size) of 150, a peak threshold Batch_size of 8192, when run over 50 epochs, was empirically found to return the best results.These results are listed in the 'Results' column, which shows Accuracy = True Positives + True Negatives True Positives + True Negatives + False Positives + False Negatives (16) 1084 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. RANDOM FOREST CONFIGURATION AND RESULTS
This section assesses the best performing RFC model (RFC-500) after training and validation.It shows the model's hyperparameter configuration and Confusion Matrix (Table 9), along with graphical representations of ROCAUC results (Figures 5).

1) RFC CONFIGURATION AND RESULTS
Table 9, Configuration column presents the optimal configuration for the RFC-500 model.This model was the best performing of the RFC models.Using an n_estimator size (amount of decision trees) of 500 and a random_state of 42 was found to return the best results.When examining the Confusion Matrix values, which represent the number of correct classification data predictions over the total amount of classification predictions, as well as calculated scores for Precision, Recall (Sensitivity) and Accuracy.This gives a good measure of the performance of the classification model's performance by providing a measure of misclassified instances.Misclassifications are typically the result of noise in the dataset.Results are pretty good, FP is about 3% and FN about 5%.Overall this model has performed to a very good standard.However, producing this level of performance incurred drawbacks, notably time consumption.The higher the value of the decision tree value (n_estimator), the higher the accuracy, but, increasing this value meant the longer the duration of the experiment took to complete.Moreover, inputted values above 500 decision trees didn't show any further improvements.
Figure .5 presents the AUC graph for the RFC-500 model results in Table 9 .Although not as tight to the top-left hand corner as figure.41DCNN AUC, this is still a very good scoring and demonstrates this model is good at predicting between the two separate classes.

C. SUPPORT VECTOR MACHINE CONFIGURATION AND RESULTS
This section assesses the best performing SVM model (SVM-500) after training and validation.It shows the model's hyperparameter configuration and Confusion Matrix (Table 10), along with graphical representations of ROCAUC results (Figures 6).

1) SVM CONFIGURATION AND RESULTS
In Table 10, the Configuration column presents the optimal configuration for model SVM-500.For this model, using the Radial Basis Function (RBF) Kernel and a Random State of 42 was empirically found to return the best results using this dataset.The overall performance of this model (Results column), is moderate.Actual classification within the confusion matrix is unbalanced, particularly when examining the large number of False Negatives that have been produced.
Figure .6presents the AUC graph for the RFC-500 model results in Table 10.When compared to 1DCNN and RFC, this model performed quite poor, in both classification of the two separate classes and duration of time to complete the task.

D. COMPLETE LIST OF RESULTS
The three tables (Table 11, Table 12, Table 13) below present the training and validation results for the execution of the 15 models averaged over 50 runs.They show the optimal architecture for each model for the automatic detection of OSA using single-lead ECG signals.The top table (Table 11) shows the 5 1DCNN configuration hyperparameters and results, the middle table (Table 12), shows the 5 RFC configurations and results and the bottom table (Table 13),  shows the 5 SVM configuration hyperparameters and results.
When evaluating the whole set of results across the three tables, it is clear to that the 1DCNN models outperforms both the RFC and SVM models.Particularly notable, is the high performance of Sensitivity & Specificity across the 1DCNN models, where results range from very good to excellent with well-balanced classification.
Further comparison analysis shows the 1DCNN models produce results in significantly quicker time and using less computational power.Experiment times alter significantly between 1DCNN models (3 to 4 minutes) and RFC models, (1+hrs -up to 1.40hrs) and even more so when analysing the SVM models (10+hrs -up to 17hrs).Moreover, looking at the 1DCNN results, from the bottom (No.5) to the top (No.1), it is possible to see performance slightly increases each time.This pattern coincides with the novel dataset windowing strategy.Windows with more rows and fewer columns, shows a gradual increase in performance results.This same pattern is also evident in both the RFC and SVM experiments.Additional window dimension testing showed the minimum threshold was at around 500 columns, after this point results didn't improve significantly.
Training and learning of the 1DCNN models were shaped by hyperparameter influences.The importance of these hyperparameters is evident when looking at the wide variation of configurations between each model.For the best performing model, 1DCNN-500, reducing and balancing both the k_size and n_filter dimension's for convolving and output, scaling up the Batch_size for training, and minimizing epoch iterations for updating learning values, was empirically found to return the best results.However, for the 1DCNN-2500 model, which still performed very well, but with slight signs of overfitting, using a small dimensional kernel and a large filter output with a scaled-up batch size, was empirically found to return the best results.
Training and configuration hyperparameters of the RFC models were more straight-forward.The focal hyperparameter setting was the input value of the n_estimators.This value indicates the amount of decision trees to be used within the random forests when running the model.The amount of decision trees dictates both performance and duration of an experiment.At this stage, the influences of the Random_state hyperparameter controls the randomness of the data for training, testing and stability in the results.Experiments were set from 100 estimators, with fairly short durations, through to 2000 estimators, that took many hours.To reduce lengthy testing durations and without dropping model performance, the implementation of Gini Index was chosen over Entropy.Gini provides quicker results and less computational power.Further attempts to improve results included initiating the hyperparameter, max_depth, however, this only succeeded to increase overfitting.The sweet-spot for this model was using 500 estimators, anything higher than this value only increased the testing duration, but not the results.
Of the 3 groups, the table shows SVMs was the worst performing models for both results and duration of testing, taking up to 17hrs to complete.Finding the optimum performance of the SVM included the hyperparameter RBF_Kernel, for classification and Random_State to control the randomness of the data for training and testing along with stability of the results.Furthermore, attempts to try and improve performance results for some SVM experiments were influenced by the hyperparameter Gamma, however, with no positive effects.

E. COMPARISON STUDY AGAINST OTHER OSA DETECTION METHODS
Table 14 provides a comparison of results, using this proposed model against other state-of-the-art classification methods, discussed earlier in section II Related Works.These studies present a range of different ML and DL approaches.The results for all 1DCNN models including our current and previous models performed better in comparison to others, suggesting this is an excellent approach for detection of OSA.R. Pathinarupothi et al LSTM-RNN model does show slightly higher scores, however as previously mentioned, the authors acknowledge training and testing was performed on small portions of the Apnoea-ECG database.

V. DISCUSSION
This study set out with two main objectives.Firstly, to evaluate the 1DCNN model and secondly, to compare it against other classification models (RFC and SVM).The 1DCNN model was constructed using the state-of-the-art techniques in 1DCNNs, consisting of a Convolutional and a Max Pooling Layer and a fully connected Multilayer Perceptron (MLP) which included a hidden layer and SoftMax output for classification.It was found that using multiple convolutional layers showed no improvement to the 1DCNN model.Using one layer decreased both complexity and load and produced excellent classification performance that empirically provided the best results.
It was decided to perform some comparison experiments against other more traditional ML algorithms, RFC and SVM, since they are well-known for their binary classification problem-solving.The main objective of this comparison testing was to find the model that frequently produces the best performances, in the quickest times and using the least computational power.
All the models were evaluated using a well-received dataset, containing approx.216hrs of segmented ECG singlelead time-series signals obtained from 35 subjects.Over 70hr of non-apnoea segments were initially removed to balance the dataset at approx.108hrs for each group, apnoea/non-apnoea.To ensure fairness of testing, the segments were grouped into a single balanced dataset containing approx.35 million samples of Apnoea and 35 million samples of Non-apnoea or Normal.
Changes and limitations to the acquired dataset.In the data of the original evaluation study, the scoring of apnoeas and hypopneas was done according to standard criteria, where the number of apnoeas and hypopnoeas were marked and scored separately using the values Apnoea Index (AI) and  Hypopnoea Index (AHI).For the dataset used in this study all the marking and scoring was done by an expert sleep specialist in a different way.This new marking and scoring method did not differentiate between apnoea (AI) and hypopnoea.The result of the scoring were markings for the beginning and the end of episodes of disordered breathing.The disordered breathing may contain one single apnoea or hypopnoea or may contain a longer sequence of apnoeas and hypopnoeas.The markings were mapped to time with a resolution of one minute.Therefore, it is unknown exactly how much of each scored minute is accommodated with apnoea and/or hypopnoea, whether this is fully or partial.The final result of the scoring was a binary outcome for each minute of the recording being coded as either ''normal breathing'' (N) or ''disordered breathing'' (A).The total number of minutes spent in apnoeas or hypopnoeas was determined for each recording.All scoring was assessed against the Apnoea-Hypopnoea Index (AHI).
The novel idea of reshaping the dataset into 5 different window sizes provided the opportunity to improve training and evaluation of the models.The results showed that using different sizes impacted the performances of each model.Results appear to coincide with window sizes, more rows with less columns generally produced increased performance, however, this increase seemed to plateau at reduction of approx.500 columns.
Figure .7 Presents comparable ROCAUC curve scores for the best performing model from each group (1DCNN, RFC, SVM) using the W-500 windowed dataset.1DCNN-500 produces the best performance.The overall two best performing models, (1DCNN-500 and 1DCNN-1000), produced excellent classification results.Interestingly, the other 3 CNN models (1DCNN-1500, 2000, & 2500), which also performed to a very good standard, with only slight signs of overfitting, found their optimum performances using almost polar-opposite hyperparameter configurations to the best performing models.Adding dropout layers to each of these 3 models could improve performance.
The overall performance of the RFC models produced very good results with some overfitting.However, the main drawback to this model was the duration of experiments, sometimes taking over 1hr to complete.Attempts to speed up this process using hyperparameter influences and reducing estimator values, was very limited before results started to dramatically decrease.
The overall presentation of the SVMs was poor, both in terms of performance and results.Classification was very unbalanced, and the duration of the experiments was extremely slow, with some experiments taking almost 17hrs to complete.
The limitations and drawbacks shown by some RCF and SVM results could be associated to the type of data used for the experiments.RCFs and SVMs are often better suited to text analysis, also in the case of SVMs, small datasets.Another point is the number of variables used in these experiments.RFC and SVM respond better to higher amounts of variables.
All the results presented in this study have demonstrated the complexity and value of the hyperparameter selection required to achieve an optimal performance in the automated detection of OSA.

VI. CONCLUSION
Obstructed Sleep Apnoea is a worldwide problem that will affect 1 in every 5 people at any one time and will affect 1 in every 2 people at some stage over their lifetime.It is a condition that can develop into serious health complications, both physically and mentally and can lead to mortality.The global economic impact of OSA costs billions of pounds per annum and is forecast to continue to grow year on year.Traditional diagnosis techniques are not enough.Over 80% of patients still remain incorrectly diagnosed.In more recent years newer OSA diagnostic solutions have emerged with some success, particularly in the area of ML algorithms, however, these innovative methods require extensive domain experience and time.Over the past decade there has been various supervised machine learning algorithm developed to better diagnose certain human conditions and illness.
The approach of a 1DCNN looked to address many of these issues and it has demonstrated the capability to automatically detect instances of OSA through captured single-lead ECG signals.The study has also shown that the 1DCNN model provides greater classification accuracy, rapidity and robustness when compared to the other traditional ML algorithms.
This study has provided a view where the design and implementation of the 1DCNN system could deliver a support mechanism in clinical practice for the diagnosis of patients suffering with OSA.However, whilst the approach gives us confidence to perform such tasks, it will first require some important steps, to be published in future papers; • Further evaluation and testing using alternative ECG signal dataset (University College Dublin (UCD), Dataset) • Attain clinical approval to better evaluate this study in a clinical setting • Tests using real-world data captured from test subjects • Development of a frontend system to host and interact with the 1DCNN model when classifying uploaded ECG signals

FIGURE 1 .
FIGURE 1. Architecture of the one-dimensional convolutional neural network.

FIGURE 2 .
FIGURE 2. Graphical output results from the 1DCNN-500 model using dataset W=500.Showing training and validation accuracy.

FIGURE 3 .
FIGURE 3. Graphical output results from the 1DCNN-500 model using dataset W=500.Showing training and validation loss.

FIGURE 5 .
FIGURE 5. Graphical output results from the RFC-500 model using dataset W=500.Showing AUC plot.

FIGURE 6 .
FIGURE 6. Graphical output results from the SVM-500 model using dataset W=500.Showing AUC plot.

FIGURE 7 .
FIGURE 7. Comparable ROCAUC plot results for the best performing model from each group, using dataset size W=500.
Their model is split into 4 blocks.Blocks 1 & 2 consist of 1DCNN in each for feature extraction, block 3 consists of two LSTM networks for gradient vanishing problem and long-term dependency, and block 4 consists of two separate classifiers (Sigmoid and SVM).The model using the SVM classifier produced the best results.The model also achieves excellent scores of ACC 90.92%, SE 91.24% SP 90.36% F1 92.76% when compared to other studies, where they mainly used feature engineering techniques.

TABLE 2 .
Breakdown of the 35 subjects recoding.

TABLE 3 .
Breakdown of the annotated files into two groups (Apnoea and Non-Apnoea).

TABLE 4 .
Merging of the segmented sample files into the two groups.

TABLE 5 .
Merging of the segmented sample files into six files.

TABLE 6 .
Merging of the segmented sample files into two files.

TABLE 7 .
5 separately balanced datasets of specific window sizes.
Sensitivity measures the true-positive rate, what the model has correctly predicted, and Specificity measures the true-negative rate, again what the model has correctly predicted [43].

TABLE
RFC experiments -configuration and results.

TABLE 13 .
SVM experiments -configuration and results.

TABLE 14 .
Comparison of proposed model Vs other OSA ML and DL models.