A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Medical datasets are usually imbalanced, where negative cases severely outnumber positive cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO and NLST, with imbalance ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, etc.), seven over-sampling methods (SMOTE, etc.), and two integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, etc.). The results show that class imbalance learning can improve the classification ability of the model. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable method, and the AUC in the model is generally higher than in other ways. Using ROS, the random forest performs the best predictive ability and is more suitable for the lung cancer datasets used in this study. The code is available at https://mkhushi.github.io/


I. INTRODUCTION
In a class-imbalanced dataset, one of its classes has a significantly lower number of samples than the other [1]. There are challenges inherent in learning such class imbalanced data. The skewed distribution of the training examples makes standard learning classifiers biased, favouring the majority class and cannot detect rare instances [2], [3]. Rare minority samples may be treated as noise, and noise may be incorrectly identified as minority samples [4], [5]. In the medical field, this type of imbalance problem often exists. The number of normal samples in the dataset is often more than that of abnormal samples, and the gap between the two is The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . relatively large [6]. Researchers have developed various class imbalance methods and performance evaluation metrics to address these challenges, briefly discussed in Section II-A and Section II-B, respectively. The most commonly used abbreviations are presented in Table 1. To investigate class imbalance methods, we implemented them on two real-world class imbalanced datasets: (i) the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial dataset and (ii) the National Lung Screening Trial (NLST) dataset. PLCO and NLST are high-profile datasets in the field of lung cancer, and many researchers have done some research on them [7], [8]. Both datasets contain anonymised clinical information from trial participants, including whether they have confirmed lung cancer or not. In these lung cancer datasets, the ratio of most samples (normal people) to a few samples (lung cancer patients) is around 25. Therefore, they all belong to the class imbalance dataset, which can explore the class imbalance methods.

II. CLASS SKEWNESS IN DATA
Class skewness is a well-known problem in machine learning [9]. Suppose the distribution of the class in the data is imbalanced. In that case, the machine learning model will tilt towards the samples in the majority class and cannot give enough attention to the samples in the minority class. It will cause the model's output to be biased towards the majority class [10], [11]. The accuracy of the classifier is unreliable due to the lack of consideration of minority classes. In the current field of machine learning, the class skewness in data has caused many scholars to pay attention to classimbalanced learning [12], [13].

A. TYPES OF IMBALANCED METHODS
In the Biomedical Sciences, class imbalance methods have already been used in many applications, such as gene expression [14], medical diagnosis [15] and medical side effects [16]. Class imbalanced data methods can be classified into three categories: (i) data-level methods, (ii) algorithmlevel methods and (iii) hybrid methods [17].

1) DATA-LEVEL METHODS
Data-level methods involve procedures applied in the training data to make the class distribution more balanced by reducing the number of samples in more classes or increasing the number of samples in minority classes [18]. At present, the data-level method is mainly in the data pre-processing stage, using resampling to redistribute the training data of different classes in the data space [19], [20]. This kind of method can change the dataset structure as much as possible to balance the imbalanced class. Some studies have shown that the resampling method can improve the model's ability to a certain extent by resampling the data samples to adjust the analog distribution of the samples [21], [22]. In the data-level method, resampling and the work of the classifier do not affect each other, which is also one of its advantages [23]. Resampling procedures can be further organised into (i) under-sampling, (ii) over-sampling and (iii) hybrid methods [24]. In the following, we briefly describe these methods.
In Under-sampling methods, samples from the majority class are discarded until the number of samples in each class are nearly equal while preserving valuable information for learning [25], [26]. However, it is inevitable that when undersampling the dataset, some samples that are meaningful to the training model may be ignored [27], [28]. After [38]. 11) Condensed Nearest Neighbour (CNN): Use the nearest neighbour algorithm to iterate, and use under-sampling to put the majority class sample and all the minority class samples together into a set C. The remaining part uses 1-NN to judge whether it can be classified correctly, and the wrongly classified samples are put into set C. Repeat the above process to determine whether the majority class of samples can be retained [39]. In over-sampling methods, new samples are created based on samples from the minority class to reach a more balanced class distribution of samples while strengthening class boundaries [40], [41]. However, over-sampling may lead to overfitting because it duplicates or synthesises a minority of samples [42]. As the number of samples increases, the training time also increases [43]. Over-sampling methods include: 1) Random Over-Sampling (ROS): ROS is the earliest over-sampling technique developed, which copies random minority class samples to achieve a more balanced class distribution of samples [44].
2) Adaptive Synthetic (ADASYN): This method uses a weighted distribution of the minority class samples based on their difficulty learning. More synthetic samples are generated for minority samples harder to learn than the easier ones [45].
3) Synthetic Minority Over-Sampling Technique (SMOTE): Synthetic samples are generated by interpolating k Nearest Neighbors (kNN) of each of the minority samples [46]. 4) Synthetic Minority Over-Sampling Technique -Nominal Continuous (SMOTE-NC): This is a generalised version of SMOTE that accommodates both continuous and nominal data [46]. 5) Borderline SMOTE: This method performs SMOTE on borderline samples, which are instances that are often misclassified by their nearest neighbours [47]. 6) Support Vector Machine (SVM) SMOTE: This method oversamples minority samples along the borderline and uses an SVM classifier for predicting new instances [48]. 7) KMeans SMOTE: This method uses the combination of KMeans clustering and SMOTE method to form K clusters through clustering and then uses over-sampling to retain clusters that contain many minority samples. These clusters will be allocated to synthetic samples and then put into clusters with insufficient samples in the minority class. Finally, SMOTE balances the proportion of categories in each cluster [49], [50].
The hybrid method is a combination of under-sampling and over-sampling. Under-sampling and over-sampling have unavoidable disadvantages: under-sampling may discard useful information, while over-sampling may lead to overfitting. To break through these limitations, a technique combining under-sampling and over-sampling has been proposed. These include (i) SMOTE-ENN [44], which combines SMOTE for over-sampling and ENN for under-sampling, and (ii) SMOTE-Tomek [44], which uses SMOTE for oversampling and Tomek links for under-sampling. The purpose of using these two methods is to balance the training dataset and remove the noisy points at the wrong side of the decision boundary, to find better clusters and create models with good generalisation ability.

2) ALGORITHM-LEVEL METHODS
Algorithm-level methods are techniques wherein (i) standard machine learning classifiers are modified and associated with a weight or cost variable, or (ii) the classifier itself is unaffected by the skew distribution [51]. Many scholars have published relevant research results discussing the class-imbalanced problem at the algorithm level [52]- [54].

3) HYBRID SYSTEMS
Hybrid systems involve a combination of sampling techniques and algorithmic methods [55]. They use data-level methods to process data externally and adjust the distribution of categories in the sample. Then algorithms are used internally to modify the learning process [56]. In this way, the model will not skew the majority class too much during classification [9]. The common ensemble methods are as follows: 1) Balanced Bagging: This method implements bagging and uses RUS to make the dataset balanced. It resamples each subset of the data before using each integrated estimator. Therefore, its advantage over sci-kit-learn is that it uses two additional parameters that control the behaviour of the random sampler: sampling strategy and replace [57]. 2) Balanced Random Forest: This method first draws bootstrap samples from the minority class, then randomly draws with replacement the same number of instances from the majority class, creating a balanced sample from which each tree is drawn. This method pays more attention to samples that are easily misclassified [59]. Although both easy ensemble and balanced cascade are called exploratory under-sampling, each time, they extract a subset from the majority class to learn the classifier. But both mainly use AdaBoost to train each bag, and it is classified as ensemble methods [9], [28].
Various ensemble-based resampling techniques, i.e., Balanced Bagging [57], Balanced Random Forest [58], [61], Easy Ensemble [59], RUSBoost [60], and Balance Cascade [59], are widely known. Random balance, SMOTEBoost, and RUS-Boost are identical due to random balance. The randomness and repetition of ensemble methods rely on random balance because each classifier utilizes the random ratio during sample training with different class proportions. SMOTE and RUS balance the samples concerning a minority as well as a majority class. The hybrid method of SMOTE and RUS provided better performance than other state-of-the-art combined ensemble methods such as SMOTEBoost and RUSBoost [62], [63]. The combination of UnderBagging and OverBagging termed as Under-OverBagging based on resampling bagging algorithm has proposed by Qian et al., [64] that oversampled the minority class and undersampled the majority class. The resampling ratio is calculated through the ratio of the minimum class size and the maximum class size.
KNN, naïve Bayes, and neural networks are widely employed as base learners both as homogeneous and heterogeneous ensembles. Previous researches show that the performance of heterogeneous ensembles is highly efficient. Another method developed by Liu and Zhou was named as easy ensemble [59] for data resampling using ensemble methods. Easy ensemble keeps the undersampling method's efficiency higher and reduces the risk of ignoring potentially useful information in majority class samples. It has been observed that using an ensemble as a base classifier is more effective for imbalance classification than using a single classifier. Balance Cascade tries to use guided rather than random deletion of majority class samples. In contrast to Easy Ensemble, it works in a supervised manner. Since Balance Cascade removes correctly classified majority class examples in each iteration, it should be more efficient on highly imbalanced data sets.
Marcelino et al. [65] demonstrated that ensemble learners might be affected by the dataset size, an important result since collecting additional data may be costly or infeasible in some cases. Thus, since dataset size may affect classification performance, it is important to examine novel approaches to this problem. Johnson and Khoshgoftaar [17] examined the effects of datasets size and balance levels on the classification performance of various ensemble methods. They concluded that the average AUC value increases within each level of class imbalance as the dataset size increases. Similarly, within each dataset size, the average AUC value increases as the minority distribution increases. In general, ensemble learning methods perform better than any single base learner, tend to be less susceptible to overfitting, and can reduce the bias during data resampling.
RUS [29] is a computationally cheap baseline method that naturally extends to the multi-class case and brings no distortion to class distribution. It is risky because it deletes random samples without checking their potential significance or relevance. TL [38] is a method of border and noise-cleaning. The algorithm is easily extendable to the multi-class case. Still, its computational complexity is higher because it is needed to find the nearest neighbours of each point in the data set. Also, the number of found links is limited because the nearest neighbours will break many candidate pairs from the same class. CNN [39] utilized the one nearest neighbour algorithm to choose which majority sample can be removed. The issue with this method is that it is sensitive to noise by preserving noisy samples. OSS [36] adds the use of TomekLinks to CNN to remove links that are considered noisy. NCR [35] combines C-NN and OSS to remove more noise samples. NM [34] is a binary undersampling algorithm that uses average distances between a given point and the nearest or farthest points of an opposite class. It undersamples only the largest major class because of intrinsic constraints of the binary NearMiss algorithm. NearMiss technique highly distorts a distribution of the major class, also NearMiss-4 has no meaning in the multi-class case.
ROS [44] is a baseline method in which we oversample all minor classes with a random selection of points up to the number of points in the largest major class. However, it can get many instances with the same points, which may not be good for some learning algorithms. ADASYN [45] oversampling algorithm for the multi-class case creates points adaptively to minor classes distributions. The algorithm is not computationally efficient because it computes the nearest neighbours twice, firstly for the whole data to find many points to generate. SMOTE [46], [66] is a widely used multiclass case algorithm. SMOTE has some drawbacks: firstly, its computational complexity is quadratic in the size of the minor class because of the k-nearest neighbour's search. Secondly, selecting target points from the nearest neighbours creates a serious distortion of the minor class distribution. Some points will never be selected as targets; new points are generated as edges of a graph but not in the middle of the distribution. Borderline-SMOTE algorithm [47], [67] creates new points as linear combinations of the borderline minor class points. We have found some drawbacks of the algorithm: 1) low computational efficiency because of k-nearest neighbours to the minor class from the whole data set, 2) a substantial distortion of the minor classes distributions, even more than with pure SMOTE. SMOTE-SVM [48], [68] instead focuses on creating samples on the decision borders of minority and majority classes created by the SVM classifier.

B. PERFORMANCE EVALUATION METRIC FOR IMBALANCED DATA
The performance of a classifier is commonly determined through a confusion matrix shown in Table 2, where True VOLUME 9, 2021 Positive (TP) is the number of correctly classified positive instances, False Negative (FN) is the number of positive instances incorrectly classified to be negative. False Positive (FP) is the number of negative instances incorrectly predicted as positive. In contrast, True Negative (TN) is the number of correctly predicted negative instances [69]- [71]. From the confusion matrix, many standard evaluations metrics can be derived [72], [73]. The most commonly used metric is accuracy, given by Eq. 1.
However, most studies on imbalanced class data point out that accuracy may not be an appropriate metric in imbalanced datasets [74]. This is because, in most applications, the minority class is often more important, requiring methods with improved recognition rates [75], [76], and errors (FN and FP) have varying degrees of consequence. For instance, in cancer diagnosis, one is more interested in correctly detecting the minority (i.e., positive) cases to diagnose and treat the patient effectively. Incorrectly diagnosing a person as cancer-positive could entail additional, unnecessary costs for further medical tests. On the other hand, incorrectly classifying a person as cancer-negative could delay necessary treatment and cost the person's life.
We describe an alternative performance evaluation metric, the area under the Receiver Operating Characteristic curve. The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR = TP/(TP + FN)) on the y-axis against the False Positive Rate (FPR = FP/(TN + FP)) on the x-axis at various threshold values [77]- [79]. The area under the ROC curve (AUC) identifies the classifier's ability to distinguish between classes and compares ROC curves [80], [81].

C. APPLICATION OF CLASS IMBALANCE METHODS TO CANCER DATASETS
Concerning cancer, a comprehensive review of data-level methods for diagnosing various types of cancer was performed in the research of Sara et al. [13]. Compared with other types of cancer, there is less study on class imbalance methods for lung cancer. Few researches also classified the Lung nodules [82], [83], chest-related diseases [84], [85], identification of thoracic diseases [86], forecasting of COVID-19 [83], [87], [88].

III. DATA DESCRIPTION
In this study, we utilise two different lung cancer datasets: (i) the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial and (ii) the National Lung Screening Trial (NLST). As shown in Figure 1, these two datasets are imbalanced in class, and they will be explained below.

A. PLCO DATASET
The PLCO dataset collects anonymised information of men and women age 55 to 74 years, including their responses to baseline and supplementary questionnaires, smoking status, screening test results, diagnostic and treatment procedures [89]. The initial data consists of 154,897 participants, and after performing data cleaning discussed in Section IV-A.1 and Section IV-A.3, the number of participants was reduced to 80,672. Among them, 3,137 or about 3.89%, have confirmed lung cancer, while the rest have no confirmed lung cancer. We took a subset containing age, Body Mass Index (BMI) value and category, x-ray history, education, smoking status, number of years smoking, packyears, number of years since quitting smoking, family history of lung cancer, history of bronchitis and emphysema and confirmed lung cancer. These variables were identified in the PLCO model developed to predict lung cancer risk [90].

B. NLST DATASET
The NLST dataset collects participant information to compare Low Dose Computed Tomography (LDCT) with chest radiography in lung cancer screening. The data contained information from 53,452 participants. There are 2,058 participants with confirmed lung cancer or about 3.85% of them.
In this dataset, we created a subset containing variables similar to the first PLCO subset, namely, age, weight, height, x-ray history, education, smoking status, number of years smoking, pack-years, age when participant quit smoking, history of lung cancer of brother, child, father, mother and sister, history of bronchitis and emphysema and confirmed lung cancer.

IV. METHOD
This research is to explore the method of a class-imbalanced dataset in biomedical data. The confirmed lung cancer cases in the PLCO and NLST datasets make up 3.89% and 3.85% of the respective populations. This low proportion of positive cases indicate that the class distribution is imbalanced. Therefore, class imbalance techniques are applicable to predict the presence of lung cancer. This research uses three classifiers as baseline models according to the type of class-imbalanced method to be explored. It performs the following two types: (i) perform sampling techniques and build classification models, or (ii) perform ensemble methods. The specific workflow is shown in Figure 2, and this section will explain the methods used in the research.

A. DATA PRE-PROCESSING
Data pre-processing includes addressing the issue of missing values and adjusting the features of the datasets. The part about scaling numerical data and one-hot encoding of categories features will be discussed later.

1) HANDLING MISSING VALUES FOR THE PLCO DATASET
Initial data from the PLCO Lung dataset consists of 154,897 participants. We excluded 4,953 participants with no indicated cigarette smoking status cig stat. Whenever this information was unknown, other variables such as the number of years smoking, pack-years and years since quitting smoking were also unknown. One would not reasonably clean these data without information on whether one is a current, former or never smoker. Variables containing a mixture of categorical and numerical data were cleaned. For instance, the number of years since quitting smoking variable cig stop contained the number of years for some former smokers, zero for current smokers, but had no response for some former smokers and non-smokers. The latter is reflected as NaN and had to be cleaned up. For non-smokers, we set this to be equal to the individual's age (i.e., we assume nonsmokers to have ceased smoking since their birth). For those with unknown X-ray history, we set the value to 3, corresponding to the category value that indicates the participant ''does not know'' the answer. For current smokers with an unknown number of years smoking and pack-years, we set their respective cig years and pack-years with the median values for current smokers. Likewise, for former smokers with an unknown number of years smoking, pack-years and years since quitting, we set their respective cig years, packyears and cig stop with the median value for former smokers. For those with an unknown family history of lung cancer, we set the value to 8, indicating a new category value. For the rest of the variables where we could not reasonably assume values for the cleanup, we used SimpleImputer from scikitlearn [91], [92]. To handle missing values of the numerical BMI variable, we used the median strategy. In contrast, for categorical variables represented by numbers, namely, education, history of bronchitis and emphysema, we used the most frequent strategy.
We also made a function to map the BMI value BMI curr. We have just imputed to their corresponding categories BMI cure as per World Health Organization (WHO) standard categorisation of BMI. Further, we created a new subset of the cleaned dataset containing our desired features (age, BMI category, x-ray history, education, smoking status, number of years smoking, pack-years, number of years since quitting smoking, family history of lung cancer, history of bronchitis and emphysema), and the target variable (confirmed lung cancer).

2) HANDLING MISSING VALUES FOR THE NLST DATASET
We converted the columns' data type to numeric since they were all initially cast as a string. For the missing height and weight values, we used imputation with the median strategy. We computed the BMI value from the height and weight values and mapped the result to the BMI category using the same mapper we used in PLCO. Current smokers have missing entries for their age when they quit smoking, so we set them to their age. We imputed their median values for former smokers with missing entries for their age when they quit smoking. We then computed the corresponding cig stop value by taking the difference of the participant's age, and age quit to align it with the definition in PLCO data. Lung cancer history of family members in NLST are indicated in separate fields for brother, child, father, mother and sister. For the missing entries in these fields, we used imputation with the most frequent strategy. We then collapsed these features in a single column, lung FH, by taking their resulting logical OR. For the missing history of bronchitis and emphysema, we used imputation with the most frequent strategy. We also introduced the binary target variable confirmed lung with a value of 1 if the participant has confirmed lung cancer and 0 otherwise, based on the variable conflict. It simplifies our study to a binary classification problem.
Further, we created a new subset of the cleaned NLST dataset containing our desired features (age, BMI category, x-ray history, education, smoking status, number of years smoking, pack-years, number of years since quitting smoking, family history of lung cancer, history of bronchitis and emphysema), and the target variable (confirmed lung cancer), using the same order and exact column names as the PLCO dataset.

3) MAKE PLCO AND NLST DATASET CONSISTENT
In this section, the two datasets after preliminary cleaning are further processed, and it is expected that the characteristics of the two datasets are consistent. We removed the PLCO non-smokers from the dataset because the NLST excludes non-smokers from their screening selection criteria. We also changed the PLCO's former smokers cig stat with a value of 2 to 0 to align with NLST's former smoker's cigsmok value of 0. NLST's categories 8, 95, 98 and 99 did not correspond to PLCO's education categories for the education feature. We calculated the mode for NLST's education variable EDUCAT, which was 3, and used this value instead for the mentioned categories. Family history of lung cancer in PLCO had categories 8 and 9, which did not correspond to NLST's corresponding categories. We used the PLCO's mode for lung fh, which was 0, for these categories. For x-ray history, to align with NLST's binary 0-1 values, we collapsed PLCO's ''Yes, Once'' and ''Yes, More Than Once'' (with values 1 and 2, respectively) into the same value of 1. Also, for those who answered ''Do not Know'' (with the value of 3), we assumed that if they were not sure of their x-ray history, the results would not have been available, so we set those at 0. Finally, we renamed NLST's feature names to follow those of PLCO's for easier reference. We identified the following variables as categorical: BMI curr, bronchit f, cig stat, EDUCAT, emphys f, lung FH, Xray history, while the following variables are numerical: age, cig stop, cig years and pack years.

B. SPLIT DATASET
The researcher used Stratified KFold (K = 5) to split the dataset, dividing the entire development set into five disjoint subsets while still maintaining the sample category ratio. This method uses four-fifths of the dataset for each split. As the training set, the remaining one-fifth is used as the test set. Each split can be regarded as the ith time (i = 1, . . . , 5), and AUC is calculated on the ith test set [93]. It is worth noting that the test set obtained each time will be placed aside, and it will not participate in any stage of scaling or recoding and model building. Since the over-sampling method will copy or synthesise some minority samples, the data obtained in this way cannot represent the original dataset, so the test set should be far from the training process.

C. SCALING AND ENCODING DATA
Scaling data and re-encoding should be applied before sampling because some sampling methods are related to the distance between the data. For example, All-KNN is based on the Euclidean distance of the data, and the magnitude of the excessive difference will affect the sampling effect. The methods of scaling and encoding will be explained in detail.

1) FEATURE SCALING FOR NUMERIC DATA
As part of data pre-processing, we transformed the numeric data to a range of [0,1] using Eq. 2.

2) ONE-HOT ENCODING FOR CATEGORICAL DATA
We performed one-hot encoding for categorical data. Each categorical feature with n categories is converted to n binary (0-1) features [94], [95].

D. CLASS-IMBALANCED METHODS
The class-imbalanced learning methods used in this research mainly include data-level methods and hybrid systems (this research mainly explores the imbalance technologies in the Imblearn library Under-Sampling Boost (RUSBoost). The Balance Cascade algorithm has been continuously adjusted by the Imblearn library in recent years and was finally abandoned in version 0.6.0, so this article will not discuss this method.

E. BUILDING CLASSIFIERS
This study uses three classic classifiers as the baseline model to find the most suitable class-imbalanced technique for the dataset based on this standard: (i) Logistic Regression (LR), (ii) Random Forest (RF), and (iii) Linear Support Vector Classification (Linear SVC).

F. EVALUATION 1) EVALUATE SAMPLING -IMBALANCE RATIO
The imbalance ratio (IR) is an essential parameter in imbalanced learning. It measures the proportional relationship between the majority and minority classes in the experiment [96]. The formula is given by Eq. 3: Most of the data-level methods used in the research are by resampling the majority class or minority class in the original dataset, thereby increasing the minority class samples or reducing majority class samples. Sampling will cause the imbalance ratio of the dataset to change. As IR becomes larger, the disparity in sample size between the majority class and the minority class becomes more significant [97], [98]. The dataset at this time is imbalanced. When the IR value is closer to 1, the dataset tends to be more balanced. Therefore, this paper will use IR to evaluate sampling techniques.

2) EVALUATE MODEL -AUC
This study selected widely-used AUC as the metric to evaluate the ability of each classifier to distinguish between confirmed and no confirmed lung cancer cases. After ith attempts, we can get the mean AUC of ith training on the ith test set. To make the experimental results more accurate and reliable, this study repeated the above process five times and calculated the final mean AUC to measure the model's predictive ability. In addition, this study will compare the experimental results in the PLCO and NLST datasets and discuss the methods of dealing with class-imbalanced data.

V. RESULTS
This section will list the imbalance ratio provided by the resampling technique and then show the prediction results of the imbalance technique model, which can help analyse the effect of the imbalance technique comprehensively. We have used the area under the curve (AUC) for the evaluation of proposed methods. The AUC performs best when the dataset is imbalanced [10], [69]. Our study had 16 imbalance datasets, so various studies [57], [99], [100] employed the AUC curve as a performance evaluation measure.

A. RESULTS FOR PLCO DATASET
The class-imbalanced PLCO dataset has an imbalanced ratio of 24.7. Through resampling technology, the class proportion of the dataset has changed. Table 3 lists the class distribution in the training set after each sampling. Since the sampling occurs in the training set, the baseline of the dataset is the number of samples in the training set (four-fifths of the whole dataset, which is 64537.6). It can be seen from the result that under-sampling changes the majority of samples, over-sampling only processes the minority samples, and the hybrid method changes both categories.
All sampling methods reduce the IR value, and the IR values of over-sampling and hybrid sampling are close to 1, which means that they achieve the class-balanced of the dataset as much as possible.
Applying various under-sampling methods for the PLCO dataset, we show the resulting AUCs for three different classifiers in Table 4. Each classifier had another best undersampling method. Logistic regression using RUS and Linear SVC had higher scores, 0.7124 and 0.7126, respectively. However, the random forest model using Repeated ENN got the highest mean AUC of 0.8968 in the model using the under-sampling method. For over-sampling methods, ROS had the best performance among the three classifiers. These are shown in Table 5. The random forest had the highest mean AUC of 0.8994 among them. For Hybrid Methods shown in Table 6, SMOTEENN achieved a higher mean AUC in logistic regression and Linear SVC. Nevertheless, using SMOTETomek with logistic regression had a higher mean AUC of 0.8684. For ensemble methods shown in Table 7, balanced bagging achieved the highest mean AUC, followed by balanced random forest. The researchers measured all resampling methods in the random forest model with the highest baseline value. In Figure 3, yellow represents the baseline, green represents the under-sampling methods, orange represents the oversampling methods, and blue represents the hybrid methods. The baseline AUC value in PLCO is 0.8532; it can be seen that the lowest value that appears in Near Miss is 0.5035, the highest value appears in ROS, and its AUC value is 0.8994. Observing the bar chart shows that the AUC displayed by the under-sampling method has more significant fluctuations than other methods. Through calculation, the standard deviation (SD) of under-sampling in PLCO is 0.1251, and the SD value of over-sampling is 0.0123. There are only two-hybrid methods, so their SD is not calculated. Also, we separately calculated the standard deviation of ensemble methods (because this method is a separate classifier) as 0.0643. The result is between oversampling and under-sampling. It shows that over-sampling is more stable than other imbalanced learning, and undersampling is the most unstable. Among all the class imbalance techniques tested in the PLCO dataset, random forest using ROS performs best.

B. RESULTS FOR NLST DATASET
The NLST dataset is also an extremely imbalanced dataset, with an imbalance rate of 25.2. The imbalance rate of the dataset obtained by the sampling method is shown in Table 8. We can see similar results to the PLCO dataset. Oversampling and hybrid sampling make the IR adjustment of the  dataset approximately 1. The sample size in the training set shows that the number of samples is reduced after using the under-sampling technique. In contrast, the total number of samples is higher than the original dataset after using other methods. Table 9 shows the resulting AUCs upon applying various under-sampling methods in conjunction with three different classifiers for the NLST dataset. Each classifier had another best under-sampling method. However, for Logistic regression and linear SVC, the difference between the best performing AUC is very small, and their sampling methods are both RUS. Besides, the performance of Random Forest using Repeated ENN is much better than other models in undersampling methods. We show the AUC results for the over-sampling methods in Table 10. Logistic regression is similar to the best over-sampling method of Linear SVC. Random forest with ROS achieved the highest mean AUC of 0.8960.
For hybrid methods shown in Table 11, SMOTETomek achieved a higher mean AUC than SMOTEENN for all three classifiers in the NLST dataset. AUCs of ensemble methods performed in the NLST dataset are shown in Table 12. Balanced bagging achieved the highest mean AUC, followed by balanced random forest.
Similarly, like the PLCO dataset, we measure the performance of the sampling method in the random forest, as shown VOLUME 9, 2021 in the figure. It can be seen that the AUC value of the under-sampling Near Miss is the lowest, and the AUC value of the over-sampling ROS is the highest. By calculating the AUC standard deviation of various sampling methods in the NLST dataset, the SD value of under-sampling is 0.1140, and the SD value of over-sampling is 0.0089. In addition, the standard deviation of hybrid systems is 0.1124, which is between over-sampling and under-sampling. Combining the standard deviation performance and the AUC in each method, under-sampling fluctuates wildly compared to oversampling, which is more stable.
In general, AUCs obtained in the NLST dataset have been lower than the AUCs obtained in the PLCO dataset, indicating an inherent difference in the data.

VI. DISCUSSION
In this section, we will discuss the application of classimbalanced technology in this study in two aspects. One is to discuss different class-imbalanced techniques, and the other is to combine the performance of the two datasets to analyse the results.

A. THE EFFECTS OF IMBALANCED LEARNING
Each classifier is combined with different imbalance techniques in this study, including data-level over-sampling, under-sampling, hybrid method, and methods. Among the three baseline classifiers, the mean value of the random forest is much higher than logistic regression and Linear SVC, and random forest models provide the highest mean value of AUC with different sampling techniques. It shows that the random forest classifier is suitable for these imbalanced medical data used in this study. It is worth noting that although the baseline AUC values of logistic regression and Linear SVC are as low as 0.5, the AUC values of most models have been significantly improved through the use of class imbalance techniques. It shows that the class imbalance technique helps to enhance the ability of model classification. Besides, most of the average AUC in over-sampling methods is higher than other sampling methods. The results show that the over-sampling way is suitable for the imbalanced medical data used in this study. The following will discuss the class imbalance learning in two aspects: the class ratio (IR value) of the samples generated from resampling and the stability of the class imbalance techniques.
It is worth noting that although the baseline AUC values of logistic regression and Linear SVC are as low as 0.5, the AUC values of most models have been significantly improved through the use of class imbalance techniques. It shows that the class imbalance technique helps to enhance the ability of model classification. Besides, most of the average AUC in over-sampling methods is higher than other sampling methods. The results show that the over-sampling way is suitable for the imbalanced medical data used in this study. The following will discuss the class imbalance learning in two aspects: the class ratio (IR value) of the samples generated from resampling and the stability of the class imbalance techniques.
To explore the relationship between the imbalance method and the model's AUC, we use IR to measure the ability of resampling technology to adjust the class distribution. From the sampling results, under-sampling discards part of the majority samples, over-sampling duplicates or synthesises minority samples, and the composite method samples all classes. However, in this study's extremely imbalanced dataset, the performance of under-sampling is not excellent, and the IR of most under-sampling methods is very high. Because under-sampling needs to discard many majority class samples to balance with the minority class, this is likely to lose valuable information. When observing the over-sampling and hybrid methods that perform well after combining with the classifier, the researchers found that the minority class samples were significantly increased. The samples were more than the original dataset, and their IR values were all-around 1. Therefore, it can be considered that the resampling method can adjust the sample distribution of the sample to make the IR of the dataset close to 1, which is beneficial to improve the model's predictive ability. Besides, the researchers also used the standard deviation to assess the stability of the imbalanced learning technique. Since the performance of the random forest classifier is better than other baseline classifiers, the researchers exemplified the AUC value of the resampling model used in the random forest. By calculating the standard deviation (SD value) within each type of resampling method, the SD value of methods (hybrid systems) is also calculated separately. We get the highest SD value of under-sampling (In PLCO: 0.1251; In NLST: 0.1140) and the lower SD value of over-sampling (In PLCO: 0.0123; In NLST: 0.0089).
It shows that different methods may have very different results when under-sampling is used, and using different oversampling methods may get relatively similar results. The standard deviation of over-sampling is much smaller than under-sampling, indicating that the over-sampling method is stable. Therefore, if the resampling method is used to process extremely imbalanced datasets like this research, over-sampling is recommended. Because the over-sampling method is relatively stable, it will not produce significant results due to selecting different methods.

B. EVALUATION OF IMBALANCED LEARNING TECHNIQUES APPLIED TO THE TWO DATASETS
After comparing the performance of different imbalance methods in the two datasets, similar results can be obtained: under-sampling pre-processing the two datasets, RUS has  shown good logistic regression and linear SVC performance. The combination of Repeated ENN and random forest both got the highest average AUC in under-sampling. In the example of using the over-sampling technique, the random forest combined with ROS performed best among all models in both datasets. For ensemble methods, a balanced bagging classifier performed well for both datasets.
In Figure 5 and Figure 6, we summarise the best performing sampling methods for each classifier on the two datasets and compare them with the baseline AUC (i.e., no sampling performed). After each classifier is processed by the sampling method in the table, the AUC of the model has been significantly increased. Except for Linear SVC, the best sampling methods for the other two classifiers are ROS, and the performance of ROS in Linear SVC is similar to the best results. Therefore, the random forest model using ROS is more suitable for processing such imbalanced medical datasets and achieving the highest AUC. The Near Miss of the under-sampling method obtained results lower than the AUC value of the corresponding baseline classifier VOLUME 9, 2021 in both datasets. It performed the worst among all resampling techniques. Therefore, the AUC values obtained by Near Miss on the three classifiers are all the lowest, and it can be considered that it is not suitable for the datasets with an imbalance rate of about 25 used in this study.
Conversely, the random forest model that uses ROS as a whole is more suitable for the highly imbalanced lung cancer dataset used in the research and can achieve the highest AUC. The difference is that SMOTETomek performs very well in the NLST dataset in hybrid methods. The average performance of SMOTEENN in the PLCO dataset is slightly higher than that in the NLST dataset. It shows that there are still some potential differences between the two datasets.
It may be worthwhile to include algorithm-level methods to complete the suite of class imbalance techniques and evaluate their predictive performance. However, the costs and weights assigned to the algorithm-level methods must be as close as possible to realistic values.

VII. CONCLUSION
In this study, we have investigated class imbalance techniques, including data-level and hybrid systems, to predict the presence of lung cancer. Two medical datasets related to lung cancer (PLCO and NLST) with imbalance ratios of 24.7 and 25.2 are used in this research. The imbalanced learning method is used to solve the problem of a skewed majority in prediction. This research discusses 23 imbalanced learning methods, including ten under-sampling techniques, seven over-sampling techniques, two-hybrid resampling methods, and four hybrid systems. The class imbalance technology adjusts the majority or minority samples by discarding the majority samples, copying or synthesising the minority samples to balance the categories in the dataset. In addition, three classic classifiers (logistic regression, random forest, linear SVC) combined with resampling techniques were used to train the dataset. The prediction results obtained using the classifier training pre-processing data (except for null values, etc.) are used as a baseline for comparison with models built using imbalance techniques. The method used to evaluate the sampling technique is the imbalance ratio, and the index used to assess the classification ability of the model is AUC.
Further, the standard deviation was used to measure the stability of class imbalance techniques. This study shows that using the class-imbalance technique has higher performance than the baseline model. Class imbalance technology helps to improve the prediction performance of the model. The data-level technology adjusts the IR of the dataset to be close to 1 through resampling. Among the imbalanced learning methods studied in this paper, the over-sampling technique performed best, and the IR value of the over-sampling dataset was about 1. Most of the models that use over-sampling have higher AUC values than other models. The over-sampling method has higher stability than other methods, and the under-sampling method has the worst stability. Also, the random forest with random over-sampling is the best predictive model, and it is more suitable for the PLCO and NLST datasets related to lung cancer. Using ROS technology to process these two datasets in the random forest model can achieve the highest AUC value.
Conversely, the random forest using Near Miss is even far below the baseline value. Therefore, the combination of ROS technology and the random forest is worthy of promotion. However, there are still some small gaps within different datasets, and compound systems and over-sampling can be suggested to deal with extremely imbalanced biomedical datasets similar to those in the research. The contribution of this research is to prove that the class imbalance techniques can be used to diagnose lung cancer. The over-sampling technique is better than other imbalanced learning methods. Finally, the researchers proposed a model combining ROS and random forest to screen for lung cancer so that more people can receive timely treatment and reduce the loss caused by misdiagnosis. In future research, the new class imbalance technology is worthy of application and exploration. Combining more diverse classifiers and imbalance techniques to achieve higher model prediction capabilities is also worth looking forward. Furthermore, a deep learning-based model, i.e., GNN, AlexNet, ResNet etc., can also be deployed for the imbalance dataset problem. KAMRAN SHAUKAT received the M.Sc. degree in computer science from Mohammad Ali Jinnah University, Pakistan. He is currently pursuing the Ph.D. degree with The University of Newcastle, Callaghan, NSW, Australia. He has served the University of the Punjab, Pakistan, for seven years, as a Lecturer. He is the author of many articles in machine learning, databases, and cyber security. He has served as a Reviewer to many journals, including IEEE ACCESS. He has attended several international conferences, including the USA, U.K., Thailand, Turkey, and Pakistan. He received the Gold Medal for his M.Sc. degree.
TALHA MAHBOOB ALAM received the bachelor's degree in software engineering from the University of Management and Technology (UMT), Lahore, Pakistan, in 2017, and the master's degree in computer science from the University of Engineering and Technology (UET), Lahore, in 2020. He is currently serving for the Virtual University of Pakistan. He has published more than 25 journals and conference papers of international repute. His research interests include big data, machine learning, deep learning, and knowledge discovery in databases. SUHUAI LUO received the bachelor's and master's degrees from Nanjing University of Posts and Telecommunications, and the Ph.D. degree from The University of Sydney, all in electrical engineering. He is currently an Associate Professor in information technology with The University of Newcastle. His main research interests include image processing, computer vision, machine learning, cybersecurity, and media data mining. His diverse research focus has led him to conduct studies in areas ranging from medical imaging for computer-aided diagnoses to computer vision for intelligent driving systems and machine learning for enhancing cybersecurity.
XIAOYAN YANG (Member, IEEE) received the Bachelor of Engineering degree in mechatronic engineering from Beijing Jiaotong University, in 2019, the Bachelor of Engineering degree (Hons.) from the University of Wollongong, Australia, and the master's degree in data science from The University of Sydney, Australia, in 2021.
MARANATHA CONSUELO REYES received the B.S. degree in mathematics and the M.S. degree in applied mathematics from the University of the Philippines, and the Master of Data Science degree from The University of Sydney, Australia, in 2020. VOLUME 9, 2021