Journals & Magazines >IEEE Access >Volume: 8

Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques

Comparing different Resampling Methods in highly imbalanced educational datasets using various Machine Learning models and Friedman Statistical Test.

Abstract:

In today's world, due to the advancement of technology, predicting the students' performance is among the most beneficial and essential research topics. Data Mining is ex...Show More

Metadata

Abstract:

In today's world, due to the advancement of technology, predicting the students' performance is among the most beneficial and essential research topics. Data Mining is extremely helpful in the field of education, especially for analyzing students' performance. It is a fact that predicting the students' performance has become a severe challenge because of the imbalanced datasets in this field, and there is not any comparison among different resampling methods. This paper attempts to compare various resampling techniques such as Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek to handle the imbalanced data problem while predicting students' performance using two different datasets. Moreover, the difference between multiclass and binary classification, and structures of the features are examined. To be able to check the performance of the resampling methods better in solving the imbalanced problem, this paper uses various machine learning classifiers including Random Forest, K-Nearest-Neighbor, Artificial Neural Network, XG-boost, Support Vector Machine (Radial Basis Function), Decision Tree, Logistic Regression, and Naïve Bayes. Furthermore, the Random hold-out and Shuffle 5-fold cross-validation methods are used as model validation techniques. The achieved results using different evaluation metrics indicate that fewer numbers of classes and nominal features will lead models to better performance. Also, classifiers do not perform well with imbalanced data, so solving this problem is necessary. The performance of classifiers is improved using balanced datasets. Additionally, the results of the Friedman test, which is a statistical significance test, confirm that the SVM-SMOTE is more efficient than the other resampling methods. Moreover, The Random Forest classifier has achieved the best result among all other models while using SVM-SMOTE as a resampling method.

Comparing different Resampling Methods in highly imbalanced educational datasets using various Machine Learning models and Friedman Statistical Test.

Published in: IEEE Access ( Volume: 8)

Page(s): 67899 - 67911

Date of Publication: 13 April 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.2986809

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Recent advancement in several fields has led to a large amount of collected data [1]. Since analyzing the considerable amount of data to reach useful information is a tedious task for humankind, data mining techniques can be used to discover valuable and significant knowledge from the data [2]. It is well-known that universities are operating in a very complex and highly competitive environment [3], [4]. The main challenge for universities is to examine their performance profoundly, identify their uniqueness, and build tactics for further development and future achievements [5]. The educational system understands the potential of using data mining to improve its performance dramatically.

Educational Data Mining (EDM) is the implementation of data mining methods for analyzing available data at educational institutions [6]. Although data mining leads to knowledge discovery, machine learning algorithms provide the needed tools for this purpose. The high accuracy prediction in students’ performances is useful as it helps to identify the students with low academic achievements at the early stage of academics [7], [8]. Educational data mining helps educational organizations to extend their understanding of the learning process by analyzing the related educational data [9], [10]. In fact, the prediction of student academic performance is indispensable for student academic progression, and it is also challenging due to the influence of different factors affecting students’ performance [11], [12]. In recent years, researchers have introduced new strategies for educational data mining.

There have been numerous researches in the education field. In 2008, [13] introduced an Artificial Neural Network (ANN) model using a sample of 1,407 students’ profiles to predict their performance. The proposed algorithm was trained and tested by applying the hold-out method, which is one of the most popular cross-validation techniques. It should be noted out that there are other researches that have implemented the Artificial Neural Network algorithm as a predictive model. In 2015, [14] developed two different models of the Artificial Neural Network algorithm. The results of this research indicated that the Artificial Neural Network model could predict 95% of students’ performance accurately, which shows the effectiveness of this model in prediction. Furthermore, [15] tested the Artificial Neural Network model with the overall accuracy result of 84.6%, which proves the potential of this model in predicting students’ performance. It is apparent that other machine learning models have also been developed. [16] formed a Naive Bayes model using the 700 students’ data to predict their performance. Also, [17] used the Decision Tree models with an overall correct classification percentage of 60.5%. In addition, this research indicated the essential features using feature importance method.

It is important to note that there are some research works that have introduced and compared different machine learning and data mining models with other models. In 2014, [18] applied various machine learning models to predict students’ performance. The results show that the Decision Tree has obtained the best performance among other models. Also, [19] assessed the performance of different classifiers such as Logistic Regression, Support Vector Machine, Decision Tree, Artificial Neural Network, Naive Bayes, and K-Nearest Neighbor. Moreover, the feature selection method is used to increase the models’ accuracy. Furthermore, [20] compared the performance of the Support Vector Machine, Logistic Regression, Naive Bayes, Random Forest, and XG-Boost data mining methods. Similarly, [21] studied the differences among the performance of Artificial Neural Network, XG-Boost, Random Forest, and Logistic Regression. The results of this research show that XG-Boost has demonstrated excellent predictive accuracy. It is significant to consider that all of these research works have used the hold-out method, which is the most straightforward cross-validation technique.

The k-fold cross-validation is a reliable cross-validation method which is not used as much as the hold-out method in the field of education data mining. In 2013, [22] implemented and compared the Decision Tree, Naive Bayes, and K-Nearest Neighbor models while using 10-fold cross-validation. Furthermore, some other researches, such as [23], [24] used the k-fold cross-validation to compare different data mining models in the purpose of predicting students’ performance.

It is a well-established fact that it can be challenging to improve a model’s predictive accuracy. Different factors have an impact on enhancing prediction accuracy. Using feature selection and handling imbalanced class distribution problem are among the essential factors. The class imbalance distribution is a common problem for educational data, which can extremely affect models’ performance. Therefore, [25] developed the Decision Tree and Logistic Regression models to predict students’ performance while handling the imbalanced class problem. [26] concentrated on developing different algorithms while using the feature importance method and SMOTE oversampling method as a way to solve the imbalanced data problem. Moreover, [27] compared random oversampling and SMOTE balancing methods along with four popular data mining models to assess the students’ performance. Choosing a way to solve the imbalanced data problem can be challenging, and many resampling methods are available to handle the imbalanced data problem. However, there is not any research comparing these methods with each other.

A summarized list of research works on educational data mining and predicting students’ performance is presented in Table 1.

TABLE 1 Review of Research Works in the Field of Educational Data Mining and Predicting Students’ Performance

In summary, due to the importance of imbalanced data problem, a lack of comprehensive comparison among the popular resampling methods as a way to handle this problem is evident. This paper tries to study the impact of the imbalanced data problem on the machine learning models’ performance. It uses different resampling methods to solve the imbalanced data problem and compares these methods while using various machine learning classifiers to fill the gaps in the literature. The novel innovations and vital processes of this research as compared to similar research works include:

Applying feature scaling to normalize the variety of independent data features.
Implementing and comparing different resampling methods, namely Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek.
Applying different model validation methods, namely Random Hold-Out and Shuffle K-fold cross-validation methods, to perform the validation step.
Comparing the performance of resampling methods using various machine learning models such as Logistic Regression, K-Nearest Neighbor, Support Vector Machine, Naive Bayes, Artificial Neural Network, and Decision Tree, and XG-Boost.
Measuring the performance of the implemented models using different evaluation measure methods such as Accuracy, Recall, Precision, and F1-Score.
Showing the effect of the resampling methods on the classifiers’ performance.
Analyzing and examining the differences between resampling methods and indicating the best method among others using the Friedman test as a statistical significance test.
Investigating the difference between multiclass and binary classification and the importance of the features’ structure.

This paper is prepared as follows: The next section explains the methodology of this paper and the information about the datasets and all the preprocessing operations, such as different solving methods of the imbalanced data problem. In section 3, the implemented predictive models are introduced. Section 4 describes the validation methods used to evaluate the generalization of statistical analysis results. In section 5, the employed evaluation measure methods are described. Section 6 presents the results and complete analysis to demonstrate the performance of the different resampling methods while using various machine learning classifiers. Finally, Section 7 exposes the conclusion and recommends some directions for future research.

SECTION II.

Material & Methods

This paper attempts to compare the different resampling methods of handling the imbalanced data problem to find the best approach and classifier while predicting the students’ performance. Also, examining the difference between multiclass and binary classification and the importance of the features’ structure are among the goals of this research. The steps of the applied methodology to achieve the goals of this paper are as follows:

Data Collection
Data Preprocessing
Handling Imbalanced dataset
Implementing Predictive Models
Analyzing the Results

A. Dataset Information

This research has used two different educational datasets from educational institutions of Iran and Portugal [23]. In the Iran dataset, all available information about postgraduate students collected and registered manually from Iran University of Science and Technology between 1992-93 and 2014-15 academic years. This dataset consists of a set of factors that can affect the students’ performance. This dataset includes information on the 650 students with 19 different attributes. Also, in the Portugal dataset, all the information is related to student achievement in the secondary education of two Portuguese schools. This dataset includes information on the 394 students with 19 different attributes. The output variable in this study is the Final GPA. The information about the output attribute for both datasets is divided into four categories based on the grade point average of the students. These four categories are Poor, Medium, Good, and Excellent students, so this paper faces a multi-classification problem. Table 2 presents the main features of these datasets. Using these two datasets helps to better express the imbalanced data problem in all levels of educational fields, to have a better comparison among different resampling methods, and to gain more trustable results. Moreover, different structure of these datasets helps to have a more comprehensive analysis in the effect of the formation of a dataset.

TABLE 2 Main Features of Students’ Dataset of Iran University of Science and Technology

B. Data Preprocessing

One of the most significant steps in machine learning is data preprocessing. This step transforms the raw data into a proper and understandable format. In the real world, datasets contain many errors; therefore, this step can solve the errors, and the datasets become easy to handle [28]. Fortunately, handling the missing data as a step of data preprocessing is not needed because the datasets used in this research have no missing data.

1) Imbalanced Data Problem

Imbalanced data problem occurs in many real-world datasets where the class distributions of data are highly imbalanced. It is important to note that most machine learning models work best when the number of instances of each class is approximately equal [29]. The imbalanced data problem causes the majority class to dominate the minority class; hence, the classifiers are more inclined to the majority class, and their performance cannot be reliable [30].

Analyzing the introduced datasets reveals that they are highly imbalanced, and the four categories of students based on their grade point average are not equal. In fact, the Iran dataset includes more samples from Medium (40% of samples) and Good classes (40% of samples), while the other two classes have fewer samples (the Poor class with 11% of samples and the Excellent class with only 9% of samples). The Portugal dataset involves 15% of samples related to Poor class, 44% of samples to Medium class, 35% of samples to Good class, and only 6% of samples to Excellent class. Accordingly, it is necessary to solve the imbalanced data problem because this problem may lead to unpredictable outcomes. Figure 1 shows the distribution of the students’ performance based on the different classes of both datasets.

FIGURE 1.

The distribution of the students’ performance of both datasets.

Show All

Many strategies have been generated that can handle the imbalanced data problem. The sampling-based approach is one of the most effective methods that can solve the imbalanced data problem. The sampling-based approach can be classified into three categories, namely: Over-Sampling [31], Under-Sampling [32], and Hybrid-Sampling [33].

a: Over-Sampling Method

Over-sampling raises the weight of the minority class by replicating or creating new minority class samples. There are different over-sampling methods; moreover, it is worth noting that the over-sampling approach is generally applied more frequently than other approaches.

Random Over Sampler
This method increases the size of the dataset by the repetition of the original samples. The point is that the random over sampler does not create new samples, and the variety of samples does not change [34].
SMOTE
This method is a statistical technique that increases the number of minority samples in the dataset by generating new instances. This algorithm takes samples of the feature space for each target class and its nearest neighbors, and then creates new samples that combine features of the target case with features of its neighbors. The new instances are not copies of existing minority samples [35].
Borderline SMOTE
In this method, samples and the neighboring ones are more likely to be misclassified than the ones far from the borderline. This method uses the number of majority neighbors of each minority sample to divide the minority samples into the three groups, namely Safe, Danger, and Noise. It should be noted that the Danger group is used to generate new instances [36].
SVM-SMOTE
This method generates the new minority class samples along with directions from existing minority class instances towards their nearest neighbors. The SVM-SMOTE focuses on creating new minority class samples near borderlines using the SVM model to help set boundaries between classes [37].

b: Under-Sampling Method

Under-sampling is one of the most straightforward strategies to handle the imbalanced data problem. This method under-samples the majority class to balance the class with the minority class. The under-sampling method is applied when the amount of collected data is sufficient. There are different under-sampling models, such as Edited Nearest Neighbors (ENN) [38] and Tomek links [39], which are the most popular ones.

c: Hybrid Methods

Over-sampling and under-sampling have different advantages and disadvantages. Combining these two methods can help to get benefits and drawbacks of both approaches.

SMOTE-ENN
This method is one of the well-known methods that combines the SMOTE as over-sampling model and ENN as an under-sampling model to improve the results [40].
SMOTE-Tomek
This method is another common hybrid method that connects the SMOTE as an over-sampling model to Tomek links as an under-sampling model to enhance the results [40].

All of the used resampling methods in this paper are listed in Table 3, together with their most important parameters’ settings. Best results are achieved using these settings.

TABLE 3 Resampling Methods With Their Parameters’ Settings

2) Feature Scaling

Feature scaling or data normalization is a technique that helps to normalize the range of independent variables or features of the dataset. Most of the machine learning models use the Euclidean distance between two data points, so they may not work well without Feature Scaling [41]. There are four popular ways to implement Feature Scaling, namely Standardization, Mean Normalization, Min-Max Scaling, and Unit Vector. The range of students’ performance dataset values used in this paper is widely varied. This paper uses the Standardization method to rescale the features. As a result, all the features have the standard normal distribution characteristics with $\mu =0$ and $\sigma =1$ where $\mu$ is the average, and $\sigma$ is the standard deviation from the average. The formula used to scale the values is as follows [42]:

$\begin{equation*} \mathcal {Z}=\frac {x-\mu }{\sigma }\tag{1}\end{equation*}$ View Source

SECTION III.

Machine Learning Models

There are various classifications machine learning models. This paper carries out different classifiers, including Random Forest [43], [44], K-nearest-neighbor [45], [46], Artificial Neural Network [47], [48], XG-boost [49], [50], Support Vector Machine (Radial Basis Function kernel) [51], [52], Decision Tree [53], [54], Logistic Regression [55], [56], and Naïve Bayes [57]. It is a well-established fact that most machine learning classifiers support multiclass classification inherently, such as Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), and Naïve Bayes (NB). Since Support Vector Machine (SVM) and XG-Boost do not support multiclass classification inherently, one vs. one method is used for applying the Support Vector Machine model, and one vs. all method is used for implementing XG-Boost model.

All of the used machine learning models in this paper are listed in Table 4, together with their specific parameters’ settings.

TABLE 4 Machine Learning Models With Their Specific Parameters’ Settings

SECTION IV.

Model Validation

Cross-validation is a model validation technique applied to evaluate how the statistical analysis results are generalized into an independent dataset. This paper uses two popular different cross-validation approaches, which are random hold-out (randomly divides the 80% of the data into the training set and 20% into the test set) and shuffle 5-fold cross-validation. It should be noted that the resampling method can only be used on the training set, and the test set classes should not be balanced at all. Therefore, all the resampling methods are applied to the training set while using different model validation.

SECTION V.

Evaluation Methods

Evaluating the performance of classifiers is an essential part of comparing and finding the best model. There are many ways to measure and check the performance of machine learning algorithms. This paper uses various evaluation methods such as prediction Accuracy, Sensitivity, Precision, and F1-score; moreover, the statistical evaluation strategy is used for a more trustable and powerful analyzing and comparing.

Analyzing and comparing the classifiers’ performance is a significant procedure. Although it is simple to use evaluation measures, the obtained results may be misleading. Therefore, finding the best model or method based on their capabilities is a critical challenge. Statistical significance tests are planned to solve this problem [58]. The repeated-measures ANOVA is the typical statistical test method which is used to determine the differences between more than two related sample means. The null-hypothesis being examined in the ANOVA test is that all resampling methods perform the same, and the detected differences are only arbitrary [59]. It should be noted that the ANOVA test considers three assumptions. These assumptions are as follows:

The samples should be normally distributed.
the sample cases should be autonomous from each other.
the variance between the groups (methods which are being compared) should be approximately equal.

This paper uses the Anderson–Darling normality test [59] to evaluate the normality of data. This test is a modification of the Kolmogorov-Smirnov test [60]. The null hypothesis of this normality test is that the data have a normal distribution; accordingly, if the p-value of this normality test is less than $\alpha$ ( $\alpha = 0.05$ ), the null hypothesis will be rejected, and the data do not have a normal distribution.

It is a well-established fact that the ANOVA assumptions can be violated. Therefore, the Friedman test, which is a non-parametric option of the ANOVA test, can be applied to examine the differences between models and methods [61]. The null-hypothesis of the Friedman test is that all resampling methods perform the same; also, the rejection of this null hypothesis implies that one or more of the resampling methods have a different performance. This paper uses the accuracy data gathered by shuffle 5-fold cross-validation for each resampling method.

The Freidman test ranks the data of each classifier for each resampling method, then analyzes the values of ranks [62]. Accordingly, the Friedman test gives a sum of ranks for each resampling method that assists in defining the most effective resampling method among all others.

SECTION VI.

Results & Discussion

This paper tries to show the effect of imbalanced data problem and handle this problem using various resampling methods; additionally, determining the best resampling method and the best classifier compare to all other models and examining the difference between multiclass and binary classification and the importance of the features’ structure are among the aims of this paper. All presented models and methods have coded in Python, which is an interpreted, general-purpose, high-level programming language. Moreover, all practical operations are performed with a 2 GHz Intel Core i7 MacBook Pro with 4GB of RAM. It should be pointed out that all the classifiers are first executed on the imbalanced data to show the effect of the imbalanced data problem on the models’ performance. Next, all the classifiers are implemented on balanced data generated by resampling methods to notify a better perception of the effectiveness of the resampling methods as ways to solve the imbalanced problem.

A. Random Hold-Out Method Results

Table 5 shows the performance of the different classifiers on the imbalanced datasets using the random hold-out approach. Various evaluation measure methods such as Accuracy, Recall, Precision, and F1-score are used to provide a better understanding of the performance of the models.

TABLE 5 Performance of the Classifiers Based on the Hold-Out Strategy on Imbalanced Data

One of the most popular evaluation techniques to measure a classifier’s performance is accuracy. This metric is the proportion between the number of correct predictions and the total number of samples examined. Although accuracy is easy to understand, it ignores many essential factors that should be considered in assessing the performance of a classifier. In Iran Dataset, all of the accuracy results are below 60%, which reveals that none of the classifiers have achieved satisfactory and remarkable accuracy results. The Artificial Neural Network classifier has obtained 58.46% accuracy, which is the best result among all other models. Also, the worst accuracy result belongs to the Naïve Bayes with the accuracy of 46.15%. In Portugal dataset, the Random Forest, with an accuracy of 76.83%, has the best performance, and the Naïve Bayes with an accuracy of 47.72% has the worst performance.

The Recall is the probability of detection indicating the proportion of items identified correctly. It means that ANN correctly identifies 58.46% of all different students in the Iran dataset. All the classifiers’ recall test results are similar to their accuracy results. Besides, precision is the portion of the relevant results. The results of this test do not include remarkable outcomes using Iran dataset. The highest precision in the Iran dataset belongs to ANN with 52.98%, and the lowest one goes to Naïve Bayes with 27.76% among all other classifiers. In the Portugal dataset, results are so much better. In fact, the highest precision goes to Random Forest with 78.44%, and the lowest one goes to Decision Tree with 58.23% among all other classifiers.

F1-score, which is the harmonic average of Precision and Recall, includes critical and indispensable results about classifiers’ performance on each class. As stated, the distribution of the classes is not balanced, and the majority of the data relates only to the two of the classes. Considering both datasets, the results of the F1-score with each class reveal that predictive classifiers do not perform well with some of the classes. For example, the ANN model and Random Forest, which have the best accuracy result among others in both datasets, fail to predict one of the classes or Support Vector Machine (RBF Kernel) fails to predict two of the classes in the Iran dataset. Accordingly, the classifiers’ performance is not acceptable

One of the essential results from table 5 is the overall low performance in all the classifiers. For example, in the Iran dataset, the highest accuracy among the classifiers belongs to. ANN with 58.46%, and in the Portugal dataset, Random Forest, with 46.83% has the best performance. There are lots of reasons that can reduce the initial accuracy in classification. One of the reasons could be the structure of the features. The initial results of the Portugal dataset are better than Iran dataset, and this can be because of their features’ structure. Actually, Portugal dataset has more numeric features that help the models to better find the patterns in the data. Another reason could be the number of classes. As mentioned, this paper deals with two different datasets that have four classes each. To analyze the effect of the number of classes on the initial performance, we reduced the number of classes to two. The initial accuracy results of the implementation of ML models on the new datasets are shown in Table 7. The highest accuracy among the classifiers in binary classification belongs to Logistic Regression with 77.69% for the Iran dataset and XG-Boost with 93.29% for the Portugal dataset. The results reveal that ML models have so much better performance while using binary classification, and the accuracies are increased significantly. In fact, decreasing the number of classes has a great effect on the performance of the models. The high number of classes increases the complexity; therefore, models need a high number of samples to better find the patterns in multi-classification problems. This means that given a fixed number of samples, a greater number of classes will lead to poorer results.

TABLE 6 Accuracy Results of the Classifiers Based on the Hold-Out Strategy on Different Balanced Data

TABLE 7 Accuracy Results of the Classifiers Based on the Binary Classification

As mentioned, this paper works on two different datasets in a multi-classification problem and tries to determine the effect of the imbalanced data problem and discover the best resampling method and classifier. The results obtained from imbalanced data in the multi-classification problem indicate that machine learning algorithms do not give accurate results using imbalanced datasets; also, most of the classifiers cannot predict all the target classes. Therefore, solving the imbalanced data problem is notably necessary. Table 6 represents the accuracy obtained by each machine learning technique on both balanced datasets using six different resampling models.

The accuracy result achieved using an imbalanced dataset is not acceptable. Accuracy can be a useful measure if data has the same number of samples per class. However, with an imbalanced set of samples, accuracy is not helpful at all because the model predicts the value of the majority classes for all predictions. The results of the accuracy-test using the balanced dataset are not significantly improved. It is logical that most of the classifiers are predicting with lower accuracy results on the balanced data because they are considering all classes. Since the imbalanced data problem is handled by resampling methods, the accuracy can be trustable now.

Table 8 indicates the results of the Recall and Precision tests at the same time. It should be pointed out that the results of the recall test are the same as the accuracy, but some of the machine learning models have notable advancement in their precision results. For instance, in the Iran dataset, the Support Vector Machine achieved the result of 44.80% with the precision test using imbalanced data, while the result is increased up to the 57.31% with balanced data using SVM-SMOTE method. Moreover, in the Portugal dataset, the XG-Boost obtained the result of 64.85% with the precision test using imbalanced data, while the result is increased up to the 76.32% with balanced data using Borderline SMOTE method. As mentioned, to better analyze the recall and precision tests, it is more beneficial to use the F1-score.

TABLE 8 Recall and Precision Results of the Classifiers Based on the Hold-Out Strategy on Different Balanced Data

It should be regarded that the classifiers do not achieve an excellent result with the F1-score while using imbalanced data, and classifiers do not perform well with all the classes. This is an essential problem that should be solved by handling the imbalanced data problem. After using different resampling methods and solving the imbalanced data problem, the results show that classifiers do not ignore any classes, and all four classes are predicted and analyzed in both datasets. This is one of the most significant reasons for using balanced data. For example, the Artificial Neural Network model ignores one of the classes while using imbalanced datasets. However, after solving the imbalance problem, this model considers all the classes. Table 9 presents the results of the F1-score for all applied machine learning models.

TABLE 9 F1-score Results of the Classifiers Based on the Hold-Out Strategy on Different Balanced Data

B. Shuffle 5-Fold Cross-Validation Results

This paper utilizes the shuffle 5-fold cross-validation, which splits the dataset into five subsets and uses one of the five subsets as the test set and the other four subsets as the training set every time and then repeats the hold-out method five times. Table 10 shows the achieved average accuracy results and variance of implementing machine learning models using this type of validation method.

TABLE 10 Accuracies Results of the Classifiers Based on the Shuffle 5-Fold Cross-Validation on the Different Balanced Datasets

The results of shuffle 5-fold cross-validation are more trustable and acceptable because of the way that this strategy works. The results show that after solving the imbalanced data problem, there is a slight improvement in some of the models’ accuracies. The obtained results from the balanced dataset using SVM-SMOTE is significantly better than other datasets. In the Iran dataset, the Random Forest classifier achieved 73% with the accuracy, which is acceptable and better than other models. Also, In the Portugal dataset, the Random Forest has reached an impressive accuracy of 81.27%, which is the best performance among all the models in all situations. Regarding the performance of classifiers using other resampling methods, it can be noted that Random Forest has achieved excellent results in almost all the balanced datasets.

C. Statistical Test Results

Various resampling methods provide different balanced data and classifiers have different performances while using different balanced data. Therefore, it is so hard to find the best resampling method to achieve the best results from machine learning models.

Statistical significance tests assist in dealing with the challenge of choosing the best resampling method. As declared, this paper uses the accuracy data collected by shuffle 5-fold cross-validation for each resampling method based on different machine learning models. It is worth noting that the normality assumption should be checked before applying the ANOVA test. The Anderson-Darling normality test results on shuffle 5-fold cross-validation indicate that the p-value is less than 0.05 ( $\alpha =0.05$ ) for both datasets; therefore, the null hypothesis is rejected, and the ANOVA test cannot be used. Table 11 reveals the results of the Anderson-Darling normality test for both datasets.

TABLE 11 The Anderson-Darling Normality Test Results

Since the ANOVA normality assumption is violated, the Friedman test is applied for comparing the resampling methods instead of the ANOVA test in both datasets. Table 12 displays the results of the Friedman test.

TABLE 12 The Friedman Test Results

These results show that the p-value of both datasets is less than the significance level ( $\alpha =0.05$ ). Therefore, the null hypothesis is rejected, and it can be concluded that at least one of the resampling methods has a different effect on both datasets.

Table 13 exposes the results of the median and sum of ranks derived from the Friedman test in both datasets. The midpoint of the dataset is named the median.

TABLE 13 Additional Information From Friedman Test Results

The data points of each resampling method are split equally above and below the midpoint value. Furthermore, the overall median is the midpoint of all data points. The median response for the SVM-SMOTE method is considerably higher than the overall median in both datasets. Moreover, the result of the sum of ranks for the SVM-SMOTE method is better than other resampling methods in both datasets. These results confirm that the SVM-SMOTE method might be more efficient than the other methods.

SECTION VII.

Conclusion

The recent improvements in numerous areas have led to the collection of a considerable amount of data. Today, educational institutions collect information about students. One of the main challenges for these institutions is analyzing and predicting their students’ performance. Educational Data mining is a robust analytical method that can be used to discover significant and meaningful knowledge from educational data; however, it can face some difficulties such as imbalanced educational data problems in predicting students’ performance.

This study intends to show the effect of imbalanced data problem and find the best resampling method among the different methods of handling the imbalanced data problem, namely Borderline SMOTE, Random Over Sampler, SMOTE, SVM-SMOTE, SMOTE-ENN, and SMOTE-Tomek. It should be noted that two different datasets related to students’ performance are used, the difference between multiclass and binary classification, and structures of the features are considered. Several classifiers are applied to inform a better conclusion of resampling methods. All the classifiers are first performed using the random hold-out method on the imbalanced dataset. The results show that classifiers do not have acceptable predictions while imbalanced data, and they cannot predict some of the classes at all. Moreover, the obtained results using different evaluation metrics indicate that the few numbers of classes will lead to better performance with machine learning models. Also, more numeric features help the models to have better performance. Using the random hold-out method on different balanced data generated by various resampling methods determine that the performance of some classifiers is improved, and all the classes are predicted, so the classifiers’ performance is satisfactory. Moreover, shuffle 5-fold cross-validation is used to achieve more reliable results with accuracy. The results of this validation method indicate that classifiers have a varying performance on different balanced data for both datasets; therefore, selecting the best resampling method is not easy. However, it seems that classifiers work better on the data balanced by the SVM-SMOTE method in both Iran and Portugal datasets. This paper used the Friedman test to choose the best resampling method. The results of this test confirm that the performance of SVM-SMOTE is better than other resampling methods. Also, the Random Forest model has achieved the best results among other classifiers while using the SVM-SMOTE resampling method.

This study can be developed in many ways, and it is possible to perform future work in the following directions. New ensemble and hybrid classifiers can be introduced for having a better comparison and also achieving higher performance. Additionally, feature selection methods as a way of improving models’ results can be performed to get a better perspective on the significant features.

References is not available for this document.

Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques

Abstract:

Metadata

Abstract:

Introduction