An Integrated Framework Based on Latent Variational Autoencoder for Providing Early Warning of At-Risk Students

The rapid development of learning technologies has enabled online learning paradigm to gain great popularity in both high education and K-12, which makes the prediction of student performance become one of the most popular research topics in education. However, the traditional prediction algorithms are originally designed for balanced dataset, while the educational dataset typically belongs to highly imbalanced dataset, which makes it more difficult to accurately identify the at-risk students. In order to solve this dilemma, this study proposes an integrated framework (LVAEPre) based on latent variational autoencoder (LVAE) with deep neural network (DNN) to alleviate the imbalanced distribution of educational dataset and further to provide early warning of at-risk students. Specifically, with the characteristics of educational data in mind, LVAE mainly aims to learn latent distribution of at-risk students and to generate at-risk samples for the purpose of obtaining a balanced dataset. DNN is to perform final performance prediction. Extensive experiments based on the collected K-12 dataset show that LVAEPre can effectively handle the imbalanced education dataset and provide much better and more stable prediction results than baseline methods in terms of accuracy and $F_{1.5} $ score. The comparison of t-SNE visualization results further confirms the advantage of LVAE in dealing with imbalanced issue in educational dataset. Finally, through the identification of the significant predictors of LVAEPre in the experimental dataset, some suggestions for designing pedagogical interventions are put forward.


I. INTRODUCTION
Technology innovation is reforming the world at an astonishing speed, including the field of education. The rapid development of learning technologies has enabled online education to gain great popularity. According to a report about enrollments in higher education in 2018, more than 6.3 million students in the United States took at least one online course [1]. In addition, the NMC horizon report also stated that online learning had experienced a significant growth surge and more than 2.7 million American K-12 students chose to take online courses [2]. Although the enrollments of online learning continue to increase in these years, online learning The associate editor coordinating the review of this manuscript and approving it for publication was Xiao-Sheng Si .
institutions or platforms also face a more serious challenge of high dropout ratio than their traditional counterparts [3]. Therefore, the major concern of online education's administrators and instructors is how to provide in-time interventions for improving this dilemma.
Thanks to the capability of online learning systems to track and store students' online activities, a feasible way to tackle the above issues is to get deep insights by analyzing the logs of online courses and then constructing the models for supporting instruction-related decision-making, which has attracted many research efforts [4]- [6]. Performance prediction or early warning prediction is one of the most important and interesting research topics among them [7], [8]. However, there are still some research gaps in performance prediction that remain unanswered.
Learning performance prediction is fundamentally a classification problem. It usually utilizes online learning activities (i.e. online behaviors or/and discussions) and classification algorithms to fulfill such tasks [6], [9]- [16]. However, traditional classification algorithms, which are normally designed to maximize the overall accuracy, are suitable for balanced datasets rather than imbalanced ones [17], [18], while educational dataset is often highly imbalanced [18], [19]. Imbalanced dataset means the target cases are only a very small portion in the population compared with the non-target cases, which usually results in poor prediction since the minority category is almost inundated [17], [18]. For example, for an imbalanced dataset with a 5% at-risk rate, if a model simply predicts all students as successful students, the model can also reach a 0.95 accuracy rate. However, the model failed to identify any at-risk students. Therefore, it is very critical to focus on the imbalanced issue for improving prediction performance of the minority class.
To avoid being inundated, specific strategies have been proposed to deal with such imbalanced classification tasks [20]. The typical resampling approaches include Random Under Sampling (RUS), Random Over Sampling (ROS), and Synthetic Minority Oversampling Technique (SMOTE) [22], [23]. However, these methods have many shortcomings in dealing with imbalanced classification problems [22], [23]. RUS suffers from information loss due to the random deletion of the majority class, and ROS may result in overfitting on the minority class, while SMOTE may increase the overlapping between classes [23]. These resampling methods may increase false positive cases, which is a big issue in education and should be avoided for the following two reasons: (1) Inaccurate predictions result in a heavy burden for instructor's interventions and poor intervention outcomes; (2) High false positive ratio represents labeling on-the-track students with at-risk, which may rise many unnecessary concerns from educators and parents.
In recent years, with the rapid development of deep learning techniques, several generative models, including variational autoencoder (VAE) and generative adversarial nets (GAN), have been developed in order to generate new data that are similar to those in the original dataset in the field of computer science [24], [25]. These generative approaches may provide new perspectives for solving the imbalanced issue in many fields. However, such efforts are still at the early adoption stage in educational research.
It is reported that students with different characteristics (such as prior knowledge, demographics, personality traits, engagements or efforts in online education) can achieve different learning performance [6], [14], [15]. This indicates that each student can be characterized with a set of features. Different combinations of the features represent different types of students. Therefore, it is a promising approach to identify the at-risk students by learning the latent feature distributions of those students. Both VAE and GAN can be used to address the imbalanced issue by capturing the latent distribution of data. It is known that GAN is originally designed based on the concepts of zero-sum game and adversarial training between the generator G and the discriminator D. Many enhanced variations, such as Conditional GAN (CGAN) and Wasserstein GAN (WGAN), soon emerged subsequently [26], [27]. However, GAN has several concerns, including unstable convergence, collapse problem and uncontrollable model. Therefore, this study is mainly based on the idea of VAE rather than that of GAN.
As stated earlier, different combinations of features represent different types of students. Therefore, it is possible to identify the at-risk students by learning the latent feature distributions of those kind of students. The distribution of atrisk students can be represented as a probability distribution. According to the Gaussian Mixture model (GMM) [28], any distribution can be represented by an infinite dimensional Gaussian distribution [29]. Accordingly, the probability distribution that denotes the latent feature of at-risk students can be further represented by a GMM. The analysis of the feature distribution of at-risk students is transformed to the estimation of the compositions of Gaussian components (i.e. a set of mean and variance vectors). The ideal compositions that were computed based on the theories of GMM and Bayesian probability should be as close as possible to the actual distribution. Then sampling mean and variance vectors from this approximation distribution can generate valid at-risk samples. The above description is similar to variational autoencoder (VAE) [24], but this study considers the latent relationships of student characteristics for learning stable latent Gaussian distributions and further generating valid at-risk samples. Therefore, the sampling component in this study is called latent variational autoencoder (LVAE).
Finally, an integrated student performance prediction framework (LVAEPre) is proposed based on LVAE in this study. This framework takes advantage of LVAE and deep neural network (DNN) in order to alleviate the imbalanced distributions of educational dataset and further provide early warning prediction of at-risk students. Specifically, LVAE component mainly aims to learn the latent feature distribution of at-risk students and generate some at-risk samples for the purpose of obtaining a balanced dataset. Due to the outstanding prediction performance of DNN [16], [30], it has been applied to perform final prediction in order to explore whether the latent feature distribution of at-risk students learnt by LVAE component is helpful for accurately identifying and capturing at-risk students. Finally, the effectiveness and robustness of the proposed LVAEPre framework are verified through multiple sets of experiments.
The main contributions of this study are threefold: 1. This study shows that estimating the latent feature distribution of at-risk students is the most important for generating valid at-risk samples. The visualizations of the resampling results based on t-distributed stochastic neighboring ensemble (t-SNE) have shown that LVAE is an efficient approach to deal with imbalanced education data.
2. This study proposes an integrated framework (LVAEPre) for dealing with imbalanced classification in education.
An imbalanced education dataset was collected and analyzed. The experimental results have indicated that the proposed framework has good generalization ability and robustness for capturing at-risk students.
3. Four significant predictors of LVAEPre in this specific K-12 dataset have been identified via the surrogate modelling approach. It could provide meaningful insights for instructors to design pedagogical interventions.
The remaining is organized as follows: Section 2 reviews the most related literature about early warning prediction and imbalanced classification problems. Section 3 describes the proposed framework (LVAEPre) in detail. Experimental results based on the collected dataset are presented and discussed in section 4. Finally, section 5 outlines the conclusion and future work.

A. EARLY WARNING PREDICTION
Early warning prediction studies should provide accurate prediction outcomes at an early stage. However, the previous study [16] pointed that most performance prediction studies utilized aggregated behaviors at the end of semester for predictive modelling. Given that the accumulation levels are different during and at the end of the course, utilizing student accumulated behavioral frequencies at the end of a course cannot perform real prediction to achieve the goal of ''early warning''. These studies are more likely to identify key factors rather than performance prediction. Therefore, we just focused on analyzing and reporting results of the early warning studies in the following subsections.

1) ADOPTED INPUT VARIABLES
In terms of input variables, some early warning studies adopted static variables to predict performance [31]- [34]. Static data usually include student demographics, self-report data and historical educational records, which do not update or change values frequently. Because static data can be gathered before a course or semester starts, it is a popular approach to construct an early warning model based on static data in order to provide the instructor with a list of potentially at-risk students before a course starts. These studies identified at-risk factors related to social economic status [31], [32], historical academic records [33], [34], and gender [31]. However, it is known that the use of static variables ignored student's actual efforts in the course. Therefore, prediction models based on static data cannot provide accurate predictions.
In recent years, with the popularity of online learning, many researchers adopted online learning activities (i.e. online behaviors or discussions) for early warning prediction [6], [9]- [16]. These studies extracted variables from online learning activities for modelling, including total frequency or time spent in the Learning Management System, frequency of the content accessed, frequency of the discussions posted, frequency of the grade checked, numbers of files received and viewed, the number of assignments completed and textual features [6], [9]- [16]. Some studies also reported significant predictors, including total time spent on LMS [14], total frequencies [15], number of postings [6], and discussion board visit frequency [6], [14]. This indicates that there are significant differences between successful and at-risk students in these significant activities. Therefore, collecting online learning logs is a feasible solution to reflect students' learning process and efforts. Furthermore, the distribution of significant features of at-risk students is different from that of successful students.

2) EVALUATION METRICS
Many early warning studies adopted indicators for evaluating overall performance, such as accuracy, Root Mean Square Error (RMSE)), MAE (Mean Absolute Error), and AIC (Akaike information criterion) [34]- [36]. However, since the goal is to identify potentially at-risk students, indicators like recall, F-measure, and ROC (Receiver Operating Characteristic) are more appropriate. Literature [32] collected students personal and social factors to predict academic performance, but experimental results showed that four different prediction models all gained less than 40% overall accuracy. Literature [37] reported that prediction models had serious overfitting and gained less than 70% accuracy on the testing data. These models might need to be further improved as they provided many false early warning signals and missed a large portion of actual at-risk students. Without considering the imbalanced characteristic of educational dataset, it may be challenging to achieve a satisfactory prediction performance. Therefore, it is necessary to select appropriate indicators to evaluate model's prediction performance in imbalanced education dataset.

3) PREDICTION METHODS
It is found that the majority of early warning studies have applied traditional machine learning algorithms, such as Regression, Decision Tree, Naïve Bayes, Support Vector Machine, Neural Network, K-nearest neighbor and Random Forest, for constructing performance prediction models [6], [9]- [15].
On the other hand, Deep learning, as a promising branch of machine learning, has been widely used in audio recognition [38], image classification [39], and e-commerce recommendations [40]. However, deep learning is relatively new to educational research. For example, RNN was adopted for knowledge tracing in intelligent tutoring systems [41]- [43]; CNN was applied for extracting textual features of learning resources for content-based recommendation [44]; DNN was employed for predicting performance and the results indicated that DNN models outperformed traditional machine learning algorithms in terms of the capability of identifying at-risk students [16].These studies have shown the great potential of deep learning in outperforming other machine learning algorithms. Therefore, it is promising to construct prediction models based on DNN in order to improve prediction performance.

B. IMBALANCED CLASSIFICATION PROBLEMS
When predicting academic performance or dropout, the collected education data, which often belongs to imbalanced dataset, can certainly result in imbalanced classification problems. Given an imbalanced education dataset consisting of N samples (i.e. students): D = {(x i , C i ), i∈ [1, N]}, where x i denotes the input features of the ith student, and each student has a corresponding class information C i C i . In this study, we consider the early warning prediction problem as a binary classification issue (i.e. C = {0, 1}), and assume the negative (positive) class to be the majority (minority) class. In many educational cases, the number of negative samples is much larger than the number of positive samples. If a prediction model just simply classifies all the samples as the majority class, it may still obtain high overall classification accuracy [22]. However, the model has no practical application value.
To date, many research efforts have been focused on addressing the imbalanced classification problems at the data level [20]. It was called 'resampling'. The basic idea is to resample either the majority class or the minority class in order to obtain relatively balanced distributions among classes. RUS and ROS are two widely used resampling methods [20]. RUS is to randomly delete the majority samples in order to balance the distributions of the two classes. In contrast, ROS is to randomly select samples from the minority class and to duplicate the selected samples to achieve a balance. Many researchers have pointed out that these two random resampling methods are not good solutions for imbalanced classification problems, because RUS suffers from information loss due to the random deletion of the majority class, while ROS may result in overfitting on the minority class [22], [23]. SMOTE as an improved method of ROS randomly creates artificial samples along a line joining a minority sample and a selected nearest neighbor [45]. However, SMOTE can significantly increase the overlapping between classes that makes classification more difficult [22].
Considering that the ensemble techniques can improve the classification performance of any weak classifier, several advanced ensemble-based methods (such as SMOTEBoost and RUSBoost) have been proposed for addressing the above issues of random-based methods [23], [45]. SMOTEBoost combines the SMOTE and the standard boosting procedure to improve the classification performance on the minority samples by increasing weights of misclassified minority samples [45]. The RUSBoost is very similar to the SMOTEBoost, but the only difference is the method to alleviate the distributions of imbalanced dataset. In other words, RUSBoost applies RUS, which randomly removes samples from the majority class [23]. Therefore, RUSBoost decreases the size of the training set, while SMOTEBoost increases the size of the training set [23], [45]. The experimental results indicated that SMOTEBoost and RUSBoost performed better than AdaBoost, RUS and SMOTE in handling imbalanced data [23]. Although the above ensemble methods (i.e. SMOTE-Boost and RUSBoost) have been proposed for a long time, there are few applications and discussions in the field of education, in which imbalanced dataset is quite common.

C. SUMMARY
The literature reveals that: (1) Most performance prediction studies are more likely to identify important factors rather than to perform early warning. (2) Without considering the imbalanced characteristic of educational dataset, it may be challenging to achieve a satisfactory prediction performance in the predictive model. (3) Many early warning studies have adopted biased indicators to evaluate model's performance. (4) Deep learning has shown great potentials for improving prediction ability. (5) Although there are many traditional resampling methods to adjust the distributions of imbalanced data, few studies have focused on whether these methods can work well in the field of education.
In this study, an integrated framework (LVAEPre) is proposed in order to address the above research gaps. This framework aims at adjusting the distributions of imbalanced education data, constructing early warning prediction model and providing early warning predictions.

III. THE PROPOSED LVAEPre FRAMEWORK
This section begins with an overview of the proposed framework (LVAEPre) and then focuses on introducing each component in detail.

A. THE OVERVIEW OF LVAEPre
The architecture of LVAEPre is shown as Figure 1, which consists of three components, including data preprocessing, LVAE component and prediction method. The data preprocessing component is responsible for transforming the raw logs into appropriate data forms for subsequent modelling and analysis. Then LVAE component generates at-risk samples based on the latent feature distribution of at-risk students. Finally, DNN algorithm is employed to construct the prediction model for providing early warning of at-risk students.
Ideally, the prediction results need to be fed back to the LVAE module to adjust its parameters for better prediction. However, this paper just focuses on the prediction performance based on LAVE, so the feedback part will be further studied in the future work.

B. DATA PREPROCESSING
Firstly, a unique ID that combines student ID and course ID are used to link all types of data sources (such as grade data, behavioral data and discussions) together to complete data log cleaning. Secondly, it is critical to find a possible way to generate behavioral features based on the log data. Considering that different courses and even the same course designed by different instructors may have different learning activity designs and requirements, extracting candidate learning features based on the statistics of learning activity categories in the raw logs is recommended in order to avoid extremely sparse data. In addition, it is difficult to find a generalized threshold for specific learning activity under this concern. For example, one student in class A had accessed the learning system 100 times, which was a very high engagement in this class. Another student in class B had also accessed 100 times, but he did not meet the requirement of his class. Therefore, this adopts a modified normalization method by normalizing the values of student's learning behaviors into 0-1 within each course in order to address these concerns. It takes into account the characteristics of educational data to make the participation levels of different courses comparable, so the proposed transformation method is more appropriate for educational data.

C. RESAMPLING MECHANISM OF LVAE COMPONENT
Assume the dataset consists of N i.i.d. samples of at-risk students (x 1 , x 2 , . . . , x N ), and the student n is represented by a K-dimensional vector x 1 n , x 2 n , . . . ,x K n , which includes prior knowledge, demographics, personality traits, online learning behaviors, textual information extracted from online discussions and other information. The latent features behind the distribution of variables in each vector implicates the student's intrinsic characters as well as his/her learning process and status.
Considering that a given distribution can be synthesized by an infinite dimensional Gaussian distribution [29], we assume that the feature distribution of the at-risk students p(x) is represented by a Gaussian Mixture model shown as where z is a vector sampled from a latent space following a standard normal distribution. The conditional distribution p(x|z) is also a Gaussian distribution with mean µ(z) and variance σ (z). Accordingly,p(z) represents the weight of distribution p(x|z). We hope the generated cases have the same latent features as the samples in the dataset, in other words, the distribution of the generated cases is same as that of the original at-risk samples. Therefore, the p(x|z) should maximize the probability p (x) of each sample in the dataset. Considering the encoder, we assume that q(z|x) can be any distribution and is independent of p(x), so log p(x) can be rewritten as equation (2).
)dz (2) The second item in equation (2) denotes the KL divergence between q(z|x) and p(z|x), which is always greater than or equal to 0. The first item in equation (2) is the variational lower bound L b . Therefore, log p(x) can be rewritten as equation (3).
When the approximate distribution q(z|x) is close to the real distribution p(z|x), log p(x) is also close to L b . Meanwhile, L b can be rewritten as equation (4).
It is obvious that when the distribution q(z|x) is also a standard normal distribution (i.e. the KL divergence is equal to 0), L b can obtain its maximum E q(z|x) (log p(x|z)). This means when given an at-risk sample x, we need to sample z from the distribution q(z|x), which makes the reconstructed x similar to the original x (i.e. maximizing the probability of p(x|z) ).
In summary, LVAE component aims to learn the optimal latent Gaussian distribution q(z|x) based on the given atrisk samples, which is the encoder network. Assuming z mean and z var denote the mean vector and variance vector of the distribution q(z|x), latent vector z that is sampled from the distribution q(z|x) can be represented as equation (5).
where ε is sampled from a standard normal distribution. Then maximizing p(x|z) based on the latent vectors z is necessary in order to make the reconstructed at-risk sample as similar as possible to the original at-risk sample, which is the decoder network in LVAE. The architecture of LVAE is shown in Figure 2. Compared with the high dimensions of pictures in the field of computer science, educational data has relatively low dimensions of input features. In this study, both the encoder and decoder networks just have one hidden layer respectively.

D. PREDICTION METHOD
Literature review indicates that few studies have employed deep learning algorithms for early warning prediction, but deep learning shows great potential than traditional counterparts. This study adopts full-connected deep neural network (DNN) as prediction method in the LVAEPre framework. In this study, DNN has three hidden layers with dropout and L2 regularization, but the optimal parameters of the DNN architecture need to be determined in the training process.
Therefore, the proposed LVAEPre framework uses the data processed by LVAE to train the prediction model based on DNN in order to achieve more accurate identification for at-risk students. The trained prediction model needs to be verified on the validation dataset with original imbalanced ratio to demonstrate the generalization ability of the LVAEPre framework.

IV. EXPERIMENTS AND RESULTS
In this section, several experiments have been carried out based on a collected education dataset with 8.7% at-risk ratio to verify the effectiveness and robustness of the proposed LVAEPre framework. The visualizations of different resampling results based on t-SNE have also been compared to further demonstrate the advantage of LVAE in dealing with imbalanced education data. Therefore, the baseline methods, evaluation metrics and data description are introduced first. Then the experimental results are reported and discussed in detail.

A. BASELINE METHODS
LVAEPre mainly consists of LVAE and DNN. LVAE component aims to fulfill the resampling task based on the latent feature distribution of at-risk students, and DNN is to perform the binary classification task. Therefore, the benchmark methods need to be selected from both resampling and prediction aspects to verify the effectiveness of the LVAEPre framework from multiple viewpoints. SMOTE and RUS are the commonly used resampling methods in dealing with imbalanced classification problems. SMOTEBoost and RUSBoost are the combination of Boosting and resampling methods in recent years in order to address the issues of SMOTE and RUS [23], [45]. Although there are many variants of boosting, the most influential one is AdaBoost [46]. The basic idea of AdaBoost is to correct the mistakes of previous weak learners [46]. Suppose the first weak learner h 1 is trained based on the training dataset D 1 , then the error of h 1 can be calculated. The error of h 1 is used to calculate the weight of h 1 , and the distribution of training dataset is updated to D 2 , which focuses on the mistakes of h 1 . Then a weak learner h 2 is trained based on D 2 . AdaBoost will continue to generate multiple weak learners and its corresponding weights until the termination condition is satisfied (such as error is less than the pre-set threshold or the numbers of weak learners have reached the pre-set numbers). On the other hand, Decision Tree (DT) is often selected as the best prediction model in performance prediction studies [6], [10]. Therefore, DT is chosen to train weak learners in AdaBoost in this study.
In order to provide a reference point for the results of the LVAEPre framework, multiple combinations of the above commonly used methods are employed to generate baseline methods in order to demonstrate the effectiveness of LVAEPre from multiple viewpoints. First, using DT and DNN without any resampling methods to generate the initial baseline results is to verify whether the resampling methods and LVAE are beneficial to the improvement of prediction performance. Then the performance of the proposed LVAEPre is compared with that of baseline methods. The baseline methods include: • SMOTE-DT: The resampled dataset that is generated based on SMOTE is classified by DT classifier.
• SMOTEBoost: The resampled dataset that is generated based on SMOTE is classified by AdaBoost.
• SMOTE-DNN: The resampled dataset that is generated based on SMOTE is classified by DNN.
• RUS-DT: The resampled dataset that is generated based on RUS is classified by DT classifier.
• RUSBoost: The resampled dataset that is generated based on RUS is classified by AdaBoost.
• RUS-DNN: The resampled dataset that is generated based on RUS is classified by DNN.
• LVAE-DT: The resampled dataset that is generated based on LVAE is classified by DT.
• LVAE-AdaBoost: The resampled dataset that is generated based on LVAE is classified by AdaBoost.
All the experimental results will be reported later based on the original validation dataset for the purpose of facilitating comparison and analysis. VOLUME 8, 2020

B. METRICS FOR PERFORMANCE EVALUATION
Measuring the overall prediction accuracy is commonly used in performance prediction. However, the dataset for early warning prediction is typically imbalanced or highly imbalanced. It is not appropriate to only use the overall accuracy to measure model's prediction performance. Because if a model simply predicts all students as successful students, this model can obtain a high accuracy rate, but this result cannot make sense to provide early warning signals.
In addition, it is crucial to accurately identify the minority class without sacrificing the benefits of the majority class in imbalanced education dataset. Precision is the ratio of all predicted positive cases whose actual values are also positive, and recall is the ratio of positive students being captured by the model. In many classification tasks, high precision and high recall rates cannot be achieved at the same time. Therefore, F 1.5 score is selected as a harmonic mean of precision and recall [7]. In general, the high value of F 1.5 score, the better the prediction performance of models.
where ''positive'' denotes at-risk student, and ''negative'' means successful student. True positive (TP) denotes that a student whose status is at-risk and the model also correctly predicts the student as at-risk. True negative (TN) indicates that a student whose status is successful and the model also correctly predicts the student as successful. False positive (FP) means the number of successful students misjudged by the model (false early warning), and False negative (FN) is the number of at-risk students misjudged by the model (missed at-risk students). In this study, both accuracy and F 1.5 values are used to synthetically evaluate and compare models' overall performance. Finally, all models are optimized by the validation results to avoid overfitting.

C. DATA DESCRIPTION
Data was collected from more than 600 fully online courses offered through a K-12 virtual school located in the United States. These courses were hosted on the Blackboard learning management system (LMS) in the 2014-2015 and 2015-2016 academic years and lasted for 16 weeks. The major data sources included: (1) student behavioral data, (2) student discussion posts in the discussion forums, and (3) student final grades. First, the timing of the early warning prediction in this study was in the middle of the semester, so LMS logs that were recorded after 8 th week were removed. Then a unique ID combining student ID and course ID were used to link all three types of data sources together, which preserved 11688 students with 10,329,074 behavioral logs and 164,745 discussion posts for analyzing and modelling. After data preprocessing, ten learning features were extracted from the raw logs, which indicates that each student could be represented by a 10-dimensional input vector. The generated features are shown in Table 1.
For early warning modelling, students' final grades, which were originally stored in numeric format, need to be transformed into a binary format. This study selects 60 as the passing score to distinguish at-risk and successful students. At-risk students are labeled as ''1'' (positive), and successful students are labeled as ''0'' (negative). This threshold generates 8.72% at-risk students. It means the imbalanced ratio of the dataset is higher than 9. Based on the criteria of imbalanced datasets [47], the collected dataset is a highly imbalanced dataset, which makes it very difficult to correctly identify at-risk students. Therefore, the proposed LVAEPre framework are expected to address this imbalanced classification task.
Stratified sampling approach is employed to split dataset into the original training and validation datasets. It is generally recommended that splitting 70% is for model training and the remaining is for validation [48], [49]. The original training dataset is used to train the LVAEPre framework, while the original validation dataset with 8.7% at-risk ratio is used to verify the effectiveness and robustness of the proposed framework.

1) PREDICTION PERFORMANCE OF LVAEpre
After data preprocessing, the training dataset with 10dimensional input features are fed into the LVAEPre framework and baseline methods for training prediction models. The validation results of baseline methods and LVAEPre are shown in Table 2. Among them, the criterion of DT classifier is Gini index, and the base classifier of AdaBoost is also DT with Gini index. The numbers of weak learners and learning rate in AdaBoost are optimized by grid search. The search range of numbers of weak learners is from 50 to 100 with a step size of 10, and search range of learning rate is from 0.5 to 1.5 with a step size of 0.1. Then the optimal numbers of weak learners and leaning rate are 70 and 0.9 respectively. Similarly, the parameters of DNN are also determined by grid search, and the final architecture of DNN has three hidden layers with 100, 100 and 10 neurons respectively, and three dropout layers among hidden layers and output layer with 0.5, 0.5 and 0.7 dropout rates respectively to avoid overfitting.
Firstly, the results of the first two rows in Table 2 show that DT and DNN have poor prediction performance without LVAE or SMOTE or RUS to adjust the imbalanced distributions of the original training dataset, especially for DNN that cannot capture any at-risk students at the 8 th week. The last nine rows in Table 2 are the validation results of baseline models and LVAEPre, which have employed LVAE or SMOTE or RUS approaches to alleviate the distributions of two classes. The F 1.5 scores clearly indicate that both traditional resampling methods and LVAE component are helpful for improving prediction performance of imbalanced education dataset. Therefore, adjusting the distributions of imbalanced education data can improve the prediction performance of the minority class. This finding is consistent with the previous study [9].
Then the prediction performance of different classifiers under the same resampling method is also compared. It is found that AdaBoost performs significantly better than DT when using the same resampling method. Based on the idea of AdaBoost, it is not surprising to this result. Many studies have also claimed that ensembled methods, such as AdaBoost and Random Forest, usually perform better and more robust than single classifier [16]. Furthermore, Table 2 shows that deep learning models perform slightly better than AdaBoost under the same data condition, which is line with previous findings [16], [30], [50]. It is concluded that DNN is a promising method for building prediction models than traditional machine learning algorithms in education.
Finally, it is found that LVAEPre has the most outstanding prediction performance in terms of the overall accuracy and F 1.5 score through comparing the results of the last nine rows. This means the proposed LVAEPre framework can make an optimal tradeoff between the predictions of the two classes, so LVAEPre obtains the lowest misclassification rate and the relatively high recall rate as shown in Table 2.
Other methods that are based on SMOTE or RUS have high recall rates but extremely low precision rates. It indicates that these methods misclassify a high percentage of successful students to achieve high capability of capturing at-risk students.
Further examining the collected dataset, there are 1,020 at-risk students and 10,668 successful students. Take the SMOTE-DNN method as example, the recall rate is 0.7876 and the precision rate is 0.3685. That means 803 (1020 * 0.7876) at-risk students can be captured by the model, and 2179 (803/0.3685) students were predicted as at-risk. In other words, 1376 (2179-803) successful students were misclassified as at-risk. Increasing false positive cases might not be a big issue in others fields, such as telephone marketing or mail marketing. Within the cost limit, the marketing campaign can focus on the population with the highest response rates to maximize profits. However, it could be a big issue in the field of education, since no one likes to be labeled as ''at-risk'', especially when he or she is on the right learning track. Furthermore, misclassifying too many successful students can also result in a very heavy burden for instructor's interventions. These concerns certainly make models based on SMOTE or RUS difficult to implement in educational practice. Therefore, LVAEPre can not only effectively handle imbalanced education data, but also provide better early warning predictions than other baseline methods.

2) COMPARISON OF VISULIZATION RESULTS BASED ON DIFFERENT RESAMPLING APPRAOCHES
The above experiments show that the proposed LVAEPre framework performs better than other baseline methods. Since LVAEPre consists of LVAE and DNN, the outstanding performance of LVAEPre is also contributed by these two aspects. Given that the above experimental results have also indicated that DNN performs better than other traditional machine learning algorithms under the same data condition, the three resampling methods (i.e. LVAE, SMOTE and RUS) will be compared in order to reveal potential reasons for the outstanding performance of LVAEPre. These three resampling methods all aim to make a new balanced data based on the original imbalanced data. Therefore, observing the visualization results of data distribution before and after resampling may be the most intuitive solution to compare different resampling methods.
Because t-SNE is capable of capturing the local structure of the high-dimensional data very well and revealing global structure such as the presence of clusters at several scales [51], it has been considered as a powerful visualizing approach to preserve both global and local structures of data in low-dimensional space [52]. In this study, t-SNE is employed for presenting visualization results of training datasets under different resampling approaches. The visualization result of the original imbalanced training dataset with 8.7% at-risk rate is also presented as the benchmark. Figure 3 shows the comparison results.  Figure 3 denote the successful students in the original training dataset, the at-risk students in the original training dataset, the at-risk samples generated by LVAE, the at-risk samples generated by SMOTE, and the remaining successful students after resampling by RUS respectively. Figure 3(a) not only shows the highly imbalanced characteristic of the original training dataset, but also indicates that there is no significant distribution difference between successful students and a small number of at-risk students. This means that it is extremely difficult to accurately capture atrisk students in the original imbalanced dataset. Figure 3(b-d) present the data distribution of the balanced training datasets that are generated based on LVAE, SMOTE and RUS respectively. Firstly, Figure 3(b-c) shows that both LVAE and SMOTE can increase the whole sample size of training dataset via generating at-risk samples, but RUS significantly decreases the sample size as shown in Figure 3(d). This is consistent with the previous view in [23], [45]. Then, Figure 3(b) shows that LVAE has learnt the latent feature distribution of at-risk students very well and generated valid at-risk samples so that the two types of students have relatively obvious boundaries. However, Figure 3(c) shows that SMOTE results in a serious overlapping between two types of students, which make a great number of successful students very similar to at-risk students. Similarly, RUS randomly deletes a large number of successful samples in order to obtain a balanced dataset, but it also makes the difference between the two types of students less obvious. Therefore, the visualization results could explain (1) why prediction models based on SMOTE and RUS can misclassify lots of successful students as atrisk; and further explain (2) why prediction models based on SMOTE and RUS can achieve high abilities (i.e. high recall rates) in capturing at-risk students. In addition, because LVAE can generate a relatively clear boundary between two types of students, LVAEPre could achieve high recall rate without increasing false positive cases. In general, LVAE is more promising than other resampling methods in education. It also reveals why LVAEPre outperforms other baseline models.

3) ROBUSTNESS OF LVAEpre
The results of Table 2 are based on the selected stratified splitting rule (i.e. 70% for training and 30% for validation). In order to verify whether the proposed LVAEPre framework has good robustness in providing stable prediction results under different training samples, another two sets of additional experiments based on different splitting rules (i.e. 60%/40% and 80%/20%) were also carried out. Figure 4 visualizes the validation results of the LVAEPre framework and baseline methods in terms of four evaluation metrics under different splitting rules (i.e. different samples for training and validating LVAEPre). Figure 4 shows that the proposed LVAEPre framework has good robustness on the different validation datasets, and LVAEPre outperforms other baseline methods under different splitting rules in terms of overall accuracy and F 1.5 score. In addition, Figure 4 also indicates that more training samples (i.e. 70% or 80% dataset for training) will make the advantages of LVAEPre framework more obvious, because more training samples will contain more at-risk students, which can make the latent probability distributions of at-risk students learnt by LVAE more accurate. Other findings in Figure 4 are consistent with that in Table 2. For example, using traditional resampling methods or LVAE can improve model's ability in capturing at-risk students, and DNN performs better than traditional machine learning algorithms under the same resampling method.

4) IDENTIFICATION OF SIGNIFICANT PREDICTORS
The above experimental results demonstrate the effectiveness and robustness of the proposed LVAEPre framework. However, the prediction results of the LVAEPre framework are like a ''black box'', which cannot provide instructors with meaningful insights on how to design effective interventions. Therefore, this subsection seeks to open the ''black box'' via the surrogate modelling method, which is a commonly used approach to extract significant predictors of a complex model [53]. Due to the advantage of visualizing decision process, Decision Tree is often selected as the method of surrogate analysis to ''simulate'' rules that were learned by complex models. Therefore, the DT model in surrogate analysis kept the same input variables with LVAEPre, but the target variables were the predicted results of LVAEPre. The DT model in surrogate analysis is able to simulate the LVAEPre results with 100% accuracy. Because the surrogate tree is also very deep and complex, only the top five layers are represented in Figure 5. Figure 5 shows the most significant factors include 'Total_Frequency', 'Discussion_word_counts', 'Check_ grade' and 'Hit_count'. To enhance readability, the major atrisk paths are reported in the following.   • Rule 4-1: 1-1 +2-1+3-1 + Hit <= 0.359 (0/1: 0.0694/0.9306) Rule 1-1 means if a student's total frequency is in the lower 6.1% in the class, the at-risk probability increases from 7.47% to 52.40%. When Rule 1-1 is satisfied and the student's discussion word counts are in the lower 19.7% in the class, the at-risk probability further increases to 75.93% (Rule 2-1). When both rules 1-1 and 2-1 are satisfied and the student's check grade frequency is in the lower 17.3%, the at-risk probability increases to 89.50% (Rule 3-1). Finally, if rules 1-1, 2-1 and 3-1 are satisfied and the student's hit frequency is in the low 35.9% in the class, the risk chance would further increase to 93.06%.
Because total behavior frequency has often been used to represent behavioral engagement level [54]- [56], and many researchers have claimed that high behavioral engagement level has positive correlations with high learning performance [54], [57], it is not surprised that if a student seldom accesses to the online learning system, he/she is unlikely to perform well. In addition, the 'Discussion_word_counts' variable is a general signal about a student's discussion engagement level VOLUME 8, 2020 [58]. Long postings usually need a considerable amount of time investment in constructing and presenting their ideas or thoughts, which involves high level of critical thinking to support their arguments with sufficient evidence [59], [60]. Improving student's critical thinking is helpful for understating the relationships between concepts, using concepts to explain phenomena, and restricting knowledge in a more coherent way [61], [62]. Researchers have also found that there is a positive significant relationship between student's critical thinking level and academic performance [63]. In addition, the variables of 'Check_grade' and 'Hit' can reflect student's learning strategies. For example, if a student frequently checks grade, he/she is likely to have high selfregulated learning skills and often performs self-monitoring, self-reflection and self-evaluation [64], which could result in the high ability of planning, managing and controlling their learning process and learning performance [65]. Therefore, the identification of these significant variables could provide meaningful guidance and assistance for designing instructional activities.
In summary, through the analysis of significant predictors of LVAEPre in this specific K-12 dataset, instructors could design some intervention programs to help at-risk students in the second half of the semester, such as requiring students to frequently access online learning system, encouraging students' to share and express their opinions and thoughts in their learning process, and employing learning dashboards or other learning widgets to drive student's high learning engagements.

V. CONCLUSION AND FUTURE RESEARCH
This study has proposed an integrated prediction framework (LVAEPre) in order to alleviate the imbalanced issue of educational dataset and further to provide accurate early warning prediction of at-risk students. The effectiveness and robustness of the proposed framework have been demonstrated by comparing its prediction performance with ten baseline methods. The comparison of t-SNE visualization results further confirms the advantage of LVAE in dealing with imbalanced education data. LVAEPre also has many benefits, including higher sensitivity rate, lower false positive error, and lower misclassification rate (i.e. higher overall accuracy rate). In addition, four significant predictors are identified via the surrogate modelling approach, which could provide meaningful insights for instructors to design appropriate interventions. But due to the limitation of available datasets, more educational data from different learning contexts are expected to further verify this framework in the future work. Furthermore, future research might also focus on the following directions: (1) how to generate other textual features for further improving prediction ability, (2) final grades were adopted as the target variable to reflect student's learning status, but in the future, more complex target variables can be considered, such as increasing or decreasing trends or prediction probability changes throughout the semester.