Prediction of Students’ Academic Performance in the Programming Fundamentals Course Using Long Short-Term Memory Neural Networks

In recent years, there has been evidence of a growing interest on the part of universities to know in advance the academic performance of their students and allow them to establish timely strategies to avoid desertion and failure. One of the biggest challenges to predicting student performance is presented in the course “Programming Fundamentals” of Computer Science, Software Engineering, and Information Systems Engineering careers in Peruvian universities for high student dropout rates. The objective of this research was to explore the efficiency of Long-Short Term Memory Networks (LSTM) in the field of Educational Data Mining (EDM) to predict the academic performance of students during the seventh, eighth, twelfth, and sixteenth weeks of the academic semester, which allowed us to identify students at risk of failing the course. This research compares several predictive models, such as Deep Neural Network (DNN), Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Support Vector Classifier (SVM), and K-Nearest Neighbor (KNN). A major challenge machine learning algorithms face is a class imbalance in a dataset, resulting in over-fitting to the available data and, consequently, low accuracy. We use Generative Adversarial Networks (GAN) and Synthetic Minority Over-sampling Technique (SMOTE) to balance the data needed in our proposal. From the experimental results based on accuracy, precision, recall, and F1-Score, the superiority of our model is verified concerning a better classification, with 98.3% accuracy in week 8 using LSTM-GAN, followed by DNN-GAN with 98.1% accuracy.


I. INTRODUCTION
The Peruvian university educational process presents the challenge of generating strategies that improve the quality of teaching to form individuals with cognitive, creative, and innovative capacities.In this sense, the need arises to analyze the student dropout rate and disapproval due to economic, social, and cognitive factors.According to [1], the dropout The associate editor coordinating the review of this manuscript and approving it for publication was Xiong Luo .rate of university students according to the geographical area in 2018 was: Lima-Center (13.4%), mountain (18.2%), jungle (24.6%), and coastal (24%).However, the COVID-19 pandemic increased the dropout rate, reaching 42.6% in coastal, mountain, and jungle areas and 18.1% in Lima-Center.In Engineering careers, which include the careers of Software Engineering (SE), Computer Science (CS), and Information Systems Engineering (IS) the dropout rate is between 15% to 20% [2], and the failure rates range from 25% to 30% [3].However, the failure rate increases in the first course of the specialty and ranges from 25% to 35% [4], [5].Therefore, early identification of students at risk of dropping out or failing a course involves the analysis of attributes, characteristics, or factors during the academic process that influence academic performance.
In recent years, Educational Data Mining (EDM) has gained importance due to its results in predicting academic performance, dropout, course approval, or failure [6].
EDM is a sub-area of data mining that applies statistics and machine learning to extract, process, interpret, and evaluate hidden patterns in educational datasets [7].Education experts utilize EDM to support academic decisions that benefit students and the academic community [8].
EDM is combined with machine learning techniques, such as Random Forest, Decision Tree, Support Vector Classifier, Logistic Regression, K-Nearest Neighbor, Deep Artificial Neural Network, and Convolutional Neural Network (CNN) whose aim is to generate predictive models based on the extraction of patterns from educational data [6], [9], [10], [11], to predict academic performance, dropout, approval, disapproval of students at an early stage.Therefore, based on the results, universities can apply strategies that reinforce student knowledge and reduce a course's dropout or failure rate [12].
One of the challenges of EDM is the amount of available data and the imbalance of data used as input in the proposed models, which causes instability in the accuracy of the results.Problems associated with unbalanced class datasets cause machine learning algorithms to converge slowly when trained, generalize overfitting to available data, and poorly resolve unseen data [13], [14].
In this research, we compare the Generative Adversarial Network [15] and the Synthetic Minority Over-sampling Technique [16] as resampling techniques to address the problem of unbalanced data and generate reliability in the results.
This study's data collected over three years belong to students from two Peruvian universities' Software Engineering, Computer Science, and Information Systems Engineering programs.A predictive model based on Long-Short Term Memory Networks was developed and compared with six models: Deep Neural Network, Decision Tree, Random Forest, Logistic Regression, Support Vector Classifier, and K-Nearest Neighbor.The proposal's framework is divided into 5 phases: Data Collection, Data Balancing, Training Data, Testing Data, and Model Evaluation.This study makes the following contributions: • Collect and preprocess open academic data, making it available for future research.
• Evaluates two data oversampling techniques, GAN and SMOTE, for tackling the unbalanced data problem.
• Performs experimental evaluation and analysis of machine learning techniques such as Long short-term memory, Random Forest, Decision Tree, Support Vector Classifier, Logistic Regression, K-Nearest Neighbor, and Deep Artificial Neural Network.
• Evaluates quantitative performance-based accuracy, precision, recall, F1 Score, classification error, sensitivity, specificity, and confusion matrix.This document is organized as follows: Section II discusses related work.Section III introduces the proposed method.Section IV provides an analysis of the experimental results and discusses their implications.Finally, Section V concludes the paper and outlines future work based on the results obtained.

II. RELATED WORK
In this section, we explore and analyze EDM research, focusing on applying various Machine Learning techniques.Our analysis includes performance measures, feature patterns, objectives, and algorithms used in these studies.
Recent years have seen a surge in Educational Data Mining and Machine Learning research.In their survey, [10] specifically examines the application of Artificial Neural Network techniques in EDM for predicting students' academic performance.This survey identified 21 articles, categorizing them based on objectives, education levels, predictor and output variables, algorithms, model accuracy, and key findings.They conclude that ANNs obtain accuracies above 84%.On the other hand, in [9], a systematic mapping of machine learning techniques was applied to EDM.They analyzed 39 articles and concluded that ANNs are the most used, followed by SVM, LR, and KNN.
Moreover, we can observe the limited use of data-balancing algorithms in research.In [29], four data balancing algorithms were compared -SMOTE, ADASYN, ROS, SMOTE-ENN -, to handle unbalanced data sets and improve GB, LR, SVC, and KNN.The authors determined that the Synthetic Minority Over-sampling Technique (SMOTE) yields superior results in managing unbalanced datasets.In this context, Deep Neural Networks (DNNs) achieved an accuracy of 89%, closely followed by Random Forests at 88%.We infer that data balancing not only facilitates achieving class equilibrium but also mitigates the bias associated with class disproportionality.Furthermore, it provides a more substantial dataset for training, thereby positively influencing the enhancement of performance metrics.Likewise, the amount of data and attributes used to train predictive models varies among the research.In [24] and [25], they used 32593 student records and considered 206 attributes corresponding to demographics and academic data.In [32], they used 25,541 student records collected over nine months and analyzed 20 attributes from an online platform where notes on activities such as forums, quizzes, and tests stand out.In [29], they used 4,266 records and considered 12 attributes corresponding to grades from different courses.In [26], they analyzed 1,854 records and only three attributes (previous exams, school data, and faculty data).In [20], they analyzed 1,308 records with five attributes (score, vulnerability index, regime, gender, and population segment).In [33], they used 900 records with ten attributes obtained from the interaction of students with an online platform.In [27], they used 842 records with ten attributes distributed in personal and academic attributes.In [17], they used 649 records with 33 attributes extracted from 3 categories (personal attributes, academic background, economic background).In [28], they used 537 attributes; no specific use was evident.In [21] and [22], they used 480 records with sixteen attributes distributed in three categories (demographic category, academic category, and behavioral category).In [34], they used 400 records with thirteen attributes corresponding to academic and personal data.In [35], they used 284 with five attributes extracted from the interaction of students with a virtual platform (view, post, forum view, forum post, successful submission).In [18], they used 145 records with four attributes (repetition concept, selection concept, repetition skills, and method skills).The amount of data and an adequate number of attributes are necessary for predictive models to learn from the interaction of the data and its attributes.Predictive models such as DNN, LSTM, and CNN require much data to learn and predict correctly.
On the other hand, the accuracy percentage varies according to the predictive model used, the data set, and the trained attributes.In [34], they achieved an accuracy of 98.5%.The authors combined Naive Bayes (NB) with ababoots_J48.The authors constructed an ensemble meta-based tree model that combines a boosting method with Naive Bayes trees to predict student performance.They used the Pearson correlation method to find attributes with high correlation, with the visited resource attribute having a high impact on the final result.
Meanwhile, in [33], they achieve an accuracy of 97.4% using ANN.The authors manage to predict whether a student requires academic assistance in their course using an ANN.
According to the proposed architecture, the ANN was designed with a network of 4 input neurons, 12 hidden layers, and three outputs.Likewise, in [32], the authors used Long Short-Term Memory (LSTM) to predict course withdrawal.LSTM achieved an accuracy of 97.25% to predict a student's withdrawal in week 25 and 84.15% to predict a student's withdrawal in week 10.In [23], they used Decision Trees and Artificial Neural Networks to predict student dropout in an undergraduate program.The classification comprised two (promoted or not promoted) and three classes (promotion, repetition, dropout).They conclude that ANN and DT achieve an accuracy of 96.71% using all variables for both classes.
Meanwhile, in [25], they propose using LSTM with RF and gradient boosting with a 4-layer architecture to predict student performance.They were compared with eight predictive models and achieved the best accuracy, 95.40%.In [30], they propose using LSTM with three layers to predict for weeks, whether a student passes or fails a course.They used data from virtual learning environments and comparisons against ANN, SVM and LR.LSTM achieved an accuracy of 93.46% in predicting a student's performance in week 8, ANN achieved 85%, SVM achieved 75%, and LR achieved 80% in the same week.In [31], the researchers utilized CNN, LSTM, and SVC in their study.The CNN was employed for feature extraction, while the LSTM model was utilized to retain historical data.Additionally, SVC was employed to address the issue of data imbalance.The authors compared their predictive model against several established models, including CNN-LSTM, CNN-RNN, RF, DT, LR, and SVC.The proposed model achieves 91.55%.
On the other hand, in [22], they propose using a Convolutional Neural Network to predict whether a student will complete their course.The authors demonstrate that CNN achieves an accuracy of 90%.However, they conclude that the small amount of data and variables affect the model's accuracy.In [29], they used Deep Neural Networks to predict undergraduate students' pass and dropout rates.They compared data balancing algorithms to handle unbalanced data sets.They concluded that Deep Neural Network achieved 89% accuracy, followed by Random Forest with 88%.In [17], they trained an Artificial Neural Network to predict student performance using economic environment data.The authors compared three predictive models: ANN, Booting, and Bagging.They conclude that Bagging classifiers achieve a better accuracy of 88% and consider economic data important for predicting student performance.In [21] they used Genetic Algorithm (GA) to select features and Random Forest to classify students according to their performance.They compared six predictive models: DT, ANN, RF, Voting, Bagging, and Boosting.The authors concluded that the use of GA+RF achieves an accuracy of 85%.However, it is considered for future work to obtain more training data and compare other algorithms for feature selection.In [18] they managed to predict concepts/skills for writing computer programs using DT and LR.The results show that DT achieves 84% accuracy.The authors conclude that the concept of selective logic is an essential prerequisite for writing computer programs, and they also suggest evaluating the models with a larger amount of training data.In [19], the authors used Transfer Learning with DNN in response to the limited quantity of available data to predict student performance.An accuracy rate of 86% was attained.The prediction models in question were not evaluated by comparative analysis.In [26], they compared six models: RF, NB, NN, SVC, LR, and KNN, to predict the final grade of undergraduate students.They found that Random Forest and Nearest Neighbors achieve 74.6% accuracy while KNN achieves 69.9%.The authors suggest that other training variables should be used to improve the accuracy of the models.In [20], they trained an ANN with academic and socioeconomic data to predict students who fail in undergraduate programs.They used two predictive models.The first model presents 48 input neurons, 39 hidden layers, and one output, while the second model presents seven input neurons, four hidden layers, and one output as the network architecture.The authors achieved 74.5% accuracy with the second ANN model.However, they consider that the number of variables is limited for their training.
In [24], they used ANN-LSTM for multi-class classification (distinction, pass, fail, and withdrawn).The architecture of the LSTM network is composed of i) an input layer for 200 attributes, ii) a dense hidden layer with an output of 100 units and an activation function ''ReLu'', and iii) an output layer with function ''SoftMax'' activation with four output units representing four categories: Distinction, Pass, Fail, and Withdrawn.
In [35], the authors seek to predict academic performance in online learning systems using a linear regression model and a k-means classifier.Both algorithms achieved an accuracy of 50%.The authors recommend using a dataset with more records and attributes.Also, other algorithms should be evaluated, and performance measures of the models should be tested.In [28], they lack the results of their evaluations on k-means.
Based on the related works discussed, the research gap addressed in this study is described as follows.Firstly, most of the proposed models [20], [26], [27], [29], [30], [32], [33], aim to predict student performance using imbalanced data.Secondly, while many studies propose predictive models for assessing student performance at the end of a course, instructors often require weekly or even daily predictions.
In this research, seven machine learning algorithms (DNN, LSTM, DT, RF, LR, SVC, and KNN) were employed to evaluate results in terms of student performance prediction.Studies [24], [25], [28], [30], [31], [32] utilized LSTM.However, in our research, we applied and compared the prediction results using original data and synthetic data generated by two data balancing methods (SMOTE and GAN).We used 5-fold stratified cross-validation to stabilize the resulting evaluation measures.

III. RESEARCH METHOD
In this section, Fig. 1 shows the proposed approach flowchart, involving i) data collection, ii) data preprocessing, iii) data balancing, iv) training data, v) testing data, and vi) model validation: The data collected were from university students in Software Engineering, Information Systems Engineering, and Computer Science careers from two Peruvian universities from 2020-2022.We obtained 677 records with 13 academic attributes related to academic grades in programming fundamentals, linguistic comprehension, and mathematics.Table 2 presents the academic attributes of the data set used to predict whether a student passes or fails the programming fundamentals course.Demographic and family data were not considered due to data restrictions and privacy by universities.

B. DATA PRE-PROCESSING
We carried out the process of data cleaning, data discretization, and feature encoding to obtain a unified and error-free data set, which allows for better results.

1) DATA CLEANING
The data cleaning process was carried out to eliminate records with missing data.We were able to identify six records with one or more missing attributes.

2) DATA DISCRETIZATION
The discretization process allowed us to transform the numerical values of student grades into nominal values.Since the goal is to classify whether a student passes or fails the programming fundamentals course, the label ''Passed'' considers the grade range from 12.5 to 20.In contrast, the label ''Failed'' considers a grade from 0 to 12.49, as illustrated in Table 3.

3) FEATURE ENCODING
In the feature encoding stage, nominal data was taken to be converted into numerical labels.In Table 4, we can see the attributes, their values, and their encoded label.

C. DATA BALANCING
To solve the problem of unbalanced data, which leads to the domination of majority classes when training and testing machine learning models, we used two techniques: i) The Synthetic Minority Over-Sampling Technique (SMOTE) [16] and ii) Generative Adversarial Networks (GAN) [15], [36].
Fig. 2 presents the data imbalance on the attributes of the class label, where the majority class (Pass) represents 68% of the data and the minority class (Fail) accounts for 32% of the data.
In Fig. 3, the analysis of attribute correlation is presented.There is a 51% correlation between graded practice one and  the midterm exam, a 51% correlation between graded practice one and the final exam, and a 51% correlation between graded practice one and the target.Similarly, the correlation between the midterm and final exams is 64%, and between the midterm exam and the target, it is 68%.Moreover, graded practice two correlates with the final exam and the 55% and 62% target, respectively.The correlation between the final project and class participation is 67%.Finally, a correlation of 71% is observed between the final exam and the target.

D. TRAINING DATA
For the training process, we configured seven machine learning techniques: Long Short-Term Memory, Deep Neural Network, Decision Tree, Random Forest, Logistic Regression, Support Vector Machine, and K-Nearest Neighbor.Table 5 presents the configuration parameters of the machine learning models used in this research.
Long Short-Term Memory is a Recurrent Neural Network created by Hochreiter and Schmidhuber in 1997 [37] to address the problems of explosion and disappearance of gradient obtained in traditional RNN models.LSTM has been used in time series problems [38], [39], [40].
In this study, we will use LSTM to predict student performance.The configuration parameters of the proposed model are presented in Table 6.
An LSTM network contains four main components: i) cell state, ii) input gate, iii) output gate, and iv) forget gate.The input gate, cell state, and output gate are necessary to update, maintain, and delete information from the forget gate.The architecture of an LSTM cell and its components is shown in Fig. 4.
The forget gate at a given time t(f t ) is designed using a neural network and a sigmoid function.It receives as input a data point representing the current state at time t(X t ) and the hidden state of a previous data point (h t−1 ), concatenates them, and applies the sigmoid function, yielding a value  between 0 and 1.A value of 1 signifies that the outcome will be retained, while 0 indicates that the outcome will be discarded.This process is described in Equation 1.
The input gate facilitates updating the cell's current state and comprises two steps.The first step involves obtaining information (it) by multiplying the input (X t ) and the hidden state from a previous time (h t−1 ), concatenating them, and applying the sigmoid function.The resulting value determines whether the information is retained or rejected, as described in Equation 2. The second step calculates (pt) using the same current state information (X t ) and the hidden state (ht-1)concatenating them in a tanh function, expressed in Equation 3. The cell state, or the update of the cell state, utilizes information from both the forget gate and input gate to decide and store the new state in the cell of state.The previous cell state (Ct-1) is multiplied by the vector (ft).If the result is 0, then the values are discarded.If the result is 1, then the previous memory state is completely passed to the cell, allowing the calculation of the new state by taking the output values of the vectors (it) y (pt).This process is described in Equation 4.
An output gate determines the value of the following hidden state (h t ).First, it multiplies the previous hidden state (h t−1 ) with the current state (X t ) concatenated in a sigmoid function, as shown in Equation 5.Then, it updates the cell state (C t ) by multiplying it by a tanh function.Finally, it predicts the student's performance by obtaining h t , as described in Equation 6.
In Equations 1, 2, 3, 4 y 5, W represents the weights of the gates, and b represents the biases.Algorithm 1 shows the pseudocode of the LSTM layer.The input for each student's learning notes is distributed at each time t, from week 4 to week 16.It is represented in the vector X = [X i ,. . . .X t ], The data for the forget gate, input gate, cell state, and output gate are calculated, and the new state in the vector ht is obtained.An activation function is applied to obtain the output of the LSTM layer, which allows for predicting whether a student passes or fails the course.
The design of our LSTM network consists of i) an LSTM encoder with 64 neurons that receives inputs ranging from 5 to 12 attributes depending on the week being evaluated, using a ReLu activation function, ii) an LSTM Decoder with a 32-neuron LSTM layer and a ReLu function, and it also has a Dense layer with two neurons that connect all outputs from the previous layer through the SoftMax function.The output is 0, indicating that the model predicts the student will fail the course, and 1 indicates that the model predicts the student will pass the course.Fig. 5 shows the details of the LSTM architecture used in this research.

E. TESTING DATA
We applied K-fold stratified cross-validation, where k=5 for all evaluated models.K-fold stratified cross-validation was applied five times.This process ensures that the evaluation measures (accuracy, precision, recall, F1-Score, classification error, and confusion matrix) are obtained through the average of the five iterations generated by cross-validation.Fig. 6 shows the details of obtaining the evaluation measures after comparing the model's results with the test data in each iteration.

A. ENVIRONMENT
This research used a laptop with a Ryzen 7-5800X processor, 16 GB of RAM, and a 1 TB hard drive.The source code was developed in Python, using Jupyter Notebook.We used Python libraries such as NumPy, Matplotlib, Pandas, Scikit-Learn, Keras, and TensorFlow.

B. RESULTS AND DISCUSSION
The experimental results were carried out to verify that the proposed technique of using LSTM and EDM to predict students' performance in the programming fundamentals course achieves high performance over the other evaluated techniques.In this sense, we evaluated i) The performance of the classifiers on the data set without applying balancing techniques, ii) The performance of the classifiers applying data balancing techniques, and iii) The performance of classifiers specifically in week 8.

1) PERFORMANCE EVALUATION OF THE CLASSIFIERS ON THE DATASET WITHOUT APPLYING BALANCING TECHNIQUES
Performance measures such as precision, Recall, and F1-Score were applied to the seven predictive models (Long Short-Term Memory, Deep Neural Network, Decision Tree, Random Forest, Logistic Regression, Support Vector Classifier, and K-Nearest Neighbor).The dataset with unbalanced data was used.Fig. 7 presents the accuracy results achieved by the predictive models and their distribution by week.In week 7, RF achieved the highest accuracy of 77%, while SVC recorded the lowest at 75.3%.In weeks 8, 12, and 16, DT maintains an increase in accuracy ranging from 93.4% to 100%, followed by RF, achieving an accuracy of 92%, 97.1%, and 100% in weeks 8, 12, and 16, respectively.However, the LSTM algorithmic model reaches 76.2%, 87.4%, 90.8%, and 97.7% in weeks 7, 8, 12, and 16.This is attributed to the limited data quantity and bias due to imbalanced data, which allows traditional models to focus their predictions on the majority class (Passed).
Regarding the recall measure, in week 7, KNN achieved 12% in classifying false negatives, meaning students are classified as failed when their current status is passed.In week 8, DT achieves a recall of 97.9%, indicating that 2.1% of students are classified as failed while their actual status is passed.In week 12, DT achieved a recall of 99.7%, and in week 16, DT, LR, and RF all achieved a recall of 100%.Regarding the F1-Score, in week 7, KNN and SVC show better management of imbalanced data with 88% and 87.5%, respectively.Similarly, in week 8, DT and RF achieved the highest values of 97.9% and 97.8% for managing imbalanced data.In week 12, DT and RF continue to obtain the best values, and finally, in week 16, DT, LR, and RF achieve a 100% F1-Score.LSTM presents 84.7%, 92.7%, 94.6%, and 98.9% in weeks 7, 8, 12, and 16, respectively.Fig. 10 displays the weekly results of the F1-Score measure for each predictive model.

2) EVALUATION OF THE PERFORMANCE OF THE CLASSIFIERS APPLYING DATA BALANCING TECHNIQUES
The problem of unbalanced data and the small amount of data for training machine learning predictive models lead to biased results and performance.Data balancing techniques such as SMOTE and GAN were used to address both problems.Stratified cross-validation was performed five times.The synthetic data for SMOTE and GAN were 4000.However, after obtaining the data with GAN, a data cleaning was carried out, as it generated 64 inconsistent records.

a: PERFORMANCE EVALUATION OF THE MODELS OF MACHINE LEARNING USING THE SMOTE TECHNIQUE
Fig. 11 shows the results of the seven predictive models based on their accuracy using SMOTE to classify two categories (pass and fail).It can be observed that DT achieves a precision of 75.8%, 93.1%, and 98.3% in weeks 7, 8, and 12, respectively.However, in week 16, the DNN, LSTM, and KNN models reach a precision of 99% accuracy.
Regarding the precision measure, KNN with SMOTE in week 7 achieves the best precision measure with 83.9%, equating to a 16.1% error rate in classifying false positives, meaning the model considers students as passed, whereas they actually failed.In weeks 8 and 12, DT achieves the best precision measure with 92.1% and 98.1%, respectively.However, in week 16, KNN, DNN, and LSTM achieved better management of false positives, with a precision of 98.7%.Fig. 12 displays the results of the precision measure with SMOTE.
Regarding the recall measure, DT and SVC with SMOTE achieve better results in classifying false negatives.Fig. 13 shows the results of the recall measure with SMOTE for the predictive models distributed by week.
Regarding the F1-Score measure, DT with SMOTE emerges as the best classifier in weeks 7, 8, and 12 with percentages of 81.8%, 94.3%, and 98.8%, respectively.However, in week 16, it drops to 96.6%.Likewise, in week 16, KNN, DNN, and LSTM with SMOTE are presented as the best classifiers with 99%.Fig. 14 displays the results of the F1-Score measure with SMOTE.

b: PERFORMANCE EVALUATION OF THE MODELS OF MACHINE LEARNING USING THE GAN TECHNIQUE
In Fig. 15 the results of the seven predictive models based on their accuracy using GAN to classify two categories (pass and fail) are presented.It can be observed that LSTM achieves 5892 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.accuracies of 51.3%, 98.3%, 98.5%, and 99.77% in weeks 7, 8, 12, and 16, respectively.Another notable predictive model is DNN, which achieves accuracies of 51.1%, 98.1%, 98.3%, and 99.5% in weeks 7, 8, 12, and 16.
Regarding the average precision, LSTM with GAN shows precision values of 65.8%, 98.3%, 98.5%, and 99.7% in weeks 7, 8, 12, and 16, indicating it learned to mitigate the error of false positives Fig. 16 displays the results of the precision measure for predictive models with GAN.The recall and F1-score measures follow the same classification trend, as shown in Figs.17, and 18.

THE PERFORMANCE OF CLASSIFIERS SPECIFICALLY IN WEEK 8
Fig. 19 summarizes the evaluation of the predictive models in week 8 according to their accuracy.DT achieves 93.4% accuracy using the original data and 93.1% accuracy using SMOTE as a data balancing method.When applying GAN to balance the data, we can see that LSTM achieves an accuracy of 98.3%.Fig. 20 summarizes the evaluation of the predictive models in week eight based on their precision.DT achieves 92.1% accuracy using the original data and 92.1% using SMOTE as   a data balancing method.However, it can be observed that traditional predictive models using GAN achieve precision measures lower than 75%, while DNN and LSTM reach 98.1% and 98.3% precision, respectively, when using GAN.
Fig. 21 presents the evaluation results of the recall measure for the seven predictive models in week 8. DT and RF achieve 97.9% and 97.8% using the original data, respectively.Similarly, DT and RF attain a recall of 97.6% using SMOTE.However, when using GAN, the traditional predictive models show an increased error percentage in  classifying false negatives, while DNN and LSTM, using GAN, achieve 98.1% and 98.3% recall.
Regarding the F1-Score measure, DT and RF show better percentages with original data, 94.5% and 93.5%, respectively.Using SMOTE, they achieved 94.3% and 93.7% in week 8, respectively.However, both measures drop to 75.8% and 75.9% using GAN, implying that the classifier error using GAN is 24.2% and 24.1%, respectively.In contrast, DNN and LSTM achieve 98.1% and 98.3% using GAN, respectively.Fig. 22 presents the results of the predictive models using data balancing methods in week 8.
Table 7 presents the confusion matrix of each predictive model (DT, KNN, DNN, LSTM, LR, RF, SVC) with respect to the application of data balancing methods (unbalanced data, SMOTE, GAN) distributed in week 8.The confusion matrix allows for the derivation of accuracy, recall, precision, and F1-Score measures, as well as the classification error of each model.We can conclude that LSTM and DNN, both using GAN, manage to predict two students, according to their academic performance, as false positives and two students, according to their academic performance, as false negatives, resulting in a 3% error rate in classification.However, the predictive model DT achieves a 34% error rate in classification, meaning this model predicts 25 students, according to their academic performance, as false positives and 20 students as false negatives.According to the analysis, the data generated by SMOTE, and the original data present a similar distribution; however, the data generated by GAN show floating-point values with decimal parts ranging from 1 to 8 decimals.GAN presents data that adapts to deep learning predictive models such as LSTM and DNN, but traditional models do not train adequately with these data.Likewise, we can observe data distribution according to the attributes and the original data generated by SMOTE and GAN, as shown in Table 8.
Table 9 shows the values obtained by the sensitivity measure to evaluate the models' ability to predict a passing student as a true positive.With unbalanced data and SMOTE, LSTM shows a lower percentage of 93% and 92%, respectively.However, using GAN, LSTM presents a better ratio of true positives at 98% compared to other predictive models.Similarly, the specificity measure demonstrates the models' ability to predict a failing student as a true negative.LSTM shows the lowest percentages of 72% and 75% in unbalanced data and SMOTE, respectively; however, LSTM better classifies failing students with a percentage of 95% using GAN.
According to the Receiver Operating Characteristic (AUC) measure, all predictive models show values above 0.5, indicating that they have learned to predict or classify student performance as positive, with the outcome being approved.RF obtains the best value with 96% and the lowest by DT with 92%.On the other hand, we have the AUCs obtained by applying the SMOTE data balancing method, with KNN, LSTM, and DNN showing the best results.Finally, the evaluation of AUC applying GAN highlights LSTM as the predictive model that best classifies true positives with a percentage of 87%.
The training time in milliseconds used by predictive models employing data balancing methods is also presented.data (449 ms) must be added, as GAN uses a discriminator and a synthetic data generator that are trained within another neural network.We can conclude that GAN requires a longer training duration.
Table 10 presents the results of the classification error measure for week 8.
Table 11 presents the results obtained by applying the Bonferroni-Holm Correction [41] for pairwise comparisons between our proposed model (LSTM) and other predictive models.The null hypothesis of the non-parametric test is that the means of the algorithm results based on the F1-Score with the application of stratified 5-fold cross-validation are the same, with a significance level of 0.05.It is demonstrated that the null hypothesis was rejected when comparing LSTM with SVC, KNN, and LR, indicating significant differences between the means of F1-Score values among these models.However, the null hypothesis is accepted when comparing LSTM with DT, DNN, and RF, as these models have no significant differences between the F1-Score values.

V. CONCLUSION FUTURE WORK
In this study, we present the results and findings from the articles that make up the research on predicting students' academic performance in the fundamentals of programming courses using LSTM and EDM.
In this research, we used 667 records from the fundamentals of programming course of the Computer Science degree and related fields from two Peruvian universities.After data cleaning, we achieved 661 records with thirteen attributes comprising our models' input.
We included six additional predictive models: Deep Neural Network, Decision Tree, Random Forest, Logistic Regression, Support Vector Machine, and K-Nearest Neighbor.We addressed the problem of unbalanced and small data using data balancing techniques, such as SMOTE and GAN.
The performance of the proposed models with unbalanced data was evaluated.The results show that traditional models such as Decision Tree and Random Forest achieve accuracy between 77% (week 7) and 99.9% (week 16), but with rates of false positives and false negatives due to the bias of the predominant class (approved).Likewise, it was found that the performance of the models using balanced data shows better values in terms of their precision and recall, with DNN and LSTM being the models with the highest precision and recall.
Week 8 is strategic for organizational decision-making regarding the Programming Fundamentals course.In that context, LSTM with GAN presents an accuracy, recall, precision, and F1-Score of 98.3%, followed by DNN-GAN with 98.1%.
The results show that SMOTE generates better results than GAN as a balancing method.This is because SMOTE adds synthetic data from vector space with fewer variations.However, GAN creates more realistic synthetic samples that are different from each other.The above facilitates the models to learn the SMOTE data and overfit.The experiments with GAN generalize with the models.
GAN has proven to adapt to the requirements and objectives of this research.However, the analysis of the data generated by GAN was done manually, finding 64 inconsistent data, which were eliminated.
For future research, more data should be considered, and attributes related to competencies and learning styles should be incorporated.Likewise, it is expected to be able to predict a student's dropout using Graph Neural Networks (GNN) and incorporate other data balancing techniques, like undersampling or hybrid methods.

FIGURE 1 .
FIGURE 1. Flow diagram of the proposed approach.

FIGURE 2 .
FIGURE 2. Imbalanced distribution of the class label.

FIGURE 4 .
FIGURE 4. The architecture of the LSTM cell.

FF1 − Score = 2 *
. MODEL VALIDATION To verify the results of our models, we used five evaluation measures: accuracy, precision, recall, F1-Score, and classification error.Accuracy measures the quality of our model 7. Precision allows us to measure the percentage of cases the model gets right 8. Recall provides us with information about the number of cases the model can identify 9. F1-Score compares performance by combining precision and recall 10.The classification error lets you know the percentage of error our model generates 11.The outcomes of our model in comparison to the existing outcomes were categorized as follows: true positive (PT), true negative (TN), false positive (FP), and false negative (FN).Sensitivity is a measure that indicates the likelihood that a student who has actually passed the test will be correctly identified as having passed by the predictive model.On the other hand, specificity indicates the likelihood that a student who has actually failed the test will be correctly identified as having failed by the predictive model.Accuracy = TP + TN TP + TN + FP + FN (Recall * Precision Precision + Recall (10) ClassificationError = FP + FN TP + TN + FP + FN (11) Sensitivity = TP/(TP + FN ) (12) Specificity = TN /(TN + FP) (13)

FIGURE 7 .
FIGURE 7. Accuracy of predictive models distributed by weeks.

FIGURE 8 .
FIGURE 8. Precision of predictive models distributed by weeks.

FIGURE 9 .
FIGURE 9. Recall of predictive models distributed by weeks.

Fig. 9
Fig. 9 displays the weekly results of the recall measure by each algorithmic model.Regarding the F1-Score, in week 7, KNN and SVC show better management of imbalanced data with 88% and 87.5%, respectively.Similarly, in week 8, DT and RF achieved the highest values of 97.9% and 97.8% for managing imbalanced data.In week 12, DT and RF continue to obtain the best values, and finally, in week 16, DT, LR, and RF achieve a 100% F1-Score.LSTM presents 84.7%, 92.7%, 94.6%, and 98.9% in weeks 7, 8, 12, and 16, respectively.Fig.10displays the weekly results of the F1-Score measure for each predictive model.

FIGURE 10 .
FIGURE 10.F1-Score of predictive models distributed by weeks.

FIGURE 11 .
FIGURE 11.Accuracy of predictive models distributed by weeks with SMOTE.

FIGURE 12 .
FIGURE 12. Precision of predictive models distributed by weeks with SMOTE.

FIGURE 13 .
FIGURE 13.Recall of predictive models distributed by weeks with SMOTE.

FIGURE 14 .
FIGURE 14. F1-Score of predictive models distributed by weeks with SMOTE.

FIGURE 15 .
FIGURE 15.F1-Score of predictive models distributed by weeks with GAN.

FIGURE 16 .
FIGURE 16.Precision of algorithmic models distributed by weeks with GAN.

FIGURE 17 .
FIGURE 17. Recall of predictive models distributed by weeks with GAN.

FIGURE 18 .
FIGURE 18. F1-Score of predictive models distributed by weeks with GAN.

FIGURE 19 .
FIGURE 19.Accuracy obtained by each predictive model at week 8.

FIGURE 20 .
FIGURE 20.The precision obtained by each predictive model at week 8.

FIGURE 21 .
FIGURE 21.The recall obtained by each predictive model at week 8.

FIGURE 22 .
FIGURE 22.The F1-Score obtained by each predictive model at week 8.
LSTM requires a longer training time with unbalanced data (63.45ms).DNN shows a longer training time with SMOTE (60.18 ms) and GAN (44.22 ms); however, for GAN's training time, the data generation time for synthetic

TABLE 1 .
Summary of review of some previous work on student academic performance using machine learning.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 1 .
(Continued.) Summary of review of some previous work on student academic performance using machine learning.

TABLE 4 .
Feature encoding of attributes.

TABLE 5 .
Parameter configuration of each machine learning model.

TABLE 6 .
Parameter configuration for the LSTM model.
End forApply the activation function to obtain the output of the LSTM layer.

TABLE 7 .
Confusion matrix of each predictive model according to the method of balancing data obtained in the evaluation of week 8.

TABLE 8 .
Data generation according to the data balancing method.

TABLE 9 .
Measurements of sensitivity, specificity, AUC, and training time of the predictive models according to the data balancing method, applied to week 8.

TABLE 10 .
Classification error of the predictive models according to the data balancing method, applied to week 8.