Prediction of Course Grades in Computer Science Higher Education Program via a Combination of Loss Functions in LSTM Model

In the realm of education, the timely identification of potential challenges, such as learning difficulties leading to dropout risks, and the facilitation of personalized learning, emphasizes the crucial importance of early grade prediction. This study seeks to connect predictive modeling with educational outcomes, particularly focusing on addressing these challenges in computer science higher education programs. To address these issues, nonlinear dynamic systems, notably Recurrent Neural Networks (RNNs), have demonstrated efficacy in unraveling the intricate relationships within student learning traces, surpassing the constraints of traditional time series methods. However, the challenge of vanishing gradient issues hampers RNNs, leading to a significant decrease in gradient values during weight matrix multiplication. To solve this challenge, we introduce an innovative loss function, the MSECosine loss function crafted by seamlessly combining two established loss functions: Mean Square Error (MSE) and LogCosh. In assessing the performance of this novel loss function, we employed two self-collected datasets comprising learning management system (LMS) and assessment records from a higher education computer science program. These datasets serve as the testing ground for four deep time series models: Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), Long Short-Term Memory network (LSTM), and CNN-LSTM. Employing 29 meticulously designed feature sets representing combination of demography, learning activities and assessment, LSTM emerges as the preeminent model which is consistent with our expectation that RNN is the best suited approach. Building on this groundwork, we solve the vanishing gradient issue and boost the LSTM model’s performance by integrating the proposed MSECosine loss function, resulting in an enhanced model termed eLSTM. Experimental results underscore the noteworthy achievements of the eLSTM model, emphasizing an accuracy of 0.6191% and a substantially reduced error rate of 0.1738. The proposed MSECosine loss function performance in addressing the vanishing gradient issue yields two times better than compared to standard loss functions. These outcomes surpass those of alternative approaches, highlighting the instrumental role of the MSECosine loss function in refining eLSTM for more accurate predictions in course grade prediction, as well as the feature set that captures early grade prediction.


I. INTRODUCTION
Detecting students who are at risk of dropping out or failing is important, and predicting their academic performance early The associate editor coordinating the review of this manuscript and approving it for publication was Tao Huang .can be highly beneficial [1], [2], and [3].Academic performance is the main factor in evaluating the quality of education for college students [4], [5], [6], [7], and [8].Therefore, the early grade prediction can be achieved through the use of sequential data that contains previous information about the student's activities [9], [10].Thus, to deal with this type of data, dynamic systems have to be employed.One way to analyze dynamic systems is by the usage of time series approaches [11].
Accordingly, various time series forecasting schemes, such as simple, autoregressive, and exponential smoothing approaches, have been used in the past for early prediction in different fields e.g., economics, stock market, and engineering [11], [12], and [13].However, these methods are limited in their ability to learn complex patterns and can only handle simple prediction challenges using linear methods.Also, they demand a significant quantity of data to attain a high level of accuracy [11], [12], and [13].
Deep learning models can address this limitation by being employed for time series forecasting tasks [14].Deep time series models by leveraging the power of neural networks can effectively handle non-linear relationships and capture intricate patterns that may be missed by linear methods [14].Consequently, this ability makes the deep time series scheme to be well-suited for challenging prediction tasks such as grade prediction [14], [15].
This study specifically addresses the prediction of course grades in a computer science higher education program via a novel MSECosine of loss functions in the LSTM model.While traditional time series forecasting schemes, including simple, autoregressive, and exponential smoothing approaches, have been employed in various fields for early prediction [11], [12], and [13], they exhibit limitations in learning complex patterns and are restricted to linear methods.
The existing literature on early grade prediction using these deep learning methods has primarily focused on understanding temporal relationships through new gating techniques, employing LSTM [16], [18], and [19].However, this body of work has not thoroughly explored the significant issue of error magnification during the training phase [16], [18], and [19].
Accordingly, one way to address this issue is by utilizing suitable loss functions, such as Mean Square Error (MSE), which is considered the best function as it does not suffer from vanishing or exploding gradients due to its use of an exponential term [16], [18].However, the MSE function can magnify errors when the network is not performing well [16], [18].To tackle this problem, the logarithm function can be used to prevent the exponential function from expansion, reducing the skewness of exponential terms [17].
Hence, this study aims to fill this gap between predictive modeling and educational outcomes by proposing a novel loss function, MSECosine, to address this specific challenge and enhance network performance.This innovative loss function, a combination of Mean Square Error (MSE) and LogCosh, aims to address the vanishing gradient problem, presenting a unique approach that distinguishes our work from previous efforts.
Our hypothesis posits that the proposed MSECosine loss function effectively addresses the issue of error magnification during the training phase, leading to improved performance in early grade prediction.To investigate this, our research question examines how the MSECosine loss function impacts the performance of deep time series models in predicting student grades early on.
If the combination of the logarithm function can prevent the exponential loss function from growing, then combining the Logcosh and MSE loss functions will produce a desirable network error.In this research, we aim to suggest the novel MSECosine loss function to address error magnification during the training phase by combining the MSE and Logcosh loss functions.
The contribution of this work is threefold.Firstly, we present the results of an early grade prediction paper through an empirical investigation of the accuracy of four deep time series models.Secondly, we propose a method called eLSTM, introducing the new loss function MSECosine as a solution to the vanishing gradient problem in deep time series models.Thirdly, we present the optimized early grade prediction technique using the proposed eLSTM based on the evaluation of 29 feature sets.
The remaining of this paper is segmented into five sections.The literature review of the essential of early grade prediction using deep time series models as well as the impact of loss function to the network performance are covered in Part II.The methodology is discussed in Part III.The enhancement of the LSTM model using proposed MSECosine loss function is addressed in Part V. Attained results and the discussion are mentioned in Part V.The discussion section is stated in Part VI.Conclusively, the paper is concluded in Part VII.

II. RELATED WORK
This section discusses the existing literature on time series prediction within the context of academic performance forecasting, emphasizing the limitations of traditional time series methods and the advantages of deep learning techniques, specifically Recurrent Neural Networks (RNNs) like LSTM.
Additionally, it highlights the superiority of LSTM over other models in predicting student performance, as demonstrated in various studies.Furthermore, we address the limitations of these models, which are overcome through the use of appropriate components such as loss functions.
Numerous optimization algorithms are currently being applied across diverse academic disciplines, including mechanical engineering, automobile engineering, aerospace engineering, etc.The relevance of these studies becomes particularly significant when considering the dynamic nature of academic performance [20], [21].
Academic performance, a nonlinear dynamic system, exhibits feedback loops, chaotic behavior, and sensitivity to initial conditions.Unlike linear systems, where input-output relationships are straightforward, nonlinear systems can have complex and disproportionate connections.This complexity is further compounded by the dynamic nature of academic performance, subject to changes over time due to various influences like study habits, motivation, and external factors.
Such intricate relationships mean that minor alterations in input variables can trigger significant and unpredictable shifts in output.The multifaceted interplay of factors affecting academic performance defies linear models.The evolution of these influences over time introduces volatility, potentially leading to unexpected fluctuations or abrupt changes in academic outcomes.For instance, doubling study hours may not result in double grade improvement, due to intricate variable interactions and diminishing returns.
Recognizing the significance of nonlinear dynamic systems in academic performance prediction, deep time series models offer advantages over traditional methods.Nonlinear techniques are vital for predicting time delays inherent in dynamic systems, often encountered in time series forecasting [18], [22].
In various domains, researchers have employed traditional time series techniques, ranging from basic methods like averaging to more complex approaches such as nonlinear autoregressive networks with exogenous inputs (NARX) [23] and exponential smoothing (Figure .1) [23], [24].These methods address prediction challenges but may fall short in capturing the intricate dynamics of academic performance.Classic time series models (Figure .2) rely solely on past inputs and current histories for predictions.While effective in industries like sales estimation, energy consumption forecasting, and passenger predictions, their linear nature limits their ability to handle complex patterns and latent factors [22].These linear constraints lead to vulnerability to outliers, time inefficiency, and data limitations, necessitating heuristics and fine-tuning, especially for seasonality [22].
In contrast, deep time series methods, such as Recurrent Neural Network (RNN), have demonstrated remarkable performance across various domains, capturing intricate patterns and relationships [23], [24].They provide more accurate control and forecasting in partially unknown environments, as evidenced by their success in natural language processing, image classification, and more [25], [26], and [27].
As a result, various deep learning techniques, including MLP, CNN, LSTM, and CNN-LSTM combinations, have been employed for time series prediction [13], [28].These methods excel in capturing complex time-dependent relationships, as highlighted in Table 1.Academic performance forecasting is commonly achieved through grade prediction using standard machine learning (e.g., MLP, SVM) and time series (e.g., LSTM, GRU) methods [29], [32].Notably, LSTM consistently outperforms other models in terms of precision, Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) due to its effective capture of complex temporal dependencies in educational data [1].See Table 2 for a comprehensive overview of these approaches.
The mentioned table demonstrates LSTM's efficacy in predicting student performance by surpassing baseline models, such as SVM and MLP, with superior accuracy [29].LSTM's adeptness in handling sequential data and retaining long-term information enhances its adaptability to the complexities of educational data, resulting in enhanced predictions [28].In contrast to alternatives, LSTM offers reliable and precise predictions, making it the preferred model for educational grade prediction [18].
These comparisons underscore LSTM's and Bi-LSTM's superiority in predicting student performance.Their ability to process sequential data and comprehend temporal 30222 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.dependencies proves promising for accurate and effective grade prediction.While both LSTM and Bi-LSTM [18], [33] exhibit strong performance, LSTM's computational efficiency sets it apart.Operating unidirectionally, LSTM processes input sequences from past to future or vice versa, reducing complexity and memory needs compared to Bi-LSTM [16].Consequently, the LSTM model [34], [35], [36], [37] emerges as a compelling choice for grade prediction in educational data analysis, offering a balance of performance and computational efficiency.
Although the above-mentioned approaches are showing promising results for time series prediction, these methods suffer from the issue of vanishing gradient due to working on large-scale time series forecasting problems.Several loss functions can overcome this issue.The above-mentioned deep time series forecasting methods are applied into a regression-type predictive modeling problem [33].Multiple regression loss functions have been proposed to improve the performance evaluation in regression analysis.The advantage and disadvantage of the regression loss function is specified in Table 3.
Among the above presented loss functions, MSE and Log-Cosh loss functions are considered as the most well-known regression functions.This is due to their differentiability that avoids vanishing gradient issues while using large sequential data.The MSE differentiability makes this function easily accomplish the mathematical operations.Also, this function evaluates the fitness of the model by producing lower value.Whilst, the LogCosh function is beneficial for keeping balance as it utilizes the logarithm terms.

III. MATERIAL AND METHODS
This section outlines the methodology used in this research, which is divided into five phases of 1) Data Collection, 2) Implementation of various features selection models, 3) Modeling of predictive deep time series models, 4) Evaluation metrics that applied in all tested models, and 5) Enhancement of best model using the proposed MSECosine loss function.The data collection and preparation phase provided the information on 1) Description of self-collected dataset of 'LMS' and 'Assessment', 2) Procedure of merging these two datasets, and 3) Preprocessing methods.The second phase presents the details on implementation of 29 designed feature selection and their importance.Furthermore, the third phase displays the principle and proposed framework of this study using four different time series models of MLP, CNN, LSTM, and CNN-LSTM.Afterwards, phase fourth specified the evaluation metrics that applied to quantitatively assess the performance of each of the tested methods by this study.Finally, phase five demonstrated the workflow diagram of eLSTM using the proposed MSECosine Loss function.These phases are introduced as below.

A. DATA COLLECTION AND PREPARATION
This section gives the comprehensive details on the descriptions of two collected datasets that used to test the models of this research and also, denote the pre-processing schemes that employed these data to prepare them for prediction purpose.
The second dataset comprises the access log to the UPM LMS called PutraBLAST.This dataset consists of 11895 instances and contains 7 attributes namely ''Time'', ''Event_context'', ''Component'', ''Event_name'', ''Description'', ''Origin'', and ''Matric''.Based on the features of this dataset we calculate the frequency of access by each student to generate two features called ''week 1-7'' and ''week 8-12'' in each of their registered courses.This information is valuable to indicate the effort of the students in learning as they access the LMS for learning activities.
Once these datasets are combined by mapping them according to the matric number, the total number of instances becomes 3721, and we only utilized 21 attributes that provided more crucial information e.g., the student performance and engagements.The details of the aggregated 'Assessment' and 'LMS' dataset is shown in Table 5.

B. PRE-PROCESSING STEPS
The pre-processing step of this study involved two techniques called LabelEncoder and MinMaxScaler normalization.These methods are beneficial for preparing data for deep time series models.As the collected dataset contains multiple variables, it is necessary to convert categorical features to numerical form to be processed by time series models.Additionally, as time series models are highly sensitive to input scale, the MinMaxScaler normalization method was used to ensure that all features have the same range.A detailed description of each approach is provided in their respective sections.

1) LABEL ENCODER TECHNIQUE
Label Encoder is a technique used to convert categorical data into numerical form.This process involves assigning a unique numerical value to each category present in the dataset used in this research.These numerical values are arbitrary and have no intrinsic meaning.The procedure for converting categorical data into numeric form using LabelEncoder is outlined below.

2) MINMAXSCALAR NORMALIZATION METHOD
MinMaxScaler is a widely used data preprocessing technique in machine learning and data analysis.Its primary objective is to transform the data in a way that all features lie within a specified range.
In our case, we have applied the MinMaxScaler to the combined 'Assessment' and 'LMS' dataset, which contains both categorical and numerical attributes.
The MinMaxScaler method serves to standardize the data, making it amenable for various machine learning algorithms and ensuring that no single feature dominates the learning process due to differences in scale.This choice is based on its advantages, such as faster convergence when dealing with deep time series models, which are applied in this work, and the preservation of critical information in the data.The MinMaxScaler transformation can be expressed using the following mathematical formula, as stated in Eqn.(1): where the I represents the input data, I Scaled denotes the scaled data after the transformation, I Min and I Max data signifies the lowest and highest value in the input data respectively.Therefore, this formula ensures that each feature is scaled proportionally to its range within the dataset, making it a valuable preprocessing step for prediction tasks when using deep time series models.

3) GENERATING STUDENT ENGAGEMENT FEATURES
As explained in the previous sections, the extracted and integrated data are transformed and normalized.We also generated several new features based on the LMS dataset, to represent the learning efforts by the students.
We identified the frequency of weekly access by the students in each course from week 1 until week 12, and created two aggregated features based on their total frequency of access from week 1 until week 7 (called FreqW1W7), and from week 8 until week 12 (called FreqW8W12).

4) FEATURE SET GENERATION
To determine the impact of combination of different attributes to the grade prediction model's performance, we designed 29 feature sets shown in Figure .31.This is also to diminish the risk of overfitting that may occur with a single model.Figure .31 is shown at Appendix.
Figure .3 illustrates the comprehensive framework employed for student grade prediction, constructed based on five stages: 1) Data collection, 2) Pre-Processing stage, 3) Deep time series models, 4) Prediction stage, and 5) Output.In the initial stage of data collection, the dataset used in this investigation originates from two primary sources, 'Assessment' and 'LMS,' represented as separate time series datasets in the early phase (Refer to Table 4).Following this, the second stage is the preprocessing stage.In this stage, three traditional preprocessing models of 1) Data Integration, 2) Label Encoding, and 3) Normalization have been employed.In the data integration stage, the aforementioned 'Assessment' and 'LMS' datasets are combined (Assessment & LMS) to provide information about student performance and engagement.Subsequently, distinct techniques of Label Encoding and Normalization are applied to the combined dataset of Assessment & LMS, respectively.The Label Encoding method transforms categorical features into a numerical format compatible with the deep time series models used in this study.Additionally, the MinMaxScaler normalization method is applied to enhance model performance and convergence.This normalization technique is chosen for its suitability in scaling input data to a specific range, recognizing the impact of input scale on deep time series models.
Moving forward, the third phase involves the development of four deep time series models: MLP, CNN, LSTM, and CNN-LSTM for student grade prediction.These models are systematically split into training and testing sets during the prediction stage, ensuring an unbiased assessment of their effectiveness and guarding against overfitting.
The output stage concludes the research framework, providing accuracy reports based on predictions made on the testing data.Evaluating the performance of each deep time series technique yields insights into the most effective approach.The model demonstrating higher accuracy and lower error values is selected as the optimal scheme for student early-grade prediction.Subsequently, this chosen approach is refined further through the proposed MSECosine loss function, detailed in the subsequent section outlining the procedures taken to enhance the selected model.

5) SETUP OF DEEP TIME SERIES MODLES FOR GRADE PREDICTION
The combined Assessment & LMS dataset is carefully divided into 70% for training our models and 30% for testing to make sure our models are well-trained and thoroughly evaluated.After that, we created four different deep learning models: MLP, CNN, LSTM, and CNN-LSTM.To provide transparency in our model configurations, Table 6, outlining both tested and chosen values for key hyperparameters crucial for optimizing the performance of each time series model.All these time series models are built using python programming languages using various libraries.The ''Pandas'' library due to providing several functions to load data in various formats e.g., CSV is utilized to load the time series dataset of this work.Then, for the preprocessing stage the {''NumPy'' and ''Sklearn''} libraries have been utilized.Furthermore, beside these libraries, another three libraries of {''Keras'','' TensorFlow'', and ''Matplotlib''} were used to build the four deep time series approaches.
The MLP model utilized in this research comprises a single layer with 100-time steps, 21 features based on a designed predictive analysis model of 29 (this may vary depending on the selection of the designed model), and one channel.It is followed by one fully connected layer with 79 units and single output layer with a single unit.
The CNN model consists of 1D convolutional layer with 64 filters, a kernel size of 2, and a LeakyReLU activation function is used, subsequently one max pooling layer with a pool size of 2, one flattened layer, and one output layer with a single unit.
Similarly, the LSTM model comprises one LSTM layer comprising 79 units and a LeakyReLU activation function, afterwards one flattened layer and one output layer with a single unit.
Also, the CNN-LSTM technique contains a single distributed convolutional layer containing 64 filters, each with a kernel size of 1, and a LeakyReLU activation function, followed by a singular distributed max pooling layer with a pool size of 2, a singular distributed flatten layer, one LSTM layer with 79 units and LeakyReLU activation function, one flattens layer, and one output layer with a single unit.
where T is the size of testing set, [g pred l == g l ] is an indicator function.If the indicator function is equal to 1, it signifies that the predicted grade g pred l for students l is equivalent to the true grade g l .Conversely, if the indicator function is equal to 0, it indicates a mismatch between the predicted and true grades.Finally, the term 1 T represents the computation of the average accuracy by dividing the sum of indicator functions by the total number of instances (T ).
Eqn. (3) represents the MSE formula, which quantifies the error between the true and predicted grades.
where T is specified, the testing set size and denotes the sum over all student's grades in the testing set.Also, g l and g pred l are denoting the true and predicted grade of student l, respectively.The MSE is calculated as the average of the squared differences between the true and predicted grades.Eqn.(4) shows the RMSE formula which calculates the square root of the average squared differences between the predicted g_pred l and actual g l values.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where T is the observations number, g_pred l and g l are represent the predicted and actual values respectively.
The deep time series model that has the best performance is then selected to be improved with proposed MSECosine loss function as explained in the next section.

IV. ENHANCEMENT OF LSTM WITH MSECOSINE LOSS FUNCTION A. PROPOSED MSECOSINE LOSS FUNCTION
In implementing the eLSTM model, we utilized several Python libraries, including Pandas, NumPy, Scikit-learn, Keras, TensorFlow, and Matplotlib.These libraries were employed for efficiency and adhering to standard practices in data processing and model construction.Notably, the core mathematical components of Algorithm2, particularly the proposed MSECosine loss function, considered as a custom loss function, were implemented from scratch.
Referring to Eqn. (3), if the error obtained by the MSE loss function has large value, then the MSE loss function due to having the square term, will amplify the error even further.This will cause numerical instability and slow convergence.Hence, in order to address this concern, the present study introduced a novel MSECosine loss function.
The proposed MSECosine loss function of this study is constructed by combination of two loss functions of MSE and LogCosh.This combination has been done based on the convexity theorem where this theorem allows the convex functions to be combined together in a specific way while preserving their desirable properties and convexity nature.
Therefore, the proposed MSECosine loss function can offer the ability of controlling the errors from growth and prevent the function from being too sensitive to the outliers.The proposed MSECosine function gets these advantages from the MSE and LogCosh loss function respectively.So, it can be more robust and be able to handle a wide range of data rather than using the MSE and LogCosh functions individually.
The below steps present the mathematical proof of the convexity of the proposed MSECosine function based on the convexity theorem.Let the proposed MSECosine defined as MSECosine g l − g pred l , where the g l and g pred l be the actual and predicted value correspondingly.

MSECosine g l − g pred l
= MSE g l , g pred l + (1− ∝) xLogCosh g l , g pred l (5) where the ∝ is a hypermeter which is between 0 to 1.This value is controlling the relative weights of the proposed MSECosine function.While the ∝ is close to 0 the MSECosine function gets the advantage from the LogCosh function to deal with any outlier.Otherwise, if the alpha value is close to one, they get benefits from the MSE function.g l and g pred l are signifying the actual and predicted value over 100-time steps of this study.Thus, to proof that MSECosine g l − g pred l is convex, the Hessian matrix must be positive semidefinite for each value of g l and g pred l .The Hessian matrix of MSECosine g l − g pred l with regards to g pred l is specified as follows.
Here the M is meant for matrix.Subsequently, as the both MSE and LogCosh as well as all the values of g l − g pred l are positive semidefinite in h, so the proposed MSECosine loss function is convex.

B. ENHANCED LSTM MODEL
In this study, four deep time series models namely, MLP, CNN, LSTM, and CNN-LSTM have been employed for early student grade prediction using the combined 'Assessment' and 'LMS' dataset.Then, the selected method of this study with higher precision which is LSTM is getting enhanced by the proposed loss function of MSECosine that is constructed by combination of two popular loss functions of MSE and LogCosh.This step was accomplished for testing which can offer more precise prediction while applying the proposed MSECosine loss function to the standard LSTM technique.The pre-processed data that is fed to the input layer stage is set to the input value of 100 data points and the number features is obtained from the designed feature selection models (Refer to Figure .31).Also, to obtain the best time step value, various timestep values were tested and value of 100 is considered as the best time step value.
In this layer, the input shape is calculated based on the multiplication of time steps, which is set to 100, with varying feature values ranging from 5 to 21 per time step.These feature values are obtained from the designed models (Refer to Figure .31).Therefore, the total number of features steps per time step is 2100.
Afterward, the third stage belonged to the eLSTM model hidden layer where in here we only have one hidden layer as it most of study claimed that the LSTM with a single hidden layer provided more accurate results [1].This layer has 79 LSTM units that take the varying feature values ranging from 5 to 21 features (P 1 , P 2 , . . ., P 21 ) from the designed feature selection model of 29 as input for each 100-time steps.Afterwards, the eLSTM units using their memory cells process the input sequence to update the hidden states and yield a new hidden state and a new memory cell for each 100-time steps.Subsequently, the output from the last time step of each eLSTM unit is associated with the next layer of output layer.
In the output layer, the obtained output from the last time step of each eLSTM unit is concatenated into a single vector.This vector is then passed over a dense layer with one output neuron.The output of this neuron shows the predicted student grade value of the target variable which is the grade.
Finally, in the training stage, the Stochastic Gradient Descent (SGD) [1] has been used as an optimizer and the MSE for the loss function [18], [34].In this stage the main concern is to make the predicted student grade as close as possible to the actual grade by adjusting the model's weights.This can be done by computing the loss function and updating the weights using backpropagation.
The details of the proposed eLSTM model based on the 29 designed feature selection models of this study is explained in Algorithm 2.
The proposed eLSTM model is get a sequence of observations P = {P 1 , P 2 , P 3 , . . ., P N } as an input, where each of these observations are a vector with k length.Then, the v 0 and j 0 present the initial hidden state and cell state respectively.Each of these two parameters are a vector with length of s.
The mentioned input and hidden state are altered by usage of various weights metrics of we f , we i , we c , and we o with size of sx(k + s), and we y that has size of 1xn.Also, there are multiple bias vectors ba f , ba i , ba c , and ba o with size of s, and which has size 1.
Hence, the variety of these weights and biases are because they must learn different transformations for the defined input and hidden state to attain an appropriate activation of gates and output in each time step.
Formerly, defined the number of time steps in which in this work is equal to 100.Furthermore, map the hidden state to the prediction.By tuning these parameters during the training stage, the LSTM model can yield more precise predictions based on the input sequence of P.
Consequently, the order of forecasted outcomes y = {y 1 , y 2 , y 3 , . . ., y N } where each of these predicted values are a scalar in which they are considered as the output.
Once the predicted value obtains from the eLSTM model after 100-time steps then the eLSTM model employed the proposed MSECosine loss function to compute the differences among the predicted value of y_pred T and true value y T .Therefore, the predicted sequence is describe as g_pred T = {g1 pred T , g2 pred T , . . ., gT _pred t } and the target sequence g T ={g1 T , g2 T , g3 T , . . ., g t }, then the proposed MSECosine loss function is specified in Eqn.(7),

V. EXPERIMENT AND RESULTS
This section provides the setting and outcomes of three experiments conducted in this study which are the evaluation of deep time series approaches on 29 designed feature set models, comparison of the eLSTM method against the basic deep time series approaches, and the performance of the models on the feature sets.

A. EXPERIMENT 1: ENHANCED LSTM MODEL EXPLANATION
This section outlines the process to identify the optimal deep time series model among MLP, CNN, LSTM, and CNN-LSTM.To evaluate these models, 29 distinct feature selection schemes were devised, utilizing a combined dataset ('Assessment' and 'LMS') (Figure .Referring Tables of 7, 8, and 9, and Figures.5, 6, and 7, we can conclude that the LSTM model produced higher performance compared to the three tested models by offering higher accuracy, Lower error and RMSE values.Therefore, we selected the LSTM model as the best model for early student grade prediction.

B. EXPERIMENT 2: PERFORMANCE OF eLSTM MODEL COMPARED TO STANDARD LSTM MODEL
This section specified the comparison results between the standard LSTM and proposed eLSTM model based on three terms of accuracy, MSE, and RMSE.Two sub-experiments were conducted as follow: • Experiment 2.1: LSTM vs eLSTM using all 29 feature sets    Deriving insights from the Tables of 7 to 10, and Figs.5-11, the mean and median values of each time series technique are unequal.Accordingly, to assess that the MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM methods are dissimilar, the Friedman test which considers as the most well-known approach for testing the dissimilarity among more than one sample utilization.Table 11 indicates the mean of rank using the Friedman test with the p-value for MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM accuracy based on 29 designed feature selection methods.Referring to Table 11, this evaluation confirms that the time series models (MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM) are significantly different.The p-values of their precisions are resulting in the null hypothesis being rejected due to an alpha value below 0.05.Therefore, we proceed with five post hoc tests (Nemenyi, Bonferroni-Dunn, Finner, Li, and Holm) to obtain specific pairwise comparisons and identify the observed differences.
Tables 12 and 13 present the adjusted p-values to test multiple comparisons among five deep time series models of MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM.The proposed eLSTM model due to having the smallest mean of rank compared with the other tested techniques of MLP, CNN, LSTM, and CNN-LSTM is taken as the control method.Tables 12 and 13 demonstrates the highly significant improvement of the proposed eLSTM method over MLP, CNN, LSTM, and CNN-LSTM with the significant level α = 0.05.Referring to obtained results the Tables mentioned above, the proposed eLSTM model presented the VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.improvement of 3.97632%, 3.45922%, 0.929247%, and 0.503247% in terms of accuracy in comparison to other techniques of MLP, CNN, LSTM, and CNN-LSTM respectively (Referring to Table 7).
The details of the results of each of these stated methods are represented in Appendix.Therefore, based on the provided results in appendix, the designed feature selection 11 (refer to Figure .31) produced higher accuracy in both models of LSTM and proposed eLSTM.Thus, we can conclude that the student engagement in the first week has a major impact on their overall performance.The combination of feature sets plays a significant role in the performance of the early grade prediction as student's engagement and attainment develops over the semester [8].
Based on the proposed eLSTM model, which focuses on leveraging the model's ability to capture temporal dependencies and patterns, we found out that seven attributes {Gender, Age, Country, Sponsorship Type, Course Name, MarksTest1, FreqW1} are the most important features for the early grade prediction.
This is because the designed Feature Set 11 that consist of these seven mentioned attributes produced highest accuracy and lowest error and RMSE value.Therefore, we conducted further experiments to observe more closely the performance of the features as below:    Referring to the Figures.12,13, and 14, the proposed eLSTM model yields higher accuracy and lower error and RMSE value on the designed Feature Set 1 compared to other four tested time series model.Therefore, it signifies that the proposed eLSTM technique showed superiority on demographic categories.Regards to the above-mentioned figures, the proposed eLSTM model better performance compared to other four tested models on feature set 4 and 5 in terms of producing higher accuracy.Based on the above-mentioned figures the proposed eLSTM model showed higher performance on designed feature sets of 11 compared to all other models (in all experiment 1 until 3).Refer to above figures the eLSTM scheme showed higher performance in terms of accuracy on designed feature set model of 23.This experiment is based on models developed using data prepared based on feature set 24 until 28.Refer to Figures 24, 25, and 26, the highest accuracy is belonged to designed feature set of 25 using proposed eLSTM model.

6) EXPERIMENT 3.6: EFFECT OF ALL FEATURES
This experiment is based on models developed using all the attributes in the feature sets.Referring to the above stated figures the proposed eLSTM model produced higher performance compared to other approaches of MLP, CNN-LSTM, and CNN-LSTM in terms of accuracy, error, and RMSE value on designed feature model of 29.

VI. DISCUSSIONS
The LSTM model outperforms MLP, CNN, and CNN-LSTM, achieving superior accuracy with reduced errors and RMSE.This is further enhanced by the proposed MSECosine loss function.Across six experiments with varied feature sets, the eLSTM model consistently excels in accuracy, error, and RMSE compared to MLP, CNN-LSTM, and CNN-LSTM.
Feature set 11, combining demography, weekly engagement, as FreqW1W7 and Test1 marks, yields the highest accuracy.To gain insights into eLSTM's performance and feature influence, we employ the LIME library, highlighting the significance of Feature Set 11.The eLSTM's workings, driven by this feature set, illustrate its effectiveness, supported by LIME's interpretability which is mentioned in Appendix.
To establish the superiority of the proposed MSECosine loss function in addressing the vanishing gradient issue compared to state-of-the-art loss functions, we monitored the trend values of the loss.Figure 11 presents a comparative analysis of the loss trends between MSE, MAE loss functions, and the proposed MSECosine loss function in this study.Feature Set 11, known for its highest accuracy, was specifically chosen to delve deeper into the effectiveness of the MSECosine approach compared to the conventional method.
The proposed eLSTM technique in this study employs Lime explanation, comprising three key elements: 1) Instance explanation, 2) LIME explanation, and 3) Prediction Probability.In the ''Instance explanation,'' eight features-Gender, Age, Country, Sponsorship Type, Course Name, MarksTest1, FreqW1, and FreqW1W7-are presented to elucidate the model's prediction.LIME generates explanations by altering feature values and observing prediction changes.
Within the instance explanation, features are paired, each assigned an importance weight indicating its impact on prediction.Positive weights indicate higher predicted probability for the target class, while negative weights imply reduced probability.For instance, ''Age'' with an importance score of 10.09 significantly influences the grade prediction, favoring passing.Similarly, ''Sponsorship Type'' with an importance score of 5.67 positively affects the outcome.
The prediction probability section categorizes ''PASS'' and ''FAIL.''Positive and negative weights for these categories indicate feature impact on predictions.Attributes like {Age, Sponsorship Type}, and {Course, Country} impact predictions, while ''FreqW1W7,'' ''FreqW1,'' and ''Gender'' have opposing effects.This study's findings underscore eLSTM's superiority over other tested time series approaches, attributed to its MSECosine loss function, offering improved training balance and gradient stability.These advantages elevate performance and generalization, differentiating eLSTM from other models.
Therefore, it can provide educators with a practical tool for early student grade predictions.The model, demonstrated in Figure .30, helps identify students at risk, enabling targeted interventions.By using the proposed eLSTM that combines LSTM with MSECosine, insights into temporal dependencies and vanishing gradient issues can be gained, capturing nuanced performance patterns.Additionally, the feature selection process identifies influential factors, providing a holistic view of student success.Thus, institutions can integrate these findings into decision support systems for refined early warning and proactive interventions.These approaches consider factors like attendance, engagement, and historical academic performance, empowering informed intervention decisions.

VII. CONCLUSION AND FUTURE WORK
This study has extensively investigated the advantages of deep time series models over traditional methods for educational sequential data prediction.The utilization of RNNs, with their incorporation of nonlinear activation functions and feedback loops, has proven to be instrumental in capturing intricate nonlinear correlations among variables, distinguishing them from conventional time series models.However, a notable challenge encountered during the training phase of RNNs is the issue of vanishing or exploding gradients, which can hinder the model's convergence.
A significant contribution of this study lies in proposing a novel solution, namely the MSECosine loss function constructed through the combination of MSE and LogCosh.The aim of proposed MSECosine loss function is to address this limitation by providing more control over error magnification, particularly during suboptimal predictions.The amalgamation of these functions mitigates sudden spikes in error values, contributing a more stable training process.We utilise data on in higher education computer science program assessment and learning activities for early course grade prediction using the proposed solution, and compare this with benchmark approaches.This investigation is conducted by exploring the application of the models on 29 feature sets that are constructed based on several combination of demography, learning activities and assessment features.
The culmination of Experiments 1, 2, and 3 holds significant relevance for predicting course grades in higher education computer science programs.Experiment 1 establishes a foundation by evaluating deep time series approaches on varied feature sets, while Experiment 2 refines the analysis by comparing the eLSTM method against basic approaches.Experiment 3, focusing on specific factors like demography, weekly engagement, and continuous assessment, adds granularity to the predictive models.These experiments collectively offer insights into the nuanced dynamics influencing academic performance, providing a holistic framework for the development of accurate and effective early course grade prediction models.This comprehensive approach is particularly vital in computer science programs, where diverse factors contribute to student success, aiding educators in tailored interventions and support strategies to enhance the overall learning experience.
Our research underscores the pivotal role of the proposed MSECosine loss function in effectively regulating errors during the training phase of LSTM models and enhancing the overall performance of deep time series approaches.By strategically combining the logarithmic term from Log-Cosh with the exponential terms of MSE, our approach offers a refined mechanism for error control.The eLSTM model, stemming from this methodology, surpasses other architectures in terms of accuracy and demonstrates lower error values.Specifically, our approach achieves an impressive 0.6191% accuracy with Feature Set 11, providing robust evidence of the superior performance of eLSTM for early grade prediction.In essence, our method not only introduces a novel approach for mitigating vanishing gradient issues but also attains the best results for early grade prediction, showcasing its effectiveness in improving both training dynamics and predictive accuracy.The performance of the proposed MSECosine loss function in mitigating the vanishing gradient issue is twice as effective as that of standard loss functions.
Moving forward, this research unlocks possibilities for further study in the realm of deep learning for time series data, and motivates works in early grade prediction.Future research endeavors could involve the refinement of existing loss functions, exploration of additional regularization techniques, and the incorporation of interpretability tools to enhance the transparency of model predictions.Additionally, the generalizability of the MSECosine loss function across various educational datasets and contexts warrants further investigation.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Figure. 2
represents the sequence of the methodology of this work.

Algorithm 1 1 . 2 . 3 . 4 . 5 .
Steps of Encoding the Features from Categorical form to Numeric Begin Identify the categorical attributes in the combined dataset 'Assessment' and 'LMS'.Defined an instance of the LabelEncoder class to use for fitting and transforming the categorical attributes.Fit the LabelEncoder instance to the categorical column to create a mapping of categories to numerical values.Transform the categorical column using the fitted LabelEncoder instance to create the numerical column.Replace the original categorical column with the new numerical column in the dataset.End This technique operates through three essential steps of 1) Compute minimum ( I Min ) and maximum (I Max ) of input values.These values are critical as they determine the scaling range, 2) Scaling the input data using Eqn.(1) that mentioned below.This equation standardizes each data point in the range [−1, 1] based on its relationship with the minimum and maximum values, 3) Specifying scaling range where in this research, we have chosen to scale the data to the range [−1, 1].

FIGURE 3 .
FIGURE 3. The principle and framework of student grade prediction.

Figure. 4
Figure. 4 illustrates the architecture diagram of the eLSTM scheme.According to Figure.4, the eLSTM model of this study consists of five stages of 1) Pre-processing, 2) Input layer, 3) eLSTM hidden layer, 4) Output layer, and 5) eLSTM Model Training based on 70% and 30% train-test split.The pre-processed data that is fed to the input layer stage is set to the input value of 100 data points and the number features is obtained from the designed feature selection models (Refer to Figure.31).Also, to obtain the best time step value, various timestep values were tested and value of 100 is considered as the best time step value.In this layer, the input shape is calculated based on the multiplication of time steps, which is set to 100, with varying feature values ranging from 5 to 21 per time step.These feature values are obtained from the designed models (Refer to Figure.31).Therefore, the total number of features steps per time step is 2100.Afterward, the third stage belonged to the eLSTM model hidden layer where in here we only have one hidden layer as it most of study claimed that the LSTM with a single hidden layer provided more accurate results[1].This layer has 79 LSTM units that take the varying feature values ranging from 5 to 21 features (P 1 , P 2 , . . ., P 21 ) from the designed feature selection model of 29 as input for each 100-time steps.Afterwards, the eLSTM units using their memory cells process the input sequence to update the hidden states and yield a new hidden state and a new memory cell for each 100-time steps.Subsequently, the output from the last time step of each eLSTM unit is associated with the next layer of output layer.In the output layer, the obtained output from the last time step of each eLSTM unit is concatenated into a single vector.This vector is then passed over a dense layer with one output neuron.The output of this neuron shows the predicted student grade value of the target variable which is the grade.Finally, in the training stage, the Stochastic Gradient Descent (SGD)[1] has been used as an optimizer and the MSE for the loss function[18],[34].In this stage the main concern is to make the predicted student grade as close as possible to the actual grade by adjusting the model's weights.This can be done by computing the loss function and updating the weights using backpropagation.The details of the proposed eLSTM model based on the 29 designed feature selection models of this study is explained in Algorithm 2.The proposed eLSTM model is get a sequence of observations P = {P 1 , P 2 , P 3 , . . ., P N } as an input, where each of these observations are a vector with k length.Then, the v 0 and j 0 present the initial hidden state and cell state respectively.Each of these two parameters are a vector with length of s.The mentioned input and hidden state are altered by usage of various weights metrics of we f , we i , we c , and we o with size of sx(k + s), and we y that has size of 1xn.Also, there are multiple bias vectors ba f , ba i , ba c , and ba o with size of s, and which has size 1.
for t in range(1, t + 1 ): #Concatenate the input and prior hidden state into a new vector, Q T = np.concatenate((PT , v T − 1)) #Calculate input gate activation that used sigmoid function σ , f T = sigmoid(np.dot(wef , Q T ) + ba f ) #Calculate input gate activation by using sigmoid function σ, i T = sigmoid(np.dot(wei, Q T ) + ba i ) #Calculate candidate cell state π c T using tanh function π c T = np.tanh(np.dot(wec , Q T ) + ba c ) #Adjustcell state at time step c T using element-wise #multiplication symbol ⊙, c T = f T c (T − 1) +i T ⊙ πc T #Calculate the output gate activation O T using sigmoid #function σ, O T = sigmoid(np.dot(weO , Q T ) + ba O ) #Tune the hidden state v T using element wise multiplication #⊙ combined with the hyperbolic tangent function, v T = O T * np.tanh(c T ) #Compute prediction g_pred T = np.dot(wey , Q T ) + b ay #Compute the proposed MSECosine loss function L T #between the predicted and actual label value L T = MSECosine(g_pred T , Actual_lable T ) #Iterate through all previous steps for each 100-time #steps T = 1, 2, 3, .., t #Return the sequence of predictions g_pred T .
31).Data normality was assessed using box plots, providing MIN, Quartile1, Median, Quartile3, MAX, and Mean values.Comparison of mean and median values indicates normality; disparities prompt non-parametric tests.

FIGURE 5 .
FIGURE 5. Box plot of accuracy using 29 feature sets.

FIGURE 6 .
FIGURE 6. Box plot of error using 29 feature sets.

FIGURE 7 .
FIGURE 7. Box plot of RMSE using 29 feature sets.

2 )
EXPERIMENT 2.2: EFFECT OF PROPOSED LOSS FUNCTION To show the superiority of the proposed MSECosine loss function compared to state of art loss functions terms of addressing the vanishing gradient issue, we monitor the loss trend value.Figure.11demonstrated the comparison of the various loss of MSE and MAE loss functions with the proposed MSECosine loss function of this work.We used feature set 11 since it has the best accuracy to further investigate the effectiveness of the proposed MSECosine against the standard approach.

FIGURE 11 .
FIGURE 11.Comparison of loss trends value across each epoch between MSE, MAE, and proposed MSECosine loss functions.Referring to Figure.11, all three loss functions of MSE, MAE, and the proposed MSECosine demonstrated a decrease over 100 epochs.However, the initial values differed, with MSE starting at 4.1550, MAE at 1.4462, and the proposed

• Experiment 3 . 1 :to 12 •
Effect of demography • Experiment 3.2: Effect of demography, weekly engagement in week 1 until 7, Test1 marks • Experiment 3.3: Effect of demography, weekly engagement in week 1 until 7, Test1 marks and FreqW1W7 • Experiment 3.4: Effect of demography, weekly engagement in week 8 until 12 and continuous marks • Experiment 3.5: Effect of demography, weekly engagement in week 8 until 12, continuous marks and Freq 8 Experiment 3.6: Effect of using all features The detailed design of the experiments is provided below: 1) EXPERIMENT 3.1: EFFECT OF DEMOGRAPHY This experiment is based on models developed using data prepared based on Feature Set 1. Figures.12,13, and 14 display the comparisons of five models of MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM based on the accuracy, error, and RMSE.

FIGURE 12 .
FIGURE 12. Accuracy comparison based on feature set 1.

FIGURE 13 .
FIGURE 13.Error comparison based on feature set 1.
of five models of MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM based on the accuracy, error, and RMSE.

FIGURE 16 .
FIGURE 16.Error comparisons based on Feature Set 2 to 9.

FIGURE 19 .
FIGURE 19.Error comparisons based on Feature Set 10 to 17.

4 )
EXPERIMENT 3.4: EFFECT OF DEMOGRAPHY, WEEKLY ENGAGEMENT IN WEEK 8 UNTIL 12, AND CONTINOUS MARKS This experiment is based on models developed using data prepared based on feature set 18 until 23.

FIGURE 21 .
FIGURE 21.Accuracy comparisons based on Feature Set 18 to 23.

FIGURE 22 .
FIGURE 22. Error comparisons based on Feature Set 18 to 23.

5 )
EXPERIMENT 3.5: EFFECT OF DEMOGRAPHY, WEEKLY ENGAGEMENT IN WEEK 8 UNTIL 12, AND CONTINOUS MARKS AND FREQW8 TO W12

FIGURE 24 .
FIGURE 24.Accuracy comparisons based on Feature Set 24 to 28.

FIGURE 25 .
FIGURE 25.Error comparisons based on Feature Set 24 to 28.

FIGURE 27 .
FIGURE 27.Accuracy comparison based on Feature Set 29.

FIGURE 30 .
FIGURE 30.Visualizing the inner working of the proposed eLSTM models using LIME for precise early grade prediction.

FIGURE 31 .
FIGURE 31.Construction of feature sets for student grade prediction models.

TABLE 1 .
Application of deep learning techniques for time series prediction.

TABLE 2 .
Application of deep learning techniques for time series prediction.Standard and deep time series approaches for grade prediction.

TABLE 3 .
Pros and cons of regression loss functions.

TABLE 4 .
Time series datasets descriptions.

TABLE 6 .
Tested hyperparameters for model configuration optimization.

TABLE 7 .
Descriptive statistic of accuracy based on 29 features sets.

TABLE 8 .
Descriptive statistic of error based on 29 features sets.

TABLE 9 .
Descriptive statistic of RMSE based on 29 features sets.

Table 10
present the comparison of accuracy, Error, and RMSE of LSTM to the proposed eLSTM model based on six values of MIN, Quartile1, Median, Quartile3, MAX, and Mean.Figures.8, 9, and 10 demonstrate the box plot of the standard LSTM and eLSTM models based on Table 10 respectively.

TABLE 10 .
Comparison of accuracy, MSE, RMSE based on 29 features sets.
FIGURE 8. Box plot of accuracy between LSTM and eLSTM using 29 feature sets.

TABLE 11 .
Mean of rank by Friedman test with the p-value for MLP, CNN, LSTM, CNN-LSTM, and proposed eLSTM Models based on all 29 designed feature selection methods.

TABLE 12 .
Adjusted p-value for tests for multiple comparisons based on all 29 feature sets.

TABLE 13 .
Adjusted p-value for tests for multiple comparisons among four models of MLP, CNN, CNN-LSTM, and proposed eLSTM based on all 29 feature selection methods.

TABLE 14 .
MLP performance on the twenty-nine designed feature set using the MSE loss function.

TABLE 15 .
CNN performance on the twenty-nine designed feature set using the MSE loss function.

TABLE 16 .
LSTM performance on the twenty-nine designed feature set using the MSE loss function.

TABLE 17 .
CNN-LSTM performance on the twenty-nine designed feature set using the MSE loss function.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 17 .
(Continued.) CNN-LSTM performance on the twenty-nine designed feature set using the MSE loss function.

TABLE 18 .
Proposed eLSTM performance on the twenty-nine designed feature selection models using the proposed MSECosine loss function.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.