Deep learning model to predict students retention using BLSTM and CRF

There is an increasing awareness that predictive analytics helps universities to evaluate students’ performances. Big data analytics, such as student demographic datasets, can provide insight that helps to support academic success and completion rates. For example, learning analytics is an essential component of big data in universities that can provide strategic decision makers with the opportunity to perform a time series analysis of learning activities. A two-year retrospective analysis of student learning data from the University of Ha’il was conducted for this study. Predictive deep learning techniques, the bidirectional long short term model (BLSTM), were utilized to investigate students whose retention was at risk. The model has diverse features which can be utilized to assess how new students will perform and thus contributes to early prediction of student retention and dropout. Further, the condition random field (CRF) method for sequence labeling was used to predict each student label independently. Experimental results obtained with the predictive model indicates that prediction of student retention is possible with a high level of accuracy using BLSTM and CRF deep learning techniques.


I. INTRODUCTION
An enduring challenge in higher education all around the world is student retention [1]. Simply, higher education institutions are increasingly aware of the critical need to develop innovative approaches that ensure students graduate in a timely fashion and are well trained and workforce-ready in their field of study. As the volume and variety of data collected in both traditional and online university offerings continues to expand, new opportunities arise to apply big data analytics to challenges in higher education. Student retention is most commonly conceptualized as year-by-year retention or persistence rates as well as graduation rates [2]. Together, these rates indicate student success rates, which are typically defined as primary key indicators of university performance. In addition, they reflect the overall quality of student learning behaviour.
Prior research has shown that the decision by a student to voluntarily withdraw from their course of study may be influenced by both personal and institution-related factors [3,4]. Regarding institutional factors specifically, higher education institutions increasingly recognize that student retention rates can be important indicators of student satisfaction with the institution and/or the learning curriculum [5,6]. The present study examined the relationship between the preparatory year (bridging year) program at the University of Ha'il, undertaken by students prior to commencing their bachelor's degree, and the retention rate of these students. As such, its findings contribute to a deeper level of understanding of the role of institution-led pre-degree academic preparatory programs in the graduation outcomes of students. It also provides a unique practical contribution to the strategic planning relevant to our understanding of the relationship between the decisions made by students to remain enrolled on their course of study and the main factors that influence their decision making. Three research questions were addressed in this study: 1) What are the main features of students at the early stages of study which help to indicate student retention rates?; 2) How can student retention rates be improved?; and 3) What is the impact of first-year course grades on graduation rates for undergraduate students in a given period?
This paper focuses on two key performance indicators that are usually used in universities, that are particularly important to any investigation of student behaviours because they indicate whether they may be at risk of discontinuing their studies: i.
First-year student retention rates -defined as the percentage of first-year undergraduate students who continue at the university through to the next year in relation to the total number of first-year students in the same year. ii.
Graduation rate for undergraduate students in a given period -defined as the percentage of undergraduate students in each subject cohort who completed the programs during the specified period (no more than one successive year out of university).
The main objectives of this paper are as follows: 1) to identify students at risk of failure to assist universities to devise an early intervention plan to amend student performance correctly and prevent student drop out; and 2) to ascertain the effectiveness of the deep the bidirectional long short term model (BLSTM) model with the condition random field (CRF) methods, for the early prediction of students at completion risk, compared to conventional approaches.

II. LITERATURE REVIEW
The common elements, related to student retention, are defined as follows:

A. STUDENT RETENTION AND ITS RELATIONSHIP TO STUDENT PERFORMANCE
Student retention refers to the outcome whereby, once enrolled in their course, students "remain and successfully complete their studies" [5]. The literature typically measures student graduation in terms of retention rates over a period of 5-6 years from the time of enrolment [7]. There is general agreement within the field of higher education that "student retention arises from a complex combination of student, institutional and external factors" [8]. These factors manifest in different ways for different students and, as a result, there is a complex relationship between the contextual factors that underpin a student's decision to stay enrolled in or to withdraw from a course and the performance of the student. Notably, student retention in higher education is invariably linked in the literature to student performance while completing their studies as well as to the university enrolment system and the acceptance of students on particular courses [9].
Findings from research conducted in the Saudi higher education context show that student retention may be influenced by personal, social, academic, and institutional factors [4,10]. Personal and social factors refer to those aspects related to the students' personalities and their external communities, respectively and may include such things as learning motivation, ability to manage the demands of the course, family support, financial status, and the like [10]. Academic factors relate to the learning program and materials the students engage with during the course of study, whereas institutional factors are related to the policies and procedures of the institutions such as admission policies and the provision of student support services [10]. In addition, research shows that factors such as student ethnicity, race, and gender, as well as the distance from the student's hometown to the learning institution, can be important predictors of student retention outcomes [3,4].

B. GRADUATE RETENTION
Included in the broader construct of student retention is graduate retention. This refers to students who have graduated from their undergraduate degree and choose to pursue postgraduate studies. The graduate retention rate therefore refers to the number of graduated students who are retained throughout their postgraduate study program [11].

C. STUDENT ENGAGEMENT
In the field of education, student engagement broadly refers to the level of attention, interest, and motivation the student displays towards the learning materials or topic [2]. As such, it is a complex and multidimensional construct that is strongly correlated to student retention [8]. As a generalization, students who demonstrate positive engagement with the learning materials and/or topic are more likely to achieve successful learning outcomes and to complete their courses of study [2]. Similar to student retention, student engagement is potentially influenced by multiple factors and may therefore be experienced in different ways by different students across different learning contexts [8].

D. STUDENT ATTRITION / STUDENT DROPOUT
Terms relevant to a lack of student retention include student withdrawal, student attrition, and student dropout. These terms refer to the circumstance where the student has been unable to complete his/her studies and has subsequently left the course prior to its completion [12].
The complex interrelationship of factors influencing student retention in higher education is also increasingly influenced by advancements in both information and communication technologies and the capabilities of universities to collect, manage, and analyze student enrolment data. As [8] suggests, "increased network capabilities, machine learning and artificial intelligence are poised to fundamentally impact on the relationship between students, teachers and institutions."

E. IMPORTANCE OF STUDENT RETENTION
The broader literature generally acknowledges that student retention rates are a major consideration for higher education institutions worldwide [5,6,8]. Although it also acknowledged that students may choose to withdraw voluntarily from their study program, there may also be institutional factors or elements of the curriculum design that prompt a withdrawal [3,10]. As a result, tertiary institutions around the world realize the need to have in place strategies and plans to monitor and address student attrition factors [6]. Moreover, tertiary institutions around the world are aware of the increasing student dropout rates. For instance, the American College Testing (ACT) Report on student retention and graduation rates in colleges in the United States (US) from 1991 to 2012 reported that student graduations (within 5 years of course commencement) across all college institutions were at 51.9% in 2012, a decrease from 54.4% in 1991 [13].
In terms of the evidence around the causes of student retention, research shows that student experiences of the culture of the university and the pedagogical approaches implemented by teachers can have a significant influence. For instance, a study by [14] of 265 university students attending colleges in the US found that students who did not return the following year for study or who changed their major to another field had significantly lower perceptions of social connectedness and satisfaction with faculty members (approachability and quality of interactions) compared to students who returned. Further to the relationship between student-faculty member interactions and student retention, [15] shows that the number and quality of meetings held with an academic advisor can also impact first-year university students' decisions to stay enrolled in the course. Indeed, [15] concluded from the study of 363 first-year students studying at universities in the US from Fall 2009 to Fall 2010 that every satisfying meeting with an academic advisor increases the odds that a student will remain in the course by 13%. A recent study by [16] focused on the influence of financial stress, debt levels, and the availability of financial counselling on the retention rates of 2,475 undergraduate students studying in the US. The researchers found that both financial stress and student loan debt contributed to an increased likelihood of withdrawal from college. Moreover, they found that students who had sought out financial counselling were more likely to withdraw from college within the following year compared to students who had not accessed financial counselling [15]. Given the ongoing advancements in technologies for collecting, storing, managing, and mining data it is not surprising that universities are increasingly utilizing them to be better informed about student retention outcomes. A study by [17], for instance, applied three data mining methods: artificial neural networks, decision trees, and logistic regression, to 8 years' worth of institutional data from a university in the US mid-west region to predict the retention/attrition rates of 25,224 freshmen students. The researcher reported that the artificial neural networks performed the best (81% prediction accuracy) and found that educational and financial variables were the most important predictors of student attrition. Moreover, for universities, an important outcome to emerge from advancements in technologies for teaching and learning is the growth of online learning platforms. Online learning provision has potentially important implications for student retention rates at universities. For instance, it has potential as a learning pathway to overcome some of the factors influencing a student's decision to withdraw from their studies such as distance to university [18]. Conversely, it has the potential to accentuate some factors leading to student withdrawal such as the feeling of not having access to adequate support [19]. In terms of the research evidence, a study recently conducted by [20] compared university students' (n=15) and university faculty members' (n=15) perceptions of the main factors that influenced student retention in online education. The researchers reported that the top five factors, according to the students, were increased faculty instruction, the provision of meaningful feedback, accessing course credits for previous study, achieving a desired grade point average (GPA), and the provision of institutional support. The top five factors to influence student retention on online courses, according to the faculty members, were student self-discipline, faculty-student interaction quality, the provision of institutional support, grade received, and accessing transfer credits [20]. In response to the complex interrelationship of factors influencing student retention rates in higher education, institutions across the sector have an interest in better understanding the types of infrastructure and preparatory programs required to improve student retention. [21] conducted a study to identity the characteristics of students most associated with community college retention. In addition, they examined the relationship between student participation in a preparatory study skills course and overall retention rates. The study sample comprised 1,740 freshman students from a community college with three campuses across four county districts in the US. In terms of student characteristics, the researchers found gender followed by age were the most significant predictors of retention. Specifically, females were more likely to be retained than males and students aged 40 years and above were more likely to be retained than those aged 18-39. Regarding the retention prediction significance related to participation in a study skills course, the researchers reported that participation was a significant predictor of retention. That is, the student participants who successfully completed the study skills course were 63.6% more likely to be retained compared to students who did not take the course. VOLUME XX, 2017 9

III. PROPOSED METHOD
Improvements to retention rates can be initiated by creating a student demographic dataset as a measure of undergraduate student performance at the university. This data consists of university related information collected on the students. This dataset typically includes student grade averages, standardized assessment test results, participation rates, and attendance. It can also be used to measure student perceptions of the university and to predict retention. It is useful for the university to collect and store student demographic data for statistical analysis, such as machine learning, to predict student retention as a key indicator in the university quality assurance process [22]. This paper examines two performance indicators: first-year student completion rates and student graduation rate.
The prediction method in this research can be summarised as follows: 1. Collect student data from the university dataset during the first and second years of their study. 2. Adopt a sampling method to preprocess a subset of the student data'. Or do you mean, 'Adopt a sampling method to select a subset of the student data. The sampling criterion used was "college=computer science and engineering". Save the data in a demographic file for use in the prediction process. 3. Extract relevant features from student's data and save them in feature vectors: KPI-1 and KPI-2 features. 4. Train features using a bidirectional LSTM learning network to determine optimal weights. 5. Label the output of BLSTM features using CRF to predict each student's label independently. The proposed method is illustrated in Fig. 1.

A. DATA COLLECTION AND PREPROCESSING
Student engagement in the university environment is a major concern for higher education institutions due to its implications for student retention. A better understanding of this issue can be achieved by selecting data from the university dataset using structured query language (SQL). Then a sampling approach is employed to select only significant features of students during their first and second years. For instance, in the first year, information stored about students includes their preparatory GPA, Math 1 Gr, Eng 1 Gr, etc. These features are saved in a student demographic data file. The demographic data held for each student could be increased to include other important features such as second year course grades, GPAs of semesters 3 and 4, and assessment outcomes from the first and second years of study. The key reason for using demographic data is to perform valuable and effective selection from a large amount of data without noise. Hence, we need to extract features from the university's back-end dataset and transform the retention prediction problem into a time series prediction problem.

B. FEATURE EXTRACTION
The main aim of preprocessing student information is to remove incorrect data from the dataset [23]. This step is mandatory for extracting relevant features from the data. The features are described in Table 1. Two dimensions of student features have been extracted as key performance indicators (KPIs) in this paper: graduation rate and first year student retention rate. We will consider the student features in the prediction BLSTM problem as a sequence labelling model in which both the fetched student features and the output labels are constructed as sequences and saved into a feature vector. This is denoted by an input sequence KPI-1={f1,f2,f3,f4, f5,f6} and KPI-2={f7,f8,f9,f10,f11,f12,f13} with = 13 feature size and the corresponding labels forming an output sequence = { 1 , 2 , . . . , }, where is time series. = { =1 , =2 , … , =n } stands for the input feature vector of semester S = {s1, s2, s3, s4}, whereas Yi stands for the corresponding output label.

C. TRAIN DATA USING BIDIRECTIONAL LONG SHORT-TERM MEMORY (BLSTM)
BLSTM is regarded as one of the leading artificial recurrent neural networks (RNNs) widely used in classifications and predictions based on time series data. It can be used to process entire data sequences and retain them in compressed form using the sequence labelling method. BLSTM also performs well with long-term data dependencies [24], and for many sequence labelling tasks such as speech recognition and handwriting. This advantage inspired us to formulate the prediction problem from a sequence perspective. The previous step tries to characterize students with their features, and then, we need to train BLSTM method over the student features in both directions; namely, forward and backward with hidden states before concatenating the output from both directions. BLSTM demonstrated that it is efficient in various related works. The typical BLSTM structure [25] is depicted in Fig.  2. It is composed of basic LSTM units with each unit mapped between input sequence and output sequence . This is given a student record is = { , , … . . }, its corresponding features is = { , , … . , }, and label sequence is = { , , … . . , }. The mapping process goes through multiple gates including the input gate, forget gate, current memory cell, and output gate defined as it, , , respectively. Furthermore, each student was mapped to a feature embedding vector ∈ , where is the dimension of the student feature vector. Then, LSTM used hidden layer which contains the mapped embedding vector , calculated as follows: is the size of hidden layers. Then, the mapping process continues at the time step by the following equations:

D. PREDICTION USING CONDITIONAL RANDOM FIELDS (CRF)
The main task of BLSTM is to extract and encode selected features of students based on two performance indicators: − and − . It is hoped that it may provide clear indications of the differences between the encoded features of students who enrolled on the same course or in the same semester. This leads to the adoption of CRF [26] for sequence labeling to predict each student's label independently. It can then model the relationships between adjacent labels with a transition score and learn the interactions between pairs of features and labels with a state score as shown in Fig. 3. For instance, there are three types of students based on their features: 1) ongoing students; 2) students at risk of dropping out; and 3) unsurpassed university students. The ongoing or normal students have features weight labeled as 0.  Where ∏ is the exponential operation and is the score function for the transition between the sequence pair ( ′, ) for a given as shown in Fig. 4. During training of the CRF model, we employed maximum likelihood estimation (MLE) as introduced in [27]. For a training pair ( , ), we maximize the loss function During the decoding process, we search for the best sequence ′ with the highest conditional probability as follows.

IV. EXPERIMENTAL RESULTS
The student data was collected from the University of Ha'il dataset. In our experiments, we trained student features using BLSTM to assign optimal weights. Then we applied the CRF model to mark student features with labels. Later we compared each student with their neighboring students to predict retention and graduation rates. The university dataset consisted of 35,000 student records. We had two classes of student data based on the KPIs: first year and second year students. Two thousand first year students were selected from the preparatory dataset and 949 second years from the College of Computer Science and Engineering dataset. As a result, each student had 13 features. We regarded the first and second semesters as a time series for predicting student retention rate and the third and fourth semesters as a time series for predicting student completion rate.
The principal objective was to investigate the number of students at risk of discontinuing their study as identified by dropout risk and completion within the time risk. In this paper, we selected registered students attending the College of Computer Science and Engineering (CCSE). Three College departments were included in our investigation: Computer Science (CS), Software Engineering (SE), and Computer Engineering (CE). We combined data from both students who were at risk and from those with unsurpassed records into a CCSE demographic dataset suitable for learning status prediction. Details of the CCSE dataset are presented in Table  2. The preprocessed datasets were divided into two groups, 80% of which were used as training datasets and 20% of which were used as testing datasets. BLSTM was used as a deep learning model for time series predictions, the parameters and values of which are i displayed in Table 3. Where TP is the true positive, indicating that the retention status of the student is at risk, as is their predicted status. FP is a false positive, where the student's retention status is not at risk, but the predicted tag indicates that there is a risk. FN is a false negative, where the student's retention status is at risk but their predicted status suggests that there is no risk. We trained our prediction model over the first and second semesters, with the mean and standard deviation metrics shown in Fig. 5. The figure shows the performance of our model, the BLSTM with CRF, and the performance of another model using only the BLSTM, both of which were tested on the CCSE dataset. Our solution achieved precision of 89.1, recall of 88.5 %, and an F score of 88.8. The results of our method are promising in the second year and better than the performance of predicting student retention in the first year. The most recent student data from the academic year 2020-2021 were used to evaluate the performance of our BLSTM+CRF method compared to that of the state-of-the-art prediction methods. Table 4 compares the results obtained by Logistic Regression [28], Decision Tree [29], Random Forest [30], Naïve Bayes [31], Support Vector Machines [32], and Neural Network [33] methods based on the most intuitive measure/indicator of success: accuracy. It is simply defined as a ratio of correctly predicted student at risk retention status to the total number of students.

VI. CONCLUSION
The retention rates of students across their selected field of study and the graduation rates overall over a specified period are issues of concern to higher education organizations. Indeed, universities must be persistent in their efforts to improve student retention rates according to university teachers. This paper has a dual focus. First, on the use of data preprocessing and deep learning algorithms to extract sensitive information about students; the data in this case were extracted from a demographic data file. Second, on the utilization of key performance indicators with CRF methods as a classifier in the database. The unique contribution of this paper is: by employing these algorithms, we formulated a deep learning predictor model to investigate students' retention at four levels of study (over a two year period). Experimental results were obtained from a case study of the University of Ha'il. Regarding the BLSTM learning method, the results obtained with the predictive CRM approach indicated that prediction of student retention was possible with an accuracy of over 0.85 in most scenarios and with FP rates ranging from 0.05 to 0.10 in most cases. Future work will utilize the ant colony optimization (ACO) method to enhance the structure of BLSTM cells, because it allows the researcher to determine the optimal weights of BLSTM connected cells. Furthermore, it can reduce the number of BLSTM cells required whilst at the same time improving predictive ability.