Interpretable Models for Early Prediction of Certification in MOOCs: A Case Study on a MOOC for Smart City Professionals

Over the last few years, Massive Open Online Courses (MOOCs) have expanded rapidly and tend to become the most typical form of online and distance higher education. As a result, a tremendous amount of data is generated and stored on MOOCs online learning platforms. In any case, this data should be effectively transformed into knowledge, thus providing valuable feedback to learners, and enhancing decision making practices in the educational field. Despite the benefits and learning prospects that MOOCs offer to learners, there is a considerable divergence between enrollment and completion rates. In this context, the main scope of this study is to exploit predictive analytics and explainable artificial intelligence for the early prediction of student certification in a 11-week MOOC for smart cities, namely DevOps. A plethora of Machine Learning models were built employing familiar classification algorithms. The experimental results revealed that the models based on Gradient Boosting, Logistic Regression and Light Gradient Boosted Machine classifiers prevailed in terms of Accuracy, Area Under Curve, Recall, Precision, F1-score, Kappa, and Matthews Correlation Coefficient, getting a predictive accuracy of 94.41% at the end of the second week of the course. Therefore, students who are less likely to obtain a certificate could be envisaged at an early enough stage to provide sufficient support actions and targeted intervention strategies to them. Finally, the performance attributes (i.e., overall grades per week) proved to be the most important predictors for identifying students at risk of failure.


I. INTRODUCTION
Over the last few years, there is an increasing interest in Massive Open Online Courses (MOOCs) offered by topquality universities worldwide. MOOCs have expanded rapidly nowadays and tend to become the most typical form of online and distance higher education [1]. In light of this new trend, diverse and large groups of learners, varying in several characteristics such as age, nationality, family obligations and educational level, may attend for free or at low-cost flexible and short-term courses of their interest and study at their own pace without requiring physical attendance [2]. Additionally, even after registration has been completed, attendees differ significantly with regard to their objectives, motivation, interests, and interaction with the course content [3].
As a result, a tremendous amount of data is generated and stored on MOOC online learning platforms regarding student learning behavior and performance, engagement with learning material, social interactions, assignments scores, learning outcomes, and demographic information such as gender, ethnicity, education level, employment status and employment type, to name a few [4]. However, this data should be mined and effectively transformed into knowledge, in order to provide valuable feedback to students and enhance decision making practices in the educational field [5]. Educational Data Mining (EDM) and Learning Analytics (LA) constitute powerful tools for uncovering valuable information from vast amount of raw data.
EDM and LA have significantly evolved over the past two decades and form an integral part of educational research [6]. These different but complementary scientific fields aim primarily at enhancing learning experience and optimizing the quality of teaching [7]. More precisely, EDM concerns the development and implementation of Data Mining (DM) methods in data derived from different educational settings for addressing a wide range of learning problems [8]. LA refers to the exploitation of educational data for supporting educational practices, understanding learning behavior and improving student performance [9]. Notwithstanding their common objective, the principal difference between them is that EDM tackles an educational problem from the technological viewpoint, whereas LA is mainly focused on the pedagogical aspect of it [10].
One of the most studied problems in the fields of EDM and LA is prediction, which involves building a Machine Learning (ML) model for inferring future learning characteristics of students based on historical data and ML methods [11]. This process is known as predictive analytics [12]. Regarding MOOCs, prediction is an umbrella term for a wide array of specific predictive problems which have attracted the attention of many researchers. These problems encompass primarily the prediction of student behaviors and outcomes, such as dropout, retention, completion, certification [13], final exam grade and course grade [14]. The first four problems refer to binary classification tasks (i.e., the output attribute has two class labels), whereas grade prediction is a typical multiclass or regression task (i.e., the output attribute is a numerical one). In particular, concerning the term "certification", which is the main focus of our research, it refers to the successful completion of a MOOC course [15] (i.e., achieving an average score above a predefined threshold).
Despite the educational benefits and learning prospects that MOOCs offer to students, there is a considerable divergence between enrollment rates and completion rates [16]. Lack of time and interactivity, course time-table and length, motivation, subject interest, isolation feelings, and poor background knowledge are important factors of low completion rates [17]. These factors can be classified into two main groups: student-related and course-related factors [2]. Recent researches have shown that the success of a MOOC is directly linked to the provision of support services to potential low performer students, which could motivate them to successfully complete the course and receive a certification [18].
In this context, the main scope of this study is to exploit predictive analytics for predicting student certification in a MOOC. The contribution of this study is three-fold. First, we examine whether predictive analytics could help us to gain insight into student online learning behavior and build highly effective learning models in predicting student certification at the end of a 11-week course. Second, we intend to understand and explain the predictions made by a predictive model and specify the features of students which are of great impact on earning a course certificate. Explainable Artificial Intelligence (XAI) is a new research field attempting to build ML models that are easier to interpret and understand than the so-called black-box models [19]. Finally, the results of the study provide evidence that students who are less likely to obtain a certificate could be envisaged at an early enough stage to provide sufficient support actions and properly targeted intervention strategies to them.
The rest of the paper is organized as follows. Section II presents a brief review of previous work in predicting student certification in MOOCs. The dataset used in the study is described in Section III, whereas Section IV formulates the experimental design and analyzes the produced results. Finally, the paper concludes considering some thoughts for future research directions.

II. RELATED WORK
A great deal of existing research has focused on building predictive models for predicting student behaviors and outcomes in MOOCs. Additionally, several ML models have been utilized in recent years confronting the problem of MOOC unsuccessful completion.
Coleman et al. applied Latent Dirichlet Allocation (LDA), an unsupervised probabilistic method, for uncovering behavioral patterns of students enrolled in a MITx course [20]. These patterns were then used for building a mixedmembership model to predict certification of students.
Al-Shabandar et al. examined the efficiency of various ML algorithms for predicting student certification in 15 MOOC courses offered by Harvard and MIT [21]. To this end, they considered two feature datasets: the clickstream dataset (i.e., video views, content interaction, access to assignments, and posts in discussion forums) and the demographic dataset (i.e., age, gender, and education level). Random Forests (RF) and Multilayer Perceptrons (MLPs) were found to prevail in terms of several metrics such as classification accuracy and kappa.
Four classification algorithms (i.e., Logistic Regression (LR), k Nearest Neighbors (k-NN), Gradient Boosting and RF) were applied in [22] for the early prediction of student certification. The produced models were evaluated in terms of F1-score and Area Under Curve (AUC) at the end of each one of the first four weeks of the course.
Very recently, Moore and Wang examined the influence of student motivational dispositions on completing a XarvardX MOOC [3]. For this purpose, they employed Latent Profile Analysis (LPA) for detecting specific groups among course completers. Two latent profiles were identified regarding intrinsic and extrinsic motivation of students. What is more, it was found that educational background, gender, and latent profile were all substantially related to the course grade.
In a similar study, Gitinabard et al. combined forum posts and online activity log files for predicting student dropout and certification in a MOOC course delivered by Coursera [23]. To this end, LR and Support Vector Machines (SVMs) were applied, creating two classification models. The predictive accuracy was above 85% from the first week of the course, whereas AUC exceeded 90%. Additionally, submissions and video watching were found to be the most influential attributes. Video watching frequency per week was also confirmed to be a significant attribute for grade prediction in a weekly-organized MOOC [24], [25].

III. DATA DESCRIPTION
This study was conducted as a part of the Erasmus+ Sector Skills Alliance project called "DevOps: Competences for Smart Cities" 1 . DevOps aims at equipping current and aspiring smart city professionals with appropriate skills to enable transformative urban innovation and support the technologically enhanced urban governance with a special emphasis on the DevOps methodology [26]. Registration in the DevOps MOOC 2 started at 15 September 2020 and lasted one month, whereas the start date of the course was 19 October 2020.
The course was organized in a weekly setup (11 weeks length in total), and it was structured at approximately 1-2 modules per week 3 (15 modules in total). Each module was available in English and consisted of 2 to 5 learning units, each of which included an automatically graded multiple choice assessment quiz. Weeks 7 and 11 were used as reflection weeks to help learners catch up on their study or get some free time to reflect on what has been taught. The course content was designed to address the European Qualifications Framework (EQF) level 5, since this is the requisite level of autonomy and responsibility for smart city professionals.
The registration form comprised a questionnaire regarding personal, demographic and employment data of students, in which they were informed that all data would be acquired and used according to the GDPR regulation (EU 2016/679) for evaluating the quality of the course. In addition, participants were asked to consent for the analysis of data; otherwise, they could skip the questionnaire and proceed with the registration providing only their full name and email. As a result, an overall number of 961 students enrolled in the course, whereas 944 students provided demographic data.
Considering that the quality of data is essential for building effective and robust ML models [27], a preprocess analysis was performed for cleaning and preparing the data before applying a ML algorithm. For this purpose, the missing values of the numerical attributes were imputed employing the mean imputation method. Regarding the missing values of the qualitative attributes, they were replaced with the constant 1 https://devops.uth.gr/dev/ 2 https://smartdevopsmooc.eu/moodle/pages/login.php "unknown". Besides that, all zero activity records, during the course, were excluded. Finally, 936 records were filtered and saved in a comma-separated values (.csv) format file, allowing to apply a plethora of DM methods and ML algorithms.
The initial dataset comprised eleven qualitative attributes regarding personal information of students such as gender, educational background, employment, skills, previous MOOC experience, and average available study hours per week (Table  I). Therefore, it is called Demographics set of attributes. This dataset was gradually enriched with two other sets of attributes, namely Performance set and Activity set. The Performance set consisted of ten attributes about students' learning achievements during the first two weeks of the course, mainly considering grading scores in quizzes (100point scale) and overall grade in modules 1 and 2 (Table II). The Activity behavior set comprised twelve numerical attributes concerning students' recording activity in the online learning platform such as number of views, posts, discussions, and connections, as well as the total time devoted to the first two modules of the course.   The target attribute "Course Total Grade" is a binary one with two possible values {0,1}, where 0/1 means that a student has not/has obtained a certification of successful completion of the course (i.e., a student achieved an average quiz score of 80% or more). As regards the distribution of the two output classes in the dataset, 75.75% of the records represents students who obtained a certification, whereas 24.25% corresponds to students who didn't get a course certificate.
Table IV provides a descriptive statistic summary (i.e., count, mean, standard deviation, minimum, 1 st quartile, median, 3 rd quartile and maximum) for each one of the performance and activity attributes in order to get a better understanding of the distribution of the data.
In addition, a matrix heatmap is illustrated in Fig. 1 depicting the correlation dependence between the numerical attributes of the dataset. Therefore, each square of the matrix represents the correlation between the attributes paired on the two axes. The red color indicates positive correlation between two attributes, whereas the blue one negative. Moreover, the intensity of the color implies how strongly these attributes are correlated, meaning that the deeper color corresponds to stronger correlation. A cursory reading of the matrix heatmap reveals that there is a perfect positive correlation between the performance attributes.

IV. METHODOLOGY AND EXPERIMENTS
A plethora of supervised learning algorithms were employed for building corresponding predictive models using PyCaret [28]. PyCaret is an open-source software ML library package in Python, which enables the implementation of several classification methods. Specifically, the algorithms used in the experiments are as follows: -Adaptive Boosting (AdaBoost) classifier [29], which was originally used for binary classification problems. This is a meta-estimator that is seeking to build a strong classifier exploiting a set of weak classifiers. -Gradient Boosting (GB) classifier [30], another common boosting method which fits a new estimator targeting the errors made by the previous one. -Classification and Regression Tree (Cart) [31], a very powerful decision tree algorithm for both classification and regression problems, using Gini coefficient for splitting the dataset at the node with larger uncertainty. -Extremely Randomized Trees (Extra) [32], a very fast ensemble-based algorithm which chooses randomly the split point at each node. -Linear Discriminant Analysis (LDA) algorithm for twoclass classification problems [33], which employs the Bayes theorem for calculating the probability of the output class given the input attributes under the assumption that the output class is linearly separable. -Light Gradient Boosted Machine (LightGBM) ensemble method [34], a robust extended version of the GB classifier. -Logistic Regression (LR) algorithm [35], a representative statistical method for binary classification problems, which models the probability of a student classified as certificated (with two possible outcomes: 0 and 1) given the values of the independent attributes. -Random Forest (RF) [36], a popular bagging-based algorithm, which combines the output of several decision trees trained on sub-samples of the dataset for producing the final prediction via majority voting.
Several studies indicate the effectiveness of these methods for building highly accurate and robust predictive models in the fields of EDM and LA [11]. What is more, boosting and ensemble methods show top-performing results in both classification and regression problems without any special parameter configuration being required. The parameter settings of these methods are presented in Table V. In most cases, the default parameter settings were adopted.  Since the distribution of the two output classes in the dataset was imbalanced, we applied the Synthetic Minority Oversampling Technique (SMOTE) [37] for augmenting the minority class. In addition, the k-fold (k=10) cross validation resampling procedure was used for evaluating the performance of the predictive models [38], [39]. Accordingly, the dataset was randomly divided into k folds of equal size. Each fold was used for evaluating the performance of the model trained on the rest folds, whereas the final measure was the average value of the computed performance measures on each test fold.
One of the main goals of our study was the early prediction of student certification. Therefore, the experiments were performed in three consecutive steps each of which was linked to a specific time-point.
Step 1 (Week 0) corresponded to the data available before the start of the course, whereas steps 2 and 3 corresponded to the time-point at the end of the first (Week 1) and second week (Week 2) of the course respectively. The attributes used in each one of the experiment steps are shown in Table VI.

TABLE VI ATTRIBUTES USED IN EACH STEP
Step 1

Demographic attributes Number of views in introductory forum Number of views in announcements forum
Step 2 Step 3

Attributes of step 1 Number of forum views in
Demographic attributes Performance attributes Activity attributes

V. RESULTS
A broad variety of widely used evaluation metrics were calculated to quantify the performance of the classification models in each one of the experimental steps. More specifically, these metrics include accuracy (Acc), Area Under Curve (AUC), Recall, Precision, F1-score, Kappa, and Matthews Correlation Coefficient (MCC).
The results are shown in Tables VII-IX, while the best value for each metric per step is bold highlighted. Overall, it is observed that GB, LightGBM and LR produce the bestperforming models. All metrics increase over time, as could be expected, since new information about students is added from week to week. Accuracy is greater than 83% for all models (except LDA) in Week 0 reaching 93.46% at the end of the first week and 94.41% at the end of the second week. F1-score, which is the harmonic mean of precision and recall, achieves a value of 87.05% and 88.91% at Week 1 and Week 2 respectively. In addition, AUC ranges from 0.9376 to 0.9863, showing that very precise models are created with high-level measure of class separability.  It is noticed that the difference between training and 10-cross validation accuracy is very small, especially for a training size more than 400, indicating the models' ability to minimize bias and variance.
The feature importance plot displays how useful is each one of the 10 most important input attributes for predicting the target attribute for each learning model. Besides that, it provides valuable information for better understanding both the data and the model. Mostly, it is observed that the performance attributes are the most important, and especially the total grade in modules 1 and 2, as well as the quiz grades. Regarding the activity attributes, the number of views in the announcement forum and the number of connections in modules 1 and 2 seems to be the most significant predictors. These findings are in line with the results of recent research, showing that certificate achievers engage in more courseassociated and graded assessment quizzes in the course [40].
Regarding the interpretability of the ML models, we also provide the SHAP (SHapley Additive exPlanations) summary plot [41] for the LightGBM (Fig. 3), GB (Fig.4) and RF models (Fig. 5). In these plots, the input attributes are ordered in descending importance from top to bottom and present their impact on the model output (red color for positive impact and blue color for negative). It is shown that a large total grade in module 2 increases the chance of a learner to earn a certificate. Therefore, a result that was previously made by a black-box method is now reasonably explainable and easily understandable.
Finally, we made an attempt to build new classification models based only on the two most important attributes in each step (Table X). It is observed (Tables XI-XIII) that the effectiveness of all methods was slightly improved for all metrics showing that the integration of XAI may also contribute towards building more effective and robust learning models regardless the ML method applied.

TABLE X ATTRIBUTES USED IN EACH STEP
Step 1 Number of views in introductory forum

Number of views in announcements forum
Step 2

Number of views in announcements forum Overall grade in Module_1
Step 3

Number of views in announcements forum
Overall grade in Module_2      Finally, the results reveal that the produced predictive models could serve as an early alert system for identifying students at risk of failure from the very beginning of the course. Therefore, a series of targeted support actions and intervention strategies could be implemented to improve student learning outcomes [42]. It is important the prediction and therefore the intervention take place early enough, after the first two weeks of the course, thus providing the ability to educators to measure its effectiveness in addressing the learning needs and disabilities of learners during the next weeks of the course [43].

VI. CONCLUSIONS
In the present study, an attempt was made to exploit predictive analytics for predicting student certification in a MOOC for smart city professionals. Therefore, a plethora of highperforming ML models were produced employing a variety of classification algorithms. The experiments were performed in three consecutive time-steps in correspondence with the first two, out of eleven weeks of the course.
The results revealed that students at risk of failure can be identified with an accuracy measure greater than 94% at the end of the second week of the course. Hence, each model may serve as an early alert system for educators, since students less likely to obtain a certificate could be identified at an early enough stage providing sufficient support actions and properly targeted intervention strategies. Moreover, it is of vital importance for universities to increase retention rates in MOOCs and provide high-quality education to learners. Furthermore, we identified the features which are of great impact on earning a course certificate. Our findings indicate that performance attributes regarding student grading in quizzes and activities during the first two weeks of the course, are of great impact for accurately identifying low performers.
Although our research is a case study, the findings are in line with recent research. Explainable learning models could provide more meticulous and expressive information about student learning behavior and performance [19]. To this end, for future work we intend to build predictive models based on the most informative features of the dataset and apply them to other MOOCs. In addition, we intend to experiment with new ML, such as Semi-Supervised Learning, which have been proven to be very effective in the EDM field [44].