Data Analytics on Online Student Engagement Data for Academic Performance Modeling

In large MOOC cohorts, the sheer variance and volume of discussion forum posts can make it difficult for instructors to distinguish nuanced emotion in students, such as engagement levels or stress, purely from textual data. Sentiment analysis has been used to build student behavioral models to understand emotion, however, more recent research suggests that separating sentiment and stress into different measures could improve approaches. Detecting stress in a MOOC corpus is challenging as students may use language that does not conform to standard definitions, but new techniques like TensiStrength provide more nuanced measures of stress by considering it as a spectrum. In this work, we introduce an ensemble method that extracts feature categories of engagement, semantics and sentiment from an AdelaideX student dataset. Stacked and voting methods are used to compare performance measures on how accurately these features can predict student grades. The stacked method performed best across all measures, with our Random Forest baseline further demonstrating that negative sentiment and stress had little impact on academic results. As a secondary analysis, we explored whether stress among student posts increased in 2020 compared to 2019 due to COVID-19, but found no significant change. Importantly, our model indicates that there may be a relationship between features, which warrants future research.

emotional state of their cohort, which may be important in 23 student outcomes. This has motivated studies such as [1] 24 and [2] to develop approaches to detect salient features among 25 The associate editor coordinating the review of this manuscript and approving it for publication was Dongxiao Yu .
forum 'noise' and provide ways for instructors to identify 26 urgent posts for timely intervention. Many studies have used 27 sentiment analysis to interpret student behavior in MOOC 28 courses through discussion forum posts [3], [4], [5], [6]. 29 While sentiment is useful for understanding opinions, atti-30 tudes and emotion, more recent studies have sought to dis-31 tinguish further nuances in features such as stress to develop 32 more holistic models of student behavior. The challenge of 33 detecting sentiment and stress in a MOOC corpus is that 34 language used by students may not always conform to stan-35 dard meanings. Therefore, detecting stress requires refined 36 methods, as demonstrated by the development of models such 37 as TensiStrength [7]. 38 and test its applicability using student posting data, which 73 may also be subject to event-specific stressors. 74 The work is driven by the following research questions: 75 1) How does stress compare to other discussion forum 76 features such as engagement, semantic and sentiment 77 in determining student academic performance?
78 2) Did stress increase among student cohorts during the 79 pandemic? 80 To achieve this, we use an ensemble method consisting 81 of three machine learning algorithms (Naîve Bayes, Random 82 Forests and Deep Learning), with overall results filtered using 83 stacked and voting methods. TensiStrength is used to extract 84 stress features and provide numeric calculations for sentiment 85 and stress measures. This will provide a more measured 86 understanding of the impact of COVID-19 on online learning. 87 As far as we are aware, this study is one of the first to utilise 88 TensiStrength in the educational space for detecting stress. 89 Our overarching contributions are the following: The development of the ensemble model provides a 101 platform-agnostic tool that can assist in identifying posts 102 that require urgent intervention, adding both theoretical and 103 methodological contributions to the MOOC research domain. 104 The paper is structured as follows. After this Introduc-105 tion, Section II discusses state-of-the-art works related to 106 our study. The research problem and aim are defined in 107 Section III, followed by the research design and technical 108 details presented in Section IV. In Section V, the experiment 109 design and experimental results are reported, with the related 110 discussions presented in Section VI. Finally, Section VII 111 provides the conclusion and discusses future work. 113 Data-mining techniques are well-established in Social Media 114 research for retrieving textual content to model user behav-115 ior [11], [12], [13], [14], [15]. While sentiment studies have 116 made significant advances into health fields such as mental 117 health (e.g., [16], [17], the application of machine learning 118 techniques to educational settings such as MOOCs is still 119 developing. A study by [18] determined that standard sen-120 timent analysis methods such as those used in social media 121 research were unsuitable for the MOOC context. Instead, 122 they developed a BERT-based sentiment analyzer that out-123 performed state-of-the-art social media sentiment predictors 124 with 0.94 accuracy. This demonstrates the need for purposed 125 models. 126 VOLUME 10,2022 Previously, sentiment analysis has been used for student posts more accurately, particularly when there are 145 uncommon expressions and alternative modes of phrasing 146 to denote their feelings about a course. In analysing event-147 based language, [23], [24] note that event-based posts such as 148 'running late' may be correlated as stressful language, when 149 in fact this might just be a neutral statement. In [23] to analyse textual behavior across the duration of a course.

179
For a more comprehensive model of student behavior, it is 180 necessary to incorporate features beyond sentiment. To this 181 end, [30] analyzed 'burstiness' (posting frequency) at par-182 ticular temporal points in a course, which may be explain 183 why students demonstrate particular sentiment or stress at 184 different milestones or times in a semester.

185
In systematic reviews of sentiment analysis in the edu-186 cation domain, [6] and [31] found that Naïve Bayes and 187 Deep Learning were some of the more common techniques 188 used. Similarly, [4] demonstrate that Random Forest is widely 189 used to analyse forum messages. Therefore, we adopt these 190 three algorithms as baseline classifiers for our experiment 191 design over others. Our work combines prior research into the 192 development of a user behavioral model in the MOOC, taking 193 the COVID-19 pandemic as an overarching event. Numerous 194 reports indicate that this event impacted traditional student 195 cohorts, but there is presently a lack of understanding about 196 its effect on MOOC students. By extracting features such as 197 interaction patterns, common semantic behavior and more 198 nuanced analysis of sentiment and stress, a more holistic 199 model of student behavior in an online context can be deter-200 mined. 202 We posit that student behavior can be represented as a set of 203 behavioral features. These features, denoted by f , are quanti-204 fied or calculated and make up feature vector sets, denoted by 205 FV , which contain each feature weighting. Equation 1 defines 206 these feature vectors.

III. RESEARCH AIM
Here we use a student's academic performance, defined as 209 ap, inside MOOCs as a label for these feature sets. We aim to 210 understand and clarify the coefficients within each behavior 211 set and their impact on both the overall behavior model 212 and ap by solving the function f () defined by Eq. 2. The 213 coefficients α, β, . . . , γ define the impact or importance of 214 each individual feature vector on AP for the model.

217
Our conceptual model is depicted in Figure 2. This framework 218 learns the function in Eq. 2 from a data source (solid-lined 219 boxes) and performs data engineering to synthesise addi-220 tional values (dashed boxes) to finalise the proposed model. 221 Based on observations on MOOC student data, the model 222 is designed to learn from three types of features categories: 223 engagement, semantics and sentiment, as depicted in the 224 Venn Diagram in Figure 3. The learned model is then trained 225 and validated using machine learning prediction algorithms, 226 which outputs a usable instance of our proposed model.

227
The feature extraction layer is comprised of the feature 228 categories shown in Figure 3. Engagement describes the 229 intensity, or level of interaction, a student has with the discus-230 sion forum and incorporates measures determined by overall 231 course activity. We choose to use the total number of active 232 days, denoted by ad, recorded per student in a course as our 233 temporal measure. An active day in this context is defined 234 as a day where a student has interacted with course content 235 beyond viewing a page, and is one of the measures that does 236

247
Academic Performance, ap, is used both for self-evaluation 248 for students and a label. From the dataset, ap is originally 249 provided as raw floating point number values. For our pur-250 pose, we convert these to an adapted grade scheme reflecting 251 grade milestones at The University of Adelaide [32]: High 252 Distinction (HD) is between 85% and 100%; Distinction 253 (D) is between 75% and 84%; Credit (C) is Between 65% 254 and 74%; Pass (P) is between 50% and 65%; and Fail (F) is 255 49% and under.

256
In addition to these engagement and semantic measures, 257 we extract a set of sentiment and stress features denoted 258 by sen and str, respectively. These represent measures of 259 sentiment (how positive or negative a person feels about the 260 topic they are discussing) and stress (how stressed a person 261 feels about the topic they are discussing) among student 262 posts. These measures are often used separately in social 263 media environments [21], [33], [34], but rather than repre-264 senting them as single spectrum values, here these features 265 are made up of a score describing the intensity of either 266 end of relax/stress or positive/negative sentiment spectrums. 267 Using two separate measures rather than a binary spectrum 268 is valuable as it allows for an element of nuance in senti-269 ment measuring or ''mixed feelings'' from users. Addition-270 ally, we believe that the opposite of stress is not necessarily 271 'relaxation', but rather a related measure, as not all stress is 272 inherently bad. For example, a student with high stress and 273 positive sentiment scores may be more accurately described 274 as 'excited' about something compared to a student with high 275 stress and negative sentiment scores.

276
TensiStrength uses a lexical approach with manually 277 derived lists of terms related to stress and relaxation [7]. 278 The approach looks to rank terms numerically based on their 279 contextual use, for example, as responses toward situations or 280 states. Differentiating between 'good stress' and 'bad stress' 281 is a valuable addition to our user behavior model. These allow 282 for description of mental state measures and how they impact 283 VOLUME 10, 2022 academic performance as well as the entirety of student 284 behavior. These features are summarised in Table 2. Eq. 3. We exclude students who have a value of p < 2 and a 311 ratio of r < 0.1 to refine the dataset to active students.
To construct f (s, cd), we utilise a modification of 320 BERT [35], optimised for NLP transformations on sentences, 321 or Sentence-BERT 2 [36]. Normal BERT maps sentences 322 to a vector space, however, has limitations with common 323 similarity measures. Sentence-BERT overcomes this using 324 a Siamese/triplet network architecture, which improves pro-325 cessing efficiency on big sentences. We use it to convert 326 the post content string (s) and the course description string 327 (cd) sourced from each course's 'about' page into semantic 328 sentence embed values (s v and cd v respectively), while also 329 converting s into a semantically structured data item. The 330 distance value is calculated by comparing the course descrip-331 tion value with each post and calculate the cosine distance 332 between each using PyTorch's formula [37] outlined in Eq. 5. 333 The process of creating the variables for the proposed model 334 is outlined in Algorithm 1. The semantic distance score typically ranges between 0 337 (not semantically similar to the corpus) and 1 (semantically 338 very similar or the same as the corpus). Negative scores in 339 the context of Sentence-BERT are inferred to indicate posts 340 that not only have very little in common with the overall 341 course topic, but also add little-to-no value to the discussion 342 forum. Posts with semantic similarity scores of ot < 0.02 343 were removed, as were short posts (10 words or less) to filter 344 out the 'noise' of introductory or meaningless posts.

345
To calculate sentiment and stress, the 'BERT-ified' text 346 content, s, is used. Sentiment scores are calculated using the 347 SentiStrength library [38], which has a proven record for 348 providing insight into a user's short informal texts [10], [39]. 349 This treats the 'BERT-ified' post content string s as the input 350 and returns sentiment feature values which we manually add 351 to our dataset. We calculate stress scores based off of the 352 body text of the post made by the student using the library 353 TensiStrength [7]. This is represented in Eq. 6. This combined method follows ensemble machine learning 394 principles of using stacked and voting methods [40]. A stack-395 ing method is an aggregate of our models' predictions, taking 396 the best results for features across our models [41]. This 397 allows the strengths of each model to shine and contribute to 398 our prediction service. Comparatively, the voting method of 399 uses a 'majority rule' decision for our predictions not unlike 400 our Random Forests model, but using several models. This 401 generates results using a combined brain of all of our outlined 402 models to make decisions.

404
The aim of our experiment is to determine any significant 405 relationship between student posting behavior on discussion 406 forums and final grade, ap. From this, we can determine if 407 student behavior sets measurable and predictable patterns, 408 that can arrive at a particular grade. We use ap as the label 409 for our data model and the remaining features are for pre-410 diction. Our testing/training split is 30/70% respectively and 411 we incorporate k-fold cross-validation as described in Algo-412 rithm 3, where k = 5 to mitigate risk of an unbalanced 413 dataset and investigate performance stability. Experiments 414 were conducted using Python in a Jupyter Notebook environ-415 ment on a remote university server owned by The Univer-416 sity of Adelaide. We utilised the Python libraries: PyTorch, 417 Keras and Tensorflow, Sklearn and Sentence-BERT as 418 described in previous sections. Our performance measuring 419 schemes use industry-standard metrics, accuracy, precision, 420 recall and F1, utilising 5-fold cross validation to generate 421 them.

A. RESULTS AND ANALYSIS
From initial experimentation, we observed that whether a    TensiStrength is used to extract stress from the dataset. 459 Table 5 shows the values for positive and negative sen and 460    str, with a slightly greater degree of stress, −str, for verified 461 students in both years.

462
This model data shows a greater number of total posts in 463 2020 compared to 2019. Average ad reduced for all students 464 in 2020, but r increased significantly. Average overall p is 465 similar between 2019 and 2020, with verified student post 466 numbers reducing slightly in 2020. We validate the ap scores 467 using the features through our modelling layer processes, 468 with performance measures for the baseline models outlined 469 in Tables 6, 7 and 8.

470
Results show high performance measures of > 0.8 for all 471 auditing student cohorts, while verified cohorts have mixed 472 results. Random Forests performed the best out of the base-473 line models, with all performance metrics reaching approx-474 imately 0.5, save for an F1 score for verified students at 475 0.4651.

476
In Tables 9 and 10, the results of the voting and stacked 477 method results are compared. Looking purely at verified 478 students, the baseline models generally outperformed the 479 ensemble method in accuracy, however the voting method had 480 higher F1 values with 0.6781 for 2019 verified students and 481 0.502 for 2020 verified students, which were higher perfor-482 mance than the baseline models and the stacked method. The 483 voting method was able to more accurately return ap values 484 for students.

485
Of the two, the voting method achieved the best per-486 formance metrics across the board, out-performing most 487     As highlighted in Table 3, in 2020 the active days ad 526 for all students decreased noticeably, while posting ratios r 527 for all students increased substantially. While students were 528 spending less time on the courses, their time on the forums 529 was up on the previous year as they showed more productive 530 behaviors. However, on topic scores ot were more or less 531 the same with previous year averages, so students were not 532 exhibiting more relevant posting behavior in spite of the 533 higher ratios. Results in Table 3 showed that the total number 534 of students in 2020 within our refined dataset was closer to 535 the original number, with 4098 retained against the original 536 number of 4258 after processing. Students from 2019 who 537 'survived' the pre-processing were much less compared to 538 their original count, with 2269 compared to 4553, meaning 539 over half were lost through our pre-processing. The impli-540 cations for our ensemble method is that a greater proportion 541 of 2020 students were engaging in a more meaningful way 542 compared to the 2019 cohort. This might suggest that the 543 pandemic may have given verified students more incentive 544 or time to participate in 2020 than the group from 2019.

545
The differences in physical activity between auditing and 546 verified students was also captured in Table 3. Verified 547 students tended to produce more posts on the forum and have 548 more active days in the courses compared to auditing stu-549 dents, who were particularly down in overall ad in 2020 from 550 2019. This was particularly evident in Table 4, where there 551 was significant differences in the active days of auditing 552 students between the two years, meaning the 2020 group 553 were comparatively less engaged and clearly impacted by 554 something as a whole. This could indicate the difference 555 in priorities during the pandemic for the 2020 cohort, with 556 auditing students unable to engage with online studies due 557 to life circumstances. Verified students were down in ad in 558 2020 compared to the previous year, however, the activity 559 levels were still significant enough to indicate that verified 560 students were far more invested in their outcomes. Table 4 561 also showed a significant p-value for all engagement measures for verified students, indicating that the 2020 verified 563 students were behaving differently compared to the previ-564 ous cohort. While verified students generally are expected 565 to participate in a course more by virtue of paying, it was 566 clear that there was far more engagement in the 2020 cohort 567 as a whole. These students may also have been driven to 568 participate more due to unseen factors. As one starting point 569 to further understand this change, future work needs to inves-570 tigate when active days ad occurred for each of these cohorts, their students that engaging with forums more frequently can 619 have positive effects on their overall well-being. In terms 620 of the ensemble method's potential as a real-time monitor-621 ing tool, the use of TensiStrength demonstrates that there is 622 value in detecting stress in conjunction with other categorical 623 features. This can help instructors with not only insight into 624 student interaction, but gather emotional data that will can 625 help understand the overall mood of a large cohort.

627
This work developed an ensemble method for modelling 628 student behavior using features of engagement, semantics and 629 sentiment/stress extracted from a MOOC discussion forum 630 dataset. Our objective was to observe the role of stress in 631 academic performance with a comparison between pre-and 632 during-COVID cohorts as a secondary analysis. The results 633 show that engagement had the most impact on student out-634 comes, with stress and sentiment rated the least important, 635 even during the pandemic. Addressing the research questions 636 posited in Section I: (1) stress had little impact on academic 637 performance and ranked among the least important features 638 in both years, and (2) stress did not increase during the 639 pandemic, with results indicating its importance decreased 640 compared to 2019. TensiStrength was used for more nuance 641 in understanding stress, which may be useful for MOOC 642 researchers who are improving the potential of real-time 643 monitoring tools.

644
The work is limited by the selected data range. While 645 we aimed to compare pre-and during-COVID behaviors, 646 one year is perhaps inadequate to formulate an understand-647 ing of pre-pandemic behaviors. It was clear that students in 648 2020 were engaging more actively with the forums compared 649 to the previous year, but whether this was due to the effects of 650 the pandemic remains unknown. Additional analysis should 651 expand the time range selection, to make a comparison 652 between yearly behaviors that would further contextualise 653 the results of 2020. Future work should also utilise more 654 granular analysis to model the behaviors of sets of students 655 within the datasets for refined comparisons. An interesting 656 future endeavour may be to identify a set of students who 657 are represented longitudinally across the course and mod-658 elling their student journey, pre-and during-COVID years. 659 As indicated in our Discussion, a more longitudinal, granular 660 analysis that uses the same set of students would provide 661 more contextualised and meaningful insight into the impact 662 of stress and generate a clearer comparison. Nonetheless, 663 our approach to separate sentiment and stress into distinctive 664 features makes a contribution to textual classification studies. 665  Wuhan University of Technology. She has more 885 than 100 publications, including those on AAAI, 886 ICDM, ICMR, CIKM, TOIS, and TOIT have 887 received more than 1300 citations on Google 888 Scholar. Her research interests include information 889 retrieval, recommender systems, data mining, mul-890 timedia computing, and natural language processing. 891 CHRISTOPHER DANN is an inclusive goal ori-892 entated leader whose purpose is to make a positive 893 impact on the educational experiences of learners 894 ''glocally.'' He is currently a Senior Lecturer with 895 the School of Teacher Education, University of 896 Southern Queensland, Curriculum and Pedagogy 897 (Technologies). His current research is exploring 898 the possible impact of machine learning and artifi-899 cial intelligence on the teaching and learning pro-900 cess from the perspective of teachers and students 901 across educational context.

902
YAN LI is currently a Professor in computer sci-903 ence with the School of Mathematics, Physics and 904 Computing, University of Southern Queensland, 905 Australia. Her research interests include artificial 906 intelligence, big data analytics, signal and image 907 processing, biomedical engineering, and computer 908 networking technologies and security.