Journals & Magazines >IEEE Access >Volume: 13

Student Engagement Dataset (SED): An Online Learning Activity Dataset

Data extraction and processing from Moodle and grade databases. Semester II of 2018/2019 learning session data was extracted from the Universiti Malaya database. The sele...

Abstract:

Distance learning has become a popular educational medium, and the Internet has spread since the early 2000s. To leverage this phenomenon, learning analytics and data min...Show More

Metadata

Abstract:

Distance learning has become a popular educational medium, and the Internet has spread since the early 2000s. To leverage this phenomenon, learning analytics and data mining can provide insights into improving pedagogy and assessing student engagement. To this end, a student-centric dataset was constructed by extracting data from Universiti Malaya’s Moodle-based Virtual Learning Environment (VLE), which serves approximately 25,000 students annually. In this paper, we present the Student Engagement Dataset (SED). The dataset consists of 16,609 students and 2,407 courses. It contains information such as grades and daily logged online activities (approximately 12 million data points), including temporal data across four tables. The tables include student engagement features created by aggregating raw activity data. Here, we present the dataset’s properties and describe the data collection, selection, and processing steps. Correlation analysis of student engagement features showed a statistically significant but weak negative correlation between the number of courses, early morning logins, assignments, and top students’ performance. SED is expected to present new opportunities for researchers in the learning analytics domain.

Data extraction and processing from Moodle and grade databases. Semester II of 2018/2019 learning session data was extracted from the Universiti Malaya database. The sele...

Published in: IEEE Access ( Volume: 13)

Page(s): 23607 - 23617

Date of Publication: 17 January 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3531102

Funding Agency:

Contents

SECTION I.

Introduction

Distance learning has become a popular educational medium with widespread Internet adoption since the early 2000s. Parallel improvements in video and audio recording capabilities and affordable cloud storage technologies have provided provisions for recording classes and online data storage. Finally, as classes could be recorded and stored online, a new form of distance learning emerged, loosely defined here as ‘e-learning’ or ‘Online Learning’.

As the COVID-19 pandemic entailed lockdowns, educational institutions have experienced periods of total online activity via their Learning Management Systems (LMSs) or Virtual Learning Environments (VLEs). The use of VLEs/LMSs has increased significantly compared to that in the pre-pandemic period [1]. There has been a case where approximately 67,000 students registered for an online learning environment during the initial period of lockdown in Germany [2]. Online courses during this period suffered from the lack of student engagement and decreased overall satisfaction [3]. Many methods such as online flipped classes were implemented to address these problems [4].

To our knowledge, there is no formal definition of Learning Analytics. However, [5] identified an agreement within the research community on what learning analytics focuses on. Learning analytics is used to ‘study the underlying relations between interactions and students’ academic performance or between participation levels and attrition rates in online courses in explaining learning behaviours, increasing student success, improvement and retention, and detecting at-risk students.’

‘Participation levels’ is synonymous with ‘Engagement Levels,’ the terminology used in this work. There is no formalised consensus on which exact data constitutes or models ‘engagement’ in online classes. Several data models have been used to build predictive performance models [5], [6], [7], [8], [9]. The same model does not predict success in online courses or teaching imparted via VLEs [5].

Engagement or participation encompasses much more than clickstream data (the content a student clicks on and when). It encompasses many different types of coursework submitted by a student, such as assignments, quizzes, participation in forums and discussion-like activities online. It also captures how quickly or how late students submitted an assignment, relative measures of activity (the frequency of their activity relative to other students in the same class or group) and the log of activities. The kind of agent with which the student interacts can also be considered a factor of engagement and is a performance indicator [5]. Thus, this study focuses on building a dataset that consolidates and represents such data in a statistically meaningful way.

To our knowledge, three primary public datasets are available in the Learning Analytics domain: The KDD Cup 2010 dataset, the Massive Open Online Courses Cube (MOOCCube) dataset, and the Open University Learning Analytics Dataset (OULAD) [10], [11], [12]. The KDD Cup 2010 dataset contains student interaction data for students studying in various schools from 2005 to 2007 [10]. It has a data model unique to two intelligent tutoring systems (ITS). The dataset recorded interactions between the student and ITS. No behavioural or demographic data were recorded. The ITS differs from many VLEs and online course LMSs, lacking features that enable communication between human agents, students, and students and teachers. Some VLEs or LMSs, such as Moodle, may have forums that enable this communication. The MOOCCube dataset contains data collected from the MOOC website XuetangX, which hosts online courses from many university branches in China. The dataset contains 700 courses, 100,000 concepts and 8 million students [11]. This dataset does not contain any demographic data or data on communication between human agents. The advantage of MOOC platforms is the large number of users, which allows visualisation, as in [13]. An example of an education dataset during the COVID-19 pandemic is [14], but it mainly captures web searches related to online learning.

The university from which the data in this work was mined is Universiti Malaya (UM), a public university in Malaysia that uses Moodle as its VLE. The VLE is officially known as SPeCTRUM and is accessible at https://spectrum.um.edu.my. Universiti Malaya has applied a two-semester system. Semester I typically starts in October and ends in January, whereas Semester II starts in February and ends in June, followed by a semester break for 11 to 12 weeks. Moodle is not used to actively teach material via online meetings during regular or lockdown periods, but rather,

hosts lecture notes, learning activities, and other essential learning resources,
occasionally hosts (or hyperlinks to) lecture videos,
assignment submission portal,
record student attendance,
conducts surveys/questionnaires on student satisfaction with the courses,
conducts discussions on topics,
hosts singular questions on topics, and
other miscellaneous activities include notifying students of any changes in plans within the course.

The above is a list of some of the features an instructor can take advantage of in the VLE, and is not bound to conduct surveys on student satisfaction. Although a grading function on Moodle is used to grade students partially, the official grades are stored in a different system, that is, non-LMS. Each course may contain one or more batches of students, which can be further divided into groups, with each group assigned to an instructor. Each instructor can freely upload lecture notes and occasionally post links to lecture videos for their group. Assignments were common to all groups within a course. The Moodle VLE has a forum for each group of students. In addition, instructors often open discussions on a particular topic. A discussion can be considered a collection of posts at timestamps. Anyone within a group can post a string of texts in a discussion. None of the activities bore any weight on the student’s grades.

Each course at Universiti Malaya runs for approximately 15 weeks per semester with a one-week break, typically after 7 or 8 weeks of lectures. A detailed breakdown of the academic sessions for the dataset published in this study is presented in Table 1. The University also adapts two types of assessments that would contribute to students’ final grades: the continuous assessment and the final examination - although weightage may vary between courses. A general scheme for marking and grading is presented in Table 2. Examples of continuous assessments include quizzes, assignments, and presentations, which were conducted from Week 1 to Week 14. The continuous assessment for each course differs and can consist of multiple assessment types. Final examinations are generally paper-based examinations held from week 16 onwards. The same course may or may not be offered every semester. The number of students per course also differed.

TABLE 1 Academic Session 2018/2019

TABLE 2 General Marking and Grading Scheme

The students enrolled at Universiti Malaya were informed about the Data Protection Policy and Personal Data Protection Act on Learning Analytics. This information is disseminated during registration each semester, where data are potentially used and shared for academic research purposes. This is as per Universiti Malaya Privacy Policy (https://um.edu.my/privacy-policy.html) and the Malaysian Act 709 on Personal Data Protection (https://www.pdp.gov.my/ppdpv1). Both policies govern all data collected and stored from all online services (including the Moodle-based LMS) provided by Universiti Malaya. The policies implement ethical considerations such as transparency, accountability and security. For example, students can update their personal data and withdraw their consent. Data collection purposes, usage, and security protocols are also specified. Different services may have different data owners and administrators. In the case of the Moodle-based LMS (or SPeCTRUM), the data is owned and administered by the Information Technology Department (JTM) alongside its partner, the Academic Strategic Planning Department (ASPD). All data used in this study were obtained by the authors with written consent from JTM and ASPD. The SPeCTRUM data privacy and retention policy, which highlights specific categories, can be accessed at https://spectrum.um.edu.my/admin/tool/dataprivacy/summary.php.

The data extraction and preprocessing methods are presented in the next section. Then, detailed properties of the data and brief early stages of statistical insights are provided. Finally, in the last section, we address some examples of potential usage of SED. This dataset is available at https://github.com/shahreeza-kassim/SED.

SECTION II.

Methods

This section describes the data preparation, validation, feature extraction, consolidation, and anonymisation phases. The overall data preparation for SED is shown in Fig. 1. The data collected in this study were restricted to the regular semester of the Semester II 2018/2019 academic session. Two types of raw data were extracted. The first was the student’s online learning behaviour data extracted using PostgreSQL from the Moodle platform. The second was the grade data extracted from a different database system as a CSV file. We aggregated these two data types into a student summary dataset. All implementations in this study, including the different stages of data processing in Fig. 1, were executed using Jupyter Notebook with Python libraries, such as NumPy and Pandas.

FIGURE 1.

Data extraction and processing from Moodle and grade databases. Semester II of 2018/2019 learning session data was extracted from the Universiti Malaya database. The selected data sessions were validated and processed, so features related only to student engagement were chosen.

Show All

A. Phase 1: Data Extraction

The goal was to formulate a data model representing student engagement during online course activities. Here, data relevant to student engagement are identified and termed ‘student engagement data’ or SED. More formally, we define SED as data on the following activities recorded on the university’s LMS platform: logins into the LMS platform, attendance sessions, assignments, questions, questionnaires, surveys, quizzes, and forum posts. Messages were not considered SED owing to the private nature of the data. The Moodle databases were organised by session. We extracted data from the Semester II 2018/19 (pre-COVID) session. Each session database contains 435 tables, but here, only the log data of students are extracted, as other tables mainly consist of data unrelated to student engagement, such as web page configuration. We also extracted the corresponding grade data table from a different database. The grade data extracted here are the continuous assessment and final examination results. The unprocessed data are referred to as Student_log and Student_grade data. Table 2 presents the general marking and grading schemes.

The main criterion for data selection is the relatedness of the data to student engagement. Therefore, our data selection considered similar data in previous studies of [15] and [16], such as the number of views per course and assignments. We introduced new features not seen in other Moodle-related studies, such as the faculty to which a course belongs (Student_grade data) and the time an event is created (Student_log data).

B. Phase 2: Data Cleaning and Anonymisation

Data cleaning was performed using both the Student_log and Student_grade data, where data was mainly checked for missing values, data inconsistency, and noisy data. Data were filtered by date from February 18, 2019, to June 30, 2019. The dates reflect the start and end of Semester II 2018/2019 (Week 1 to Week 19). For Student_log data, information unrelated to student engagement, such as static and configuration data, was dropped. Non-student data (e.g., faculty members, coordinators, and guest lecturers) and students with zero courses were removed during this phase. The minimum number of students per course and the minimum number of courses a student took were set to one. There are no restrictions on

whether the types of courses are social science or sciences,
undergraduate or postgraduate students/courses,
gender, race, religion or nationalities (including local and international students),
Status of a student enrolled in a course. An example is by year (first-year or final-year student). Status includes whether a student is taking the course for the first time or is retaking the course (usually due to past failures),
the minimum or maximum number of students per course, and
the minimum or maximum number of courses per student.

For the Student_grade data, missing values were removed first. Marks outside the range values listed in Table 2 were then rescaled based on the grades they were awarded. The corresponding middle-range grades replace the outlier marks. For example, the original mark of 150 was replaced with 73 for the B+ grade. The replacement occurs when a course instructor sets a different marking range rather than following the general scheme.

The dataset anonymisation process implemented in this study aligns with the Universiti Malaya Privacy Policy, inclusive of all ethical considerations. De-identification of personal information was done by removing private information for students, such as email addresses, names, departments, faculties, forum posts, and private messages. Consequently, each student is given a randomised unique user ID, termed userid, throughout this paper. Course names were removed, and course codes were replaced with a randomised course ID, termed courseid. We decided to omit course names or titles to avoid potential quasi-identifying attributes in our dataset. The cleaned and anonymised versions of Student_log and Student_grade data were saved in CSV format as Student_log.csv and Student_grade_detailed.csv, respectively. Other quasi-identifying attributes such as age, gender and the state or country of origin of students were also removed to avoid identifying individuals through public data.

C. Phase 3: Data Aggregation and Merge

Student_log data consists of time-dependent information, such as the login time of a student or the timestamp when a student accesses or takes actions regarding specific resources on the LMS. Here, we aggregated the time data according to different ‘time periods’ of the day rather than simply deriving the total number of logins per student. We also aggregated the total number of student activities. The results are derivative features, as shown in columns 5-12 (aggregated time periods) and 13-20 (aggregated student activities) of Section III Data Records, under D Table Student_activity_summary. The aggregated Student_log data was saved as Student_activity_summary.csv.

The total number of courses graded and their total marks were derived and saved as Student_grade_aggregated.csv. These data were also merged and added to Student_activity _summary.csv. The total mark data in the CSV file are normalised by the number of courses each student or user takes. Discrepancies exist between the courses and students in Student_log and Student_grade data. The discrepancies are caused by the fact that there are no indicators of student dropouts in the data extracted from Moodle. One student may have dropped out but was not omitted from Moodle. However, it will eventually be filtered out during the merger with the Student_grade data. This is because grade data is only possible if a student officially completes the course.

SECTION III.

Data Records and Analysis

In this section, we present the data records and a brief analysis of the four data frames presented in this study to demonstrate the types of insights that can be derived. The data used in this study is student-centric. After processing the raw data as described in the previous section, four tables or data frames are presented.

Table Student_activity_summary.csv
Table Student_grade_detailed.csv
Table Student_grade_aggregated.csv
Table Student_log.csv

All four tables are connected by an anonymised userid, where one userid represents one student. The four tables cover both performance and learning behaviour data. Column identifiers in CSV files are highlighted in italics. The first row of each CSV file represents column names. Each row corresponds to an individual student’s activities and/or grade information. All four tables were checked for missing values, inconsistencies, and duplicates.

A. Table Student_grade_Detail

The dataset contains student grade information and has 90,089 rows and five columns. Each column corresponds to

userid - the unique identifier for each student.
courseid(2407 unique courses) - the unique identifier for each course.
formatted agreed mark - the mark for each userid on a scale of 0-100 for each courseid.
actual grade - the grade category for each userid for each courseid.
faculty- the faculty to which a course belongs to.

From the table Student_grade_detailed, there are 16,352 students and 2407 courses across 21 faculties. Students can take more than one course. The term faculties here is loosely used as some courses are only categorised broadly as ‘undergraduate courses’ or ‘postgraduate courses,’ indicating that these courses were likely to be offered at a broader level rather than by each faculty. Fig. 2 shows the number of students and courses by each faculty. ‘University’ refers to common courses across all faculties, such as ‘Critical Thinking and Problem-Solving Skills,’ and the high number of students but relatively low number of such courses. Next to the ‘University’, the Faculty of Engineering has the highest number of enrolled students. However, the Faculty of Science offers the highest number of courses.

FIGURE 2.

The number of students enrolled and the number of courses offered by 21 different categories or faculties. Data is from Student_grade_detailed.csv.

Show All

The ‘University’ courses can also be attributed to the high number of students receiving a median mark of 75.0, as indicated in Fig. 3. Such courses have relatively lower complexity than other faculty cores or specialisation courses.

FIGURE 3.

The average student marks (Upper) and corresponding grades (Lower) were distributed across all courses they took. The marks here are normalised by the number of courses taken per student. The median mark is 75.0, the mean mark is 72.2, and the standard deviation is 16.0. Grades other than those listed in Table 2 are variations of Grade F. Data is from Student_grade_detailed.csv.

Show All

Fig. 3 also shows the grade distribution, where most students received grades A- to A. The standard grading system is presented in Table 2. The grades not listed in Table 2 but shown in Fig. 3, such as ‘W2,’ are used for different categorisations of grade F. An example is the categorisation of students dropping out of courses or semesters.

B. Table Student_grade_Aggregated

The dataset contains aggregated student grade information, consisting of 16,909 rows and three columns. Each column corresponds to

userid - the unique identifier for each student.
number_of_courses - the total number of courses completed by each student.
total_marks - the sum of marks for all courses completed by each student.

Fig. 4 shows the number of courses per student extracted from the table. Most students took six to seven courses across all faculties. 1277 students took one course, and 63 took ten or more courses. It is plausible that a student took only one course, for example, the final-year students who only needed to complete a single course to graduate.

FIGURE 4.

Distribution of number of courses taken per student across all courses. Data is from Student_grade_aggregated.csv.

Show All

C. Table Student_log

This dataset contains information on students’ activities or engagements. It comprises 12,139,424 rows and six columns and is the most extensive file. Each column corresponds to

userid - the unique identifier for each student.
component (36 unique components) - Moodle’s user interface components.
action (37 unique actions) - the actions taken by a user on a target.
target (74 unique targets) - the target component of an action by a student.
courseid (2826 unique courses) - the course id resources a user accesses.
timecreated - the time when a user accessed a particular resource.
contextid– the unique identifier for a context in terms of hierarchy (course, activity)
contextinstanceid– the unique identifier for an instance of a context.

This table captures granular data on students’ learning activities. Specifically, which component of the LMS (per course) does the student ‘click’ and when did they access it. It also contains student login information. Each course has multiple resources that can be viewed through the components, actions, and target attributes. For example, the combination in the second row of Fig. 5 can be interpreted as “On April 30 2019, at 4.57 AM, userid 13473 created a discussion in the forum of courseid 4000 with contextid of 255015 and contextinstanceid of 119878”.

FIGURE 5.

The relationship between component, action and target. The data shown is a snippet from Student_log.csv.

Show All

While contextid captures the hierarchy-level context (system>course>content), contextinstanceid points to the specific instantiation of the context. For example, contextid=99 might refer to quiz for a course. But contextinstanceid=12 refers to a specific quiz that has been created.

Some resources are accessed more often than others, as shown in Fig. 6 and 7. To improve readability, the vertical axes of these two figures were purposely scaled logarithmically, and only items that had been accessed 1000 times at a minimum are shown.

FIGURE 6.

The number of accesses/clicks by each component. The vertical axis is shown in the log scale as the number of accesses/clicks by each component varies. Data is from Student_log.csv.

Show All

FIGURE 7.

The number of accesses/clicks by each action. The vertical axis is shown in the log scale as the number of accesses/clicks by each component varies. Data is from Student_log.csv.

Show All

The temporal data can also be extracted, as shown in Fig. 8. A period of low activity can be seen in the middle of the time series, most because of the semester break. Note that the absence of a data point indicates that the student did not access the LMS on that day.

FIGURE 8.

The number of daily accesses across all components of courseid 1248 by two randomly sampled students. Data is from Student_log.csv.

Show All

D. Table Student_Activity_Summary

This dataset contains the aggregated data from Student_log.csv, which is then merged and consolidated with data from Student_grade_aggregated.csv. As a result, some of the features are derivative of other features. It consists of 16,909 rows and 20 columns. Each column corresponds to

userid - the unique identifier for each student.
no_of_courses - the total number of courses taken by each student.
average_marks - the marks are normalised by the number of courses each student takes.
total_login- The total number of LMS logins by each student.
weekend_login - the number of LMS logins each student makes during the weekend.
weekday_login -the total number of LMS logins during the weekday by each student.
midnight _login - the total number of LMS logins from 12 AM to 4 AM for each student.
early_morning_login - the total number of LMS logins from 4 AM to 8 AM for each student.
late_morning_login- the total number of LMS logins from 8 AM to12 PM for each student.
afternoon_login - the total number of LMS logins from 1PM to 4 PM for each student.
evening_login - the total number of LMS logins from 4PM to 8 PM for each student.
night_login- the total number of LMS logins from 8 PM to 12 AM for each student.
no_of _viewed_courses- the total number of views across each student’s courses.
no_of_attendance_taken - the total number of attendances taken in LMS by each student.
no_of_all_files_downloaded - the total number of files downloaded from LMS by each student.
no_of_assignments- the total number of assignments submitted by each student across all courses.
no_of_forum_created - the total number of forums each student participated in. This includes starting a topic, continuing, or replying to a topic/thread.
no_of_quizzes - the total number of quizzes across all courses each student has.
no_of_quizzes_completed - the total number of quizzes each student completes across all courses.
no_of_quizzes_attempt- The total number of quizzes attempted by each student across all courses.

Except for userid and number_of_courses, all the other columns were normalised by the number of courses each student took. This normalisation aims to enable a fair comparison of engagement levels between students. Without this normalisation, students who take more courses would naturally have higher engagements/access to the LMS and higher total_ marks. This would reduce the impact on students who took fewer courses but were relatively active.

Forum-related data is only 0.1% of the Student_log data and is relatively small. However, when used as an engagement feature, we have included no_of_forum_created to demonstrate the properties and behaviour of forum-related data. The content of the forum posts is not included in this study as it requires further preprocessing of texts to avoid privacy issues. Standardisation of English is needed as posts are created using a mixture of languages. The privacy of the posts must also be considered, as there can be posts which contain personally identifiable data such as a student’s name or matriculation number.

The attributes or features in this table are focused on student engagement or activities but without temporal information. Aggregation was performed through the summation of activities in Student_log. Activities are identified through the component, action, and target columns of the Student_log table. The feature average_mark in this table depends on the combination of courses each student takes. For example, a student might take a combination of easy courses and score a high average mark. The only possible indicator of course complexity is the type of faculty offering the course (see Student_grade_detailed.csv). Feature weekend_login to night_loginare time-based attributes aggregated from the Student_log timecreated variable. Other researchers can use the datasets presented here and aggregate their features, such as the number of assignments viewed.

Aggregating login data into different time period helps to smoothen the data. In this study, a 4-hour interval is chosen as the time window. However, aggregation also comes with the risk of information loss. For example, patterns such as students tend to login more from 12 AM to 1 AM than 3 AM to 4 AM for the midnight_login session would not be captured by aggregation. Users of SED are advised to consider the size and phase of the time window when aggregating data.

The summary statistics for the Student_activity_summary data (except for the feature userid) are shown in Table 3. All features have a minimum value of 0 or 1. Note that the median average_marks from Table 3 is lower than the median of total_marks in Fig. 3. These two are not of the same distribution where the former is the normalised version of the latter. no_of_assignments has a low median value of 0.75. This does not represent the average number of assignments for each student but rather an average ratio due to normalisation. However, it suggests that some courses do not have assignments, contributing to the low ratio.

TABLE 3 Summary Statistics for Engagement Data

Owing to the large sample size, normality was checked for selected features using the Quantile-Quantile (QQ) plot, as shown in Fig. 9. For most feature distributions, many data points deviated from the linear line in red. The effect is more profound for data points at the distribution’s right tail, indicating extreme outliers. This is evidence of non-normality and shows that except for average_marks, other feature distributions are positively skewed. This argument is further supported by the data in Table 3, in which each corresponding distribution has a mean value higher than the median.

FIGURE 9.

QQ plot for selected engagement feature distributions. Data is from Student_activity_summary.csv.

Show All

Here, we visualise the distribution of the time-based aggregated features using boxplots, as shown in Fig. 10. For readability and comparison purposes, outliers were excluded in Fig. 10 by the Interquartile Range (IQR) method where data points that are below Q1 - 1.5IQR or above Q3 + 1.5IQR were removed.

FIGURE 10.

Box plot for the number of average logins according to the time of the day. The number of logins was normalised by the courses each student took.

Show All

The figure shows that when using the median as the measure of central tendency, students mainly logged in to the LMS in descending order of midnight, afternoon, and early morning. The afternoon logins can be considered as students mainly accessing the LMS during class hours. A simple inference from the combination of midnight and early morning logins indicates that students will likely be active throughout the night, possibly during days without classes.

SECTION IV.

Correlation Analysis

In the previous section, we presented four tables or data frames to describe students’ online learning activities and academic results. We also characterised student engagement behaviour through data aggregation in one of the four tables. The result is a dataset of engagement features, as listed in Table 3. As shown in Fig. 9, most features do not approximate the normal distribution. Therefore, Spearman’s correlation is used to identify whether a monotonic relation exists between each student’s engagement features and performance. The variable average_marks represents students’ performance.

We computed the Spearman correlation coefficient $\rho$ and corresponding P-values for two cases with a significance threshold of $\alpha = 0.05$ . The first case used 16,909 observations from Student_activity_summary. The second is selecting only students who scored ≥ 90 as their average marks. These students are considered ‘Top Students’ with a sample size of N=172. The former case yielded a Spearman correlation of $\rho$ and P-values less than −0.1 and 0.05 for all engagement features. This shows a weak to negligible negative correlation between all features and the student’s average mark. Low and stabilised P-values are expected, as even the most minor observations detected are considered statistically significant in a large sample size. This leads to the conclusion that, although the correlations between engagement features and average_marks are weak to negligible, they are statistically significant.

Table 4 presents the Spearman correlation results computed for the second case. Similar to the first case, almost all features have weak to negligible negative correlations.

TABLE 4 Spearman Correlation Between Engagement Features and Average Marks of Top Students (n=172)

However, the variable number_of_courses shows a slightly stronger correlation of −0.22. Furthermore, only correlations between number_of_courses, early_morning_login, no_of_assignments, and average_marks are considered statistically significant as these features are the only ones with P-values below the significance level of 0.05.

The outcome indicates that for a sample population of top students, students taking fewer courses log less in the early morning and do fewer assignments, which correlates with higher marks. This may suggest that quality of engagement rather than quantity matters.

While the weak negative correlations of the discussed features are statistically significant (P < 0.05), the practical significance is limited. Statistical significance implies that the observed relationship is unlikely to be due to random chance in the given sample. However, our weak correlations of at most −0.22 indicate that changes in engagement features explain a tiny proportion of the variation in student performance. The normal convention assumes a correlation coefficient equal to or more than |0.5| as a strong correlation. However, policymakers should carefully define and consider the suitable correlation strength threshold (such as setting a high threshold of 0.7) in their decision-making process.

Our analysis so far is on how each student’s engagement correlates with their average marks. However, it is acknowledged that students engage differently between courses. This can lead to the ‘portability of model’ problem when doing predictive modelling [17]. In SED, there are courses where students participate actively and courses where students are less engaged. We briefly analysed one engagement feature, the number of views by courseid using SED. The result is a highly skewed distribution with a median of 635 views per course and a standard deviation of 6236.17. A considerable number of courses have over a total of more than 10,000 views. No figures are shown here for brevity purposes. Do courses with a high number of student engagements result in a higher average mark? We normalised the number of views each course received by dividing it by the number of students per course. We then computed the Spearman correlation between this normalised number of views and each student’s average marks per course. The result shows a Spearman correlation coefficient of −0.168 with a P value approaching zero. This again suggests a weak and negligible negative correlation but is statistically significant. Further results of our analysis will be reserved for future work. Nevertheless, here, we have demonstrated how SED can be further analysed on courseid level where the data is course-centric rather than student-centric.

SECTION V.

Usage Notes, Limitations, and Conclusion

The dataset presented here describes students’ online activities via LMS and the corresponding results regarding marks and grades. One of the datasets was created to represent students’ engagement features, but the raw data on interactions between students and the online learning platform were also made available. The data are kept as raw but clean as possible; however, as the data records and analysis have shown, outliers or zero data are inevitable. From the engagement features of Student_activity_summary, some features have a deluge of data, and some have almost no data (for example, number_of_forum_created). This indicates that other than viewing the courses, downloading the materials, submitting assignments and answering quizzes, most LMS components, such as chat or forums, are underutilised.

SED is mainly limited by the demographic of the student population. According to Universiti Malaya 2018 annual report [18], the students enrolled were mainly Malaysians at 87.8% of the total student population. Note that SED has only managed to capture data on approximately half of the reported 30,593 students enrolled. The percentage of undergraduate and postgraduate students are relatively balanced at 56.6% and 43.4% respectively. The report and Moodle LMS system investigated provided no information on student’s ethnicity, religion, gender, age group, or social economy background. There is a possibility that predictive models trained on SED would not perform well on a dataset with a different demographic background.

Many studies have been conducted on Moodle-based data, primarily on developing predictive models to predict student performance. The researcher used Petri-Net based on the correlation strength between the engagement features and the target variable [19]. Others have focused on the Machine Learning approach to model the relationship between students’ interaction with the LMS and their performance [15], [16], [17] and [20], [21], [22]. The reported prediction of Machine Learning models in these studies varies, with some models achieving above 85% accuracy ([20], [21], and [22]), whereas others only achieved approximately 60% accuracy [8]. However, these models have not been evaluated using a common dataset. The educational institution providing the data used in these studies, and in most cases, where the researcher belongs, has restricted access. Many of the datasets used were limited by the number of students sampled (ranging from approximately 50 to 3,000), the number of courses (1 to 24), the level of courses (high school to graduate studies) and the nature of the courses (mainly science). The datasets used in [15], [16], [17] and [20], [21], [22] also differed in their sampling duration, ranging from one semester to two years.

SED intends to address these limitations by providing a dataset of 16,909 students across 2,407 courses, including graduate to postgraduate studies and covers both science and social studies. SED focuses on only one semester of data to ensure that the population profile and demographics remain the same. Furthermore, SED provides the summation of time-based and raw temporal data regarding when a student accesses a component on the LMS.

Future works in learning analytics and predictive modelling can leverage the newly published SED on students’ learning behaviour to enhance educational outcomes. Researchers can build better predictive models of students’ performance for early detection of at-risk students or understanding procrastination behaviour (such as [21]). SED can also be used to characterise the engagement profile of high-performing students and the properties of courses that induce high student participation, as briefly demonstrated in this study. The result of these studies can then help educators, policymakers, and educational institutions introduce early intervention programs, design better courses, and improve the personalisation of education.

ACKNOWLEDGMENT

The authors thank Qazi Zarif Ul Islam for extracting, mining, and preprocessing raw data from SQL dump files to CSV.

References is not available for this document.

Student Engagement Dataset (SED): An Online Learning Activity Dataset

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction