Profile-Based Cluster Evolution Analysis: Identification of Migration Patterns for Understanding Student Learning Behavior

Educational process mining is one of the research domains that utilizes students’ learning behavior to match students’ actual courses taken and the designed curriculum. While most works attempt to deal with the case perspective (i.e., traces of the cases), the temporal case perspective has not been discussed. The temporal case perspective aims to understand the temporal patterns of cases (e.g., students’ learning behavior in a semester). This study proposes modified cluster evolution analysis, called profile-based cluster evolution analysis, for students’ learning behavior based on profiles. The results show three salient features: (1) cluster generation; (2) within-cluster generation; and (3) time-based between-cluster generation. The cluster evolution phase modifies the existing cluster evolution analysis with a dynamic profiler. The model was tested on actual educational data of the Information System Department in Indonesia. The results showed the learning behavior of students who graduated on time, the learning behavior of students who graduated late, and the learning behavior of students who dropped out. Students changed their learning behavior by observing the migration of students from cluster to cluster for each semester. Furthermore, there were distinct learning behavior migration patterns for each category of students based on their performance. The migration pattern can suggest to academic stakeholders to understand about students who are likely to drop out, graduate on time or graduate late. These results can be used as recommendations to academic stakeholders for curriculum assessment and development and dropout prevention.


I. INTRODUCTION
Educational process mining (EPM) is a research area that focuses on educational data modeling and analysis to represent students' learning behavior and match actual students' learning behavior with curriculum guidelines [1]. By examining students' learning behavior patterns, we can assess the curriculum and evaluate students' performance during the study period. Some works on EPM have been applied in The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Cusano . academic institutions to predict students' dropout [2], to recommend a correct path to students [3], to identify, examine and facilitate the evaluation of learning processes [4]- [6], and to enhance the curriculum design [3], [7], [8]. While most existing works emphasize the process perspective and the activity perspective, no previous works attempt to apply the temporal case perspective (i.e., students' profiles) in accordance with trace (i.e., curriculum) alignment.
In the educational domain, traces can be regarded as course sequences (i.e., curricula) taken by students. Concerning curriculum alignment, it is necessary to consider two types of curriculum types: noncentralized and centralized curriculum. A noncentralized curriculum is a guideline for students to independently decide on courses without contemplating strict boundaries, such as course prerequisites. On the other hand, a centralized curriculum provides a prerequisite guideline (e.g., Calculus I should be taken before Calculus II) for students to meet academic expectations based on stakeholders (i.e., government and/or policymakers). Many works on EPM consider the noncentralized curriculum, while works on EPM with a centralized curriculum face some challenges regarding the behavior and performance aspects. Concerning the behavior aspect, for example, a student can graduate before the required time if he/she can manage to take courses ahead of schedule. Another challenge involves the performance requirement according to which a student needs to retake courses in the later semesters, which may cause lateness in graduation. Most of the time, course managers must manually manage courses to be opened each semester, taking into account both students' behaviors and performance while attempting to achieve the target educational performance, such as a low dropout rate and a high percentage of students graduating on time.
Work in EPM has utilized process mining to identify a process model with limited analysis in profiling students' behavior. However, the process models of the EPM may be too large and complex for analysis by stakeholders. To analyze various traces (e.g., students' behaviors), there is an approach to group cases (i.e., instances) based on specific profiles called trace clustering. Originally, trace clustering grouped cases as a full trace without considering partial traces (i.e., time-based trace or temporal case). The result of trace clustering produces homogeneous sets of traces from a full event log and constructs simpler process models [9]- [12]. Some works have used clustering to group students based on specific indicators (i.e., performance and activity) to obtain more specific and accurate models of students' behavior [13], [14]. In the domain of EPM, the event log can be partially cut in accordance with the timestamp (i.e., period of study or semester). While work has been conducted on trace clustering in the domain of EPM [15], no studies have shown how temporal cases (i.e., students' profiles) changed over time. Most EPM emphasizes the discovered process model, i.e., the model representing students' behavior during the study period. However, the evolution of behavior and performance over time remains unexplored. In other words, there is a need to explore and analyze the behavior and performance of students for each semester (called a temporal case) to help stakeholders maintain education quality.
Some previous works have analyzed the evolution of time series data. Shao et al. proposed a new synchronizationbased clustering approach for evolving data streams called SyncTree. SyncTree investigates the cluster structure of the data stream between any two timestamps in the past and provides a synchronized microcluster to analyze the cluster evolution [16]. Disegna et al. proposed a method to analyze cluster evolution using repeated cross-sectional ordinal data on tourism market segmentation that helps destination managers and municipalities describe and verify the efficacy of policy and strategies adopted over years without the need to rely on longitudinal surveys, which are often difficult to conduct [17]. Ramon-Gonen and Gelbard proposed the CEA model, which addresses three phenomena likely to occur over time: (1) changes in the number of clusters; (2) changes in cluster characteristics; and (3) between-cluster migration of objects. Among the existing approaches, CEA by Ramon-Gonen is the appropriate method to identify repetition at various points in time and detect patterns that predict prospective loss of value as well as patterns that indicate the stability and preservation of value over time [18]. However, the existing CEA does not consider any mixed profiles during the cluster development stages. In this study, mixed profiles are the alignment course sequence and the respective grade point average (GPA) per semester. The two profiles have different distributions and characteristics at a certain time. Understanding students' learning behavior in a timely manner requires analysis at a certain point in time (i.e., the semester) to understand the possible actions in the next timestamp (i.e., the next semester).
This study attempts to modify the existing CEA and develop a new method called profile-based cluster evolution analysis (P-CEA). P-CEA considers extracted profiles at a certain time and uses the profiles to understand subsequent learning behavior. In other words, the original profile indicates students' learning behavior regarding the course sequence, while the additional profile is extracted from the GPA in the respective semester. The P-CEA is expected to determine (1) the amount of data clustered per semester, (2) how cluster characteristics appear per semester, and (3) how cluster migration appears from semester to semester. This study is expected to provide an overview of the learning patterns of students who graduate on time or drop out.

II. RELATED WORK
This section addresses work on EPM and CEA in the domain of educational sectors, such as sequential-based curriculum assessment, trace-based clustering that clusters observation data based on the similarity of the sequences, and relevant works on CEA in various domains.

A. EDUCATIONAL PROCESS MINING
Process mining aims to gain knowledge from event log transactional data. The main tasks of process mining are threefold: 1) discovery, 2) conformance, and 3) enhancement. The event log data were analyzed to discover a process model that best describes the event data. Many works have developed process discovery algorithms to address representational biases and specific problems in particular domains. Conformance checking replays the event log on top of the a priori model to quantify the fitness of the event log for the process model as well as to detect and quantify commonalities or deviations. Enhancement takes the event log and the a priori model to generate a new (enhanced) model. The last task and multiple VOLUME 9, 2021 perspectives from process data (e.g., organization, controlflow, performance, and resource perspective) can create a simulation model.
In the field of EPM, sequence matching alignment is a curriculum assessment technique [19]. The proposed technique aims to assess the conformity between student learning behavior and curriculum design and groups them into several categories, i.e., match, before, after, repetitive and elective. This research shows that only students' overall behavior conforms to the curriculum. Cairns et al. proposed methods to understand students' learning behavior. The methods aimed to discover, analyze, and provide a visual representation of the educational process. The methods used the dotted chart, process modeling, social network, and clustering plug-in on ProM. The results show students' learning behavior, but they cannot show the conformity of students' learning behavior with curriculum guidelines. The results can also show the correlations between students' learning behavior and students' performance [20].

B. TRACE-BASED CLUSTERING & CLUSTER EVOLUTION ANALYSIS
Trace clustering in process mining can be used for handling scatter and large event logs by dividing them into several groups of clusters with similar types (Song et al., 2013). Trace clustering is different from other process mining algorithms in discovering process models that are usually spaghettilike, especially from event logs that have highly flexible environments, which are difficult to analyze [21], [22]. Trace clustering can be used to make the obtained model simpler by identifying homogeneous sets [9].
Trace-based clustering for educational data uses sequence matching alignment values to measure the conformity between students' learning behavior and curriculum guidelines. This technique helps to identify homogeneous sets of traces within a heterogeneous log and makes it easier to discover groups of students' learning behavior. Priyambada et al. proposed aggregate profile clustering to discover the characteristics of students' learning behavior and its relationship with students' performance [23]. Priyambada et al. also proposed segmented-trace clustering that clusters students' learning behavior per semester. The results showed that the students could be grouped into various clusters per semester that have different learning behavior and performance characteristics. However, the method cannot show the changes in students' learning behavior in a timely manner (e.g., from one semester to another semester) (Priyambada et al., 2018).
Ramon-Gonen and Gelbard proposed a model called cluster evolution analysis (CEA) that addresses three phenomena that are likely to occur over time: changes in the number of clusters, changes in cluster characteristics, and betweencluster migration of objects. The pattern of the sequence was obtained by using generalized sequential patterns (GSPs). The GSP algorithm is used to obtain frequent sequences from sequence data [24]. The results showed that the model could identify repeated clusters at various points in time and detect pattern migration [18]. This research modified the existing CEA model to identify changes in the learning behavior of students who graduate on time or drop out. This information can provide insights to academic stakeholders for curriculum assessment and development and dropout prevention.

III. METHODOLOGY
This section discusses the methodology of this study. The modified cluster evolution analysis model for the migration of students' learning behavior is presented in Figure 1. The flow of the analysis is as follows: Data Preprocessing, Profile Assessment, and Cluster Evolution Analysis. Data preprocessing transforms the raw data into event logs and then maps it into projected event logs. Profile assessment includes assessing the behavior log and performance log. Cluster Evolution Analysis involves Temporal Clustering, Cluster-to-State Labeling, Cluster Trajectory, Pattern Recognition, and Pattern Analysis. Each of the phases will be discussed in the next subsection.

A. DATA PREPROCESSING
Data preparation aims to obtain students' learning behavior from the institution database and prepare the data. Data preparation consists of data collection and sequence matching alignment.
Students' learning behavior can be captured from students' performance recorded in academic transcripts. In other words, all information related to students' academic results in the program is retrieved.
Definition 1: Event Log (L) An event log (L) is a multiset of traces. Each trace is mapped into a case and consists of a collection of events, which is defined as follows: E is a set of events, and E = CS × T × G, where CS is a set of subjects (courses), T is a set of timestamps, and G is a set of grades. A set of courses (CS), as mentioned above, is a collection of the concatenation of course name and course ID.  a case represents a student's study history. Cases always have a trace, denoted as σ ∈ E * . C = E * is the set of possible event sequences (traces describing a case). An example of an event log can be seen in Table 1.
The coarse granularity of the timestamp (i.e., a unit of time such as semester, quarter, or trimester) cannot represent a sequence compared to a finer granularity of timestamps such as hours, seconds, etc. A data structure with a high granularity timestamp is used to analyze the data.

B. PROFILE ASSESSMENT
Before proceeding with the clustering approach, it is necessary to build the feature. This section discusses feature engineering from collected data into prepared data for cluster evolution. First, the event log is projected into the projected event log. The projected event log is aligned with the curriculum guidelines, and it creates a set of sequence alignment features. This set of sequence alignment features will be the feature for cluster evolution.
First, the event log is manipulated into a projected event log.

Definition 2: Projected event log (PL)
A projected event log is a multiset of traces where events with the same timestamp are grouped into a set. A projected event (E t ) is defined as a collection of events that happened in the same timestamp. Note that the sequence in projected events is unordered. A projection function f : In this study, the projected events contain events that are lexicographically ordered. Hence, the projected event log from Table 1 can be seen in Table 2. To align the students' data with the curriculum guidelines, we utilized sequence alignment matching [19] and produced sequence alignment features. The sequence alignment features contain mixture profiles, alignment with curriculum guidelines, and the grade point average per semester. For example, ST001 finished two courses (CS01 and CS02) in semester 1. Both of the courses aligned with the curriculum guidelines (refer to Table 3). Hence, the values of d 1 1 , d 1 2 , d 1 3 , and d 1 4 are 2, 0, 0, and 0, respectively. In semester 3, ST001 took two courses (CS04 and CS05), while the curriculum guidelines enforced CS05 and CS06. In other words, CS05 is aligned with the curriculum guidelines, and CS04 should be taken before semester 3. Hence, the values of d 3 1 , d 3 2 , d 3 3 , and d 3 4 are 1, 0, 1, and 0, respectively. The result of the sequence alignment feature according to the data in Table 2 can be seen in Table 4. The grade point average per semester denoted as follows: where cr i p is the credit of the p-th course in the i-th semester and cr i is the sum of the credit of all courses in the i-th semester. G i p is the grade of the p-th course in the i-th semester.
Profile assessment consists of behavior profiles and performance profiles. The behavior profile is the sequence alignment matching the behavior log and curriculum guidelines. The performance profile is the performance measure according to the grade and credit numbers of the student.
m is the set of student learning behaviors and the grade point average of the m-th student in the i-th semester. d i l is a collection of alignment aggregations in the i-th semester with 4 categories, l = 1, 2, 3, 4, named match, after, before, and repetition, respectively. g i is the grade point average in the i-th semester. Supposed that ST001 has behavior profile d 2 = 2, 0, 0, 0 and performance profile g 2 = 2.8, thus student's mixture profile is D 2 ST 001 = 2, 0, 0, 0, 2.8.

C. PROFILE-BASED CLUSTER EVOLUTION ANALYSIS
The cluster evolution analysis (CEA) model addresses three phenomena likely to occur over time: (1) changes in the number of clusters; (2) changes in cluster characteristics; and (3) between-cluster migrations of objects. The results showed that the model was capable of identifying repeated clusters at various points in time and detecting patterns that predict prospective loss of value. The CEA model used in this research addresses three phenomena as well, identifying repeated clusters at each semester and detecting patterns that lead to the dropout of students. The flows of the P-CEA model for the migration of students' learning behavior are as follows: change in the number of clusters, changes in cluster characteristics, and betweencluster migration of objects.

1) TEMPORAL CLUSTERING
The period-based cluster adopts the general clustering method. Clustering is a method to group objects in such a way that similar objects are placed in a group and different objects are placed in another group. In this research, student cluster analysis was used to determine the composition of student learning profiles. One of the popular approaches in clustering is k-Means. K-means clustering is a method that results in a k cluster, where k is defined by the user. The group is selected by finding the smallest sum of squared error (SSE) compared to the mean.
The K-means algorithm is used to cluster data per semester, and the elbow method is used to determine the number of clusters. K-means clustering is a method that results in area k clusters, where k is defined by the user.
where CL i j is cluster j in semester i and µ i j is the mean point in CL i j .
Various clusters are obtained for each student's learning behavior data for each semester. There are nine datasets representing the semester. The students' learning behavior in the 1 st semester is not included because in this semester the courses taken by the student are defined by the curriculum. Thus, all students have the same learning behavior, which will result in one cluster in the 1 st semester.

2) CLUSTER-TO-STATE LABELING
Cluster similarity identification based on cluster characteristics aims to group the cluster into cluster types (states). This step enables us to identify the changes in the cluster characteristics. However, this method does not work properly in data with bounded functions. The number of courses and the course itself is specific to the curriculum guidelines. So the label could be different for each educational organization.
The state is assigned to the categorical label. There are 2 categories: behavior label by using percentage of average matching courses from average total courses of the state, and performance label by using average semester GPA. The behavior label is denoted as BL = {x|x ∈ N}, where BL is the set of possible behavior and x is the behavior label. The performance label is denoted as FL = {y|y ∈ N}, where FL is the set of possible performance labels and y is the performance label.

3) CLUSTER TRAJECTORY AND PATTERN RECOGNITION
We constructed a new trajectory-oriented dataset in which each student carries a list of all the cluster types (state) to which the student belonged at different points in time, namely, the path each student has gone through between clusters. In the resulting path matrix, the rows represent the students, the columns represent semesters, and each cell represents the code of the cluster type (state) to which the student has belonged at a certain point in time. Students who dropped out are marked ''DO''. Students who graduated are marked ''GR''. The cluster trajectory denoted as follows: where CT is Cluster Trajectory, CL i m,j is Cluster j-th in semester i-th for student m-th and | CL i | is the number of clusters in i-th semester.
Supposed that ST005-ST009 had certain clusters for each semester, so an example of a cluster trajectory can be seen in Table 5. Students ST005, ST007, and ST009 graduated in the 9 th semester, and student ST006 graduated in the 8 th semester; however, student ST008 dropped out in the 5 th semester. In the former section, we described the process used to identify prominent patterns. A migration graph was designed, providing all the between-cluster student migrations over the entire 9-semester period. Bars represent a period (2 nd semester, 3 rd semester. . . ), the size of each bar section represents the size of the cluster, and colors represent cluster type (state). The arrows represent the movement (migration) of students. The number near the arrow represents the number of students who migrated.
To analyze the migration of students, we used sequential pattern mining using the generalized sequence pattern (GSP) algorithm to identify prominent patterns between the cluster migration of students. The GSP algorithm determines the frequent pattern of the sequences. The support value for each category denoted as follows: The length of the frequent sequence can be changed depending on the minimum support value, which is the value that represents the percentage of the same pattern from the overall data. The minimum support of this study tends to be low because the sequence of the data is quite long (9 sequences that represent each semester).
The pseudocode of P-CEA can be seen in Table 6.

A. DATA INTRODUCTION
This section addresses the results of the analysis of a dataset from Information Systems students' academic data from the 2 nd semester to the 10 th semester. The first semester data were excluded since all students took the same course in accordance with the curriculum guidelines. The department regularly updates the curriculum every five years. It should be noted that the curriculum guidelines for this study and the batch of students were not changed until the students graduated. The curriculum guideline is designed for eight (8) consecutive semesters, which is the standard length of study for bachelor's degrees. During the eight (8) semesters, there are two phases, i.e., the preparation phase (1 st semester to 4 th semester) and the bachelor's phase (5 th semester to 8 th semester). The preparation phase comprises basic courses that are offered to the students from the 1 st semester to the 4 th semester. Students need to pass all basic courses to continue their study in the 5 th semester; otherwise, they drop out. If a student can achieve good performance (i.e., semester GPA ≥ 3.0), he/she can take courses with a total of up to 24 credits. On the other hand, students with poor performance (i.e., semester GPA < 2.0) can take only up to 16 credits. Students who achieve a semester GPA between 2.00 and 3.00 can take up to 20 credits.

B. RESULT 1) TEMPORAL CLUSTERING
By using the k-means algorithm to cluster per semester and the elbow method to determine the number of clusters, VOLUME 9, 2021  the number of clusters for each semester can be seen in Table 7.
The number of clusters of each semester represents how complex the course-taking pattern is. The number of clusters in the early semester was not high because most of the courses are defined as a set of courses that need to be taken simultaneously. The number of clusters increased in the 5 th -8 th semesters because there were more courses, and students freely took courses based on their performance. For example, students with low performance are allowed to take fewer credits and often must retake courses. On the other hand, students with high performance have higher maximum credits and tend to take more semester courses. The last two semesters have the smallest number of clusters because there is no course left in the curriculum guideline and the number of students is small because most of them have graduated.

2) CLUSTER-TO-STATE LABELING
For this case study, we defined three labels (|x| = 3) for behavior label based on its level of conformity to the curriculum guideline that can be seen from percentage of matching courses from total courses. For the performance label, the GPA-S is only bounded to its semester and scaled on a 0-4 value, and we defined the label with four labels (|y| = 4) based on the GPA-S value. The label of each category can be seen in Tables 8 and 9.  The results of state assignments for each cluster using categorical labels are shown in Table 10.

3) CLUSTER TRAJECTORY AND PATTERN RECOGNITION
To obtain a detailed migration pattern, this section shows the migration graph, which is divided into three categories: students who graduated on time, students who graduated late, and students who dropped out.
In Figure 3, which displays the migration graph of students who graduated on time, we find that the majority of students who graduated on time remain in high GPA-S clusters (green and blue) and have high conformity with the curriculum guidelines (Label H ). No students who graduated on time migrated or remained in the low GPA-S (red) cluster. From the thickness of the line in the migration graph, we can see that students who graduated on time had a prominent pattern in the 2 nd -5 th semesters and then scattered. The L label, which represents low conformity (L) to the curriculum guideline, only appears in the late semesters (7 th and 8 th semesters).  Figure 4 displays the migration graph of students who graduated late (9 th and 10 th semesters). There is no prominent pattern for this group of students. In the migration between the 2 nd and 3 rd semesters, the majority of students who graduated late migrated separately into two large groups. Both groups had high conformity (H ) to the curriculum guidelines but had different results for the GPA-S (green and yellow). There were no students who graduated late and migrated to the low GPA-S cluster (red) until the 8 th semester due to their failure to complete their final projects. The difference between this group and the group of students who graduated on time is that most of the students in the previous group migrated to the high GPA-S cluster (blue) in the 8 th semester, while most of the students in this group migrated to the low GPA-S cluster (red). Figure 5 shows the migration graph of students who dropped out. Half of the students dropped out in the 3 rd semester (5 students), one student in the 5 th semester, and four students in the 6 th semester. Most of the students who dropped out had consistently declining performance in terms of the level of conformity and/or GPA-S. Most students who dropped out were in the low GPA-S cluster (red) regardless of their conformity with the curriculum guidelines. The GSP algorithm yields sequence patterns from the sequential migration data of students from cluster to cluster for nine semesters. We determined the minimum support parameter to be 0.1 because it yields full sequences (9 sequences). We separated the sequential migration data into three datasets: students who graduated on time, students who graduated late, and students who dropped out. We determined different minimum support parameters for the sequence of students who dropped out to show all the sequences. The results of the sequences for each student category can be seen in Tables 11-13.   TABLE 11. Generalized sequence pattern of students who graduated on time. The sequence patterns show that there are general patterns that differ between students who graduated on time, graduated late, and dropped out. The first and only L (low match) and 3 (low GPA) occur in the 7 th semester for the majority of students who graduated on time. Most of the labels of each cluster for the entire sequence of students who graduated on time consist of H and M as their learning behavior level of conformity and 1 and 2 as their level of semester GPA ranges.
Students who graduated late have three general sequences. Two of them (Seq1 and Seq2) have H3 as their cluster label in the 3 rd semester, which means that they have high conformity VOLUME 9, 2021 with curriculum guidelines but a low average semester GPA. After that, they have stable GPA-S with 2 (green) as their performance label and H and M as their learning behavior label. In the 8 th semester, all prominent sequences have 4 (red) as their performance labels due to failure to complete Final Project Course which has 6 credits. For the 9 th semester, all prominent sequences have L1 as their profile labels, which means Low on behavior level of conformity (curriculum guideline only provides courses until 8 th semester) and good performance. From this pattern, we can conclude that the course in the 8 th semester is the biggest problem for the student who graduated late followed by the course in the 3 rd semester and 6 th semester.
Students who dropped out had 7 different sequences. Five of them had a low average semester GPA, and almost all students dropped out after they had a low semester GPA, except one student (Seq7), indicating that they were unable to catch up with other students. Seq2 and Seq6 only make it until the 2 nd semester, Seq3 until the 3 rd semester, and seq4 until the 4 th semester while Seq1 and Seq5 still trying until the 5 th semester. Seq5 still has a fluctuation of their performance (up in even semester, down in odd semester), on the other hand, Seq1 remains on the 4 (red) performance label, and the behavior label declining due to the increasing number of courses that need to be taken. Only Seq7 did not have red labeled performance but still dropped out in the 3 rd semester.

C. DISCUSSION
The temporal case is an interesting perspective in educational process mining because most of the research in this domain focuses on the case perspective. The temporal case perspective can show the change of a pattern in a period of time, especially in a centralized curriculum. A centralized curriculum is a guideline that can be helpful for students to achieve academic expectations by giving a prerequisite course chain. This type of curriculum can also be a problem when students fail a course. Usually, when students fail a course, they need to wait one year to retake the course because of the odd/even semester system. When students fail a prerequisite course, there will be a chain reaction to other courses that can only be taken after they pass the prerequisite course. This situation may prolong the students' study period.
The main result of this paper is the migration of students from cluster to cluster each semester, which represents how students change their behavior and performance over time. The P-CEA can be a tool to monitor the performance of students by updating the data at the end of the semester. By the end of the semester, department stakeholders should know the position of students' performance and how they changed their behavior and performance from the previous semester. The cluster in the migration graph is sorted from the highest curriculum guideline conformity and performance to the lowest. From the results, we can see that most of the students who graduated on time tended to migrate to the upper area of the graph; on the other hand, all the students who dropped out tended to migrate to the lower area. The students who graduated late were scattered in all areas of the graph, some students migrated to the upper part of the graph, some others migrated to the lower part. Department stakeholders can provide counseling to students whose performance is declining or who are migrating to the lower area of the graph. For example, the students who migrated in the cluster with low conformity with curriculum guideline (L) and/or low GPA-S4 (red).
Another area that can be analyzed is the occurrence of the cluster. The appearance of a cluster with a low level of conformity in the early semester is because: low-performance students are unable to take the course from the previous semester or retake the course because they failed it previously. Another reason is that students with high performance have already taken the course and need to take the course from the next semester to optimize their maximum credit. Department stakeholders can consider action to help students, such as opening courses that students needed by looking at the occurrence of clusters especially cluster with low GPA-S. By opening a course, students can take/retake that course immediately and should not have to wait until the next year. This type of action will help students who will potentially graduate late.
The difference between P-CEA and CEA is in the data and cluster-to-state labeling process. P-CEA uses educational data of course-taking activity and transforms it to become profile-based data. CEA uses financial data that has different characteristics, such as data distribution and range of values. The educational data used in the P-CEA also has its own characteristic based on their time/semester. The data from earlier semester could be different from last semester. This is because students in the higher semester will have fewer course options as they have taken a lot of courses in the previous semesters.
This methodology works well as can be seen from the migration graph, the differences between migration patterns for each category can be seen clearly. The process in P-CEA is modified process from CEA, so the educational data can be processed, especially in the cluster-to-state labeling process. In the CEA, there is a process that mix the value of the data between timeframe. Because of the data used in the P-CEA have different characteristics in each semester, so the labeling process should be separated from each timeframe. By using P-CEA the data will be labeled on its own timeframe (semester).

V. CONCLUSION
This study proposed a profile-based cluster evolution analysis (P-CEA) for dynamic data that occur from time to time. The proposed method was tested on educational data in which the profiles were related to the course sequence alignment and the grade point average of students per semester. The result of the P-CEA provides students' learning behavior patterns for stakeholders to recognize students who will graduate on time or drop out.
Detailed migration pattern detection shows three migration graphs that indicate the impact of students' migration per semester on their performance indicated by the study period (students who graduated on time, students who graduated late, and students who dropped out). The migration graph for students who graduated on time shows that students migrated to a better state or remained in the same state with match as their dominant learning behavior and had a good average semester GPA. Students who dropped out remained in the same state or declined in terms of conformity with curriculum guidelines and had a low average semester GPA.
By knowing students' migration per semester, department management and stakeholders can determine the actions to be taken for students who show declining performance or students who remain in the state with repetitive or later as their dominant learning behavior to prevent students from dropping out by giving the warnings and counseling. Another action that can be taken by stakeholders is to open courses needed by students based on the cluster in the migration graph indicated by the label of the cluster. A limitation of this study is that it is unable to detect students who drop out for reasons outside the lecture. Some students who drop out have good grades and good learning behaviors. Future studies could include students' demographic data to give more insight into the pattern, exploring another aspect such as social network analysis of course-taking activity between students or lecturers also could be done.
SATRIO ADI PRIYAMBADA (Member, IEEE) received the bachelor's and master's degrees in information systems from the Institut Teknologi Sepuluh Nopember, Indonesia. He is currently pursuing the Ph.D. degree in computer science with the Graduate School of Science and Technology, Kumamoto University, Japan. He is a member of the Human Interface Cyber Communication Laboratory. His research interests include process mining, machine learning, and educational data mining.
MAHENDRAWATHI ER received the bachelor's degree in industrial engineering from the Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia, and the master's degree in operations management and manufacturing systems and the Doctoral degree in manufacturing engineering and operations management from Nottingham University, Nottingham, U.K. Since 2006, she has been a Lecturer with the Information Systems Department, Institut Teknologi Sepuluh Nopember. She teaches among others enterprise resource planning, business process management, supply chain management, and e-business. She has published her work in international journals and conferences. Her research interests include business process management, supply chain management, and enterprise systems. She is a member of the Enterprise Systems Research Laboratory.
BERNARDO NUGROHO YAHYA received the B.S. degree in industrial engineering from Petra Christian University, the M.S. degree in information system engineering from Dongseo University, and the Ph.D. degree in industrial engineering from Pusan National University. He is currently a Full Professor with the Industrial and Management Engineering Department, Hankuk University of Foreign Studies, South Korea. He has been working on various industry business consulting and engineering projects with Korean companies. His current research interests include statistical pattern recognition, machine learning, business process intelligence, and data analytics.