Clustering Algorithms in an Educational Context: An Automatic Comparative Approach

Despite an increasing consensus regarding the significance of properly identifying the most suitable clustering method for a given problem, a surprising amount of educational research, including both educational data mining (EDM) and learning analytics (LA), neglects this critical task. This shortcoming could in many cases have a negative impact on the prediction power of both the EDM and LA based approaches. To address such issues, this work proposes an evaluation approach that automatically compares several clustering methods using multiple internal and external performance measures on 9 real-world educational datasets of different sizes, created from the University of Tartu’s Moodle system, to produce two-way clustering. Moreover, to investigate the possible effect of normalization on the performance of the clustering algorithms, this work performs the same experiment on a normalized version of the datasets. Since such an exhaustive evaluation includes multiple criteria, the proposed approach employs a multiple criteria decision-making method (i.e., TOPSIS) to rank the most suitable methods for each dataset. Our results reveal that the proposed approach can automatically compare the performance of the clustering methods and accordingly recommend the most suitable method for each dataset. Furthermore, our results show that in both normalized and nonnormalized datasets of different sizes with 10 features, DBSCAN and k-medoids are the best clustering methods, whereas agglomerative and spectral methods appear to be among the most stable and highly performing clustering methods for such datasets with 15 features. Regarding datasets with more than 15 features, OPTICS is among the top-ranked algorithms among the nonnormalized datasets, and k-medoids is the best among the normalized datasets. Interestingly, our findings reveal that normalization may have a negative effect on the performance of certain methods, e.g., spectral clustering and OPTICS; however, it appears to mostly have a positive impact on all of the other clustering methods.


I. INTRODUCTION
Rapid advances in information and communication have resulted in noticeable changes in education, e.g., the way students learn and teachers teach. The use of the Internet to deliver online or blended courses is one example of such changes [1]. Currently, a significant number of institutes (both private and public) employ learning management systems (LMSs) to present and support online The associate editor coordinating the review of this manuscript and approving it for publication was Ghassem Mokhtari .
learning. These LMSs not only facilitate managing and offering learning resources but also provide educators with opportunities to monitor the learning progress of students by tracking their actions in the system. In other words, LMSs enable educators to modify and adapt their teaching to improve student learning in response to the data collected from students' online behavior. However, since these log data are raw and do not present stable information or indicate pre-existing theories, LMSs present some difficulty in terms of how to employ these data productively in revising the learning process [2]. Educational data mining (EDM), learning analytics (LA), and their applications-which usually aim to provide helpful understanding of the learning process to both students and instructors by converting raw educational data into useful information-are some possible responses to this challenge [3]- [5]. To uncover hidden patterns and knowledge from educational data, researchers usually employ various EDM or LA approaches, e.g., supervised and unsupervised learning techniques (for a systematic review of EDM research, see [3]). Even though high prediction accuracy could be achieved using supervised learning, they are often inapplicable in regard to educational data with no predefined class labels [6]. Unsupervised learning methods, henceforth referred to as clustering, not only are capable of discovering hidden, underlying patterns and structures in data without labels but also could be used to label unlabeled data for the possible use of supervised methods.
The main aim of clustering methods is to find and form objects' groups, or clusters, that are more similar to one another. The practical value of this capability is to enable personalized differentiation of learning processes that have been key to improving learning outcomes (e.g., [7]). In other words, clustering methods could be regarded as the task of modeling the data in the form of a simplified set of properties, providing insightful explanations about important aspects of a dataset. Supervised methods are usually less demanding than clustering; however, clustering methods can provide more intuition when addressing complex data [8]. In the field of education, several clustering approaches have been applied to different variables, such as student motivation and behavior, time spent on learning tasks, and so on [9]. The performance of clustering methods can differ significantly for different types of data and applications because they usually operate in spaces with different dimensions and should deal with incomplete, noisy, and sampled data. Multiple clustering methods have therefore been developed for such reasons [10]- [16], making the selection of the most appropriate clustering methods for a given dataset or problem a difficult challenge. Thus, comparing and recommending clustering algorithms in an educational context could be beneficial to both researchers and practitioners in the field. Similar to other disciplines, the evaluation of a learning method's performance is a crucial challenge in the education field. While limited performance measures are used in supervised methods to evaluate the performance, assessing clustering methods is more difficult owing to the nature of cluster analysis (e.g., [17]- [20]).
In cluster analysis, one key question that must be addressed is cluster validation or the evaluation of a clustering algorithm's quality. This task is intrinsically challenging due to the lack of objective measures. Basically, there exist different types of validation methods for clustering algorithms [21]; among them, internal and external validation are the most frequently used by researchers [22]. Essentially, these measures assess clustering methods from different viewpoints, and in practice, there is no clustering method that could possibly reach the best performance in all of these performance metrics for a given problem domain [23]. A number of studies revolve around developing performance measures for clustering methods with the aim of determining the appropriateness of the produced clusters [24], [25]. However, surprisingly, although there is an increasing consensus concerning the importance of properly identifying the best clustering method and subsequently interpreting the produced result for a given problem, a limited number of research studies [25]- [27], if any, have comprehensively considered both internal and external measurements for the evaluation of clustering methods in an educational context. In addition to the tedious process behind the experimentation and the data preprocessing, one main reason is that cluster evaluation normally involves multiple conflicting criteria (due to a large number of external and internal metrics). In this study, to address this issuecomprehensively evaluating the performance of different clustering methods in educational datasets (e.g., Moodle)this study models the cluster evaluation task as a multiple criteria decision-making (MCDM) problem and accordingly proposes a novel approach for the automatic evaluation of clustering methods in an educational context. Our approach, inspired by [23], involves empirically studying the performance of seven different clustering methods (from different families) on 18 educational datasets of different sizes to generate two-way clustering results. Thereafter, seventeen performance metrics (including three internal and multiple externals metrics) are used to measure the performance of the produced clustering results. This study then employs a well-known MCDM method, called TOPSIS, which takes the performance measures as inputs, to rank each clustering algorithm for each dataset as a means of validating our proposed evaluation approach. This approach makes it possible to find and recommend the most suitable clustering method for different educational datasets to practitioners, researchers, and decision makers. However, it should be noted that the findings reported in this study are relevant and limited to educational datasets that contain students' (online) activity data, and for other types of educational data, other results could be anticipated. The main contributions of this work can be summarized as follows: 1) To model a cluster evaluation task as a multiple criteria decision-making problem and accordingly propose a novel approach for the evaluation of clustering methods in an educational context. 2) To automatically find and recommend the most suitable clustering method for different educational datasets, for practitioners, researchers, and decision makers. 3) To systematically and comprehensively compare wellknown clustering methods used in an educational context using multiple real-world educational datasets (with different sizes and performance measures). 4) To investigate the effect of normalization on the performance of clustering algorithms in an educational context (on datasets with different sizes and feature numbers). Sections II and III of this paper review the related work and describe the materials and methods, respectively, and Section IV and V illustrate the experiments and results. Section VI provides the conclusions and future directions.

II. RELATED WORK
A significant number of EDM and LA studies have applied clustering methods to educational data to provide instructors and students with useful insights into their learning processes. For example, in a study conducted by [28] and [29], different clustering methods, including k-means, were employed to cluster the e-learning behavior of students and determine the impact of human characteristics on students preferences, respectively. Similarly, using item response theory and kmeans [30] could successfully identify the learning ability of students in a collaborative-learning environment. In a different attempt, [31] employed expectation-maximization clustering to identify active and passive collaborators within a group of students at a European university. Moreover, [32] could determine the different behavior patterns adopted by students in online discussion forums using an agglomerative clustering method. According to the systematic review conducted by [3], generally, clustering methods are applied to EDM, LA, and broad educational research for several reasons, including the analysis of students' motivations, attitudes, and behavior and understanding students' learning styles. Furthermore, this study also highlights k-means as the most frequently used clustering method by EDM and LA research. Although a deep review of 30-year use of clustering methods in the educational context is given by this study, there is no indication of comprehensive comparative studies using educational datasets. However, there exist a few studies that use educational datasets to compare clustering methods (e.g., [33]), but these mainly neglect systematically comparing well-known clustering methods, considering the application of multiple performance measures, considering datasets with different sizes and characteristics, studying datasets with different feature numbers (e.g., large and medium), investigating the possible effect of normalization on a given dataset, and so forth. Furthermore, among those few studies that evaluate the quality of clustering algorithms, there is no study that approaches such an evaluation task using a combination of multiple criteria.
On the other hand, there are several studies, in other disciplines, that revolve around systematically and comprehensively comparing clustering methods with the aim of providing guidelines and recommendations to researchers and practitioners [22], [23], [34]- [37]. In the study conducted by [23], for example, the authors used different performance metrics to assess clustering algorithms in the field of financial risk analysis. In this study, the authors used a decisionmaking algorithm to choose the best method on three different datasets. The results of their study confirmed that it is unlikely to find a single clustering method with the best performance on all performance measures. Additionally, they highlighted the repeated-bisection as the most suitable clustering method in their study. In the context of the text-independent speaker verification task, a comparative study of clustering methods was conducted by [37]. The authors used three datasets of documents and six different clustering methods, including k-means, random swap, expectation-maximization, hierarchical clustering, self-organized maps (SOM), and fuzzy cmeans. According to their findings, in a small number of clusters, SOM and hierarchical methods perform weakly and have a lower accuracy than the other methods. In the study conducted by [35], four performance measures were used to evaluate five clustering methods-namely, k-means, multivariate Gaussian mixture, hierarchical clustering, spectral and nearest neighbor methods-on 35 gene expression datasets. The results indicate that the multivariate Gaussian mixture method outperforms other methods, the spectral method behaves sensitively to the proximity measure used, and k-means shows a good, stable performance, similar to the Gaussian mixture method.
Using six datasets to model protein interactions in the yeast Saccharomyces cerevisiae, [38] conducted a comparative study of clustering methods (i.e., Markov clustering, restricted neighborhood search clustering, super paramagnetic clustering, and molecular complex detection), and they found that restricted neighborhood search clustering is the most robust and has the highest performance with respect to variation in the choice of parameters used in the clustering methods. They moreover found that other clustering methods behave more stably with regard to dataset alterations. Finally, [22] conducted a broad comparative study on nine clustering methods-including k-means, CLARA, hierarchical, expectation-maximization, hcmodel, spectral, subspace, OPTICSs, and DBSCAN-using 400 different artificial datasets of different sizes. Their findings reveal that spectral methods tend to show good performance (compared to others) when considering the default configuration of clustering methods.
Evidently, despite an increasing consensus regarding the significance of properly identifying the clustering method that is best suited for a given problem, a surprising amount of educational research (including EDM and LA) neglects this critical task [25]- [27]. While several clustering techniques have been employed in an educational context, it is still controversial regarding which technique would perform the best for a given dataset [22]. As previously discussed, to the best of our knowledge, there is no indication of comprehensive comparative studies using educational datasets that consider multiple performance measures (both internal and external), well-known clustering methods, and real-world educational datasets (with different sizes and features) to produce two-way clustering and accordingly provide researchers and practitioners in education disciplines with appropriate, much-needed guidelines. This goal would eventually help to improve EDM and LA research and practice. To fill in this gap, this study puts forward an approach for the evaluation of clustering methods in an educational context, possibly finding the most suitable clustering method for a dataset in hand. This approach could potentially help to ensure the suitability of a clustering method for a given dataset because, in many EDM or LA studies, the classes generated by clustering methods are later used as ground-truth for classification methods and wrongly clustered data can possibly affect the prediction power of the approach.

III. MATERIALS AND METHODS
Even though there is a diversity in proposing and employing several different types of clustering methods in the literature, some methods are used more often [3]. Moreover, several frequently used methods employ similar concepts of mathematics-e.g., graph or spectral clustering similarity matrices-or are based on similar assumptions about the data-e.g., agglomerative and divisive-and thus, in typical usage scenarios, they are expected to produce similar performance. For such reasons, this study considers different clustering methods from different families of methods. Several taxonomies have been proposed in the literature, breaking down clustering methods into different families. For example, according to the existing literature, [39] grouped clustering methods into partitioning and hierarchical methods; [23] broke clustering methods down to partitioning, grid-based, hierarchical, model-based, density-based, constraint-based and frequent pattern-based methods; [40] classified clustering methods as partitioning, hierarchical, density-based, grid-based, and model-based approaches; [41] separated clustering methods into density-based, partitioning, and hierarchical methods; and finally, [22] added more categories and defined clustering methods as partition-based, linkage, model-based, spectral methods, methods based on subspaces, and density-based methods. It is apparent that different families of clustering methods have been formed by several researchers according to different criteria, such as objective functions and cluster structures [42], [43]. This study considers seven well-known, commonly used methods in the area of education, including expectation-maximization (EM) from the model-based family, k-means and k-medoids from the partitioning family, OPTICS and DBSCAN from the densitybased family, spectral clustering from the family of spectral methods, and agglomerative clustering from the hierarchical family. These methods were selected according to the recent systematic review of clustering methods conducted in [3].
For the performance metrics used for evaluating clustering methods, several studies proposed different metrics (e.g., [44]), among them, internal and external metrics. In external validation metrics, the similarities between the clustering method's results and correct partitioning (predefined class labels) are measured, whereas in internal validation metrics, according to intracluster similarity and intercluster dissimilarity, the appropriateness of a clustering structure is measured without using external information (class labels). Both internal and external validation metrics play an important role in choosing an optimal clustering method for a certain dataset [45]. The present study employs three internal metrics-namely, Dunn's Index While some of the external metrics are closely related and belong to the same pair counting category, they possess some differences; for example, they can behave partially in regard to the distribution of class sizes in a partition, and they can be biased toward the number of clusters [22]. Considering a vast number of (internal and external) performance metrics enables us to compare different clustering methods and possibly highlight the best method for a certain dataset regardless of the partiality of the methods [46].

A. CLUSTERING METHODS
This section presents a brief overview of seven clustering methods used in this study.
K-means is an algorithm that has been broadly employed by researchers in educational research [3]. As input, this algorithm requires a distance metric and the number of clusters (k). In this method, k objects are selected at random as centers of these clusters. Thereafter, using the minimum squarederror criterion, which aims at measuring the distance between the cluster center and an object, all of the objects are grouped into k clusters. In the final stage, each cluster's new mean is computed, and this process repeats until the center of each cluster remains unchanged. The time complexity of this algorithm is O(nkd), where d denotes number of features, k is the number of clusters, and n is the number of objects. Key advantages of the k-means algorithm are its simple implementation, low computational cost, and good (or acceptable) results in many different practical situations. Nonetheless, it has several disadvantages, for example, requiring a specification of the number of clusters in advance, which can strongly affect the final classification, unsuitability for situations with nonconvex clusters, clustering outliers, scaling with the number of dimensions, and dependency on the preliminary starting condition [47].
K-medoids, which is a modification of k-means, is among the algorithms that have been proposed to address the limitations of k-means. In this method, similar to k-means, the goal is to minimize the distance between the point specified as the cluster center and the data points in that cluster. Unlike k-means clustering, k-medoid selects datapoints as cluster centers, which usually have minimal average dissimilarity with points labeled to be in a cluster (the most centrally situated point in the cluster). The time complexity of k-medoids is O(k(n − k) 2 ). However, compared to k-medoids, the k-means method appears to be less flexible in certain situations; for example, it fails to be used with specific similarity measures, such as Absolute Pearson Correlation, because the distance used must be consistent with the mean; moreover, it appears to be unsuitable for clustering nonspherical groups of objects due to relying on minimizing the VOLUME 8, 2020 distances between the nonmedoid objects and the medoids; nonetheless, it has a lower computational time and requires less time to run than k-medoids [48].
Expectation-maximization (EM) is a model-based clustering method whose main goal is to define each data object's membership according to a probability. This method is composed of two steps, the expectation step, which estimates the likelihood of each object belonging to a cluster, and the maximization step, which computes the parameters of the distributions for maximizing the distributions' probabilities in the dataset [49]. The time complexity of this algorithm is O(dni), where i is the number of iterations. Among the advantages of EM are its stability and low complexity in the implementation. Furthermore, it is regarded as particularly suitable for incomplete datasets. However, this method suffers from some issues, as follows: it mostly fails in finding small clusters, the obtained clusters from this method can firmly depend on the initial conditions, intractable expectation and maximization steps can be involved, and it is slow to converge.
Agglomerative is a hierarchical clustering method that considers the linkage between data points. In this method, which follows a bottom-up fashion, data objects are grouped into a tree. Basically, it initially places individual data points in its own cluster and subsequently combines these clusters into larger clusters. Until a certain termination condition is reached or all data points gather together in one cluster, this process iterates [50]. According to the similarity measure between the two clusters, the clusters are agglomerated in this method. The time complexity of this algorithm is O(n 2 ). Some advantages of this method are its low complexity in implementation and deciding on the number of clusters (the dendrogram produced is more informative than the unstructured set of flat clusters produced by k-means). On the other hand, this method is less flexible, especially in undoing previous steps. In other words, once the instances have been assigned to a cluster, they can no longer be moved around. Another weakness of this method is its unsuitability for large datasets. Similar to some previously mentioned algorithms, e.g., k-means, the final results in the agglomerative methods are strongly affected by the initial seeds. Moreover, this method is vulnerable to outliers, and the final results can be affected by the order of the data.
Spectral clustering is another class of clustering methods that has been developed to address the issues of conventional clustering methods (namely, determining nonlinear discriminative hypersurfaces) [51]. It initially develops an affinity matrix-representing the data by a weighted graph-where the similarity between m and n points are indicated by the values in the m-th row and n-th column. Note that for a weighted graph representation of the data, this method usually benefits from weighted kernel k-means, which is a generalization of the k-means method. Thereafter, to group the data based on a given criterion, it uses the eigenvalues and eigenvectors of the matrix. In this method, various types of similarity matrices can be used, such as the Laplacian matrix. This algorithm has a time complexity of O(ndki), where n is the number of data points, d is the dimensionality of each point, and i is the number of iterations required for k-means to converge. The overall time complexity of the spectral clustering algorithm is O(nzki), where z is the average number of rows in the similarity matrix. One main advantage of this method is that it does not impose a predetermined shape for the clusters due to its definition of an adjacency structure from the original dataset. On the other hand, one disadvantage of this method is its demanding process of computation of the eigenvectors of the similarity matrix (in other words, it can become computationally expensive when it addresses large datasets).
DBSCAN (which stands for Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering method, which means that in a set of points, it clusters together points with many adjacent neighbors (or points that are tightly packed together) [52]. In this approach, those points whose adjacent neighbors are far away are considered outliers (or points that stand alone in low-density areas). Epsilon (ε) and the minimum number of points needed to create a cluster (η) are the inputs of this method. Initially, it randomly selects a starting data point that has been unvisited. Accordingly, if a sufficient number of data points are within the ε-neighborhood of these data points, it forms a cluster; otherwise, it labels it as noise. All of the data points that are in the ε-neighborhood of a data point that is found to be part of a cluster are also considered to be part of that cluster. The noise points might later be found in a sufficient number of data points that are within the ε-neighborhood of these data points and thereby become a member of a cluster. This procedure goes on until a cluster is completely formed. Thereafter, another new point that has not been previously visited is chosen and processed similarly, forming the next noise or cluster. DBSCAN has a time complexity of O(n 2 ). DBSCAN performs well when it addresses outliers within the dataset and is great at dividing high-density clusters from low-density clusters in a dataset. However, similar to other clustering methods, it suffers from some shortcomings; for example, it struggles with high-dimensional data or clusters with similar (or varying) density.
OPTICS (which stands for Ordering Points To Identify the Clustering Structure) is also a density-based method based on the maximal density-reachability concept [53]. Similar to DBSCAN's process, this method begins with a data point and grows its neighborhood. However, to solve the issue of DBSCAN, which is the detection of meaningful clusters in data with varying density, it uses (linear) ordering in such a way that spatially nearest points become neighbors in the ordering. Furthermore, it stores a special distance for each point to represent the density that must be accepted for a cluster that enables both points to belong to the same cluster. OPTICS has a time complexity of O(nlogn). One advantage of this method is its capability of handling clusters with irregular shapes and large density variations; however, disadvantages are that it has a higher computational time and requires more time to run than DBSCAN.

B. PERFORMANCE MEASURES
One of the most crucial issues in cluster analysis is to evaluate the produced cluster's quality [54]. Basically, according to several research studies in the literature (e.g., [22]), performance measures used for evaluating the performance of a clustering method are categorized into two groups, internal and external performance metrics. In brief, internal metrics rely on information that belongs naturally to the data to evaluate the appropriateness of the formed clusters (without class labels or external information), whereas external metrics consider and measure similarities between the results of clustering methods and predefined class labels (or correct partitioning). Some pros of using internal evaluation metrics are that the compactness, connectedness, and separation of the clusters are considered in such methods and they rely on the data that was clustered itself and accordingly assigning the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. On the other hand, these internal methods could be biased toward algorithms that use the same cluster model (e.g., k-means is a distance-based model, and distance-based criteria will overrate) and could compare algorithms and highlight which performs better, but cannot give more information about the validity of the results produced by the algorithms. For external measures, some advantages of using such performance metrics are that such indices rely on data that were not used for clustering (e.g., the gold standard created by humans), measure how similar the produced clusters are to the benchmark clusters, and make no assumption on the structure of the clusters.
As previously mentioned, this study uses three well-known internal and 14 external performance measures that have been widely employed by different researchers in various field of study (e.g., [22], [55]) to thoroughly and impartially evaluate the performance of different clustering algorithms on educational datasets with different sizes. The following sections briefly describe these performance metrics and the respective notations employed.
Notations. In this study, for a dataset D with N objects, it is assumed that a partition P = {P 1 , . . . ,P K } of D, where K i=1 P i = D and P i ∩ P j = ∅ for 1 ≤i = j ≤ K , and K is the number of clusters. Provided that ''true'' class labels for the data are available, this study can use another partition on D: where K is the number of classes. Let us assume that the number of objects in cluster P i from class C j is denoted by n ij ; then, a contingency matrix, as shown in Table 1, can be used to write the information on the overlap between the two partitions. The notation shown in this contingency matrix will be used throughout this paper.
The Dunn index is an internal performance metric that, by measuring both the intracluster diameter and intercluster distance, aims to assess the produced clusters' quality. This metric uses the respective equation shown in Table 2 to compute the DI for any number of clusters, where the i-cluster of such a partition is represented by c i . In the equation, dist P i , P j denotes the distance between clusters P i and P j , where dist(P i , P j ) = min to the distance between data points a i ∈ P i and a j ∈ P j , and diam (P k ) is the diameter of cluster P l , where diam (P l ) = max The Davies-Bouldin index is an internal performance metric that uses quantities and features that are inherent to the dataset to evaluate how well the clustering has been performed. The aim in this method is to minimize the average distance between each cluster and those similar to it. The corresponding equation is illustrated in Table 2.
Silhouette is an internal validation index that represents different clusters with different silhouettes, which indicates how well objects are placed within clusters. In brief, it calculates the width of the silhouette for each data point, including the average of the width of the silhouette for each cluster and the whole dataset. To do so, it employs the corresponding equation listed in Table 2 for computing the width of the silhouette of the i-th data point. In the equation, any object in the dataset is represented by i, and x(i) and y(i) are the average dissimilarity of i to other objects in the same cluster P 1 and to all objects in the neighboring cluster P 2 , respectively. P 1 = P 2 , and the distance measures are used to calculate the dissimilarity. Since x(i) and y(i) measure the dissimilarity of i to its own and neighboring clusters, respectively, a good clustering method will have a Sil(i) with a value near 1. The average Sil(i) of the total dataset is used for measuring the quality of the produced clusters.
The Rand index is an external validity index that measures the amount of pairwise agreement between the set of class label C and the formed cluster P. This metric is defined using the respective equation in Table 2. In the equation, the count of pairs of objects that have the same label in C, which are assigned to the same cluster in P and are assigned to different clusters are denoted by w and x, respectively. Moreover, the count of pairs in the same cluster that have different class labels and the number of pairs that have a different label in C that were assigned to a different cluster in P are denoted by y and z, respectively. This index results in a value between 0 and 1, where 1 indicates that C and P are identical.
The Adjusted Rand index is an external performance metric that measures the percentage of correct decisions made by the algorithm. This metric is defined using the corresponding equation in Table 2. In the equation, E(R) denotes the expected index. The Jaccard index is an external performance metric that measures the similarity between objects. It is defined and represented by the corresponding equation in Table 2. In the equation, the count of pairs of objects that have the same label in C, which are assigned to the same cluster in P and are assigned to different clusters are denoted by x and y, respectively. Moreover, the number of pairs of points within the same cluster that have different class labels is denoted by j. This index results in a value between 0 and 1, where 1 indicates that C and P are identical.
The F-measure is an external performance metric, the harmonic mean of the precision and recall, where the precision and recall indicate the percentage of a cluster with positive points and the true positive rate, respectively. This metric is defined using the corresponding equation listed in Table 2.
The Fowlkes-Mallows index is an external performance metric that is widely employed to measure the similarity between two formed clusters. This similarity measure could be between a benchmark classification and a clustering or two hierarchical found clusters. This metric is defined using the corresponding equation listed in Table 2.
The Mutual Information is an external performance metric that measures the similarity between two labels in the same data. This metric is defined using the corresponding equation listed in Table 2.
The Normalized Mutual Information is an external performance metric that normalizes the score of the mutual information in such a way that the results scale between 0 and 1, where 0 denotes no mutual information and 1 refers to perfect correlation. This metric is defined using the corresponding equation listed in Table 2. In the equation, P and C denote the class and cluster labels, respectively, H (P) and H (C) refer to the entropy of P and C, respectively, and MI(P, C) denotes the mutual information between P and C. The entropy is expressed as H (N ) == − N i=1 P (n i ) log 2 (1/P(n i ). The Adjusted Mutual Information is an external performance metric that adjusts the mutual information (MI) score by considering chance normalization. In other words, it considers that MI usually results in a higher MI score in regard to a larger cluster of numbers without considering whether there is more information to share. This metric is defined using the corresponding equation listed in Table 2.
The Rogers and Tanimoto index is an external performance metric that computes and measures the similarity coefficient of the formed clusters, for two clusterings of the same dataset, from observations' comemberships (the comembership refers to the pairs of observations that are partitioned together). This metric is defined using the corresponding equation listed in Table 2. In the equation, the count of pairs of objects that have the same label and that are assigned to the same cluster and different clusters are denoted by w and x, respectively. Moreover, the count of pairs in the same cluster that have different class labels and the number of pairs that have a different label that were assigned to a different cluster are denoted by y and z, respectively.
The Variation of Information index, which is similar to the mutual information index due to being a straightforward linear expression that involves the mutual information, is a distance measure between two clusters. In this index, unlike the mutual information index, the triangle inequality is followed regarding the variation of information, and when changing from cluster P i to P j , the loss or gain of the information is measured. This metric is defined using the corresponding equation listed in Table 2.
The Geometric accuracy is acquired by calculating the sensitivity's geometrical mean and positive predictive value. This metric is basically concerned with balancing the sensitivity and predictive value that reflects two conflicting inclinations of clustering. Usually, when objects with different and similar complexity are partitioned in the same cluster, the positive predictive value and the sensitivity are, respectively, decreased and increased. This metric is defined using the corresponding equation listed in Table 2.
The Overlapping Normalized Mutual Information is an extension of the NMI, which is developed to handle the overlapping partitions. This metric is defined using the corresponding equation listed in Table 2.
The Purity index is a straightforward and simple external performance metric for measuring the quality of clustering. In this index, the purity is calculated by assigning each cluster to the most frequent class in the cluster. Thereafter, by counting the correctly assigned documents and dividing by the number of objects, the accuracy of this assignment can be measured. This metric is defined using the corresponding equation listed in Table 2.
The Completeness index is an external performance metric that indicates a clustering result's completeness. In other words, the completeness of a clustering result is satisfied when all of the objects of the same cluster are members of a given class label. This metric is defined using the corresponding equation listed in Table 2.

C. MULTIPLE CRITERIA DECISION-MAKING APPROACH FOR THE EVALUATION OF CLUSTERING METHODS
The performance of clustering methods can differ significantly for different types of data and applications because they usually operate in spaces with different dimensions and should address incomplete, noisy, and sampled data. Multiple clustering methods have therefore been developed for such reasons. Moreover, a large number of performance measures (internal and external) have been proposed for the evaluation of different clustering methods, which makes the selection of the most appropriate clustering methods for a given dataset or problem a difficult challenge.
To address this challenge and comprehensively evaluate the performance of different clustering methods in educational datasets, this study models the cluster evaluation task as a multiple criteria decision-making (MCDM) problem and accordingly propose a novel approach for the evaluation of clustering methods in an educational context. In our empirical approach, this study examines the performance of seven different clustering methods (from different families) with 18 educational datasets of different sizes to generate two-way clustering results. Thereafter, seventeen performance metrics are used to measure the performance of the produced clustering results. This study then employs a well-known MCDM method (called TOPSIS, see the following section), which takes the performance measures as inputs, to rank each clustering algorithm for each dataset as a means of validating our evaluation approach (some examples of TOPSIS applications can be found in [56], [57]). This makes it possible to find and recommend the most suitable clustering method for different educational datasets that can serve the needs of practitioners, researchers, and decision makers (see Fig. 1).

1) TOPSIS
There exist several MCDM methods for evaluating various contradictory criteria in decision making, among them techniques for order preference by similarity to an ideal solution (called TOPSIS), which is the most frequently used method by many researchers in different disciplines [58]. Even though some studies considered using more than one MCDM method for their decision making (e.g., VIKOR and DEA methods), they found that in many cases, some of these methods do not produce reliable results in different problem domains and the output mostly depends on the applied domain and the problem in hand [59]. However, a significant number of studies, conducted in different domains, concluded that the TOPSIS method is among the strongest decisionmaking methods that can be applied to many different domains and problems [23]. TOPSIS is therefore selected to evaluate clustering algorithms in our approach. This technique, developed by [60], was first used to rank alternatives across many criteria. Basically, by minimizing and maximizing the distance to the ideal and negative-ideal solution, respectively, TOPSIS aims to find the best alternatives. Below is a step-by-step description of the TOPSIS process used in this study: In the first step, the normalized decision matrix is computed using the equation below (Equation (1)). This approach transforms the dimensions of various attributes into nondimensional attributes, making it possible to compare attributes.
In this equation, i and j refer to the number of alternatives (clustering methods here) and criteria (evaluation metrics here), while x ij represents the evaluation metrics of the i-th criterion C i for the alternative A j .
In the second step, the weighted normalized decision matrix is computed by constructing a set of weights (w i ) for each criterion. The normalized value that is weighted (v ij ) is computed as: v ij = w i r ij , j = 1, . . . , n; i = 1, . . . , m (2) where w i is the weight of the i-th criterion, and m i=1 w i = 1. In the third step, both the ideal and negative-ideal solutions are found using Equation (3) and (4): where I = {j = 1, 2, . . . , n|j associated with benefit criteria} , J = {j = 1, 2, . . . , n| j associated with costcriteria} This experiment, which aims to evaluate clustering methods, considers the variation in information and the DI as the cost criterion that requires minimization and all other metrics as benefits to requiring maximization.
In the fourth step, separation measures for both the ideal and negative-ideal solutions are calculated. To compute these separation measures, Equations (5) and (6) are used: In the fifth step, the ratio that measures the relative closeness to the idea solution is computed (using Equation (7)): In the final step, using the maximization of the ratio of C * i , the alternatives are ranked. The time complexity of TOPSIS has 3 stages, as follows: The complexity of the attribute normalization and weighting is O(n 2 ). The complexity of the positive-negative ideal solution and V distance is O(n), and the complexity of the algorithm ranking results is O(1).

IV. EXPERIMENTS
This section first describes the datasets developed for this study and then presents our proposed approach in the form of an algorithm that automates the whole process of comparing and selecting the best algorithm for each dataset. Finally, the results from our experiment are discussed to validate our proposed evaluation approach.

A. DATASETS AND DATA PREPROCESSING
To properly compare clustering methods, several datasets with various sizes and number of features are required. Several studies in different fields used artificial datasets to compare clustering methods. However, the results and guidelines from most of these studies are not applicable to real-world datasets because such datasets mostly lack the normal distribution of data samples, as characterized by certain features that are divided into certain classes. This experiment used students' activity data extracted from the To create our datasets, for each course, different types of data were extracted, including the number of times the course resource was viewed, the number of times the course modules were viewed, the number of times the course materials were downloaded, the number of times the feedback was viewed, the number of times the feedback was received, the number of times a forum discussion was viewed, the number of attempts at quizzes, the number of times a discussion was created in a forum, the number of times book chapters were viewed, the number of times a book list was viewed, the number of times an assignment was submitted, the number of times an assignment was viewed, the assignment grade, the quiz grades, the number of times a discussion was viewed in a forum, the final grade, the number of times a post was created in a forum, the number of times comments were viewed, the number of times posts were updated in a forum, and the number of posts. Using the extracted data, this study created nine different datasets that have different student numbers and feature sizes. For example, the first dataset contains ten features with a small number of students, the second dataset includes ten features with a medium number of students, and the third dataset contains ten features with a large number of students. In addition to these nine datasets, for each, a normalized version of the dataset was generated using the min-max normalization technique to investigate the possible effect of data normalization on the clustering algorithm performance. Basically, this normalization technique is used in this study because attributes are mostly addressed on a different scale. To do so, this technique maps the minimum and maximum value in X to 0 and 1, respectively, to allow the entire range of values of X to be mapped to the range 0 to 1. Note that this study used more than one course with similar attributes to generate datasets with a large number of students. Additionally, before applying the clustering methods, according to the guidelines provided by the handbook of EDM [61], the authors performed data cleaning and preprocessing to address the sparsity issue and reduce the effect of noise and outliers. Furthermore, a filter-based feature selection approach was used to determine the high-potential variables to form our datasets for the experiment (small number of features, medium number of features, and high number of features). The description of the datasets is given in Table 3.

B. THE PROPOSED EVALUATION APPROACH
Algorithm 1 illustrates our proposed generalizable approach to systemizing the process of comprehensively evaluating clustering algorithms and selecting the most suitable algorithm for a given problem. Fig. 1 also shows the framework of our proposed approach. Table 4 shows the values of the performance metrics for the clustering algorithms in the datasets with a small number of features and a large number of students (SF-LS), a medium number of features and a small number of students (MF-SS), and large numbers of features and students (LF-LS). These three datasets are presented as a sample (datasets with small, medium, and large feature numbers), and the full list of performance measure results is provided in the Appendix (Table 6 and 7). As is apparent, there is no clustering algorithm that could reach the best performance in all metrics at any datasets. This circumstance clearly implies that in the evaluation of clustering algorithms, more than a single performance measure should be employed in order to find the best-performing algorithm for the problem at hand.

V. RESULTS AND DISCUSSION
Tables 5 summarizes the rankings generated by the TOP-SIS method for different algorithms on different datasets. In addition to the best-performing algorithm, the second-and third-ranking methods (if necessary) are considered, as in some cases, the difference between the first three ranked methods was small. Therefore, this approach could be used, in general, to highlight and recommend one or two bestperforming algorithms among all seven algorithms for each dataset.
According to Table 5 and Fig. 2a, for the dataset with small numbers of features and students, spectral strongly outperforms the other algorithms, while OPTICS tends to be the lowest performing method. In terms of the dataset with a small number of features and a medium number of students, DBSCAN followed by spectral tends to have the first and second highest performance among the methods, respectively, whereas EM followed by k-means show the weakest performance. For datasets with a small number of features and a large number of students, k-medoids and Evaluating the results of i with j 8: end 9: end 10: Choose the best clustering algorithm by using TOPSIS 11: return the best clustering algorithm spectral clustering outperform the other methods and are ranked first and second among the algorithms, respectively, while agglomerative shows the lowest performance. Thus, according to the number of students, in datasets with a small number of features, spectral clustering appears to be a better and more stable option than the other clustering methods.
For datasets with a medium number of features and a small number of students (see Fig. 2b), our findings show that agglomerative clustering and k-medoids are the best and worst methods, respectively. The spectral clustering of datasets with a medium number of features with both a medium and large number of students outperforms that of the other algorithms, while EM tends to have the weakest performance. Considering the second and third ranked methods in datasets with a medium number of features and a small number of students (which are not very different from each other), it can be said that overall, the spectral algorithm has the best performance and is a stable clustering method when addressing datasets with a medium number of features regardless of the number of students. Moreover, it is apparent that EM mostly tends to perform low on datasets with a medium number of features. Findings from datasets with small and medium numbers of features (regardless of the number of students) are in line with [22], which concluded that spectral clustering outperforms several other clustering methods due to its ability to define nonlinear discriminative hypersurfaces.
Finally, considering both the first and second ranked methods in datasets with a large number of features (20 features here), it can be said that the OPTICS algorithm has the best performance and is a stable clustering method regardless of the number of students (see Fig. 2c). Furthermore, spectral clustering mostly does not perform well in datasets with a large number of features regardless of the number of students. OPTICS thus can be considered to be the highly recommended option regarding datasets with a large number of features. Similar to spectral clustering, which shows a limited performance in large datasets, some other methods, including agglomerative clustering, appear to be performing low. These findings are in line with those reported, in different disciplines than education, by [34], [35], and [37], where they concluded that hierarchical clustering methods (e.g., agglomerative) perform low on large datasets, leading to a low accuracy in experiments.
Considering the performance of all of the algorithms on different datasets, it can be said that regardless of the number of students, spectral clustering is preferred in datasets with a small and medium number of features (15 or less) regardless of the number of students. However, it tends to perform poorly in datasets with a large number of features (more than 15). For datasets with a large number of features, OPTICS strongly outperforms the other methods and appears to be the most stable algorithm (in datasets with more than 15 features). Surprisingly, k-means, which has been frequently employed by researchers in EDM and LA research scenarios that mostly lacked comparative clustering algorithms, could never be ranked the best on any datasets. These findings can serve to notify the educational research community to take such comparison tasks more seriously and possibly take more measures before instinctively choosing a clustering algorithm, e.g., k-means. In this regard, our findings are consistent with those reported by researchers in different disciplines, e.g., [62]. Table 5 and Fig. 3a, for the normalized dataset with small numbers of features and students, DBSCAN outperforms the other algorithms, while spectral clustering tends to be the lowest performing method. In terms of the normalized dataset with a small number of features and a medium number of students, OPTICS and agglomerative tend to have the best and worst performance among the other methods, respectively. In regard to normalized datasets with a small number of features and a large number of students, k-medoids outperforms the other methods, while spectral clustering is among the lowest performing methods. Thus, according to the number of students, in datasets with a small number of features, DBSCAN, OPTICS, and k-medoids appear to be better options than other clustering methods. Regardless of whether spectral clustering and DBSCAN were strongly affected by normalization, the results from the other methods are fairly similar to the nonnormalized datasets, with a slight improvement in performance for all of the methods. In other words, all of the clustering algorithms have improved their performance by normalization of the datasets, except for spectral clustering. More specifically, normalization worsens the performance of spectral clustering and improves the performance of all of the other methods. VOLUME 8, 2020 For the normalized datasets with a medium number of features, regardless of the number of students, considering both the first and second ranks, our finding shows that agglomerative and spectral clustering are the best and worst methods, respectively (see Fig. 3b). Regardless of the number of students, in datasets with a medium number of features, almost all of the clustering algorithms have improved their performance by normalization of the datasets, except for spectral clustering. More specifically, normalization reduces the spectral performance and mostly improves the performance of the other methods. K-means, EM, and agglomerative clustering appear to strongly benefit from normalization. In nonnormalized datasets with a medium number of features, spectral clustering showed the best performance, while in normalized datasets with a medium number of features, agglomerative clustering ranked the best. Therefore, it is fair to say that spectral and agglomerative clustering are among the high-performing algorithms in nonnormalized and normalized datasets with a medium number of features (i.e., 15 features), respectively.

According to
Finally, considering both the first and second ranked methods in normalized datasets with a large number of features (20 features here), regardless of the number of students, it can be said that the k-medoids algorithm has the highest performance and is the most stable clustering method (see Fig. 3c). Furthermore, in addition to a normalized dataset with large features and a small number of students, OPTICS mostly does not perform well in normalized datasets with a large number of features. This finding also indicates a negative effect of normalization on the OPTICS method.
Interestingly, the spectral algorithm, which was negatively affected by normalization in both a small and medium number of features, shows better performance with normalized datasets that have a large number of features. In contrast, OPTICS and DBSCAN appear to be mostly negatively affected by normalization when they address datasets with a large number of features. Last, the k-medoids approach tends to benefit strongly from the normalization of datasets with a large number of features. In nonnormalized datasets with a large number of features, OPTICS was shown to have the best performance, while in normalized datasets with a large number of features, k-medoids is ranked the best. Therefore, it is fair to say that OPTICS and k-medoids are among the highest performing algorithms in nonnormalized and normalized datasets with a large number of features (more than 15), respectively.
In brief, these findings suggest that normalization can have a negative effect on the performance of the spectral algorithm in datasets with 15 or fewer features; however, it appears to mostly have a positive impact on several other clustering methods. This result is in line with those reported by other researchers, e.g., [29], where they state that spectral    clustering, in general, appears to be sensitive to different measures. On the other hand, methods such as OPTICS and agglomerative clustering tend to benefit more from normalization in datasets with a smaller number of features (15 or smaller), whereas k-medoids clustering benefits more from normalization regarding datasets with more than 15 features. Consequently, our findings reveal that in normalized datasets with more than 10 features, agglomerative and k-medoids methods are preferred, while in datasets with 10 or less features, different clustering methods could be employed, and in general, differences between the performance of clustering methods is not considerable.

VI. CONCLUSION AND FUTURE DIRECTIONS
While several clustering techniques have been employed in an educational context, it is still controversial regarding which technique would perform the best for a given dataset. Even though a growing number of researchers highlight the importance of comparing and identifying the most suitable clustering methods for the problem at hand, many educational studies ignore such a critical task. This limitation could potentially have a negative effect on the prediction power of both the EDM and LA approaches. This study models this evaluation task as a multiple criteria decision-making problem and employs the TOPSIS method to rank the results produced by multiple internal and external performance measures when applied to several clustering algorithms. This approach would help to automatically recommend the best clustering methods for each dataset. In this study 18 different datasets (created from the University of Tartu's Moodle system) with different sizes were evaluated to not only find the most suitable clustering method for each dataset but also investigate the possible effect of using different sizes and types of educational datasets on the performance of clustering algorithms in two-way clustering. Furthermore, this study examined the effect of normalization on the quality of the clusters produced by different clustering methods.
Our findings reveal that in regard to datasets with 15 or fewer features, regardless of the number of students, spectral appears to be more robust and outperforms the other clustering methods. However, it tends to perform low in datasets with a large number of features (more than 15). This finding implies that spectral is sensitive to improvements in the number of features and the data size. Concerning the more suitable clustering method for datasets with a large number of features, OPTICS strongly outperforms the other methods  and appears to be the most stable algorithm. Unexpectedly, the k-means method, which has been frequently employed by researchers in educational research (both EDM and LA research), could not rank first in any of the datasets. These findings could motivate the educational community to pay more attention to the evaluation task when using clustering methods because using a low-performing method could possibly reduce the prediction power of their EDM and LA approaches.
With regard to the effect of normalization on the performance of clustering methods, this study found that normalization could have a negative effect on the performance of the spectral algorithm in datasets with 15 or less features; however, it appears to have a positive impact on all of the other clustering methods. More explicitly, the spectral algorithm, which was negatively affected by normalization with both a small and medium number of features, shows better performance with normalized datasets that have large features. Methods such as OPTICS and agglomerative tend to benefit more from normalization in datasets with a smaller number of features (15 or smaller), whereas k-medoids benefits more from normalization in regard to datasets with more than15 features. In conclusion, it can be said that in general, normalization can have a negative effect on the performance of certain methods, e.g., spectral and OPTICS; however, it mostly appears to have a positive impact on all of the other clustering methods.
A few limitations of this study could be regarded as directions for future research. For example, the proposed evaluation approach was applied to rather small data size for two-way clustering. In future work, the proposed automatic approach could be applied to larger educational datasets with a larger value of k. Another direction of future work would be to employ more advanced or heuristic clustering methods that are capable of reducing the effect of noise and outliers in large datasets automatically, while addressing issue of dimensionality and offering a good level of computation ( [13], [14]). Finally, another interesting direction for future work would involve implementing similar experiments on more datasets from different educational means and sources, games or intelligent tutoring systems to possibly compare with our findings from this study and provide wider recommendations and guidelines to researchers and practitioners in the educational community. MARGUS PEDASTE (Senior Member, IEEE) received the master's degree in biology education and the Ph.D. degree in biology and earth science education from the University of Tartu, Tartu, Estonia. He is currently a Professor of educational technology with the University of Tartu. His research interests include effect of educational technologies, virtual and augmented reality, science education, and teacher education and agency.

See
YUEH-MIN HUANG (Senior Member, IEEE) received the M.S. and Ph.D. degrees in electrical engineering from the University of Arizona, in 1988 and 1991, respectively. He is currently the Chair Professor with the Department of Engineering Science, National Cheng Kung University, Taiwan. He has coauthored three books and has published more than 250 refereed journal research articles. His research interests include e-learning, multimedia communications, wireless networks, and artificial intelligence.