An Automated Process for the Repository-Based Analysis of Ontology Structural Metrics

Quantitative metrics are generally applied by scientists to measure and assess the properties of data and knowledge resources. In ontology engineering, a number of metrics have been developed to analyse different features of ontologies in the last few years. However, this community has not generated any standard framework for studying the properties of ontologies or generated sufficient knowledge about the usefulness and validity as the measurement instrument of these metrics for evaluating and comparing ontologies. Recently, 19 ontology structural metrics were studied using the OBO Foundry and AgroPortal ontology repositories. This study was based on how each metric partitioned the two datasets into five groups by applying the $k$ -means algorithm. The results suggested that the use of five clusters for every metric might be suboptimal. In this paper, we propose an automated process for the study of ontology structural metrics by including the selection of an optimal number of clusters for each metric. This optimal number is automatically obtained by using statistical properties of the generated clusters. Moreover, the cosine similarity is used for estimating the similarity of two repositories from the perspective of the behaviour of the same set of metrics. The results on the two datasets allow for a more realistic perspective on the behaviour of the metrics. In this paper, we show and discuss the difference observed in the comparative behaviour of the metrics on the two repositories when using the optimal number with respect to a predetermined number of clusters for every metric. The proposed method is not specific for ontology metrics and therefore, can be applied to other types of metrics.


I. INTRODUCTION
Semantic web technologies and ontologies are widely used in the development of knowledge-based systems [1]- [7]. Their success lies in the combination of four main features present in almost all ontologies: standard identifiers for classes and relations that represent the phenomena within a domain; a vocabulary for a domain; metadata that describe the intended meaning of the classes and relations; and machine-readable axioms and definitions that enable computational access to some aspects of the meaning of classes and relations [8].
The associate editor coordinating the review of this manuscript and approving it for publication was Juan A. Lara .
The development of hundreds of biological and biomedical ontologies has resulted in the need for repository-based initiatives to facilitate the finding and sharing of biomedical knowledge. Examples of such repositories are AgroPortal [9], OBO Foundry [10], BioPortal [11], OntoBee [12], Ontology Lookup Service (OLS) [13] or AberOWL [14].
Consequently, in these repositories, researchers can find different ontologies covering similar biological subdomains, and they need support in making informed decisions about which to use, the differences between the ontologies, and so on. The ontology community has recognised the need for reference methods to measure the quality of ontologies [15], which could contribute to guiding users in their decisions, but there is no community agreement thus far. In the last few decades, different types of approaches for evaluating the quality of ontologies have been proposed: • Qualitative approaches: These approaches are based on the application of qualitative criteria. A diagnostic task based on ontology descriptions, using three categories of criteria (structural, functional and usability profiling) was proposed in [16]. In [17], an approach that applies four qualitative criteria (philosophical rigour, ontological commitment, content correctness, and fitness for a purpose) was proposed.
• Quantitative approaches: These approaches use quantitative metrics for measuring ontology properties [18]- [25]. They vary in the number and type of metrics used, which means that each approach may be interested in the quantitative assessment of the concrete facets of ontology quality. This has been the most active area in the last few years.
• Principles and guidelines: These approaches propose to follow a series of recommendations to ensure the quality of the ontology. Examples are the OBO Foundry principles [10] or the MIRO guideline [26]. In this work, we were interested in quantitative metrics, as they enable objective and reproducible evaluation processes. The relevance of quantitative metrics has been sufficiently demonstrated in the literature in the last few decades for different fields, including software engineering [27], [28] and recruitment [29]. Unfortunately, the metrics displayed by ontology repositories are limited, and each repository includes different metrics.
In our case, we believe that increasing the knowledge about the usage of metrics can help to increase the knowledge about the engineering of ontologies and to define effective methods for linking the values of the metrics to the structural properties of ontologies. Our belief is reinforced by recent works which have started to apply quantitative metrics to ontology design patterns [30]. This is why in the last few years, we have been studying and learning how quantitative metrics can be applied for analysing and understanding the structure and engineering of ontologies. Early works studied how to combine quantitative metrics to analyse the features of ontologies [19], [31] and how to use them to study the evolution of ontologies from the metrics perspective [32]. The main target of these works was the analysis of ontologies.
In particular, every ontology was assigned a rating between 1 and 5 for every metric. In [31], [32], such a rating was based on the application of the best practices. In [32], as all the versions of an ontology could be considered a repository, the rating was based on the distribution of the data in the repository. One feature of this approach is the use of five categories for every metric. After these works, an important question remained unanswered: which metrics provide a better classification of ontologies so that they can be more useful for discriminating between ontologies? In [33], a method based on public repository data for answering such questions was presented, by applying 19 ontology structural metrics to the OBO Foundry repository and to AgroPortal. The behaviour of some metrics was the same in both the repositories, but it changed for the others. To the best of our knowledge, this past study was the first one that focused on the reliability of ontology metrics. In this previous study, the behaviour was measured by the stability and the goodness of the clustering, two statistical properties of the five clusters in which the ontology repository was partitioned for a given metric. However, the study had the limitation of using five clusters for every metric, which might not be optimal.
In the present study, we addressed the abovementioned limitation by defining a method for selecting the optimal number of clusters for each metric and analysed the behaviour of each metric by using the OBO Foundry and AgroPortal datasets used in [33]. The optimal number of clusters for a metric might vary for different datasets or repositories. Many validity indexes were considered for analysing the clustering result with a prefixed k number, and different rules were formulated to choose a k value into a plausible range of cluster numbers. However, each rule was based on a specific index, and no standard consensus on the best validity index and rule was found in the literature. Our main aim was to search the optimal number of clusters by combining two validity indexes: stability and goodness of the clustering. Hence, the first contribution of this work was an automated process for analysing ontology structural metrics for classifying the ontologies included in a repository. Note that a certain metric was studied in [33] using five clusters in each repository, but currently, the optimal number of clusters for the same metric might differ between repositories. Metrics with similar behaviour in different repositories could be the most appropriate for standardisation. This led to the second contribution, as the characterisation of the metrics described the content of the repository. Finally, we proposed a metric which measured the similarity between repositories on the basis of the behaviour of the metrics. The proposed methods were applied to ontology metrics and repositories in this study, which provided new insights in the field of ontology evaluation methods. However, these methods are generic, so they can be applied to study the behaviour of other quantitative metrics.

A. ONTOLOGY METRICS AND THEIR PROCESSING
In this work, we studied a set of 19 ontology structural metrics, M = {M 1 , . . . , M 19 } described in Table 7. These metrics accounted for different features of an ontology (e.g. cohesion, multiple inheritance, ratio of properties per class, and ratio of annotations per class). All the metrics included in this work were quantitative. Therefore, for every metric, we could define a function f (x) with an ontology as its domain and the raw values of the metric as its range. Note that the metrics might have different ranges of raw values. Our method was designed for application to a repository, in this work, a set of ontologies θ = {θ 1 , . . . , θ n }. Hence, once the raw values for a given metric were computed for all of the ontologies in the repository, we applied a clustering algorithm to partition the ontologies into k non-empty and non-overlapping categories. The clustering could be seen as a function n(f (x)), which generated an ordered factor of k categories. Furthermore, n(f (x)) partitioned the range of f (x) in k non-prefixed continuous intervals that contained all of the observed samples in the experimental data. This process was repeated for each metric. The results of the process for partitioning the range of f (x) in k intervals were therefore metric and dataset dependent, as the clusters were dynamically generated for a given dataset. Each ontology would then be associated with a particular cluster for each metric. The assessment of all of the ontologies included in the same cluster for the feature measured by a certain metric would be the same. This could also serve to perform a qualitative evaluation of the ontologies, but this was out of the scope of the present study. Our focus was on the metrics and on the repositories. Fig. 7 describes the process applied to a repository of ontologies for a certain value of k and for each metric. First, the raw values for all the ontologies and for all the metrics were obtained. Second, the clustering algorithm performed, for each metric, the scaling of the raw values of the ontologies into the k categories.

B. CLUSTERING METHODS
There are different clustering algorithms for the partitioning of objects into a predefined number k of non-empty and nonoverlapping clusters [34], [35]. Here, we applied the k-means clustering method, which is one of the most well-known and widely used clustering methods [36]. Such an algorithm aims to find the partitioning by maximising both the compactness of the objects within a cluster and the separability between the clusters.
Nevertheless, our algorithm for selecting the optimal k could be applied to any partitioning method, such as the Partitioning Around Medoids (PAM) and Clustering LARge Applications (CLARA) algorithms [37]. The PAM method minimises the sum of the dissimilarities of the objects in the dataset to their nearest medoid (the most centrally located object of each cluster), which produces spherical clusters. The CLARA method takes a sample of objects from the dataset in which PAM is applied to find medoids and to compute the best clustering as a result.
Furthermore, accelerated techniques for these clustering methods have been recently developed to reduce the computational cost [38]. These improvements are mainly related to the BUILD initialisation and SWAP refinement algorithms, which in the PAM method consists of the following: (1) the complexity of the BUILD phase for choosing an initial clustering is reduced from O(kn 2 ) to O(kn), and (2) the workload of finding the closest new medoids of the SWAP refinement is reduced from O(k(n − k) 2 ) to O((n − k) 2 ). As CLARA uses PAM, CLARA is also improved by these accelerated techniques. Both of them will also be applied to show the comparison of different clustering methods for selecting the optimal k for each metric.

C. ANALYSIS OF THE OPTIMAL NUMBER OF CATEGORIES
The methods described in the previous section can be applied for different values of k, which means that different groupings of the ontologies can be obtained, as the optimal number of clusters is usually unknown. The lack of a gold standard has led to the need for a validation process of an unsupervised classification method. We propose to validate the results of each clustering by measuring two robust features of the k categories generated: stability and goodness of the clustering. In a nutshell, the optimal k is the one whose clustering is the most stable and accurate. Nevertheless, the two criteria involved in determining such an optimum may not return the same k.
Next, the cluster validation criteria applied are described. We also propose an algorithm which automates the process for selecting the best k when the ones reported by both indexes are different; this algorithm can be applied to any unsupervised partitioning method.

1) CLUSTERING STABILITY
Stability refers to whether the structure of a meaningful valid cluster is significantly altered by small variations in the data. Therefore, it seems reasonable to assess the effect of potential changes on each category of the clustering (C 1 , . . . , C k ) by a bootstrap resampling procedure [39]. For each metric M i in M , we measured the stability S M i ,k (C j ) of each category C j by using the Jaccard coefficient of similarity between sets [40]. In a repository, this index provides the proportion of concordant ontologies between C j and the most similar cluster in a bootstrapped clustering of a metric M i . Thereby, the stability measure S M i ,k (C j ) is derived as the mean of the values obtained from the b bootstrap samples. Such scores can be interpreted as degrees of statistical stability [41], thus classifying each category into the types displayed in Table 1. In addition, to deal with the comparison of the metrics M i for i = 1, . . . , m on a repository in terms of the clustering stability, the mean of the category stability scores for each metric S k (M i ) can be used as a global stability criterion of M i , assuming the same relative importance of the categories. k g ← max(goodness) 5: for each M i ∈ Metrics do 6: if k s == k g then 7: The goodness of the clustering is assessed by two internal cluster validation measurements, namely, cohesion and separation of the clusters. The first feature reflects how closely related the ontologies in a category are, whereas the other quantifies how well-separated a category is from the rest of the categories. Advantageously, both of them can be gathered into several validity indexes, such as the Silhouette width (sil) [42], Calinski-Harabasz (ch) [43], Dunn (dunn) [44], and Davies-Boudin (db) [45] measurements. Among them, we propose to use the Silhouette width (sil) because it can provide the classification quality for both ontologies and metrics, which gives it a competitive edge over the others. For each metric, the sil coefficient of a particular ontology represents the degree of confidence in the clustering and is given by where a l is the average distance between the ontology θ l and all the others belonging to the same category, and b l is the average distance between the ontology θ l and the ones of the closest neighbouring category. In particular, sil M i ,k (θ l ) measures how well each ontology θ l has been clustered, varying its value between −1 and 1, which can be interpreted as in [42]. A value close to 1 (−1, respectively) means that the ontology tends to be 'well-classified' ('misclassified', respectively). A value close to zero signifies that the ontology can be assigned to the nearest neighbouring category as well. For comparison purposes, the overall goodness of the clustering for a metric M i is evaluated by the global Silhouette coefficient, which is defined by the average of the sil scores, sil k (M i ) = n l=1 sil M i ,k (θ l )/n, for i = 1, . . . , m. As suggested by Kaufman and Rousseeuw [37], the global Silhouette width score can be interpreted as the effectiveness of the clustering structure found, in terms of the metrics (see Table 2).

3) ALGORITHM FOR SELECTING THE OPTIMAL k
The methods presented in the previous sections permit the application of the partitioning of a set of ontologies based on the values for a certain metric M i and for a range of k-values. For each k within this range of k-values, the clustering is carried out, allowing one to quantify its validity indexes, such as the stability and the goodness of the clustering. We propose to use the stability and the goodness scores to select the value of k that provides the best clustering of the ontologies for a metric M i . The clustering obtained for the optimal k should be used to analyse the set of ontologies from the perspective of the metric M i .
The algorithm for determining the optimal k is presented in Algorithm 1 from k s = arg k max S k (M i ) and k g = arg k max sil k (M i ) for each metric M i . Its purpose is to search the optimal k number of clusters by combining stability and goodness, which provides the most robust clustering with respect to such validity measures. This goal is pursued by the following reasoning: 1) If the highest stability and the highest goodness are obtained for the same value of k, k s = k g , then return this value as the optimal one. 2) If the highest stability and the highest goodness are obtained for different values, k s = k g , then the optimal k value should be calculated as follows: a) If both k s and k g provide at least stable classifications or both provide non-stable classifications, the optimal k should be the value with the largest Silhouette width, i.e. k g . b) If k s provides at least stable and reasonable classifications and k g does not provide stable classifications, we should select k s . c) If k s provides at least stable, but less than reasonable, classifications and k g does not provide stable classifications, then if k g provides at least a reasonable silhouette, then we should select k g . Otherwise, k s should be selected.

D. REPOSITORY PROFILE
The calculation of the optimal k value for each metric in a repository enables one to define the profile of a repository as a vector with m elements, one per metric. Each element in the profile is the optimal k value for the metric M i . The repository profile may be considered as characteristic information on the repository from the perspective of a set of metrics, which can be used to compare repositories. Given an ordered set M of m metrics that can be applied over a repository, this can have different q-dimensional profiles, where q is the number of metrics used for constructing the profile. In the case of a q-dimensional profile with a size smaller than m, there will be different possibilities for selecting the q metrics. The elements in the profile preserve the order defined in M . In this work, we used a q-dimensional profile corresponding to q metrics given in Table 7 with q ≤ 19.
Moreover, the similarity between two repositories based on their q-dimensional profiles, q ≤ m, can be interpreted as the homogeneity degree of the repositories with respect to these q metrics, which could be obtained by applying the cosine similarity given as follows: where − → r 1 and − → r 2 are the corresponding q-dimensional profiles of the repositories. Note that this similarity would relate the behaviour of the metrics on the repositories, not the content. Moreover, the similarity between profiles would require both the vectors to be homogeneous; that is, they represent the same ordered set of metrics.

A. EXPERIMENTAL SETUP
We applied our process to two repositories of ontologies, AgroPortal and OBO Foundry. More concretely, we analyzed 78 AgroPortal ontologies (θ agr = {θ agr 1 , . . . , θ agr 78 }) and 119 OBO Foundry ones (θ obo = {θ obo 1 , . . . , θ obo 119 }), which were studied in [33]. Our results are based on the raw scores of ontology structural metrics retrieved from the OQuaRE REST service, 1 which can be reproduced through either the evaluome web service [46] or the R package version evalu-omeR [47]. The evaluome tool allows the evaluation of the reliability of metrics with a user-friendly interface. It analyses the stability and the goodness of the clusters reported for each metric by using both the global validation criteria described in Sections II-C1 and II-C2. The package evalu-omeR [47] is available in Bioconductor. In addition, the optimal number of clusters for each metric is assessed by the decision-making algorithm proposed in Section II-C3, which is implemented in evaluomeR release v1.3.4 hosted GitHub. 2 evaluomeR depends on the following packages: cluster [48], corrplot [49], Rdpack [50], plotrix [51], fpc [52] Summarized-Experiment [53], and MultiAssayExperiment [54]. It requires R version 3.6 or higher to run [55]. Other dependencies such as the Bioconductor or CRAN R packages are automatically downloaded via the Bioconductor install manager. Note that the accelerated techniques described in Section II-B for the clustering methods are implemented in the R cluster package 2.0.9 version onwards, and consequently, the experimental results by evaluomeR are obtained through these accelerated optimisations of the PAM and CLARA algorithms. The resulting tables (CSVs), figures, and evaluomeR scripts are available in the directory usecases/kcomparison 3 in our GitHub repository.

B. CLUSTERING METHODS COMPARISON
A comparison of the three clustering methods described in Section II-B is shown in Fig. 1 for AgroPortal and Fig. 2 for OBO Foundry. In general, the optimal k value selected for each metric on both the repositories from k-means is lower or equal to the values reported for PAM and CLARA, except for NACOnto, Ponto, and WMCOnto2 in the AgroPortal repository (see Fig. 8). In most of the cases, k-means provides similar or better classifications according to Tables 1 and 2 from their global stability and global goodness scores. In the rest of the paper, we will apply the k-means algorithm to analyse and discuss the automated process proposed for the study of ontology structural metrics, because it is one of the most well-known methods. The users of our process can choose the most suitable clustering method for their study.

C. DESCRIPTIVE STATISTICS OF THE METRICS
We used violin plots [56] to describe the behaviour of the distributional shape of the raw scores on each repository (see Fig. 3 and Fig. 4). In these graphical representations,  the kernel density trace that overlaid the boxplot might assist in revealing the existence of groups of ontologies, which were scaled into interval [0, 1] for a visual comparison. At a quick glance, these figures convey different behaviours of most of the ontology structural metrics in each repository, as well as between both the repositories.

D. SELECTION OF THE OPTIMAL k VALUES
In order to determine the optimal number of categories, we performed our evaluation on the generated clusterings with k varying between 3 and 15. This range was selected to retain all the ontology variability in both the repositories, θ agr and θ obo , thereby avoiding elementary binary classifications. The number of bootstrap replications was set at 500 in order to reach relative reliable and accurate outcomes (as reported in [33]). Table 3 and Table 8 display the global stability scores for each metric on the AgroPortal and OBO Foundry repositories, respectively. The highest stability S k (M i ) was reached when k = 3 for 12 out of the 19 metrics on AgroPortal and for 11 on OBO Foundry. Furthermore, the highest stability score for each metric fell within the range of 0.71 to 0.96 on AgroPortal (of 0.79 to 0.97 on OBO Foundry), so there were no 'Unstable' clusterings of the metrics, and there was one only 'Doubtful' clustering, CROnto, on the AgroPortal repository. Moreover, there are 8 (9) 'Stable' clusterings for the metrics CBOOnto, CBOOnto2, LOCOMOnto, NOMOnto, POnto, RFCOnto, TMOnto, and WMCOnto2 (AROnto, CBOOnto, CBOOnto2, DITOnto, INROnto, NOMOnto, RFCOnto, WMCOnto, and WMCOnto2), and 10 (10) 'Highly stable' clusterings of the rest of the metrics. In detail, 5.26% (0%) metrics were classified as 'Doubtful', 42.11% (47.37%) were 'Stable', and 52.63% (52.63%) were 'Highly stable'.
In addition, Table 4 and Table 9 show the global goodness scores for each metric on the AgroPortal and OBO Foundry repositories, respectively. The highest goodness sil k (M i ) was achieved when k = 3 for 12 out of 19 metrics on both the repositories. Moreover, the highest goodness score per metric ranged from 0.66 to 0.91 on AgroPortal (from 0.60 to 0.93 on OBO Foundry), so no metrics generated either unstructured clusterings or weak ones. Indeed, 73.68% (57.89%) of the metrics provided categories with 'Strong structure', and 26.32% (42.11%) reproduced categories with 'Reasonable structure' in AgroPortal (OBO Foundry). Table 5 reports the optimal k values, and Fig. 5 shows the scores returned from the performed evaluation analysis on each of the 19 ontology structural metrics in the AgroPortal and OBO Foundry repositories, respectively. Note that there were 11 (57.89%) metrics in AgroPortal and 13 (68.42%) in OBO Foundry whose corresponding ks for the global stability and goodness criteria were different. The optimal k displayed in the column 'Optimal k' was the result of the application of the algorithm proposed in Section II-C3, thus providing an automatic decision-making process to carry out the statistical optimisation of the repository-based analysis of the 19 ontology structural metrics. This automatic process revealed that k = 3 was optimal for most of the metrics in both the repositories, that is for 15 metrics from θ agr and 16 metrics from θ obo out of the 19 in total (78.94% and 84.21%, respectively).

E. ANALYSIS OF THE OPTIMAL k VALUES ACROSS REPOSITORIES
The optimal k values for stability and goodness across the metrics is shown in Table 5 for the θ agr and θ obo repositories. The graphical representation of the stability and the goodness scores for the optimal k are shown in Fig. 5. Furthermore, the qualitative interpretation of their stability and goodness is shown in Table 6. Moreover, 18 metrics from θ agr and 19 from θ obo provided at least stable clusterings (>0.75). All the metrics exhibited at least a reasonable clustering structure in both the θ agr and θ obo repositories by using their optimal k.
In addition, we could compare the behaviour of the metrics in both the θ agr and the θ obo repositories when we used their optimal k value. The main findings were as follows: • The following metrics had the same optimal k (k = 3) in both the repositories: ANOnto, CBOOnto, CBOOnto2, CROnto, DITOnto, INROnto, NOCOnto, NOMOnto, PROnto, RFCOnto, RROnto, and WMCOnto.
• The clustering structures presented an alike distribution for the metrics with the same optimal k, except for CBOOnto, CBOOnto2, CROnto, INROnto, and RFCOnto. This difference in the Silhouette scores for CBOOnto and RFCOnto in both the repositories is clearly observed in Fig. 9 and 10 for CBOOnto, and Fig. 11 and 12 for the RFCOnto metric.   [3,15]. The highest score for each metric is highlighted in bold lettering, and NA is reported when there are insufficient different values to build k clusters.

TABLE 4.
Goodness scores for the θ agr repository in the k range [3,15]. The highest score for each metric is highlighted in bold lettering.

F. REPOSITORY PROFILES AND SIMILARITY
The repository profiles could be affected by the clustering algorithm used; Fig. 8 displays a visual comparison of the 19-dimensional profiles according to the three methods for each repository, although we focused on the k-means algorithm to compare repositories, as mentioned before.
The comparison between the AgroPortal and the OBO Foundry repositories was carried out through their 19-dimensional profiles, which were defined from the optimal k values in Table 5 by using the 19 metrics included in the study and ordered as shown in Table 7  Optimal k values for the metrics in the θ agr and θ obo repositories. The column Stability presents the most stable k value for each metric, whereas Goodness exhibits the k value with the greatest silhouette for such a metric. The Optimal k column displays the optimal k obtained by the proposed selection algorithm.

FIGURE 5.
Visual comparison between θ agr and θ obo for the stability and goodness scores from the optimal k value selected for each metric.
The cosine similarity of these two 19-dimensional profiles was 0.77. Hence, the same optimal k was achieved in both the repositories for 12 metrics, i.e. the same value in − → r agr 19 and − → r obo 19 . Thus, the similarity of the corresponding 12-dimensional profiles was 1, which could be interpreted as the two repositories being fully homogeneous with respect to the subset of 12 metrics.
Furthermore, note that in [33], perfect correlations were found for certain metrics. For instance, the CBOnto and CBOnto2 metrics presented a positive perfect correlation, and VOLUME 8, 2020 TABLE 6. Classification of the stability and goodness scores across metrics for their optimal k value in the θ agr and θ obo repositories.
CBOnto and INROnto had an almost positive perfect correlation, which were detected in both repositories. Nevertheless, dependent metrics do not affect the characteristic information on a repository, as such characteristic information is provided by the independent metrics. For each metric, dependent or independent, from a subset of q metrics, the optimal k value is independently calculated by evaluomeR to determine the q-dimensional profile, and the component corresponding to a dependent metric is a function of the remaining metrics. It may provide redundant information without affecting the comparison, although it could lightly modify the similarity because of the increase in the dimension.
For example, let us take the following set of 6 independent metrics: ANOnto, AROnto, CBOOnto, CROnto, NOCOnto, and POnto. From Table 5, the 6-dimensional profiles for θ agr and θ obo based on these independent metrics were as follows: 4, 3, 3, 3, 3) Hence, the cosine similarity between both the repositories was 0.983. Next, we added two metrics highly correlated with CBOnto to consider an 8-dimensional profile. As stated before, these metrics were CBOnto2 and INROnto. Therefore, the 8-dimensional profiles with respect to the ordered set of metrics (ANOnto, AROnto, CBOOnto, CBOOnto2, CROnto, INROnto, NOCOnto, POnto) were as follows:  (3, 4, 3, 3, 3, 3, 3, 3) which did not affect the components of the 6-dimensional profiles, although its similarity slightly increased to 0.987. In any case, these results reported a high homogeneity degree between both the repositories based on these 6 or 8 metrics.

A. PROPOSED METHOD
The method presented in this work is a natural evolution of [33]. There, the goal was to develop a mechanism for analysing the behaviour of a set of metrics for a prefixed value of k. This method was extended to deal with certain ranges of values of k and to automatically suggest the optimal number of clusters for each metric. Note that this algorithm is just one possibility and that the users of the method are entitled to design their own algorithm in case they prefer to prioritise and interpret the stability and the goodness scores in a different manner. Therefore, the proposed method optimises the analysis, thereby constituting an excellent benchmark for the developers of ontology metrics, the developers of ontology evaluation methods, and the managers of ontology repositories. The method aims at helping to make informed decisions on the use and interpretation of metrics in a repository. Note that the method used in this study can also be applied to other types of quantitative metrics, such as the ones presented in [57]- [59]. The only requirement for the application of the method is to have a dataset for which at least one metric has been measured and that the range of values for the metric permits one to build a certain number of k clusters. Consequently, we believe that the method is of interest to data scientists in general.

B. METRICS
In [33], 19 ontology structural metrics were evaluated assuming a partitioning in five clusters of the values of each ontology for each metric in each repository. The results obtained in this work showed two main differences (see Table 5): (1) The optimal k value was different from five for all the metrics included in the study. (2) The optimal k value was the same in both the repositories for 12 out of 19 metrics. This is an indicator of the existence of differences between the two repositories of ontologies studied from the perspective of VOLUME 8, 2020 TABLE 7. Definition of the 19 metrics evaluated: Column 1 shows the acronym of the metric and its reference; Column 2 describes the ontology facet measured by the metric; and Column 3 describes how the metric is calculated.  [3,15]. The highest score for each metric is highlighted in bold lettering. the ontology features measured by these metrics. Our interpretation for one metric having the same optimal number of clusters in both repositories is that it has the same behaviour in both the repositories; hence, it classifies the data in the same number of groups. However, this does not necessarily mean that the distribution of values for the metric in both the repositories is the same. For example, the optimal number of clusters for ANOnto or CBOnto was 3, but the shapes of their  [3,15]. The highest score for each metric is highlighted in bold lettering.
violin plots in AgroPortal and OBO Foundry were different (see Fig. 3 and Fig. 4) In [33], six metrics (NACOnto, NOCOnto, POnto, TMOnto, TMOnto2, and WMCOnto) provided stable or highly stable clusters. Now, with the use of the optimal value of k for each metric (see Fig. 5 and Table 6), only CROnto (0.68) in AgroPortal was not stable or highly stable. Consequently, 18 out of 19 metrics were stable or highly stable in both the repositories when the optimal k values suggested by our method were used. The use of the optimal k value revealed that all the metrics had at least a reasonable structure, and such clusters were equal or more strongly structured that the ones built using a prefixed k = 5 in both the repositories. Moreover, 14 metrics had a strong structure in AgroPortal and 11 in OBO Foundry. In most cases, the optimal k value was the one providing the largest goodness value. When the optimal k value was provided by the largest stability value, the metrics were highly stable and had at least a reasonable structure in AgroPortal (DITOnto, PROnto, and RROnto) and were at least stable and had a reasonable structure in the OBO Foundry (AROnto, DITOnto, POnto, PROnto, and RROnto). It can be seen that the optimal k value was provided by the stability for DITOnto, PROnto, and RROnto in both the repositories. These three metrics were related to the existence of properties and relations, including taxonomic ones in the ontology. The following 10 metrics had a strong structure in both the repositories: ANOnto, AROnto, CBOnto, CBOnto2, CROnto, INROnto, NACOnto, TMOnto2, WMCOnto, and WMCOnto2. These metrics were also stable or highly stable in both the repositories; therefore, we can conclude that they exhibit the same behaviour in both the repositories. Most of these metrics are related to important features of ontologies, such as the taxonomy (CBOnto, CBOnto2, INROnto, NACOnto, TMOnto2, WMCOnto, and WMCOnto2), annotations (ANOnto), attributes (AROnto), and individuals (CROnto). These results show the benefits of using the optimal k value and reveal the difference between the repositories. Fig. 6 shows the changes in stability and goodness when the optimal k value was used instead of k = 5. Both aspects improved for most of the metrics when the optimal k value was used, which was the expected result. There were cases such as LCOMOnto in AgroPortal in which the stability score was higher for k = 5, although the optimal k was 3 because of the results obtained for the goodness, as the latter was given priority when the same degree of stability was achieved for different values of k.
We could also compare the results of the content of the clusters obtained between both the optimal k and k = 5 for the 19 metrics. Figs. 13 and 14 provide a visual comparison of the content of the clusters generated in each repository. We could infer different types of situations. Two examples are described next for the AgroPortal case (see Fig. 13). For WMCOnto, the red and black clusters were common for both values of k, and the blue one (optimal k) included the remaining three clusters from k = 5. For ANOnto, we observed how the content of the clusters when k = 5 shown with red data points was distributed in two clusters of the optimal k, and the same happened for the cluster of green points. Consequently, there was no common pattern relating the content of the clusters obtained with the optimal k and the ones obtained with k = 5; hence, each clustering generated its own structure.

C. REPOSITORIES
Our method uses the data existing in a given repository to study the behaviour of a set of metrics. It is not surprising that the optimal number of clusters for a metric might differ between repositories, as their content is expected to be different. The general interpretation of the results obtained in both the repositories is as follows: some metrics could be appropriate for certain repositories and not for the others. Note that the repositories are in continuous evolution, so our results would hold for the particular versions of the ontologies used in the study. Changes in the repositories might imply changes in our results.
In our study, both the repositories were analysed with k varying between 3 and 15 (see Tables 3-9). The results showed that k > 8 could not be applied for three metrics (AROnto, DITOnto, and TMOnto2) in AgroPortal, whereas such an application was possible in the OBO Foundry repository. This was attributed to the number of different values for these metrics in AgroPortal. This could also be due to the lower number of ontologies in the AgroPortal repository. Generally speaking, not being able to obtain the stability or goodness for certain values of k should not be interpreted negatively for the metric, the repository, or the method. This made it explicit that such a number of categories does not make sense for the particular metric in the particular repository. We believe that this is another example of how the method captures the signal differentiating the repositories. The differences for these three metrics could be intuitively affirmed by inspecting the violin plots for the metrics. A clear example would be TMOnto2, as the violin plot in AgroPortal suggests three large groups, whereas the data were more disperse in OBO Foundry. If we compare the optimal k values for these three metrics in both the repositories, we can see that the optimal k for AROnto is 3/9 (AgroPortal/OBO Foundry), for DITOnto 3/3, and for TMOnto2 3/10. The optimal for OBO Foundry is larger, showing that the content of the repositories is different from the perspective of the metric.
In Section IV-B, 12 metrics were identified as having the same optimal k value in both the repositories; therefore, the behaviour was different for 7 metrics:  • The optimal k value was higher in AgroPortal for the metrics NACOnto, POnto, TMOnto, and WMCOnto2.
• The optimal k value was higher in OBO Foundry for the metrics AROnto, LCOMOnto, and TMOnto2.
This difference implied that the optimal partitioning of the repositories using these metrics required a different number of groups. Moreover, 5 out of these 7 metrics measured facets related to the taxonomy. In three of these cases, the VOLUME 8, 2020  difference was in one cluster, but the situation was different for NACOnto (10-3), LCOMOnto (3)(4)(5)(6)(7)(8), TMOnto , and TMOnto2 (3)(4)(5)(6)(7)(8)(9)(10). For example, both TMOnto and TMOnto2 were related to ancestors, whereas TMOnto referred to classes with multiple ancestors, TMOnto2 referred to the direct ancestors of the classes targeted by TMOnto. Thus, we can conclude that this result confirms the adaptive nature of our method to the actual distribution of the data points and that each metric captures a specific feature.
The optimal number of clusters for a given metric may differ between repositories, which makes it a data-dependent feature. We could draw an analogy between this repositorybased feature and parameters such as K and Lambda in BLAST [60], which are calculated on the basis of the content of the sequence database and which are used for computing the scores and E-Value. In our case, the optimal number of clusters was used to build the repository profile, the q-dimensional profiles. We think that the profiles may be an interesting and flexible tool to estimate whether two sets of metrics have a similar behaviour across repositories, because they can also be applied to subsets of the set of metrics. For example, the similarity of the behaviour of a set of metrics related to a particular feature (e.g. classes or annotations) could be studied using our q-dimensional profile approach.

D. REPOSITORY PROFILES
The scope of our automated process was also enlarged by collecting the optimal k number of clusters for each metric on a repository into a vector referred as to the repository profile. This profile might be considered to be feature information of the repository with respect to an ordered set M of m metrics, allowing users to compare the classification performance of the set M on different ontology portals in terms of the similarity between their m-dimensional profiles by using the cosine function. As a result, the profiles could be ranked according to their cosine similarity measures. Furthermore, it is worthy to remark the flexibility of the repository profile which may be used for any q-dimensional ordered subset of metrics chosen by the user, q ≤ m.

E. LIMITATIONS AND FURTHER WORK
Our method for obtaining the optimal number of clusters for the metric does not analyse the existence of exceptional cases, such as clusters with only one member. For example, such outlier cases can be identified by exploring the violin plots shown in Section III-C, but they must be manually identified. For instance, the NOCOnto metric depicts outlier cases in Fig. 3 and 4, and TMOnto2 can describe an exceptional case.
In the violin plot of θ obo , the scaled raw data of TMOnto2 have two well-separated density functions on two small clusters, whereas in θ agr , the distribution is visually different (these two density functions are overlapped).
The existence of such cases should be informed to the algorithm for selecting the optimal number of clusters. As further work, we will develop methods for identifying such cases automatically.
In addition, we have studied the metrics on two repositories. It would be interesting to include more repositories in further studies, which could contribute to finding metrics with (in)consistent behaviour across repositories. Moreover, the analysis of the similarity between the repositories by using the q-dimensional profiles could generate new insights.
The present study shares some limitations with [33], as we did not include metrics such as consistency or formal correctness, which are usually implemented as Boolean functions. In this study, we considered the minimum k = 3, as we believe that our method is less interesting for Booleanlike metrics and thus keeps the variability of the metrics, avoiding elementary binary classifications. We also did not include non-structural metrics, as we wanted to perform the study under the same conditions as those described in [33]. Finally, we proposed the repository profiles, which could be used to determine the homogeneity degree of repositories with respect to a set of metrics. Further experiments will be carried out to associate homogeneity degrees with ranges of similarity scores.

V. CONCLUSION
In this paper, we have presented a method for an automatic decision-making process based on the optimal analysis of FIGURE 13. Comparison of the optimal k and k = 5 for θ agr . Clusters for the optimal k are ellipses and the ones for k = 5 are represented by the colors of the data points.
repositories from the perspective of the features measured by using a set of metrics. The method is based on measuring and analysing the statistical properties of the clusterings obtained by applying the metrics to an ontology repository. The results showed that the use of the optimal k value for such an analysis allowed for a finer distinction of the profile of the ontology repositories. We believe that this type of study may well help users to generate new insights into the properties of an ontology repository and its content. The method presented was applied to ontologies but is generic and, therefore, can be applied to other types of metrics and repositories.  He is also a member of the IMIB-Arrixaca Bio-Health Research Institute. He has been leading research projects related to semantic web technologies since 2004. His current research interests include the application of semantic technologies for the development of learning health systems and the development of quality assurance methods for ontologies and terminologies. VOLUME 8, 2020