Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks

Data integration, which aims to solve problems and create new services by combining datasets, has attracted considerable attention. The discovery of similar datasets that can be combined is critical. In the literature on similar dataset discovery, it is important to select an appropriate discovery method for each information need, such as the domain. However, conventional studies have evaluated discovery methods in different ways, such as domains, test datasets, and evaluation metrics. This factor prevents the appropriate method selection for each situation. Furthermore, the specific effects of the combination of different methods are not well known despite conventional studies arguing the importance of the combination. This study attempts to understand (1) the similarity indicators that should be employed for each domain and (2) the effects of a combination of different indicators on performance. We evaluated 16 inter-dataset clustering models based on different metadata-based similarity indicators, using unified evaluation metrics and datasets for 15 domains. Our results (1) suggest that similarity indicators should be used for each domain and (2) demonstrate that most of the combinations of different methods can improve clustering performance.


I. INTRODUCTION
Data integration, which aims to solve problems and create new services by combining datasets, has attracted significant attention [1], [2].This places the discovery of combinable datasets as one of the most critical issues.With improvements in computer processing power and the development of cloud data services, platform providers have developed business ecosystems in which large and diverse datasets are distributed [3], [4], [5], [6].There are some tools to discover appropriate datasets from such enormous candidates by applying information-retrieval techniques [7], [8].Furthermore, there has recently been growing interest in approaches The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar .based on dataset similarity rather than simple query matching.It is impractical for users to understand the perfect query words that represent the required datasets in advance [9].This allows users to avoid constructing appropriate query words and to carry richer semantics [10].Therefore, conventional studies have proposed numerous similarity indicators for datasets.
Although conventional studies have argued the importance of selecting appropriate methods [10], more practical discussions should be conducted.This research field is still in its infancy.Therefore, each study used different evaluation datasets and metrics.For example, Bernhauer et al. [10] used datasets from the Czech National Open Data Catalog and evaluated their results using Precision@k and PR-AUC.On the other hand, Wang et al. [14] used gold standards in Elsevier DataSearch and evaluated their results mainly using nDCG and F-measure.Consequently, it is difficult to understand an appropriate method by quantitatively comparing their experimental results.In addition, many conventional studies have limited the evaluation domain to one [14], [15], [18] or evaluated the domain-agnostic performance [10], [11], [12], [17], [19], [20].These facts make it difficult to determine which metadata-based similarity indicator should be employed for each domain.Furthermore, the specific effects of the combination of different methods are not well known despite conventional studies pointing to the importance of the combination [10].
This paper attempts to understand (1) similarity indicators that should be employed for each domain and (2) the effects of a combination of different indicators on performance.We compared and evaluated the clustering performance of 16 metadata-based similarity indicators using 1500 metadata datasets (15 domains × 100 datasets) from the Kaggle data platform.Our research contribution is as follows: (1) We compared 16  We demonstrated what combinations that lead to improved clustering performance.Table 1 shows a comparison between our study and conventional studies.

II. TASK DEFINITION
Our research objectives are to understand (1) similarity indicators that should be employed for each domain and (2) the effects of a combination of different indicators on performance.We attempted to understand these through interdataset clustering.Whereas general clustering takes a single dataset to generate clusters of instances in an input dataset [21], [22], inter-dataset clustering takes multiple datasets as input to obtain groups of similar datasets.In the former, each cluster element follows the same data format and scheme, although the latter does not necessarily satisfy them.For example, some conventional studies have investigated the discovery of speech recognition corpora sets [23], clustering of geographic dataset series [18], and categorization of similar research papers [19].
We work on inter-dataset clustering based on metadatabased dataset similarity in the same manner as [18] and [19].A more procedural definition of our objective is to find clustering models that can reproduce a true dataset cluster domain-by-domain.Clustering models predict the cluster of each dataset given the distance values between all dataset pairs.To evaluate the metadata-based similarity indicators, we only took different distance values between datasets as inputs to each model.We unified the other elements of the models, such as the clustering algorithms and random parameters.
Our experiments quantitatively evaluated how well each metadata-based similarity indicator rebuilds true dataset clusters for each domain defined based on 15 types of Kaggle dataset tags.When the performance of similarity indicators is exceedingly different for each domain, the results of this study can provide productive insights for future studies to discover similar datasets.

III. RELATED WORKS A. DISCOVERY OF SIMILAR DATASETS
There are several approaches for discovering combinable datasets.The most typical method is based on information retrieval.Kato et al. [24] indicated the retrieval difficulty depends on user needs from experiments on government open datasets.Wang et al. [14] showed an ontology-based retrieval method is effective for biomedical datasets of Elsevier DataSearch.Sakaji et al. [13] evaluated similarity searches using Word2Vec and BERT vectors on Kaggle datasets.Bernhauer et al. [10] compared several similarity search methods and organized the advantages of similarity search and full-text ad hoc retrieval.
They claimed the importance of combining different methods because ad hoc retrieval performs better, whereas similarity search can improve recall.
In cases where the discovery target is limited to a scientific dataset attached to research papers, informationrecommendation approaches are predominant.In contrast to retrieval-based approaches, they do not necessarily depend on text information because they mainly use link structures such as citations [25], [26], [27] and co-author relationships [28].As mentioned above, such information is only available in datasets linked to research papers.
There have been several attempts at inter-dataset clustering.Whereas general clustering takes a single dataset to generate clusters of instances in an input dataset [21], [22], this approach takes multiple datasets as inputs to obtain groups of similar datasets.Siegert et al. [23] attempted to discover the subsets of the most similar ones from 6 speech emotion recognition corpora.They used acoustic features in each dataset for clustering.Against such an approach based on dataset content, there have also been efforts based on metadata.Sajid et al. [19] attempted to classify research papers by using metadata instead of paper bodies.They claimed that metadata-based methods are also applicable to nonopen-access papers.Lacasta et al. [18] focused on detecting related dataset series using geographic metadata.They indicated that clustering methods based on text metadata effectively clustered geospatial datasets.This is because such datasets have no unified data modality and content-based discovery methods cannot be applied.
In summary, metadata-based methods are effective for datasets with mixed modalities or limited access to their content, whereas content-based methods are not.Our study is similar to Lacasta et al. [18] in the context of metadata-based dataset clustering but differs in dataset domains.Furthermore, we consider methods based on ontology and non-text metadata, which are not compared in these conventional studies.

B. METADATA-BASED SIMILARITY INDICATOR BETWEEN DATASETS
Conventional studies have proposed three approaches to compute the dataset similarity based on metadata: text-based, ontology-based, and variable-based.Text-based approaches calculate the similarity between two text metadata as the dataset similarity.They have been especially evaluated and discussed in many studies.There have been proposals for Jaccard coefficients and cosine similarity based on text vectors [29], [30], [31].Sakaji et al. [13] evaluated Word2Vec and BERT for open-domain datasets in Kaggle.Wang et al. [15] found the cosine similarity between BERT vectors was the best performance of Precision@k to retrieve biomedical datasets from Elsevier's DataSearch.In addition, they also observed that BM25, the classical indicator, showed the best performance in some cases.Skopal et al. [16] introduced data-transitive similarity using intermediator datasets between non-similar datasets.Bernhauer et al. [10] evaluated this indicator and widely compared it with other similarity indicators based on text metadata.They indicated that TF-IDF-based data-transitive similarity achieved higher Precision-Recall AUC than the Jaccard coefficient, Word2Vec, and BERT for six similarity search experiments using the Czech Open Data Catalog.Our work is similar to Bernhauer et al. [10] in comparing diverse similarity indicators.However, we focused on inter-dataset clustering instead of similarity-based dataset search.In addition, we evaluated inter-dataset similarity indicators for fifteen domains and compared not only text-based but also ontology-and variable-based ones.
Ontology-based indicators calculate the similarity between terms mapped into an ontology, such as Wikidata and WordNet.While text-based similarity indicators reflect all the contents of text metadata, this approach only uses the information contained in the ontology.Therefore, its main characteristic is that it can remove noise information contained in text metadata.Škoda et al. [17] proposed Navigational Distance that can provide a structured explanation of dataset similarity.They proved their concepts using the Wikidata ontology.Wang et al. [14] applied Wu-Palmer Similarity and Resnik similarity to ad hoc retrieval of biomedical datasets.They showed Wu-Palmer Similarity overtook Google distance and cosine similarity between Word2Vec vectors.Conventional efforts have been quantitatively evaluated and compared with other approaches in only a few domains.This paper evaluates these indicators for fifteen domains and compares them with text-based and variable-based approaches.
While both of the previous two approaches have used text metadata, another approach employs variables instead of text metadata.Variables correspond to the set of strings that refer to the dataset contents, e.g., attribute names of the database and column names of the tabular dataset.Bogatu et al. [32] proposed D3L (Dataset Discovery in Data lakes) as a series of dataset similarity indicators for data lakes.It includes several similarity indicators, one of which uses the Jaccard coefficient between variables of data lakes.Several efforts by Zhang and Balog [11], [12] also employed variables.They proposed an embedding method for variables based on skip-grams and applied the cosine similarity.Sakaji et al. [20] applied the Dice coefficient between variables to datasets of D-Ocean, a Japanese data platform.
Their experimental results indicated that it could represent the content and geographic similarity of datasets, as well as Word2Vec and BERT.Although this effort is important in that it verified the variable-based indicators for open data platforms, they have not conducted quantitative evaluations based on objective label information.This study provides a quantitative evaluation based on objective label information.Furthermore, we deeply verified the effectiveness of the variable-based approach by comparing it with other similarity indicators.
These studies evaluated similarity indicators based on different domains, datasets, and evaluation metrics.It is difficult to compare their evaluation results and determine which similarity indicators should be selected for each domain.In addition, there are no efforts that exhaustively compare similarity indicators for each approach presented above.Efforts to evaluate and compare feature extraction or similarity methods on the same metrics are also flourishing in many research fields [33], [34], [35].Although Bernhauer et al. [10] is a typical example of such studies in the dataset discovery field, they only focused on textmetadata-based similarity indicators.This study provides a more extensive comparison of representative similarity indicators.

IV. METHODOLOGY A. EXPERIMENTAL PROCESS
Figure 1 provides an overview of our experimental process, which consists of three steps: preparing input data, clustering, and evaluation.The first step involves metadata acquisition, preprocessing, and conversion into the input data for clustering models.We will describe the details of metadata acquisition, preprocessing, and similarity calculation methods in Sections IV-B2, IV-B3, and IV-B4.
In the clustering step, we predict the domain label of each dataset group by applying clustering to the similarity matrix built in the previous step.We employed the K-Medoids as a clustering method that allows a matrix based on the dataset similarity or distance as an input (Section IV-C).
The evaluation step indirectly evaluates metadata-based similarity indicators based on the performance of clustering models.We used three evaluation metrics (Section IV-D) and fifteen domain labels based on Kaggle dataset tags (Section IV-B1).

B. PREPARING THE INPUT DATA 1) INPUT AND EVALUATION DATASET SELECTION
We acquire sets of similar datasets for the evaluation of inter-dataset clustering models.We exploited ''subject'' tags in Kaggle to sample similar datasets in the domain.Subject tags represent the main contents of datasets, e.g., Covid-19 and Investing.These tags have a hierarchical structure; for example, the Investing tag is a child of the Finance tag.We first selected five tags, consisting of Finance, Health Conditions, Sports, Government, and Internet.We followed  the number of datasets belonging to each tag to select the tags.Next, we extracted three child tags for each selected tag based on their frequencies.In summary, we used the following 15 tags in Table 2 as the dataset domains for evaluation.We randomly selected 100 datasets for each of the 15 domains, for a total of 1500 datasets.Then, we acquired the metadata of each dataset.We present more details on metadata and the acquisition process in the next subsection.Note that we preliminarily excluded datasets related to two or more different domains.This preprocess prevents our experiment and evaluation from being complicated by multi-label.

2) METADATA OVERVIEW AND ACQUISITION PROCESS
This study defines metadata as data that satisfies at least one of the following conditions: data contained in the Meta Kaggle dataset [36] or acquired from Kaggle API. 1 The following Figure 2 illustrates a sample pair of metadata and the actual contents of the dataset.We employ three types of metadata: title, descriptions, and variables.The title and descriptions are natural language sentences that describe the dataset contents.The term ''text metadata'' in conventional studies mainly refers to these metadata.The title is relatively short and abstractly expresses the essential information.In contrast, descriptions often contain more detailed and specific information.We obtained this metadata from the Meta Kaggle dataset [36] on June 1, 2023.
The variable is a logical set of short strings, such as column names or database attribute names.For example, there are ''patient_id,'' ''sex,'' and ''age'' in the medical dataset.Because the Meta Kaggle dataset excludes variables, we acquired them using the Kaggle API on June 12, 2023.

3) PREPROCESSING METADATA AND MAKING INPUT DATA
We applied lowercasing, tokenization, lemmatization, and stopword removal for metadata preprocessing for text metadata.We used en_core_web_sm model of spaCy [37] to tokenize and lemmatize.We have removed stopwords using NLTK and frequent words in descriptions of Kaggle datasets.The following list shows a stopword list that we added manually.* , #, data, datum, dataset, context, content, acknowledement For the variables, we applied lowercasing, tokenization, and lemmatization.To tokenize and lemmatize, we employed the same tool used for text metadata.
We build similarity matrices from the preprocessed metadata as input data for clustering.It is a matrix in which each (i, j) element corresponds to the similarity between datasets i and j.We compute the similarity of each dataset pair on the basis of metadata-based similarity indicators described in Section IV-B4.

4) METADATA-BASED SIMILARITY INDICATOR
The term ''metadata-based similarity indicator'' in this paper refers to inter-dataset indicators that consist of metadata, a data processing method, and a distance function.We use them to compute the similarity between all dataset pairs.We then generate similarity matrices to obtain input data for clustering models.This paper compares 16 similarity indicators.There are three approaches to classifying these indicators: text-based, ontology-based, and variables-based.
A text-based approach directly computes the similarity between text metadata.The most primitive is the Jaccard coefficient between word sets of text metadata.It follows the equation below: where T i and T j are text metadata of the corresponding datasets i, j.Unique(T ) is a set consisting of unique words of the document T .
The more advanced text-based approaches use vector similarity.Following conventional studies [10], [13], [14], [15], we employ TF-IDF, Word2Vec, and BERT as vectorizers of text metadata.The following equation ( 2) shows the TF-IDFbased conversion of text metadata T to the dataset vector V : where w is the word in the text metadata T , and n is equal to the vocabulary size of the entire text metadata set.The function tf(w i , T ) is the frequency of a word w, and idf(w i ) is the inverted number of documents containing the word w.The definition of the Word2Vec-based conversion function is as follows: where w is a word in text metadata T , Word2Vec(w) is a function that converts a word to the corresponding vector.The BERT-based function is as follows: where BERT emb (BERT tok (T )) is a function that generates the document-level vector from tokenized text metadata BERT tok (T ).These similarity indicators regard the cosine similarity between these vectors as the dataset similarity.Another text-based approach is data-transitive similarity, proposed by Skopal et al. [16].It assumes that datasets x and y are transitively similar when they are similar to the same dataset i.The following Equation (5) shows the definition of data-transitive similarity DT(x, y): (5) where x, y, and i are different datasets, and D is a set of datasets.The operator is outer aggregation over all datasets in D, such as min, max, and avg.Another operator ⊎ is an inner aggregation over two distance values, e.g., sum, minus, and multiply.This paper uses the cosine similarity between TF-IDF vectors as distance following Bernhauer et al. [10], is max and ⊎ is multiply.An ontology-based approach maps words in text metadata to an ontology.It computes the graph similarity as the dataset similarity.We adopt WordNet as an ontology that can be applied to various dataset domains.We employ the Navigational distance and Wu-Palmer similarity to compare with other similarity indicators.Navigational distance is an ontology-based indicator contained in the framework proposed by Škoda et al. [17], as follows: where C i and C j are concepts on an ontology mapped from text metadata T i and T j , mapD is a function that maps words in input text metadata into an ontology, and Path means path similarity between two concepts.The aggregation operator performs overall concept pairs, e.g., sum, max, and avg.Following Škoda et al. [17], we adopt the average function as an aggregation.
The definition of another similarity indicator based on Wu-Palmer similarity is as follows: (7) where C i and C j are concepts in text metadata T i and T j , mapD(T ) is a function to map each word to a corresponding concept, and WP(C i , C j ) is Wu-Palmer similarity.The definition of WP(C i , C j ) is as follows: where LCS C i ,C j is the lowest one of the common ancestors between C i and C j , and Dep(C) returns the number of the hierarchy of an input concept C. The other elements are the same as Equation (6).To reduce the computational cost, we applied TF-IDF-based keyphrase extraction [38] into text metadata.We used limited words contained in the top five key phrases to map to the ontology.A variables-based approach uses variables to evaluate dataset similarity instead of text metadata.As described in Section IV-B2, variables consist of attribute names of databases or column names of tabular datasets.Following the definition of Sakaji et al. [20], we use a variables-based similarity indicator using the Dice coefficient as follows: where V i and V j are the variables of the corresponding datasets i, j.In addition, we discuss another variables-based similarity indicator using TF-IDF vectorization.For variables, several elements appear across many datasets, such as ''id'' and ''date.''Therefore, we expect to obtain a better similarity representation by reducing the influence of such factors.We compute this similarity by replacing text metadata T with variables V in Equation (2).
To summarize this section, we show sixteen similarity indicators compared in this study in the following Table 3.It shows the components of each similarity indicator, including the metadata, data processing method, and distance function.It also shows the approach type to which each similarity indicator belongs.In addition, we have added the name of the clustering model using each similarity indicator in the ''Model name'' column.We use these names again in Section V.

C. CLUSTERING
This study employed the K-Medoids algorithm as the clustering method.The K-Medoids algorithm is a non-hierarchical clustering method and is suited for community detection of a network.While the basic concept of K-Medoids is similar to that of K-Means, it differs in using the medoid instead of the centroid.The centroid is the average vector of the samples in the cluster.The K-Means allocates each sample into a new cluster to minimize the distance with centroid for each cluster.In contrast, the medoid is not the average vector but one of the samples contained in the cluster.The definition of medoid is as follows: where X i is a cluster and distance(x, m) is the distance between samples x and m.This method allows a matrix of the dataset similarity and distance as input data in contrast to the other methods.We compare similarity indicators that do not generate a feature vector for each dataset.Therefore, the K-medoids method is suitable for our experiment.This study aims to evaluate different metadata-based similarity indicators for each dataset domain.However, the construction of the best method for inter-dataset clustering is outside the scope of our study.Therefore, we do not compare the different clustering algorithms.

D. EVALUATION METRICS
To evaluate the clustering performance of each similarity indicator, we used NMI, ARI, and Purity.These are typical quantitative evaluation metrics for clustering models using true cluster labels.NMI (Normalized Mutual Information) is a metric based on mutual information and evaluates the global structural similarity between a clustering result and true clusters.Equation ( 11) is the definition of NMI.

NMI(X
where X and Y are different clustering results, one of which is a partition based on true labels.I(X ; Y ) is mutual information, and H(X ) is the entropy of X .Clustering results X and Y are similar when this metric takes a high value.
It is high-performing clustering when obtained clusters are similar to a partition based on true labels.ARI (Adjusted Rand Index) is a metric that also considers global structural similarity.As with the NMI, the high value of ARI means the clustering performed highly.The following TABLE 3. Sixteen types of dataset similarity indicators based on metadata used in our experiment.In ''Model name'' column, each term T, D, and V means the title, the description, and the variables.In ''Data processing method'' column, ''Keyphrase + WordNet'' means mapping extracted keyphrases to the WordNet ontology.In ''Distance function'' column, ''DT similarity'' means the data-transitive similarity.
Equation (12) shows the definition of ARI.

ARI(X
where RI(X , Y ) is the rand index without adjusting for chance, and each of ExpRI(X , Y ) and maxRI(X , Y ) refers to the expected and maximum values of the rand index.Equations ( 13), ( 14), and ( 15 where n xy is the number of samples belonging to clusters x and y.Purity aggregates the precision of each cluster.The definition of purity is as follows. where X is the clustering results, and Y is a partition based on true labels.It considers the clustering is good when the percentage of the most frequent label y in each cluster x is large.Precision(x, y) follows the usual definition of precision.

E. PARAMETER SETTINGS
For Word2Vec and BERT, we used publicly available pre-trained models and did not fine-tune them.We used the wiki-news-300d-1M model [39] 2 from Meta Research 2 https://fasttext.cc/docs/en/english-vectors.html to obtain Word2Vec word vectors.This model is based on the Wikipedia 2017, the UMBC webbase corpus, and the statmt.orgnews dataset.It converts words to vectors with a dimension of 300.We adopted the bert-base-uncased model from Devlin et al. [31] to obtain BERT embeddings.It is based on English Wikipedia and BookCorpus.This model generates vectors with dimensions of 768.
In ontology-based approaches, we applied a keyphrase extraction method based on TF-IDF.We used the implementation by Boudin et al. [38] to extract the top five ranked keyphrases from each text metadata.We used NLTK [40] to map words to the ontology and compute distances between WordNet concepts.
For clustering using K-Medoids, we set the number of clusters to 15, which is the same as the total number of dataset domains.We applied random assignments to the initial clusters and set the maximum number of iterations to 100, following the default settings of the library. 3In addition, we set the random seed value to a fixed integer of 2023.

V. RESULTS
Table 4 indicates the evaluation results over 15 domains shown in Table 2. First, we found that indicators based on cosine similarity and text metadata achieved high performance.A BERT-based indicator using the title showed consistently high performance for all evaluation metrics, followed by TF-IDF-based and Word2Vec-based indicators.Other indicators, including ontology-based and variablebased, performed comparatively poorly and had no significant differences among their performances.
One interesting tendency from Table 4 is most of the similarity indicators using both the title and the description showed poorer performance than only using the title.We considered that the main factor was the low ratio of useful information in Kaggle dataset descriptions.For example, descriptions often contain unnecessary information for discovery datasets, as credits, acknowledgments, and template sentences.As an exception, the ontology-based indicators improved some evaluation metric values by adding a description.These indicators employed keyword extraction and ontology mapping.We considered that such data processes increased the influence of the useful information in descriptions rather than noise.From these facts, indicators based on the cosine similarity between dataset title vectors are better choices in terms of domain-agnostic performance.Tables 5, 6, 7, 8, and 9 show the clustering performances for each domain, including the medical, financial, sporting, governmental, and Internet domains.Each table shows the highest F-value based on Precision-Recall for each similarity indicator and each domain, as described in Table 2.We can 40220 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.TABLE 10.The structural differences between two clustering results for each comparison condition.The smaller the values, the greater the structural differences between the two clustering results.
see from each table that the best indicator is completely different for each domain.As one interesting observation, although some similarity indicators using the WordNet ontology performed poorly in almost all domains, each showed the best performance in only one domain.Wu-Palmer similarity using the dataset title in the cancer domain and Navigational distance using the title in the crime domain outperformed BERT and TF-IDF.In contrast, some similarity indicators performed exceedingly poorly in some limited domains.Although the cosine similarity between title vectors using BERT was the best indicator for the highest number of domains, it showed significantly lower performance in the three financial domains.The F-values in these domains are equal to or less than those by the Dice coefficient between variables, which perform poorly in many domains.These results provide a more straightforward explanation of why an appropriate discovery method should be selected for each domain.In addition, these results revealed the appropriate similarity indicators for each domain, such as BERT performing well for sporting domains but not for financial domains.

VI. DISCUSSION
From the results, we found that the most effective similarity indicator was different for each domain.Similarity indicators based on vectors of text metadata were highly versatile, and each ontology-based indicator worked optimally on only one domain.While text-metadata-based similarity indicators worked effectively, variable-based ones have not performed satisfactorily in any experimental conditions.However, text and variable metadata might complement each other because they contain significantly different information.Conventional studies have also argued for the importance of combining different methods [10].In this section, we discuss the following two points: (1) whether discoverable datasets differ between text metadata and variables and (2) whether combining two similarity indicators based on different types of metadata can improve clustering performance.

A. INFLUENCE OF SIMILARITY INDICATORS ON CLUSTER STRUCTURES
We expect structural differences in clustering results if the discoverable datasets differ depending on the metadata used.Therefore, we quantitatively compared the global structural differences and local overlaps between the two clustering TABLE 11.The local overlaps between two clustering results for each medical domain and each comparison condition.We used the Jaccard coefficient to measure local overlaps.

TABLE 12.
The overlaps between the two clustering results for each financial domain and each comparison condition.We used the Jaccard coefficient to measure local overlaps.TABLE 13.The overlaps between the two clustering results for each sporting domain and each comparison condition.We used the Jaccard coefficient to measure local overlaps.

TABLE 14.
The overlaps between the two clustering results for each governmental domain and each comparison condition.We used the Jaccard coefficient to measure local overlaps.

TABLE 15.
The overlaps between the two clustering results for each Internet domain and each comparison condition.We used the Jaccard coefficient to measure local overlaps.
results.First, we computed the NMI and ARI between clustering results for all possible pairs of models.There was a significant difference between the two clustering results in their global structure when these metrics took low values.In addition, we evaluated local overlaps by calculating the Jaccard coefficient between each cluster pair in the two clustering results.Note that each cluster used to calculate the Jaccard coefficient has the highest F-value based on Precision-Recall for each domain, same as in the Tables 5 to 9. To identify the influence of metadata on discoverable datasets, we defined the six conditions shown in Table 10.
The following  11 to 15 show the pair with different metadata have the minimum overlap between their clusters in eleven domains.These results suggest that discoverable datasets differ between text-metadata-based and variable-based similarity indicators.

B. INFLUENCE OF COMBINING SIMILARITY INDICATORS ON CLUSTERING PERFORMANCE
Based on the results of the previous subsection, we discuss whether the combination of text-metadata-based and variable-based similarity indicators can improve clustering performance.In addition, we evaluated 120 clustering models obtained by combining two different similarity indicators in the manner described in Section IV.We computed the ratio of combinations improved in the evaluation metric values compared to that before combining under the same comparison conditions as in Table 10.Table 16 shows the results.From the table, each ratio of combinations with improved metric value was the highest when combining textmetadata-based and variable-based indicators.In particular, we observed an improvement in 27 of 28 combinations for the ARI value.The ratios in this condition exceeded by more than 0.1 points compared with the second-best result.From this result, we conclude that variable-based similarity indicators are more valuable when combined with text-metadata-based indicators than when used alone.

VII. LIMITATION
The first limitation of this study is the dependencies on the nature of the data platforms.We should be aware of some effects due to the characteristics of the data platform on our experimental because we collected all datasets from Kaggle.For example, almost all similarity indicators using text metadata showed a decrease in clustering performance when employing the dataset description.We considered that the low quality of the dataset description in Kaggle is one of the main factors.Dataset descriptions in the Kaggle frequently include credits, acknowledgments, template text, and other information unrelated to dataset contents.Therefore, there is scope for verification using other data platforms to reduce the influence of each platform.
The second limitation is the dependencies on the ontologies.We compared and evaluated Wu-Palmer Similarity and Navigational Distance as ontology-based similarity indicators.This study adopted the WordNet ontology for generalizability to diverse domains, whereas other studies used other ontologies.For example, Wang et al. adopted the MeSH (Medical Subject Headlines) terminology to discover biomedical datasets.What ontology should be selected for each domain is beyond the focus of this paper.Therefore, it is one of the future issues.
The third limitation was the nature of the metadata used.Each similarity indicator that we evaluated relies on one of the following metadata: dataset titles, descriptions, and variables.As shown in Section VI, we can improve clustering performance by combining two similarity indicators based on variables and text metadata.The experimental results suggested that such an improvement is due to the differences between discoverable datasets caused by metadata.We expect that the additional combination of these metadata and other types of metadata will lead to further improvement.One important future direction is to verify metadata beyond the scope of this paper, such as relationships based on citations and co-authorships in datasets related to research papers.

VIII. CONCLUSION
This study evaluated 16 metadata-based similarity indicators for the inter-dataset clustering task for each dataset domain.We selected 15 domains and acquired 1500 dataset metadata from the Kaggle data platform for evaluation.We found that the most effective similarity indicator differs with respect to each domain.While a BERT-based indicator performed the best for the most number of domains, it showed poor performance for financial domains.In contrast, some indicators were effective for only one domain.For example, the indicator based on Wu-Palmer Similarity showed the best performance for only a cancer domain.A similar result was obtained for the Navigational distance-based indicator for a crime domain.
Further analysis indicated that the combination of two different similarity indicators can improve the clustering 40222 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
performance.Although the performances of variable-based similarity indicators were remarkably poorer, 96.4% (27/28) of combinations between variable-and text-metadata-based indicators improved their ARI values.We concluded that the vast separation in the distributions of discoverable datasets between variable-based and text-metadata-based indicators led to such an improvement.One reason is that the NMI and ARI between variable-based and text-metadata-based cluster sets took exceedingly lower values.
Conventional studies have asserted the difficulty of selecting the optimal method for all situations and the importance of combining different methods.Our results added more practical details to these assertions for future studies.This study presented which similarity indicator should be employed for each of the 15 dataset domains from the experimental results.In addition, we identified specific requirements for combining the two indicators to improve clustering performance.Points left to study in future work are the analysis using other data platforms and the development of a novel similarity indicator from optimizing ratios in the combination based on our results.
inter-dataset similarity indicators among 15 dataset domains through the clustering task.Our study covered three types of similarity indicators: text-based, ontology-based, and variable-based.(2) We also analyzed the effects caused by combining two different similarity indicators on clustering performance.(3) Clustering performance varies significantly depending on the domain.Our experimental results provide the appropriate similarity indicators, which differ for each domain.(4) Most of the combinations can improve clustering performance.

FIGURE 1 .
FIGURE 1.An overview of our experimental process.

FIGURE 2 .
FIGURE 2. An example of metadata and actual contents in the dataset.

TABLE 1 .
A comparison among this paper and conventional studies.

TABLE 2 .
The list of Kaggle dataset tags used as the dataset domains for evaluation.

TABLE 4 .
A comparison of clustering performances over 15 domains.Bold values indicate the highest among the same metrics.

TABLE 5 .
A comparison of clustering performances for each medical domain.

TABLE 6 .
A comparison of clustering performances for each financial

TABLE 7 .
A comparison of clustering performances for each sporting domain.

TABLE 9 .
A comparison of clustering performances for each Internet

TABLE 16 .
The ratio of combinations with clustering performance improvement.
Table 10 shows global structural differences, and Tables 11, 12, 13, 14, and 15 indicate local overlaps between two clustering results for each condition.According to Table 10, we can observe that both NMI and ARI, between clustering results based on different metadata, took the lowest values.Tables