Restoration of Data Structures Using Machine Learning Techniques

Tabular data is the most common format used to represent real-world information. Almost all programs created for storing or processing data, such as relational database systems, spreadsheets, and statistical analysis software can import or export tabular data. These programs are not sufficiently robust to automatically solve the problems of importing messy delimited files or files that contain data from multiple tables. Additional messy datasets contain data delimited by multiple delimiters without the names of the table columns, and parts of the table rows have substituted or deleted columns. This paper proposes the STCExtract algorithm for reconstructing table structures and data in which the input file can be arranged. The STCExtract algorithm is designed to be domain-independent and modular according to machine learning algorithms and other parameters. The algorithm was developed as a two-phase process, in which the original data tables were recognized in the first phase and the columns of the original data tables in the second phase. The STCExtract algorithm was evaluated through expensive experiments using multiple real datasets. Multiple messy datasets were generated for the four experiments. Three experiments were conducted to determine the optimal parameters for the STCExtract algorithm. A fourth experiment was conducted to evaluate the proposed algorithm. The results show that the STCExtract algorithm correctly arranged the structure of the tables with an accuracy of 94.4% to 100%. The accuracy of the STCExtract algorithm in the second phase (when the data were allocated to columns) ranged from 59.7% to 90.2%.


I. INTRODUCTION
With the advancement of information and communication technologies, the amount of stored and electronically accessible data has significantly increased in recent years.Electronic data can be found in various forms, including unstructured data stored in textual documents, semi-structured XML data, and structured data in relational databases.Various interfaces, such as web browsers, structured query languages, and desktop applications, can be used to access data.
Accordingly, the data can be divided into three categories: unstructured, structured, and semi-structured.Unstructured data do not necessarily have a defined format or follow any The associate editor coordinating the review of this manuscript and approving it for publication was Hongwei Du.
rules regarding the data content.Unstructured data included text, video, audio, and photos.In contrast, structured data are organized into semantic entities.Similar entities are grouped in the form of relations and entities in the same group have the same attribute descriptions.The descriptions of all entities in a group comprise a schema.Attributes usually have a predefined format and length and usually follow a predefined order.An example of structured data is the relational data stored in relational databases.The third type of data were semi-structured.Semi-structured data included Email, Hypertext Markup Language (HTML), Electronic data interchange (EDI), and JavaScript Object Notation (JSON).Although they contain structural information, they are not entirely structured.In contrast to structured data, entities of the same group can have different attributes in semi-structured data, and the order of the attributes is not particularly important.By contrast, appropriate markers or labels are used to mark the attributes themselves.
A special type of unstructured data is data that is reliably known to have a structure, but has been corrupted for any reason.Such a dataset may contain different versions of the initial structures of one or more entities that have changed over time (e.g., replacement in the order of attributes and number of attributes).It may also contain broken entity structures created when exporting structured data, owing to the use of different delimiters to separate entity attributes and a new row as a messy-delimited file.In addition to the above-mentioned characteristics that describe such a dataset, a problem arises because of the heterogeneity of the data itself.Heterogeneous sets of structured data can contain information about different data types such as dates or time intervals, explicit numeric data (e.g., year), coded numeric data (e.g., 0 or 1 for presence or absence), categorical data (e.g., countries), and text fields (e.g., short description).In addition, with such datasets, a problem is reflected in the fact that values may be missing for some attributes; that is, the value of such attributes can be left blank or filled with some default value.In addition, some of the attributes can be multi-valued, that is, vector types, not just scalar types.
This paper proposes the STCExtract algorithm for extracting original structures and tables from a messy-delimited text file.Given an input messy-delimited file, STCExtract reconstructs the table structure and the data into which the input file can be arranged.Furthermore, the STCExtract algorithm is designed to be domain-independent, modular according to machine learning algorithms, and modular according to other parameters, such as a function for calculating the distance between structures.
Machine learning (ML) is a part of artificial intelligence (AI) that allows software applications to predict outcomes accurately without being explicitly programmed.Machine learning algorithms use historical data as inputs to predict new output values, and can be trained in several ways.Based on the learning methods, machine learning algorithms can be broadly categorized as supervised or unsupervised.Supervised learning algorithms use labeled data, meaning the data have a known outcome or a target variable.Unsupervised learning algorithms use unlabeled data, meaning the data does not have a known outcome or a target variable.Many machine learning algorithms exist in the literature; however, this paper uses two machine learning algorithms, k-means and hierarchical, for the reconstruction table structure, and five machine learning algorithms (K-means, Minibatch K-means, Birch, Gaussian mixture, and Fuzzy K-means (FCM)) for the reconstruction table columns.
The remainder of this paper is organized as follows.Section II presents the related work.In Section III, the problem formulation and the STCExtract algorithm are described.Section IV presents the experimental setup and results of the experimental analysis.Finally, Section V concludes the paper.

II. RELATED WORK
This section presents the relevant background for classifying and comparing the schemes for scheme discovery and methods.The analysis carried out in this section is primarily based on review papers [1], [2], [3], [4] in which state-of-theart schema information extraction approaches are presented.These studies analyzed the influences of the approach to scheme discovery, techniques for scheme discovery, characterization of discovery procedures, input and output data formats, and quality of the results.

A. APPROACHES TO SCHEMA DISCOVERY
Approaches to schema discovery consist of several steps, in which schema-related information is extracted from the original input set of schema information.This information can be extracted from RDF, OEM, or other data types.Scheme discovery approaches can be grouped into three categories based on the basic algorithmic concepts behind the scheme discovery method.These categories include implicit schema discovery, explicit enrichment of existing schemas, and the discovery of structural patterns within existing schemas.
Implicit schema discovery approaches attempt to discover a schema from a data source without requiring additional schema information.This procedure aims to extract new information from a dataset by analyzing the entities and connections between the elements.These approaches begin the analysis with an empty schema and attempt to characterize this dataset in two ways: grouping similar instances and similar paths [5], [6], [7], [8], [9].
Approaches that use explicit enrichment of existing schemas begin with existing schemas and statements on the schema that can complement or enrich existing schemas.Their goal was to generate new types using existing types, or to generate other schemas.These approaches can be classified into two categories.The first approach is based on machine learning techniques, whereas the other approaches are based on statistical data analysis.These techniques primarily use the connectivity between instances [10], [11], [12].
The procedures for discovering structural patterns within existing schemas begin with identifying all possible structural patterns (versions) of entities in a given dataset of a known schema.This procedure aims to analyze the co-occurrence relationship between the properties in the dataset rather than to discover types and additional schemas, as in the explicit enrichment of existing schemas.A different set of properties describes each instance within the dataset.The set of properties that describes an instance represents its structural pattern.These approaches can be classified into two categories based on the scope of the pattern recognition.The first is based on exact pattern recognition.In contrast, the second approach is based on approximate pattern recognition.Papers describing exact pattern recognition procedures are provided in [13], [14], [15], [16], [17], [18], and [19], whereas those describing approximate pattern-discovery approaches are provided in [20], [21], and [22].

B. TECHNIQUE FOR SCHEMA DISCOVERY
Techniques that enable the discovery of the required elements of a scheme are also applied as steps to their discovery.The techniques used to identify patterns can be divided into three groups.The first group consists of techniques based on machine learning, the second on formal methods, and the third on statistical methods.Machine learning-based techniques use machine learning concepts to discover the dependencies between data in a given dataset.These techniques can be divided according to several criteria: the type of learning (unsupervised and supervised machine learning), classification algorithm (e.g., K-NN), clustering algorithm (e.g., K-means and Hierarchical clustering), and clustering algorithm rules (Frequent Pattern Mining algorithms (Apriori)).Formal methods are based on mathematical and formal logic to determine the dependencies between the data in a given dataset.These techniques include Formal Concept Analysis, which is used in pattern discovery to discover classes by generating a network of concepts, and bi-simulation, which is used to cluster similar paths in data graphs.Techniques based on statistical methods attempt to generate and quantitatively describe datasets.In schema discovery, this technique does not focus on determining the frequency of properties such as the distribution of properties across types.

C. CHARACTERIZATION, FORMATS, AND QUALITY
In the case of techniques for discovering the required scheme elements, their characterization is also observed according to their behavior when working with different data.Characterization can be divided into several groups of properties: the first group consists of characteristics related to the extensibility of the solution when working with an enormous amount of data; the second group is determined by the characteristics of whether the solution is reached simultaneously or incrementally; the third group consists of solutions that consistently provide the same set of input data in the same output solution, and the fourth group consists of solutions that process data locally or remotely.
When using techniques and algorithms to detect scheme elements and observe the input data that can be delivered to the algorithms and output data, the results that the given algorithms can produce are necessary.The input data can be divided into two categories, depending on their purpose.The first category comprises datasets and the second comprises user-defined parameters.RDF data graphs from the OEM gather data types that describe input datasets.Some approaches require the definition, even partially, of the ontology for the dataset, whereas others use only its instances.The second category consists of additional data that can be user-defined parameters required by basic algorithms, such as the similarity thresholds used to compare graph elements or the desired number of clusters.Depending on the chosen approach and technique, output data can include several elements.The output may include information related to data types and subtypes, semantic relationships between data, hierarchical relationships between data, data access plans, patterns between data, and other information.
As one of the crucial elements when analyzing the results of the use of techniques and algorithms for the detection of scheme elements, it is necessary to observe the quality of the results.The quality of the resulting schema expresses the extent to which it adequately describes an input dataset.Multiple aspects of the output schema quality can be observed, including schema relevance, schema completeness, and class accuracy.The relevance aspect involves recognizing only the elements, links, or other schema-related information provided by the approach considered in the schema.The schema completeness aspect includes the comprehensiveness of the generated schema; that is, checking whether all elements should exist in the actual schema.The precision aspect involves checking the correctness of the generated type declarations between entities and corresponding instances.

III. PROBLEM STATEMENT AND STCEXTRACT ALGORITHM OVERVIEW
Existing schema discovery algorithms predominantly focus on structured and semi-structured data, the initial structure of which is known, and attempt to discover additional features.These algorithms may not be intended for and may have difficulty with unstructured data containing multiple tables that are reliably known to have a corrupted structure.Existing schema discovery algorithms cannot be directly applied to such data, and require modifications.Therefore, these solutions were not compared to the STCExtract algorithm proposed in this study.This section presents the problem statement and details of the STCExtract algorithm.

A. TERMINOLOGY AND PROBLEM STATEMENT
A messy delimited file can be considered as a set of n rows {r 1 , r 2 , . . .,r n }, with rows r i consisting of m fields {f i1 , f i2 , . . .,f im }.Each row has a different number of fields.The database from which the messy file originates comprises a set of k tables {t 1 , t 2 , . . .,t k }, where the i th table t i consists of l columns {c i1 , c i2 , . . .,c il }.Each table had a different number of columns.Each row r belongs to exactly one table t, and one or more fields f of a given row belong to exactly one column c of a given table .During the formation of a messy file, consecutive rows can originate from two or more tables; some fields can be missed, field orders can be changed, column headers for the original table can be missed, and fields can contain symbols that can be used as separators.Field types may be scalar (e.g., number, string, date) or vectors of scalar fields.This enables the separation of a row into tokens, instead of separating a row into fields.For example, using different delimiters, row r i can be split into b tokens {t i1 , t i2 , . . .,t ib }.The main goal of this paper was to extract a set of tables and their columns corresponding to the structures of the original tables and then populate the tables and columns using messy file content.The goal involves recognizing the tables using row and token values, determining the number of columns for each table, and (if necessary) determining the alignment columns (merging the columns again or adding new columns).

B. RESEARCH QUESTIONS
The research questions addressed in this paper were as follows: • RQ1: Is it possible to propose an algorithm based on machine learning techniques that reliably restores the data structure known to have a structure that has been corrupted for any reason?
• RQ2: Can the proposed algorithm be extended by using rules and functions to detect scalar data types of columns within a structure?
• RQ3: Can the proposed algorithm be extended using rules and functions to detect the vector data types of columns within a structure?
This section explains the sequence of operations of the STCExtract algorithm illustrated in Figure 1.
In general, the STCExtract algorithm is based on cluster analysis.Cluster analysis or clustering involves grouping a set of objects such that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).Cluster analysis is not a specific machine learning algorithm but a general task to be solved.In the STCExtract algorithm, clustering occurred twice.In the first clustering phase, the structure and order of the tokens in each input file row are used as a set of objects.Clustering aims to discover clusters/tables and group rows into the discovered clusters.In the second clustering phase, the token values, token structure, and order in the rows belonging to the same cluster/table are used as the object set.The goal of clustering is to place tokens in the clusters/columns of a table, and group tokens belonging to the same field.The STCExtract technique is executed as a sequence of actions in an input messy-delimited file.

1) SPLIT ALL ROWS INTO TOKENS
In this step, all rows from a messy delimited file are read row-by-row and split into tokens.Splitting a row into tokens is a dialect detection problem.The dialect detection problem was solved using the hypothesis that multiple delimiters were used.Each row in the delimited file is separated by several delimiters (tabs, commas, semicolons, colons, and vertical bars).Delimiters can be added as parameters in the STCEtract algorithm.Splitting occurred across all the rows.When splitting the content of a row, the portion of each row and its position within the rows were saved.This step results in a table containing all rows of the input file and all row items (tokens) in a format (ID line, token value, and token position).At this stage, the source file is separated from known separators to generate a file for external validation.

2) CLASSIFICATION OF TOKENS
Each extracted token is classified into one of the appropriate categories in this step.The classification of a token can be achieved based on its value, using a different set of rules and techniques.The initial rules are based on regular expressions that classify tokens into three categories: number, string, and date.Tokens in which the value is an empty string or the null value is string coded.In the STCExtract algorithm, additional rules for token categories can be added as parameters.Dictionary-based techniques can be used in addition to regular expression techniques.In this technique, a table contains a set of key-value pairs, where a key represents a token or cleaned token, and a value represents a token category.Initially, this key-value table was empty.The output from this step is a table containing all the tokens in a given format (line ID, token value, token position, cleaned value, and token category).

3) GENERATE STRUCTURES
This step created a set of objects for the first clustering phase.An object corresponds to one row in a message file containing an ordered vector of categories corresponding to tokens in a given row.Subsequently, in every row, information is attached to an additional link in the structure to which it belongs.A generated string value is added to the token category (number, string, and date) during this phase to preserve token order.Therefore, it is possible to distinguish between strings in the first and third positions.Once all the rows are linked to the structures, they are aggregated according to their values for future processing.This step results in a table containing all the objects with the same structure in a given format (object structure ID, number of appearances, number of tokens, structure of a clustering object, and description).

4) CLUSTERING STRUCTURES ON TABLE LEVEL
In this step, the first clustering phase is performed.This phase aims to discover clusters/tables based on the structure of tokens and their order in each row of the input file, the frequency of their appearance, and the similarity between them.Each discovered cluster is a table.
Clustering is a task for which many algorithms have been proposed.However, no clustering technique is universally applicable and different techniques support different clustering objectives.Therefore, an understanding of the clustering problem and technique is required in order to apply a suitable method to a given problem.This paper used hard clustering in which every data point (structure) belonged to a single cluster.
Two customized algorithms were tested during this phase: hierarchical clustering and K-means clustering.For each algorithm, six different distance functions were used (linear, linear_weight, log, log_weight, reverse and reverse_weight); also, five different cluster center detectors algorithms were used (MinDistance -cluster center is an object with smallest distance to all other objects in cluster; MaxFreq -cluster center is object with highest number of appearance in cluster; PositionMax -cluster center is an generated object that has on each position a type that appears the most on that pasition in a cluster; GroupMax -cluster center is a generated object that has on each position a type that appears the most on that position in a cluster ignoring appearance of the same consecutive tokens; GroupMax2 -cluster center is a generated object that has on each position a type that appears the most in a cluster ignoring appearance of the same consecutive tokens), and three object comparator were used (Matchercombines length similarity score and position similarity score in equal manner, Matcher1 -combines length of sequences of equal type similarity score and position of sequences of equal type similarity score in equal manner, Matcher2 -combines length similarity score, position similarity score in an equal manner, length of sequences of equal type similarity score, and position of sequences of equal type similarity score in an equal manner).The results of this step are the sets of detected clusters/tables.

5) EVALUATION OF RESULTS AND SELECTING THE MOST APPROPRIATE RESULT -PHASE1
Clustering validation has long been recognized as a vital issue for the success of clustering applications.There are two methods for evaluating the cluster quality.One is an external evaluation in which the truth labels in the datasets are known beforehand.Another method is an internal evaluation performed without labels or with the dataset itself.The proposed algorithm was evaluated using external validation.

6) ASSIGNING ROWS TO CLUSTERS
In this step, each row was assigned to one detected table based on the relationship between the rows and comparison objects.The comparison objects were then assigned to the clusters in the detected table.Each row is assigned to a specific cluster for processing, based on the associated table.In subsequent steps, row tags are assigned to the corresponding columns of a detected table.

7) SAMPLING DATA FOR CLUSTERING ON COLUMN LEVEL -OPTIONAL
Data sampling was performed during this step.This step aims to reduce the computational time of tag clustering, resulting in a subset of rows and their tags being used in the phaseclustering columns.It is essential to note that all rows were clustered based on this sampling, not just the sampled rows.This step is optional.
Sampling was performed using all rows to be structured, from which only p percent (20% of the data) was randomly selected.A row containing a structure with only one appearance is selected as the sample.In this step, the token content was changed.If a raw file continues the same token at multiple positions, a new token value is created using a combination of initial value, position, and class type.If the token values appear more than twice at the row level, then the token value is changed by comparing the initial value, position, and class type.

8) CLUSTERING DATA ON COLUMN LEVEL
In this step, a second clustering phase was performed.This phase aims to discover columns based on the value of tokens and their order in each row of the input file, frequency of their appearance, and similarity between them.Each discovered cluster represents a column.
During this phase, five standard clustering algorithms were tested: K-means, Mini-batch K-means, Birch, Gaussian mixture, and Fuzzy K-means (FCM).A structure with the most rows belonging to the cluster is selected to determine the optimal number of clusters or parameter k.The number of elements in this structure determines the parameter k.If the two structures have the same highest number of rows, a structure with fewer columns is selected.It was assumed that most data within a cluster were obtained from the same original table.
The results of this step are the sets of detected clusters/columns.Each set corresponded to one of the selected clustering algorithms.

9) COLUMN ALIGNMENT
In this step, multiple tokens were combined and aligned into a single column.Column alignment was performed in two phases: • In the first phase, adjacent columns belong to the same cluster and the values attributed by the clustering algorithm are merged.Here, the token value type is not considered.This is because some tokens are recognized as numbers, but are part of the messy text.
• In the second phase, redundant columns are still attached to the last column and the value assigned by the algorithm to the previous position is maintained.An alignment process was performed for each clustering algorithm.The results of this step are sets of aligned tokens in the clusters/columns.

10) EVALUATION OF RESULTS AND SELECTING THE MOST APPROPRIATE RESULT -PHASE2
In this step, a set of detected clusters/columns is selected.Each set contains clusters.Each cluster corresponded to an extracted column in the selected table.One set of results was selected for external validation at the table and column level.This step resulted in the detection of the selected tables and the population of the columns.

11) GENERATE OUTPUT
In this step, the datasets produced in the previous phases are exported for use in various software environments.

D. POSITIONING
Based on previous research, this algorithm can be classified as an implicit schema discovery algorithm, because no additional information is available in the input messy file.The STCExtract algorithm attempts to characterize a dataset by grouping similar instances.This algorithm uses an unsupervised machine learning technique and enables the use of multiple clustering algorithms.The scalability of the STCExtract algorithm relies on the selected algorithms and parameters.The solution was reached simultaneously locally, and the results were stable.Input data are messy text files with unknown data types but with predefined patterns for detecting primitive types.The output of the algorithm represents a structure in the form of tables and columns, corresponding to the original dataset.The quality of a solution can be measured by calculating its accuracy, precision, recall, or adequate substitutes.

IV. EXPERIMENTAL SETUP
This section describes the data sources used in our experiments.The problem addressed in the three experiments formally declared a procedure for creating and selecting a final messy-delimited test file.Four experiments were conducted, as illustrated in Figure 2.
The overall experiment was set up such that Experiments 1, 2, and 3 were used to determine the input parameters for the STCExtract algorithm.Each experiment investigated problems defined by the research questions addressed in the previous section.In Experiment 1, the algorithm results were tested using datasets with tables containing only columns with a scalar data type.In Experiments 2 and 3, the algorithm's results were tested using datasets with tables containing columns with scalar data types and one or more columns with vector data types, respectively.The evaluation results of these three experiments determined the parameters of the STCExtract algorithm that were used as the input parameters for Experiment 4.
The evaluation results of these three experiments determined the parameters of the STCExtract algorithm that were used as the input parameters for Experiment 4. The evaluation of the results of Experiment 4 represents the final results of this paper and was performed according to the methodology described in [23].
A. DATA Data were collected from two publicly available sources: the Internet Movie Database and Kaggle.Data sources are described in the following sections.Short information regarding the used is provided in Table 1 and more detailed information is provided in Appendix.

1) IMDB DATABASE
The Internet Movie Database (IMDb) contains information and statistics on films, television programs, actors, directors, and other professionals in the film industry.The official IMDb database contained seven datasets.Each dataset is in a tab-separated compressed file (TSV) in UTF-8 format and updated daily.In addition, IMDb datasets are available for customer access for personal and noncommercial purposes.Four datasets were used in this database [24].

2) THE GAME REVIEWS
The Game Review Dataset is a structured dataset comprising four files with information on game reviews.Only the review_info.csvfile was used in the experiments.One column is removed from each file [25].

3) THE CALENDAR
The Calendar database is a catalog of over three million records with limited information about data sources.This data source was used because it contained data fields in the dataset [26].
113082 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

4) BARCELONA TRAFFIC ACCIDENTS
The dataset was created by reconciling yearly traffic accident reports in Barcelona recorded by Guardia Urbana and provided by OpenDataBCN (Public Barcelona government data portal).The released dataset contains 27 columns.Seventeen columns were used in the experiments [27].

5) MONTHLY ELECTRICITY PRODUCTION
The data were collected from the International Energy Agency (IEA) website and included monthly information on energy production in various countries from 2010 to 2022.Energy production is measured in gigawatt-hours (GWh) and covers a range of energy products including hydro, wind, solar, geothermal, nuclear, and fossil fuels [28].

6) PORT SHIPMENT DATASET
This dataset contains information on the international shipments of ports.The dataset includes the name of the product, its price, weight, length, width, and height as well as the date on which it was shipped and the destination port.This dataset was not based on actual shipments or real-world data [29].

7) LARGE MOVIE DATASET
This dataset contains over 200k movies watched by 7k+ unique users, ratings, and genres.The dataset includes the ID of a user, name of the movie watched, rating given by the user, and movie genre [30].

8) EVERY PUB IN ENGLAND
This dataset includes information on 51,566 pubs.This dataset contains the following columns: the Food Standard Agency's ID for this pub, name of the pub, address fields separated by commas, postcode of the pub, easting, northing, latitude, longitude, and local authority this pub falls under.The data were derived from the Food Standard Agency's Food Hygiene Ratings and ONS Postcode Directory.The data were licensed under an open government license [31].

9) INSTITUTES OF TECHNOLOGY ADMISSIONS DATASET
This dataset contained 200,000 student profiles from the Indian Institute of Technology (IIT).The dataset provides valuable information on student admissions, academic background, and financial aspects.It includes fields such as student ID, date of birth, field of study, specialization, year of admission, expected year of graduation, current semester, fees, and discounts on fees [32].

B. EXPERIMENTS 1) EXPERIMENT 1
The objective of Experiment 1 was to determine the combination of parameters and applied algorithms that performed best in the clustering of structures, if the input data contained only columns with scalar data types.
Structures from various tables, with all columns containing only scalar data types, were clustered in experiment 1.There are three test datasets where the first dataset was created based on two tables (exp1_case1), the second based on three tables (exp1_case2), and the third based on five tables (exp1_case3) from the input dataset.(Table 2) The objective of Experiment 2 was to determine the combination of parameters and the applied algorithms that best performed structure clustering when the input data contained columns of scalar and vector data.
The clustering of structures belonging to various tables was performed in Experiment 2, with each table containing columns of scalar data type and one table containing one column of vector data type, in addition to scalar data type.Four test datasets are used in this paper.The first dataset was created based on two tables (exp2_case0), the second on three tables (exp2_case1), the third on four tables (exp2_case2), and the fourth on six tables (exp2_case3) of the input dataset.The datasets for Experiment 2 were generated by using the tables used in Experiment 1. Experiment 2 used an additional table with one column and vector data type.(Table 2) The objective of Experiment 3 was to determine whether the combination of parameters and applied algorithms performed best in the clustering of structures, if the input data contained columns with scalar and multiple vector data types.
The clustering of structures belonging to various tables was performed in Experiment 3, with each table containing columns of scalar data types and one table containing two columns of vector data types, and scalar and vector data types.There were four test datasets.The first dataset was created based on two tables (exp3_case0), the second was based on three tables (exp3_case1), the third was based on four tables (exp3_case2), and the fourth dataset consisted of six tables (exp3_case3) from the input dataset.The datasets for Experiment 3 were created using the tables from Experiment 1. Experiment 3 used an additional table with two columns and vector data type.(Table 2)

4) EXPERIMENT 4
The objective of Experiment 4 was to re-evaluate clustering using the STCExtract algorithm with selected combinations of input parameters for Phase 1 for clustering structures, and to use clustering algorithms selected for Phase 2 for clustering table columns.Experiment 4 used five new datasets, and one dataset from Experiment 2.
In Experiment 4, the clustering of structures belonging to different tables was performed, where the tables had only columns with scalar data types (case1s, case1n, and case2s).Structure clustering was also performed for tables with only columns with a scalar data type, tables with columns with a scalar data type, and columns with a vector data type (case1v and case2v).Five test datasets were used: the first dataset was created based on two tables (exp4_case1s); the second was based on three tables (exp4_case2s); the third was based on three tables (exp4_case1n); the fourth was based on three tables (exp4_case1v), and the fifth dataset consisted of four tables (exp4_case2v) from the input dataset.(Table 3)

C. CREATION OF MESSY DELIMITED FILE
The test messy-datasets required for the experiments are not publicly available.Therefore, messy datasets for the experiments were created based on the publicly available tabular datasets.Depending on the experiment, two or more datasets (Table 1) were selected as listed in Table 2 or Table 3.
During the selection process, column names were eliminated, column positions were changed in randomly selected rows, and the fields were randomly deleted.For the first three experiments, 100 messy datasets were created for 11 cases, that is, 1100 messy datasets.The same number of rows represent each table from the original dataset to create a balanced dataset.The rule that the converted structures do not share rows between the input datasets was used as an additional criterion for selecting the final test sample.

D. DESIGN OF EXPERIMENTS 1, 2 AND 3
Each Design of experiments (DOE) comprises a series of steps: planning, execution of the experiment, and analysis of the collected experimental data using various statistical methods to draw valid and objective conclusions [33].An experiment can be defined as a series of tests in which a set of input variables or factors (x) is changed by an experimenter in a controlled manner (c) to observe and identify how the response (y) of the system is affected by these changes [34].The experiments were performed systematically using factorial experiments (so-called factorial experimental designs/arrays), wherein several factors were altered during each experimental run to examine their impact of several factors and interactions on the response quantity [33].A factorial experiment whose design consists of all possible combinations of chosen factors and levels is called full-factorial design (FFD) [35].
Each experiment consisted of two phases.The goal of Phase 1 is to obtain clusters corresponding to the data tables from the input dataset.Steps 1-6 of the STCExtract algorithm create a cluster at table level.The goal of Phase 2 is to obtain clusters corresponding to the columns of the corresponding tables based on the clusters created after Phase 1. Steps 7-11 of the STCExtract algorithm are used for clustering at the column level.
In Phase 1, six distance functions, five cluster center detectors, three object comparators, and two algorithms for clustering structures were tested.The distance functions, object comparators, cluster center detectors, and algorithms for clustering structures represent the input parameters of the proposed STCExtract algorithm or factors with different levels.In Phase 1, the number of levels for each factor was not the same; thus, these experiments were symmetrical or mixed-factorial, and the number of unique factor combinations was #Runs = 6 1 x5 1 x3 1 x2 1  = 180 (in this paper, it will be called ResultsPh1).For 11 cases in three experiments, the ResultsPh1 was generated.In total, 1980 ResultsPh1 were obtained.
In Phase 2 of the STCExtract algorithm for column clustering, five machine learning algorithms were tested: K-means, FCM (Fuzzy C-means), Gaussian Mixture, Birch (balanced iterative reducing and clustering with hierarchies), and Mini 113084 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Batch K-means The proposed STCExtract algorithm includes these machine learning algorithms for clustering columns as an additional parameter.
Combining all the mentioned parameters from Phase 1 and Phase 2 of the STCExtract algorithm, 6 1 x5 1 x3 1 x2 1 x5 1 = 900 combinations for clustering the input dataset for each case were tested.A total of 9900 column-level clustering results (hereafter referred to as ResultsPh2) were obtained.

E. TOOLS
The Python 3.10 programming language in PyCharm 2021.3.1 (Community Edition) was used to implement the STCExtract algorithm.For data analysis, Python scripts were created in the Jupyter Notebook 6.5.4.MS SQL Server 2008R2 was used for data analyses.Testing was performed on three personal computers as follows: • 11th Gen Intel® Core™ i5-1135G7 Processor with 8 GB RAM.

F. EVALUATION METRICS
After the formation of the cluster, that is, the creation of the generated output tables, the obtained results were compared with the original data at the level of rows belonging to the corresponding tables.Based on this comparison, external measures, such as Adjusted Mutual Information (AMI) and Adjusted Rand index (ARI), were calculated for each ResultsPh1.
To determine the number of correctly recognized rows that belong to tables that have all columns with a scalar data type, as well as the number of correctly recognized rows that belong to tables that, in addition to columns with a scalar data type, and columns with a vector data type, the following values were calculated: accuracy ph1_scalar = x ph1_scalar /y ph1_scalar (2) In Equation ( 1), x ph1 is the total number of correctly recognized rows for each input table and y ph1 is the sum of the rows from all input tables.Assuming that each cluster contained data from various input tables, the number of correctly recognized rows, x ph1 was obtained from the table with the maximum number of correctly recognized rows in the given cluster.
In Equation ( 2), x ph1_scalar is the number of correctly recognized rows for each input table where all columns have a scalar data type, and y ph1_scalar is the number of correctly recognized rows for each input table where all columns have a scalar data type.
In Equation ( 3), x ph1_vector is the number of correctly recognized rows for each input table with a column/columns containing vector data type.y ph1_vector is the total number of rows of all the input tables with columns of a vector data type.
After the formation of clusters at the column level, that is, the generation of the generated output tables, the obtained ResultsPh2 were compared with the original data at the row and column levels belonging to the corresponding tables, assuming that there are data from several input tables in one cluster, as the number of correctly recognized rows per table and column is taken from the table with the maximum number of correctly recognized rows in the given cluster.
To determine the number of correctly recognized rows on the rows and columns that belong to tables that have all columns with a scalar data type, as well as the number of correctly recognized rows that belong to tables that, in addition to columns with a scalar data type and columns with a vector data type, the following values were calculated: accuracy ph2_scalar = x ph2_scalar /y ph2_scalar (5) accuracy ph2_vector = x ph2_vector /y ph2_vector (6) In Equation ( 4), x ph2 is the total number of correctly recognized rows for each input table and all its columns and y ph2 is the sum of the rows from all input tables.Suppose there are data from several input tables in one cluster.In this case, the number of correctly recognized rows and columns, x ph2 was obtained from the table with the maximum number of correctly recognized rows with all columns in the given cluster.
In Equation ( 5), x ph2_scalar is the number of correctly recognized rows for each input table and its columns, where all columns contain scalar data types, and y ph2_scalar is the total number of rows for all input tables, where all columns contain scalar data types.
In Equation ( 6), x ph2_vector is the number of correctly recognized rows and columns for each input table with column/columns containing the vector data type, and y ph2_vector is the total number of rows for all input tables with columns containing the vector data type.

G. EVALUATION PHASE 1 -CLUSTERING STRUCTURES
After Phase 1 of the STCExtract algorithm, three analyses were performed.
• The Adjusted Rand index (ARI) metric results for each experiment and case were analyzed independently for each parameter of the proposed STCExtract algorithm to determine whether any parameter significantly affected ResultsPh1 and how they interacted.
• The number of recognized clustered structures (k: number of clusters) and the structure of the original dataset (t: number of tables) were analyzed independently.
• Regardless of the experiment, the best results for the Adjusted Mutual Information (AMI) and Adjusted Rand index (ARI) metrics for 1980 ResultsPh1 were compared.In the first two analyses, each input parameter was analyzed independently to determine whether any parameter significantly affected ResultsPh1 at the case level, and how they interacted with each other at the experimental level.For each ResultsPh1, a consolidated measure was calculated.This measure was obtained by comparing the ARI values of ResultsPh1 among the experimental cases and choosing the minimum ARI value.Descriptive statistics (measures of central tendency and variability) were used to evaluate ResultsPh1 for ARI values of ResultsPh1.In addition, a oneway analysis of variance (ANOVA) test, with the significance set at 0.05, was used to compare the means of the input parameters for each experiment.ANOVA is a statistical relevance tool designed to evaluate whether a null hypothesis can be rejected while testing hypotheses.This determines whether the means of three or more groups are equal.The null hypothesis was that any difference between the comparator functions of the three objects would be due to chance.An ANOVA test typically uses a statistic called the 'p-value.'The null hypothesis was rejected if the 'p-value' was less than the significance level.A more detailed explanation has been provided in [35].The Tukey's Honest Significant Difference (Tukey's HSD) test was used to determine which groups differed.
In addition, at the case level, the number of clusters k obtained after Phase 1 and the number of tables t used when creating a test dataset with a broken structure were calculated for each ResultsPh1.Based on these two parameters for each case and each experiment, the number of ResultsPh1, where the value of k is equal to or greater than or less than the value of t, was calculated.Suppose k<t; this indicates the merging of several data tables within one cluster, which can lead to the degradation of the solution's performance.In this case, the unstructured input data must be clustered correctly; there is not one table in one cluster, but many different tables.If k>t, all tables may be recognized correctly, but divided into several parts, and a certain number of table rows are incorrectly classified.Suppose the value of k is equal to t.In this case, all input tables may be recognized correctly; however, inside one of the k clusters, the rows may be incorrectly clustered and may belong to another table.

1) ANALYSIS -OBJECT COMPARATOR
The results for the three object comparators (Element-Matcher, ElementMatcher1, and ElementMatcher2) in Experiments 1, 2, and 3 are shown in Figures 3, 4, and 5, respectively.The results based on ARI values from Result-sPh1 are shown in Figures 3a, 4a, and 5a.Figures 3b, 4b, and 5b compare the number of identified clusters k with the number of input tables t.The circles in Figures 3a, 4a, and 5a represent mean values.
Table 4 shows the mean values of the three object comparators according to the descriptive statistics for each case in experiment 1.
This step tested the influence of the three object comparator functions on the ARI values of 180 ResultsPh1.Result-sPh1 was divided into three groups by matcher_type, with 60 results each.The test was repeated for each case in Experiment 1 and a consolidated measure.The null hypothesis was  rejected for the consolidated measure.Tukey's HSD test for consolidated measures showed a p-value of 0.012 between ElementMatcher and ElementMatcher1.
In Experiment 1, ElementMatcher yielded the highest percentage of results (74.4%) when the number of recognized clusters was greater than or equal to the number of input tables.ElementMatcher1 comes second at 72.8% and ElementMatcher2 comes third at 57.8%.
The ElementMatcher and ElementMatcher1 functions had a more significant impact on ResultsPh1 than Element-Matcher2 according to the analysis based on the comparison of the number of clusters and tables, ANOVA, and Tukey's HSD test for a consolidated measure of Experiment 1.
Table 5 shows the mean values of the three object comparators according to the descriptive statistics for each case in experiment 2. As in Experiment 1, an ANOVA test was performed for all the cases in Experiment 2. The results showed that the null 113086 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.hypothesis should be rejected for all other cases and the consolidated measure, except for exp2_case3.Tukey's HSD test for consolidated measures showed a p-value of 0.006 between ElementMatcher and ElementMatcher2.Regarding the comparison of the number of recognized clusters and input tables, in Experiment 2, ElementMatcher had the largest number of results, in which the number of recognized clusters was greater than or equal to the number of input tables (73.33%),ElementMatcher1 (64.17%), and Element-Matcher2 (50.42%).According to the second criterion for analyzing the results to determine the object comparator in Experiment 2, ElementMatcher achieved the best results.In the case of ANOVA and Tukey's HSD test for the consolidated measure, ElementMatcher and ElementMatcher2 influenced ResultsPh1.
Table 6 shows the mean values of the three object comparators according to the descriptive statistics for each case in experiment 3.As in the previous two experiments, an ANOVA test was performed for all the cases in Experiment 3. The results showed that the null hypothesis should be rejected for exp3_case0 and exp3_case1.Since the ANOVA test did not show that the three object comparators significantly impacted ResultsPh1, the consolidated measure's descriptive statistics were consulted.It was determined that ElementMatcher2 has the highest maximum value (0.76), followed by ElementMatcher (0.65) and ElementMatcher1 (0.59).As shown in Figure 5, ElementMatcher1 has a median value of 0.17 for the consolidated measure, followed by Ele-mentMatcher at 0.14 and ElementMatcher2 at 0.00.Regarding the comparison of the number of recognized clusters, in Experiment 3, ElementMatcher had the most significant number of results, in which the number of recognized clusters was greater than or equal to the number of input tables by 67.5%, ElementMatcher1 by 54.58%, and ElementMatcher2 by 31.67%.
According to the second criterion for analyzing the results and determining the functions for comparing objects in Experiment 3, ElementMatcher achieved the best results.Regarding descriptive statistics for the consolidated measure, ElementMatcher1 had the most significant influence on the results of clustering structures (ResultsPh1).

2) ANALYSIS -DISTANCE FUNCTIONS
The comparative results for the six distance functions (linear, linear_weight, log, log_weight, reverse, and reverse_weight) in Experiments 1, 2, and 3 are shown in Figures 6, 7, and 8.The results based on the ARI values from ResultsPh1 are shown in Figures 6a, 7a, and 8a.Figures 6b, 7b, and 8b compare the number of identified clusters k with the number of input tables t.The circles in Figures 6a, 7a, and 8a represent mean values.
Table 7 shows the mean values of the six distance functions according to the descriptive statistics for each case in experiment 1.This step tested the influence of the sixth distance function on ARI values using 180 ResultsPh1.The results were divided into six groups according to distance type, with 30 results each.An ANOVA was used to determine whether there was a significant difference between the six distance functions.The test was repeated for each case and the consolidated measure in Experiment 1.The analysis showed that the null hypothesis was not confirmed only in the case of exp1_case3.When comparing the number of recognized clusters, the linear function had the most significant results in Experiment 1 (82.22%), in which the number of recognized clusters was greater than or equal to the number of input tables, followed by the log function (77.78%).Since the ANOVA test did not show that any of the six distance functions significantly impacted ResultsPh1, the results of descriptive statistics for the consolidated measure were used.Regarding the maximum value at the consolidated measure level, the functions linear_weight and log_weight had the highest recorded values (0.93), followed by linear (0.90), reverse (0.88), log (0.83), and reverse_weight (0.81).However, if we consider the median for the consolidated measure, the reverse function has the highest median value (0.90), followed by the linear and log functions (0.58).
A linear function can be selected according to the second criterion to determine the influence of the distance functions on the results of structure clustering.Based on the analysis of the ARI values of the consolidated measures, the linear_weight and log_weight functions can be selected according to the maximum value, and the reverse function according to the median.
Table 8 shows the mean values of the six distance functions according to the descriptive statistics for each case in experiment 2.
Based on the ANOVA test, the null hypothesis was not confirmed for the exp2_case0.Regarding the comparison of the number of recognized clusters and input tables, the linear and reverse functions had the highest number of experimental results (73.33%), in which the number of recognized clusters was greater than or equal to the number of input tables, followed by the log function (70.83%).Because ANOVA did not show that any of the six distance functions had a significant  impact on results of the clustering of structures, the results of the descriptive statistics for the consolidated measure were consulted.For the maximum value at the consolidated measurement level, the linear_weight function has the highest recorded value (0.83), followed by the log_weight function (0.73).However, if we consider the median of the combined measures, the highest value of the median has a reverse function (0.37) followed by linear and log functions (0.34).
A linear function can be selected according to the second criterion to determine the influence of the distance functions on the results of structure clustering.Based on the analysis of the ARI values of the consolidated measures, the linear_weight function can be selected according to the maximum value, and the reverse function according to the median.
Table 9 shows the mean values of the six distance functions according to the descriptive statistics for each case in experiment 3.
Based on the ANOVA test, the null hypothesis was rejected only for exp3_case0.Regarding the comparison of the number of recognized clusters, the linear and reverse functions had the highest number of results in Experiment 3 (65%), in which the number of recognized clusters was greater than or equal to the number of input tables, followed by the log function (61.67%).Since the ANOVA test did not show that any of the six distance functions significantly influenced ResultsPh1, the results of descriptive statistics for the consolidated measure were consulted.For the maximum value of the consolidated measure, the linear_weight function had the highest recorded value (0.76), followed by the reverse_weight function (0.65).However, considering the median of the combined measures, the highest value had a linear function (0.47) followed by a log function (0.45).
According to the second criterion for determining the influence of distance functions on the results of the clustering structure, linear and reverse functions can be obtained.Based on an analysis of the ARI values of the consolidated measures, the linear_weight function can be selected according to the maximum value, and the linear function can be selected according to the median.

3) ANALYSIS OF CLUSTER CENTER DETECTORS
The comparative results for the five cluster center detectors (GroupMax, GroupMax2, MaxFreq, MinDistance, Table 10 shows the mean values of the five cluster center detectors according to the descriptive statistics for each case in experiment 1.In this step, influence of the five cluster-center detector functions on the ARI values was tested using 180 ResultsPh1.The results were divided into five groups by combiner_type, with 36 results each.The ANOVA test determined whether there was a significant difference between the five cluster center detectors.The test was repeated for each case in Experiment 1.The analysis showed that the null hypothesis was not confirmed in any case, or at the consolidated measure level.Based on Tukey's HSD test, the interdependence between most functions for detecting the cluster center was determined in the case of the consolidated measure.All the other functions interacted with PositionMax; only PositionMax yielded the worst results.Analysis of the results of descriptive statistics for the consolidated measure determined that in the case of the maximum value at the level of the consolidated measure, the GroupMax2 function had the highest recorded value (0.93), followed by MinDistance (0.90).However, if we consider the median of the combined measures, the MaxFreq function has the highest median value (0.74) followed by the MinDistance function (0.61).Regarding the comparison of the number of recognized clusters, the MaxFreq function had the highest number of results in Experiment 1 (90.74%), in which the number of recognized clusters was greater than or equal to the number of input tables, followed by the MinDistance function (77.78%).
The MaxFreq function can be obtained according to the second criterion for determining the influence of distance functions on ResultsPh1.In the case of analysis based on the ARI values of the consolidated measures, the GroupMax2 function can be selected according to the maximum value and the MaxFreq function according to the median.
Table 11 shows the results for the mean values of the five cluster center detectors according to the descriptive statistics for every case in experiment 2. In the case of the consolidated measure based on Tukey's HSD test, the interdependence between most of the functions for detecting the cluster center was determined.All the other functions interacted with PositionMax; only PositionMax yielded the worst results.Analysis of the results of descriptive statistics for the consolidated measure in Experiment 1 indicated in the case of the maximum value at the level of the consolidated measure, the MinDistance function had the highest recorded value (0.83), followed by GroupMax2 (0.76).However, if we consider the median for the consolidated measure, the MaxFreq function has the highest median value (0.59) followed by the MinDistance function (0.51).Regarding the comparison of the number of recognized clusters, the MaxFreq function had the most significant results (86.81%), in which the number of recognized clusters was greater than or equal to the number of input tables, followed by the MinDistance function (74.31%).
The MaxFreq function can be selected according to the second criterion for determining the influence of the distance functions on the results of structure clustering.Based on the analysis of the ARI values of the consolidated measures, the MinDistance function was selected according to the maximum value and the MaxFreq function was selected according to the median.
Table 12 shows the mean values of the five cluster center detectors according to the descriptive statistics for each case in experiment 3. The ANOVA test showed the null hypothesis was not confirmed for all cases in Experiment 3, or for the 113090 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.consolidated measure.Tukey's HSD test was used to establish the interdependence between most cluster center detectors.Regarding the comparison of the number of recognized clusters, the MaxFreq function had the most significant number of results, where the number of clusters was greater than or equal to the number of input tables (83.33%), followed by the MinDistance function (73.61%).An analysis of the descriptive statistics for the consolidated measure in Experiment 3 revealed that the MaxFreq function had the highest recorded maximum value (0.76), followed by MinDistance (0.65).However, for the median of the consolidated measure, the MaxFreq function had the highest median value (0.49), followed by MinDistance (0.47).Regarding the comparison of the number of recognized clusters, the MaxFreq function has the highest number of results (83.33%), followed by the MinDistance function (73.61%).
The MaxFreq function can be selected according to the first and second criteria for determining the impact of the cluster center detectors on ResultsPh1.

4) ANALYSIS -CLUSTERING ALGORITHMS
The comparative results of the two clustering algorithms (Hierarchical and K-means) for Experiments 1, 2, and 3 are shown in Figures 12,13 This step tested the influence of the two clustering algorithms on the ARI values of the ResultsPh1.The data analysis was performed using descriptive statistics.The analysis of the results based on the ARI value showed that at the level of all three cases of Experiment 1 and the consolidated measure, the K-means algorithm had a higher median and mean value than the Hierarchical algorithm.K-means has a higher maximum value for a consolidated measure (0.93) than for the Hierarchical algorithm (0.88).When comparing the number of recognized clusters, the K-means algorithm had a higher percentage of Experiment 1 results (77.78%) than the Hierarchical algorithm (58.89%),where the number of recognized clusters was greater than or equal to the number of input tables.
Based on previous analysis, K-means clustering can be used in Experiment 1.
An analysis of the results based on the ARI values showed that the K-means algorithm had higher mean and median values than the hierarchical algorithm in all four cases of Experiment 2, and in the consolidated measure.Based on the consolidated measure, the Hierarchical algorithm achieves the highest maximum value (0.83), followed by K-means (0.76).
When comparing the results of Experiment 2 for the number of recognized clusters, the Hierarchical algorithm yielded slightly more results (62.78%) than the K-means algorithm (62.50%),where the number of recognized clusters was greater than or equal to the number of input tables.
The Hierarchical algorithm can be used according to the second criterion to determine the impact on the Results ResultsPh1.Based on the analysis of the ARI values of the consolidated measures, the hierarchical algorithm can be chosen if the selection criterion is a maximum value, and the K-means algorithm if the selection criterion is the median.An analysis of the results based on the ARI values showed that the K-means algorithm had higher mean and median values than the Hierarchical algorithm in all four cases of Experiment 3 and for the consolidated measure.
When comparing the results of Experiment 3 for the number of recognized clusters, the Hierarchical algorithm yielded slightly more results (52.50%) than the K-means algorithm (50.00%),where the number of recognized clusters was greater than or equal to the number of input tables.
The Hierarchical algorithm can be used according to the second criterion to determine the impact on the Results ResultsPh1.Based on the analysis of the ARI values of the consolidated measures, the Hierarchical algorithm can be chosen if the selection criterion is a maximum value and the K-means algorithm if the selection criterion is a median.

5) ANALYSIS OF THE INTERDEPENDENCE OF ALL PARAMETERS
The results were observed separately in previous analyses at the case and experimental levels.In contrast, the results of the three experiments were observed together to determine which input parameter combination provided the best solution for all three experiments.An additional consolidated measure ('exp_mul') was added to this analysis to consider the three consolidated measures created at the practical level and to connect the three experiments.Adjusted Rand Information (ARI) and Adjusted Mutual Information (AMI) for every ResultsPh1 were used for this analysis.Namely, Result-sPh1 by ARI and AMI values for the consolidated measure 'exp_mul' were compared.
When the ARI and AMI scores are approximately zero, indicating randomly labeled clusters, this value will not be further considered in the analysis.The analysis found that the consolidated measure 'exp_mul' with AMI values equals zero for 50% of ResultsPh1.The analysis also found that for 53.3% of ResultsPh1, the consolidated measure 'exp_mul' with ARI values equals zero.In the next step, ResultsPh1 was taken for 'exp_mul' different from zero in both AMI and ARI values.These two sets of ResultsPh1 differed by six results, that is, 3.33% more ResultsPh1 were contained in the set of AMI values.The 'exp_mul' values ranged from 0.011 to 0.014 for all six results.
Given that this is approximately equal to zero, the decision to continue working with the set of 84 ResultsPh1 can be accepted, because they appear in both the ARI and AMI sets of ResultsPh1.Then, from ResultsPh1, where 'exp_mul' is not equal to zero, the results where the number of clusters is less than the number of tables are excluded.Fifty-six results were excluded and 28 structure-clattering results were obtained.The result of the previous actions is a set of Result-sPh1 (further called SelectedResulsPh1) where 'exp_mul' differs from zero according to ARI and AMI values in each case.In addition, the number of generated clusters is greater than or equal to the number of input tables during the structure clustering.
By simply counting the SelectedResulsPh1 (28 results) using all input parameters, ElementMatcher was found to participate in 14, ElementMatcher1 in 11, and ElementMatcher2 in three results.In the case of the cluster center detectors, only three functions appeared in SelectedResulsPh1: MaxFreq  in 18, GrupMax2 in 5, and MinDistance in Regarding the distance functions of SelectedResulsPh1, the linear function yielded eight results: six for linear_weight, eight for log inside, three for log_weight, two for and one for reverse_weight.When examining the clustering algorithm, it was found that the Hierarchical algorithm yielded 19 results and K-means yielded nine results.
The average value for all (11) cases was calculated based on the AMI and ARI values for further selection from Select-edResulsPh1.Each unique 'ex_mul' and 'exp_average' position value was assigned.In the case of 'exp_mul,' according to ARI, 19 unique values and 25 for 'exp_average' were determined.In the case of 'exp_mul' according to AMI, 16 unique values and 25 for 'exp_average' were determined.An insight into the results showed that there are parameter combinations with the same 'exp_mul' or 'exp_average' value.In this case, the data are processed without rounding.Two possible combinations of parameters that yielded the best results were obtained based on independent counting of the input parameters.The results for these two input parameter combinations are listed in Table 13.Considering the number of unique values for ARI and AMI, it was determined that these combinations of input parameters were not in the top ten results.Table 14 shows the results that fall in the top 10 values per position for two values,'exp_mul' and 'exp_average' for ARI and two values for 'exp_mul' and 'exp_average.'The first two parameter combinations listed in Table 14 are among the top three combinations.These were selected as the best combination of input parameters, for which the results of the Phase1 STCExtract algorithms were further analyzed at the column level.

H. EVALUATION PHASE 2 -CLUSTERING COLUMNS
The accuracy metric results of Phase 2 were independently analyzed for each parameter of Phase 2 (machine learning algorithm for clustering columns) of the proposed STCExtract algorithm for each experiment and case within the experiment for the combination of input parameters that after Phase 1 gave the best results clustered structures.
Table 15, Table 16, and Table 17 show the combined results of Phase 1 (the structure clustering phase) and Phase 2 (the column clustering phase) for two combinations of input parameters (selected clustering algorithm, distance function, cluster center detectors, and object comparators) that yielded the best results Phase 1.The input parameters ''Element-Matcher2 -linear_weight -MaxFreq -Hierarchical'' are represented by the column labeled ''The first combination of input parameters,'' the input parameters ''Element-Matcher -linear_weight -MaxFreq -Hierarchical'' are represented by the column labeled ''The second combination of input parameters.''Table 15 shows the results of Experiment 1, where tables with columns can only have a scalar data type and only values accuracy ph1 and accuracy ph2 because accuracy ph1 = accuracy ph1_scalar and accuracy ph2 = accuracy ph2_scalar .
Based on the results shown in Table 15, Table 16, and Table 17, it can be observed that the Gaussian mixture, K-means, and Mini-batch K-means yielded the same results for correctly recognized rows at the table and column levels.By contrast, the results for the remaining two algorithms were the same or slightly worse.Suppose that the results for both combinations of input parameters for the Gaussian mixture, K-means, and Mini-batch k-means algorithms are compared.In this case, it can be said that for Experiment 1, the second combination of input parameters provided a better result for Case 3. The results were identical for both input-parameter combinations in the first two cases.In Experiment 2, the first combination of input parameters yielded better results in both cases and the second combination of input parameters yielded better results in the other two cases.In Experiment 3, both combinations of input parameters yielded the same results in one case.For the two cases, the first combination of input parameters yielded a better result, and for one case, the second combination of input data yielded a better result.The results show that the STCExtract algorithm consistently recognizes tables with all scalar-type columns, and does not recognize neighboring vector-type columns (e.g., exp3_case3).

I. RESULTS AND DISCUSSION
The results of Experiment 4, which evaluated the accuracy of the STCExtract algorithm, are presented in Tables 18, 19, and 20.The answers to the research questions that validate the proposed STCExtract algorithms are presented below.

1) ANSWER TO RQ1
The proposed STECxtract algorithm comprises of two phases.Data structures are recognized in the first phase (the tables from which the data originate) and second phase (the columns of the tables from which the data originate).Machine learning techniques were used at each stage to identify appropriate table and table column structures.
The STCExtract algorithm correctly arranged the structures of the tables with an accuracy of 94.4% to 100%, as shown in Table 18 from Experiment 4. The accuracy of the STCExtract algorithm in the second phase (when the data were allocated to columns) ranged from 59.7% to 90.2%.From Table 18, it can be observed that the accuracy of the proposed algorithm varies from case to case.However, to better cover the behavior of the algorithm in its two phases, the lowest and highest accuracies recorded between the phases 113094 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. in each case of Experiment 4 were used as the range of the accuracy measure.This algorithm does not address the recognition of data subtypes or names of the corresponding columns.The objective of new research on this issue might be to recognize the names of data subtypes, columns, tables, and their relationships.

2) ANSWER TO RQ2
In the first and second phases of the STCExtract algorithm, the impact of six different independent parameters was analyzed to identify the similarity of tables or columns.Based on Experiment 4 and its evaluation, the STCExtract algorithm better recognized tables with all columns and scalar data types.For case1s and the second combination of parameters, the results in Table 18 show that the STCExtract algorithm correctly allocated 100% of the rows according to the tables to which they belong.By contrast, in Phase 2, when allocating the columns to the tables to which they belong, the STCExtract algorithm correctly allocated 90.2% of the data.Table 19 shows that 9.8% of the rows belong to the tables but not to the content of the columns.Table 20 shows that 9.8% of the rows and 4.8% of the rows substitute or delete the columns.For case2s, the STCExtract algorithm did not successfully allocate 0.9% of the rows per table and column.
The columns were allocated successfully to 91.7% of the rows and 7.4% were incorrectly recognized.Table 20 shows that 7.4% of the rows and 4% of the rows substitute or delete columns.For each case, the number of rows in which columns were substituted or deleted was approximately 5%.A difference of 1% refers to rows incorrectly allocated by the tables.For case1n, the STCExtract algorithm correctly allocated 99.4% of the rows depending on the corresponding tables.In Phase 2, the STCExtract algorithm correctly allocated 59.7% of row columns to the corresponding tables.The STCExtract algorithm also successfully grouped 6.4% of the data into columns, of which 3.2% were in rows where the columns were substituted or deleted.According to Tables 18 and 19, the total number of rows allocated by the table and columns can be determined, which for case1n is 66.1%.Only 1% of the rows in case1n were not adequately allocated by the tables and columns.A difference of 33.9% was achieved by the rows successfully allocated by the tables; however, the number of columns was required to be recognized correctly.This is one of the reasons for the large percentage of poorly clustered columns in the open_pubs table (which, in the input test dataset, makes up 33.35% of the data).The open _pubs table contains one column with a structured address of four values separated by a comma.Consequently, it is possible to assert that the STCExtract algorithm cannot recognize the number of columns in a table when a column is intended to have structured data separated by characters, which the STCExtract algorithm recognizes as a delimiter parameter.For example, address fields can be separated using commas.The STCExtract algorithm divides the columns into independent columns.
The STCExtract algorithm is modular and can be extended by using new rules and functions to detect scalar data types.This problem may be the subject of future work that would solve the problem of determining the number of columns and sorting the data by columns, only for part of the data where it is not well done (e.g., case1n).

3) ANSWER TO RQ3
In the first and second phases of the STCExtract algorithm, the influence of the six independent functions on detecting the similarity of tables or columns was investigated.Based on Experiment 4 and its evaluation, it was found that the STCExtract algorithm does not recognize the data structure at the column level, where the minimum number of elements of most vectors is greater than one, as shown in case1v.However, for case2v, the STCExtract algorithm allocated 90.2% of the rows at the column level and practically all rows (99.6%) according to the corresponding tables.The STCExtract algorithm correctly assigned 99.5% of the rows in the table where the table had all columns with a scalar data type, and correctly assigned 99.9% of the rows in the belonging table with columns with a scalar and a vector data type.The STCExtract algorithm correctly allocated 91.2 percent of the rows at the column level during Phase 2, also known 113096 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.as the column clustering phase, for tables with all scalar datatype columns.Of the other 8.3% of rows where the number of columns was correctly clustered, a discrepancy occurred in 4.9% owing to the substitution or deletion of columns.The remaining 3.4% of poorly clustered columns may be because some column values contain a delimiter such as a vertical bar.However, the column value is remerged from multiple columns during phase reconstruction, and the STCExtract algorithm uses a coma.Thus, poor remerging of only one column can degrade the overall accuracy in Phase 2. Because the number of columns was not successfully determined, the column clustering process was unsuccessful in tables with a column of vector data type (case1v).For case2v, almost all rows (99.9%) were correctly allocated by tables, whereas columns were successfully allocated for 87.2% of the rows.Of the other 12.7% of rows in which the number of columns was correctly recognized, a discrepancy of 5% was due to the substitution or deletion of columns.In the remaining 7.7% of the cases, the values in one or more columns likely contained a delimiter.
The STCExtract algorithm is modular and can be expanded by using new rules and functions to detect vector data types.This problem may be the subject of future work that would solve the problem of determining the number of columns and sorting the data by columns only for parts of the data where it is well done (e.g., case1v).
After evaluating the STCExtract algorithm, the best results were obtained when ElementMatcher was selected as an object comparison function, linear_weight as a function to determine the distance of the given object from the cluster center, MaxFreq as the cluster center detection function, a hierarchical algorithm for the clustering of structures at the table level, and clustering columns for previously clustered converted structures from Phase 1 using the Gaussian mixture, K-means, and Mini-batch K-means algorithms.

J. THREATS TO VALIDITY
Although the data used in the experiments were obtained from various sources, they may not accurately reflect all the actual datasets or their structures.Twelve tables from various datasets are used as the basis for testing the proposed algorithm.Tables 2 and 3 show that data from the various datasets were combined.A maximum of six datasets was combined.The algorithm is yet to be tested on more than six messy datasets.
Other concerns are related to the number of columns in the dataset.STCExtract is yet to be tested if the dataset contains

FIGURE 1 .
FIGURE 1.A sequence of action in the STCExtract algorithm.

FIGURE 3 .
FIGURE 3. Experiment 1 -Object comparators: (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 4 .
FIGURE 4. Experiment 2 -Object comparators: (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 5 .
FIGURE 5. Experiment 3 -Object comparators: (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 6 .
FIGURE 6. Experiment 1 -Distance functions: (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 7 .TABLE 9 .
FIGURE 7. Experiment 2 -Distance function (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input table -t.

FIGURE 8 .
FIGURE 8. Experiment 3 -Distance function (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input table -t.

FIGURE 9 .
FIGURE 9. Experiment 1 -Cluster center detectors (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 10 .
FIGURE 10.Experiment 2 -Cluster center detectors (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 11 .
FIGURE 11.Experiment 3 -Cluster center detectors (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.
, and 14, respectively.The results based on the ARI values of ResultsPh1 are shown in Figures12a, 13a, and 14a.Figures12b, 13b, and 14b compare the number of identified clusters k with the number of input tables t.The circles in Figures12a, 13a, and 14a denote mean values.

FIGURE 12 .
FIGURE 12. Experiment 1 -Clustering algorithms (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 13 .
FIGURE 13.Experiment 2 -Clustering algorithms (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

FIGURE 14 .
FIGURE 14. Experiment 3 -Clustering algorithms (a) ResultsPh1 based on ARI values; (b) results based on number recognized cluster -k, and number input tablet.

TABLE 2 .
Details of the test datasets, including the number of converted structures used for Experiments 1, 2, and 3.

TABLE 3 .
Details of the test dataset, including the number of converted structures used for Experiment 4.

TABLE 4 .
Experiment 1 -Mean ARI value for object comparator.

TABLE 5 .
Experiment 2 -Mean ARI value for object comparator.

TABLE 6 .
Experiment 3 -Mean ARI value for object comparator.

TABLE 7 .
Experiment 1 -Mean ARI value for distance functions.

TABLE 8 .
Experiment 2-Mean ARI value for distance functions.

TABLE 10 .
Experiment 1 -Mean ARI value for cluster center detectors.

TABLE 11 .
Experiment 2 -Mean ARI value for cluster center detectors.

TABLE 12 .
Experiment 3 -Mean ARI value for cluster center detectors.

TABLE 13 .
The best result for parameters STExtract algorithms selected by counting.

TABLE 14 .
The best result for parameters STExtract algorithms selected by maximal exp_mul and the average value.

TABLE 15 .
Experiment 1 -Results of data clustering algorithms and column level for selected combinations input parameters.

TABLE 17 .
Experiment 3 -Results of data clustering algorithms and column level for selected combinations input parameters.

TABLE 18 .
Rows that are correctly recognized by tables and by table's columns.

TABLE 19 .
Rows that are correctly recognized by tables but not by table's columns.

TABLE 20 .
Rows that are correctly recognized by tables but not by table's columns in case of substitution or deleting column.