Semi-Automated Formalization and Representation of Engineering Knowledge Extracted from Spreadsheet Data

The development of new methods and tools for formalization and representation of complex knowledge in the context of the creation of intelligent systems remains in the scope of scientific research. There is a trend to automate the knowledge formalization and representation by using various sources of information, in particular, spreadsheets. This paper proposes a novel approach to the semi-automatic formalization and representation of engineering knowledge in the form of conceptual models and knowledge base codes from spreadsheet data. Our approach contains three phases: (I) rule-driven data transformation source spreadsheet tables to a specific canonical form (data preprocessing), (II) domain knowledge formalization and representation via the extraction and aggregation of conceptual model fragments from canonicalized tables, (III) model-driven synthesizing knowledge base and source codes from a domain model. The approach is implemented by our tools: TABBYXL provides the development of a software application for the spreadsheet data canonicalization; Personal Knowledge Base Designer is used to build and aggregate conceptual models fragments, as well as to construct a target knowledge base and generation source codes. Our case study on the industrial safety inspection (ISI) demonstrates the applicability of the approach for prototyping knowledge bases containing decision-making rules.


I. INTRODUCTION
Currently, the development of new methods and tools for formalization and representation of complex engineering knowledge in the context of the creation of domain-specific knowledge-based systems for solving various practical engineering problems remains in the scope of scientific research [1]. Such systems are actively used in design [2], diagnostics and forecasting conditions of complex technical systems [3], selection of design materials [4], risk analysis [5], etc.
The main elements of intelligent systems are knowledge bases, the research of which does not lose its relevance [6]. In recent years, there has been an increasing trend to automate knowledge formalization and representation by using a semi-automated transformation of various sources of information (texts, documentation, databases, spreadsheet tables, web resources, etc.), as well as the principles of visual, conceptual and cognitive modeling.
Spreadsheets are one of the most convenient ways to structure and represent statistical and other data. For this reason, they are widely distributed, and their number, only in the Internet, is about 150 millions [7]. Spreadsheets contain useful knowledge in many domains (e.g., engineering, business, etc.), and can be a valuable knowledge source. In the last decade, approaches to automating the development of knowledge bases have emerged based on the analysis of spreadsheets in CSV and Excel format, as well as web tables [8][9]. These approaches focus on automated extracting knowledge from tabular data and, as a rule, they focus on a specific structure (model) of the table. Moreover, raw tabular data are not always containing the experience and expert knowledge that are important when solving subject tasks. Besides, knowledge bases require a detailed description, confirmation, and verification by experts. In this regard, ontology and conceptual modeling methods are used in addition to build complete and consistent knowledge bases, when the spreadsheets are a source of knowledge at the terminological and essential levels.
This paper presents a novel three-phase approach to a semi-automatic formalization and representation of engineering knowledge in the form of conceptual models and spreadsheet data, and covers the following tasks: (I) extracting source arbitrary tables from a spread-sheet and transforming them to an original canonicalized form; (II) formalization and representation domain model fragments extracted from canonicalized tables and aggregating them into a complete domain model; (III) synthesizing source codes of a knowledge base in a target knowledge representation language from a complete domain model.
We develop and integrate our two tools to demonstrate the applicability of the proposed approach: (I) TabbyXL [10] for canonicalization of source arbitrary tables to a canonicalized form; (II) PKBD (Personal Knowledge Base Designer) [11] for formalization and representation of knowledge in the form of domain models and source codes of knowledge bases.
So, our contribution consists of the following results. For the first time, we proposed the approach for formalization and representation of complex engineering knowledge in the form of conceptual models and knowledge bases that utilizes the rule-driven spreadsheet data extraction. Our approach facilitates formalization, representation, and codification of knowledge and data represented in the form of spreadsheets. The experiments were conducted to demonstrate that the approach is a feasible way for generating fragments of a conceptual model (ontology) in the Industrial Safety Inspection (ISI) area (it is our case study). As a result, conceptual models for ISI tasks were developed. It is important to highlight that the resulting conceptual models were used to synthesize source codes and high-level specifications for knowledge bases. We also showed such synthesis by an example of the generation of knowledge base represented as a source code in CLIPS (C Language Integrated Production System), DROOLS, and PHP (Hypertext Preprocessor).
The paper is organized as follows. Section 2 considers the related works. Section 3 presents the preliminaries of this work. Section 4 explains our three-phase approach including a description of the implementation. Section 5 illustrates a case study of our approach for solving ISI task including an illustrative example and the experimental evaluation, Section 6 discusses the results, while Section 7 presents concluding remarks and future work.

II. RELATED WORKS
Analyzing and transforming data from spreadsheets to formalize and represent knowledge is a popular area of research. At the same time, solutions in this area can be divided into two large groups of approaches that implement: -end-to-end or full transformation [12][13][14][15][16][17][18]: provide transformation of source spreadsheets directly into some knowledge structures, in most cases it is the Resource Description Framework (RDF) or Web Ontology Language (OWL) files; -partial or step-by-step transformation [19][20][21][22][23]: provide partial solutions using intermediate forms of data and knowledge representation, for example, in the form of conceptual models; they separately solve problems of transforming spreadsheets and conceptual models (mainly in the form of UML class diagrams).
Next, let's present some examples of these studies.

A. END-TO-END AUTOMATED TRANSFORMATION OF SPREADSHEET DATA INTO KNOWLEDGE
Currently, the most popular way to represent domain knowledge is knowledge graphs (semantic networks) or ontologies. For this reason, most of the studies aimed at obtaining machine-interpreted knowledge structures based on understanding tabular data (including solving tasks of detection, analysis, and transformation of tables) are focused on the RDF and OWL formats. From the methodological point of view, they provide the extraction of separate knowledge fragments (or miniontologies) with their subsequent aggregation into one expanding domain model. An example of such a study is the TANGO approach [9]. There are also examples of research and commercial software: RDF123 [13], Owlifier [24], Datalift [16], Any2OWL [17], Spread2RDF, Any23, TopBraid Composer, etc.
The main disadvantages of end-to-end spreadsheet data transformation studies are the following: -orientation to well-structured data, which implies using either a specific table layout (usually "entity" tables containing descriptions of instances of a particular entity, with all columns being its attributes, i.e. tables of the one dimension) or specific data sets (for example, the Gold standard 1 ); -orientation to a specific way of presenting tables, for example, HTML; -high qualification requirements when describing transformations (in cases where it is possible to configure systems); -lack of validation and representation of the knowledge obtained using visual domain-specific or system-wide notations; -lack of code generation in a specific knowledge representation language that allows one to direct use and integrate the obtained codes in applications.
Despite significant progress in this area and the growing popularity of integration of spreadsheet data and semantic technologies, the studies aimed at analyzing and transforming spreadsheet data into other knowledge representation formalisms are poorly presented. The exception is associative rules [18], which are part of data mining methods and are not considered in these studies as an element of knowledge-based or expert systems.
In general, the trend of weak support for other knowledge representation formalisms can be explained by the possibility of their equivalent transformation; however, this does not solve the issue of code generation for a particular programming language. In turn, rule-based knowledge representation languages (such as DROOLS, JESS, CLIPS) are still useful when developing industrial intelligent and knowledge-based systems. Knowledge bases with rules attract developers with their visibility, high modularity, easy of making additions and changes, and simplicity of the logical inference mechanism.

B. PARTIALLY AUTOMATED TRANSFORMATION OF SPREADSHEET DATA INTO KNOWLEDGE
A As an example of studies that can provide partial transformation of spreadsheet data into knowledge, we can highlight studies that provide solutions of two different tasks, from which it is possible to build a step-by-step transformation chain or a pipeline: -transforming spreadsheet data to conceptual models, -transforming conceptual models to knowledge bases and source codes.
Although conceptual models can be used to represent ontologies and some researchers identify them with knowledge graphs and semantic networks, however, in most cases they reflect only one of the specific aspects of ontologies, for example, structural or behavioral. In this regard, there are studies aimed at obtaining certain types of conceptual models from spreadsheet data.
In particular, we can distinguish the following studies, which consider the solution of the first task.
Hung et al. [8] propose TRANSHEET, an approach for transforming spreadsheet data to a structured target model in the XML format. TRANSHEET enables users to perform mappings via a familiar and expressive spreadsheet-like formula language. This language is designed for specifying mappings between spreadsheet data and the target schema.
Hermans et al. [27] present a systematic approach called GYRO for the automation of the extraction of UML class diagrams from spreadsheet data. GYRO automatically transforms spreadsheet data in Excel format by exploiting the commonality in tables, like the two-dimensional patterns. These patterns are located within a spreadsheet table using a combination of parsing and pattern matching algorithms.
Cunha et al. [21] suggest an approach called CLASSSHEETS based on searching for functional dependencies between data cells when results of the transformation are relational models. The authors of [21] also show how to systematically transform extended CLASSSHEETS models to UML class diagrams enriched with constraints expressed in OCL (Object Constraint Language). UML class diagrams are generated under the notation of the USE framework.
Amalfitano et al. [20] describe a heuristic-based reverse engineering process for inferring conceptual data models in the form of UML class diagrams from spreadsheet tables in the Excel format. This process is fully automatic and has been defined in an industrial context and validated by an experiment involving three different spreadsheet-based information systems from the considered automotive industrial domain.
Most of the studies in this group also have some limitations related to the support of certain specific predefined models of source spreadsheets with a mixed logical and physical structure, and they are almost all focused on obtaining UML class diagrams.
The second task is usually solved independently of the first one. The existing solutions focus on analyzing specific formats of conceptual models and generating knowledge bases in the form of ontologies or logical rules presented in specific knowledge representation languages.
In particular, Zedlitz and Luttenberger [22] present an approach for transforming UML class diagrams into OWL 2 ontology using the QVT-r language from OMG standard.
Albert and Franconi [19] proposed an integrated meth-od and ORM2OWL tool for transforming conceptual domain models in the Object Role Modelling (ORM) format into OWL ontology. eXtensible Stylesheet Language Transformations (XSLT) was used at the first stage of transforming conceptual models from XML format to ORM.
Starr and Oliveira [23] proposed a method using CmapTools conceptual maps as the main means for expressing expert knowledge, as well as a set of formal transformations applied to these maps to transform them into domain ontology in the OWL format.
Other examples of solving the task of automated transformation of UML models into ontologies are presented in [24][25][26].
These approaches and software have several draw-backs, in particular: -the absence or limitation of the code generation of knowledge bases in different knowledge representation languages; -a limited set of supported formats of conceptual models, as well as the complexity of describing the models for code generation.
-complex implementation of transformations, which, in turn, causes a variety of software used by researchers within each separate transformation (sometimes a separate software is used at each stage of the transformation).
Thus, the problem of developing methods and soft-ware for extracting and transforming data from spread-sheets with an arbitrary layout and data sets into ontologies and knowledge with subsequent validation of results by subject specialists, as well as generation of source codes and the integration of them into applications remains an urgent task. At the same time, it is prospectively to use an intermediate representation of knowledge in the form of conceptual models that reflect the generated knowledge at the terminological level (T-Box) level.
In this paper, to overcome shortcomings mentioned, we propose a new approach to the semi-automated formalization and representation of complex engineering knowledge in the context of the creation of knowledge bases, the main features of which are the following: -the main source of knowledge is spreadsheets with the arbitrary layout (structure); -spreadsheet data is represented with the use of a special canonical form; -conceptual models are an intermediate means to formalize and represent knowledge extracted from canonicalized spreadsheets and the basis for synthesizing source codes of knowledge bases.

A. SPREADSHEET SOURCE AND CANONICALIZED TABLE
We formally introduce tables and their types. A table is a grid of cells arranged in rows and columns. Such tables are used as visual communication patterns, as well as data arrangement and organization tools. In this paper, we primarily focus on spreadsheet tables, i.e. tables embedded in various documents and reports. Such spreadsheet tables typically contain data from various dimensions or named entities and are presented in the Excel format (XLSX or CSV). Below, we define elements of a spreadsheet Such spreadsheet tables are designated as arbitrary in [10], since they may have a different layout and design style due to the specifics of domain data.
Another type of tables is a relational one. Relational tables contain high-quality relational data [30]. Wherein, relation spreadsheets can be converted into a relational model. Relational tables contain a set of entities, which could exist in rows (horizontal) or columns (vertical), the remainder of cells contain their descriptive attributes.
To represent extracted tabular data, we propose a canonical form that corresponds to relational tables by a layout. The special canonical form we use is formally defined as follows:

CF = (D, R H , C H )
(1) where D is a data column that contains only entries (i.e. values of a source table), R H is a column of paths of row labels (i.e. headings addressing the values by rows), C H is a column of paths of column labels (i.e. headings addressing the values by columns). A path of labels can express either a separation of a category into subcategories or reading order.
In the examples, we use the vertical bar to denote separated labels in a path. Our approach uses the canonical form as an intermediate data representation between source spreadsheet tables and target conceptual models.

B. CONCEPTUAL MODEL AND MODEL TRANSFORMATIONS
A model is an abstraction of a system under study that makes it possible to have a better understanding of and to reason about it [31]. Models can be divided into different categories, in particular, mathematical models, graph models, etc. According to [32], models are divided into three levels, namely: conceptual models, specification models, and implementation models. A conceptual model represents concepts (entities) and relationships between them, and it is mainly built to formalize and represent the static characteristics of some system.
Recently, numerous conceptual modeling techniques have been created that can be applied across multiple disciplines to increase the user's understanding of the system to be modeled [33]. Various visual and text notations, universal modeling languages, and standards, in particular, UML class diagrams, IDEF1x and others, are widely used in designing conceptual models.
Typical usage of conceptual models is to build various information systems, for example, knowledge-based systems. Conceptual models can be used as the basis for generating domain ontologies and knowledge bases. The model transformation aims at converting source models to target ones.

IV. APPROACH
Our approach consists of three phases: (I) extracting data from source tables and transforming them to the canonical form, (II) formalization and representation of knowledge extracted from canonicalized tables, (III) synthesizing source codes of a target knowledge base from the conceptual model. The workflow diagram of this process is shown in Figure 1.

A. OBTAINING SPREADSHEETS
The source data we are interested in can be found in document tables. First of all, such tables should be extracted from digital media like web-pages and PDF documents. Note that we do not determine this process here. However, it can be realized by using data scraping tools (e.g. Tabula or TabbyPDF) with additional manual verification and correction.
Phase 1 starts with extracting spreadsheet tables as shown in Figure 1. Tables that we are interested in typically have an arbitrary layout and not a relational one. Phase 1 aims at transforming them to the canonical form (e.g. Figure 2). We consider this process in terms of the table understanding [9] with the following steps: (I) role analysis, i.e. recovering data items (entries and labels) from cells, (II) structure analysis, i.e. recovering relationships between data items, (III) interpretation, i.e. separating labels into named or anonymous categories.
In our case, extracted tables are represented by two source forms resulted from our case study. The first of them (Form 1) have merged cells expressing either data repeating or hierarchical relationships between headings. An example of a source table in Form 1 is illustrated in Figure 2 (left). The target table in the canonical form corresponding to the source table is shown in Figure 2 (right). To develop rules for analysis and interpretation of source tables in Form 1, we use the following assumptions. A corner cell s started from 1-row and 1-column contains a stub head. The corner covers all head rows and all stub columns. A head cell contains one column label (e.g. h 1 , h 2 , …, h n ). They can compose hierarchical (parent-child) relationships expressed by spanning and nested cells. A child label should be placed in a nested cell while its parent label is in the corresponding spanning (e.g. h 3 and h 4 are children of h 2 parent in Figure 2). Each path of column labels belongs to the category C H (e.g. "Column Heading" column of the canonicalized table in Figure 2). A cell of a stub part contains one row label (e.g. s 1 , s 2 , …, s m ). A multicolumn stub expresses hierarchical paths of parent-child relationships. Each path is read in a row from a left column (parent) to the right one (child). Any path of row labels belongs to the category R H (e.g. "Row Heading" column of the canonicalized table in Figure 2).
A body part is a data block placed below the head and on the right the stub. A body cell contains one entry (e.g. d 1 , d 2 , …, d k ). A merged cell should be considered as a set of repeated entries with the same value (e.g. d 3 , d 4 , and d 8 in Figure 2). Each entry is addressed by one path of column labels and one path of row labels. For example, d 9 is addressed by (h 2 |h 4 ) path of column labels, and (s|s 3 ) path of row labels.
These assumptions allowed us to develop a ruleset for canonicalization source tables. It was expressed in CRL, our domain-specific language for table analysis and interpretation rules [34].
It should be noted that the transformation rules depend only on the structure of canonical tables. However, when processing new source spreadsheet table layouts, we need to create a new set of transformation rules in CRL to get canonical tables. The rules that we used in this case are listed below.
Rule 1 creates column labels from cells placed in a head. When there exists a cell c0 in the left top corner (line 02) and a cell c1 is located in the same rows and right columns (line 03) then a label is created in c1 (line 05) and this label Rule 2 creates parent-child pairs from labels placed in a head. When there exists a pair of labels l1 and l2 originated from the head determined by a corner cell c0 and the cell of l1 spans the cell of l2 by columns (lines 02-05) then the parent label l1 is associated with the child label l2 (line 07) as shown in the listing below: Rule 3 creates row labels from cells placed in a stub. When a cell c1 is located in the same columns of the corner cell c0 (lines 02-03) then a label is created in the cell c1 (line 05) and this label is associated with the category "Row Heading" (line 06) as shown in the listing below: Rule 4 creates parent-child pairs from labels placed in a stub. When there exists a pair of labels l1 and l2 originated from the stub determined by a corner cell c0 and the cell of l1 spans the cell of l2 by rows (lines 02-05) then the parent label l1 is associated with the child label l2 (line 07) as shown in the listing below: Rule 5 creates parent-child pairs from a label placed in a stub head and labels placed in a stub. When a label l1 is originated from a cell located below the corner cell c0 and in the 1st column (lines 02-03) then the parent label originated from c0 is associated with the child label l1 (line 05) as shown in the listing below: Rule 8 associates entries with labels by the same rows and columns. When there exists a triplet of an entry e and two terminal labels l1 and l2 located in the same columns and the same rows of the cell of e respectively (lines 02-07) then both labels l1 and l2 are associated with the entry e (lines 09-10) as shown in the listing below: The second form (Form 2) is a list with only two columns. Each row puts together a heading and data, i.e. a label s i and an entry d i . It also merges cells for repeating headings and data. Figure 3 depicts an example of a source table Form 2 (left) and its target table (right). Form 2 does not have column headings. The ruleset expressed in CRL intended for the data canonicalization from source tables of Form 2 is listed below.

Rule 1 creates labels from cells placed in a left column.
When a cell c is located in the 1st column (line 02) then a label is created in c (line 04) as shown in the listing below:

B. FORMALIZATION AND REPRESENTATION OF KNOWLEDGE
A canonicalized table is transformed into one fragment of knowledge formalized and represented as a conceptual model that describes a limited subset of domain concepts and relationships. Its paths of labels are interpreted as some hierarchy domain concepts (classes or attributes). This process is driven by a set of transformation rules taking into account the five cases of canonicalized table records described in [35]. Cases 1-4 originate from source tables of Form 1 and are described below.
In case 1, each of both paths in a record (a row heading and a column heading) has only one label as shown in Figure  4 (UML is used just for visualization). In case 2, a record contains a path of two row labels and a path of only one column label as also shown in Figure 4. In the latter case, labels originated from a stub of a source table can be read as hierarchical (parent-child) relationships.
In case 3, each of both paths in a record contains a pair of labels (Fig. 5). The paths of labels can be read as hierarchical relationships.
In case 4, a path of row labels in a record has only one label, while a path of column labels in the record contains two or more labels (Fig. 6). In this case, labels originated from a head of a source table can be read as multiple parentchild relationships. The last case 5 is originated from source tables of Form 2. We assume that a record contains only one path with only one row label, as shown in Figure 7. In all cases 1-5, each record contains only one entry.
We define a transformation algorithm for processing paths of row labels RL as follows: We also define a transformation algorithm for processing paths of column labels CL as follows:   New property name = third CL label All extracted parent-child relationships from canonicalized tables are interpreted as associations without cardinality. The generalization relationships are not processed, since the transformations were considered in the context of creating knowledge bases containing logical rules, in particular, for CLIPS and DROOLS that does not have these relationships.
All values of class attributes are formed using entries from the data column. At the current stage, all entries are denoted by a string datatype. As a result, each canonicalized table is transformed into a conceptual models fragment.
The diagram fragments extracted from tables are merged into a complete domain model. This process applies rules for merging extracted classes and clarifying their names, attributes, and associations as follows.
Rule 1: Merge two classes when they have equal names from duplicate fragments of class diagrams.
Rule 2: Merge two classes when they have the same structure, i.e. when sets of attributes are equal. In this case, only the first class with this structure stays in the model. Rule 3: Merge two classes when they have similar names. The resulting fragments of class diagrams can describe the same objects or processes. We suggest using a simple string comparison method based on the Levenshtein distance [36] to determine the similarity between the two names of classes. If the distance is less than or equal to three, then we assume the classes to be similar. Note that this is not enough, so we also look at the structure of classes (names of attributes must partly match).
Rule 4: Create a new association between two classes if homonymous classes and attributes exist. In this case, a name in one class is equivalent to the attribute name in another class. At the same time, the attribute of the same name is removed.
Rule 5: Remove duplicate associations between classes.

C. SYNTHESIZING RULE-BASED KNOWLEDGE BASES
The final phase generates source codes of a target knowledge base from a conceptual model obtained at the previous phase. The synthesis algorithm is based on the generalized method for transforming a conceptual model to a knowledge base presented in [37]. This method implies the application of metamodels and a domain-specific declarative language for describing correspondences between source elements of a conceptual model and target elements of a rule-based model. The algorithm includes the following main steps: Step 1: Serialize a complete conceptual model represented in a UML class diagram to the XML format using the OMG XMI standard.
Step 2: Extract constructions of a rule-based model from the structure of a UML class diagram serialized in XML.
Step 3: Modify the obtained rule-based model by using Rule Visual Modeling Language (RVML) [37] is a UML extension designed for rule-based engineering.
Step 4: Generate source codes of the target knowledge base by transforming the rule-based model.
Note that the rule-based model we used can be considered as a universal tool for the intermediate representation of extracted knowledge in the form of logical rules. It does not depend on a certain knowledge representation language. This algorithm is discussed in detail in [37]. Our software provides source code generation for the following languages: CLIPS, DROOLS, and PHP.

V. IMPLEMENTATION
The approach was implemented by integrating two tools: TABBYXL [10] that extracts relational data from source spreadsheet tables, and PKBD [11] that generates and aggregates conceptual models from canonicalized tables.
TABBYXL enables software development for rule-driven data extraction from arbitrary spreadsheet tables. Such software converts tabular data to a canonical form. Both a source (arbitrary) form and a target (canonical) form are determined by user-defined rules. The two sce-narios for the implementation of these rules are described below.
Scenario 1: The transformation rules are expressed in a domain-specific language CRL. They automatically can be translated to the Java source code of an executable application.
Scenario 2: The transformation rules are expressed in a general-purpose rule-based language (e.g. DROOLS or JESS). They can be executed by an appropriate rule engine that is compatible with "Java Rule Engine API".
PKBD provides prototyping knowledge bases by using logical rules and visual modeling based on the RVML notation. It supports various data sources including mind maps, class diagrams, and spreadsheet tables. To integrate both tools, we developed a plug-in module for PKBD that realizes rules for transforming spreadsheet cell values and their relationships to the resulting domain entities (taxonomical fragments). The module also aggregates the VOLUME XX, 2017 2 fragments into a complete domain model by using the rulebased operations for clarifying entity names, merging, and separation.

VI. CASE STUDY
The developed method and tools are used when solving tasks in the field of Industrial Safety Inspection (ISI) [38]. Let's consider our case study and evaluation in detail.

A. INDUSTRIAL SAFETY INSPECTION
ISI is a procedure required to confirm the compliance of the technical equipment state with industrial safety requirements. There are national standards and normative acts for regulating this procedure [38][39][40]. They define the composition and stages of ISI in a general form. In most cases, implementation details depend on the technical abilities and experience of inspecting organizations [41]. Our case study relies on the experience of the Irkutsk Research and Design Institute of Chemical and Petrochemical Engineering (IrkutskNIIhimmash) to demonstrate some details of the ISI procedure. This organization accumulated a large amount of information on technical state evaluation and risk assessment in the form of printed and electronic documents. Note, they mainly use word processors and spreadsheets without involving specialized software to prepare their ISI reports. However, representation of this information in the form of se-mantic structures such as conceptual models and knowledge bases can improve the efficiency of the technical condition evaluation and residual life and risk assessment. A substantial part of this information is represented as tables in ISI reports. They are of particular interest for knowledge-based engineering due to their high degree of structurization and formalization. We design an experiment involving domain experts in ISI to demonstrate the usefulness of our approach. The considered area of the safety management of industrial equipment includes the tasks of monitoring, diagnosing, and forecasting technical conditions and risk assessment. Only the complete and adequate formalized and represented knowledge in the form of domain models provide the proper solution to these tasks.

1) OBTAINING CANONICALIZED TABLES
We used a dataset of 216 spreadsheet tables extracted from 6 ISI reports, 173 of them have a unique layout and content in 5817 cells. We selected 161 tables and transformed them to the canonical form by executing the transformation rules described above. These tables accompanied by the transformation rules are referred to as an ISI-161 dataset (ISI-161) 2 . Figure 8 shows some examples of these tables containing information about elements of the inspected object and results of a hardness measurement. Figure 9 depicts the corresponding canonicalized tables obtained from the source ones via TABBYXL.

2) FORMALIZATION AND REPRESENTATION OF KNOWLEDGE
The canonicalized tables of ISI-161 were transformed with the aid of PKBD. A total of 429 entities, including 59 classes (concepts), 338 attributes (properties), and 32 associations (relationships), were allocated depending on the activated aggregation rules. Note that the experts estimated that only about 56% of 429 entities were useful for further processing. The aggregation of the taxonomical fragments into a complete conceptual model reduced them to 242 entities, including 25 classes, 196 attributes, and 21 associations.
The resulting fragments were verified by the domain experts and compared with the existing ISI-models dataset provided by the IrkutskNIIhimmash (ISI models) 3 . In the context of comparing models, the domain experts found that 17% (69 out of 400) of concepts from ISI-models dataset have concepts identical to the concepts from the models obtained as a result of the analysis of the ISI-161 dataset, including entities (Table I), properties and relationships (Table II). Figure 10 presents a fragment from the ISI-model dataset (01.mdl file) and the corresponding concepts from the results of the transformation.
The model validation revealed the need to use additional information sources, for example, conceptual models for describing the dynamics of technical states (ISI models) to build useful knowledge bases, in particular, for solving diagnostic and forecasting issues. These models were obtained from a survey of experts in the IrkutskNiiHimmash. Moreover, the coincidence reaches 24% (106 of 400) when we complement the concepts from the ISI-model dataset with the relevant properties of the corresponding concepts from the ISI-161 dataset. Table III shows the quantitative characteristics of the compared datasets.
As a result, the processing of 161 spreadsheets from 6 ISI reports provided 24% of elements of the ISI domain model. This confirmed the applicability of the approach to the conceptual model construction.

3) SYNTHESIZING RULE-BASED KNOWLEDGE BASES
The resulting conceptual models were transformed into knowledge base structures using PKBD. Figure 11 shows examples of the resulting refined structures in RVML. It should be noted that the resulting structures were used as prototypes or drafts for ontology and knowledge bases, in particular for the software supporting ISI [41]. For this purpose, we used the feature of our software to synthesize syntactically correct source codes that include descriptions of template facts and rules for PHP, DROOLS, и CLIPS ( Figure 12).

C. EXPERIMENTAL EVALUATION
TabbyXL and PKBD implement (a) the transformation from arbitrary tables to canonicalized ones, (b) the transformation from canonicalized tables to conceptual models, and (c) the transformation from conceptual models to source codes. To evaluate the performance of our implementation, we use the well-known measures: recall and precision. For the first transformation, they were adopted as follows: where R is a set of entities in the resulting table, and S is a set of cell values in the corresponding source spreadsheet. For the second transformation, they are calculated as follows: where TP is a set of correctly transformed entities (cell values), FP is a set of incorrectly transformed entities, and FN is a set of untransformed entities that could be transformed. Table IV enumerates the results of the evaluation for transformations.

VII. DISCUSSION
The performance evaluation showed that using this approach we managed to transform most of the arbitrary and canonized tables. Note that the transformations were evaluated only formally. However, in the context of our case study, the ISI reports tables were not quite suitable (estimated as 17% and 24%) from a viewpoint of qualitative content evaluation of the utility of the obtained models. An analysis of the concept features obtained from the ISI-161 and ISI-models datasets showed a difference in the emphasis of these models. While the ISI-161 dataset contains information mainly about the results of technical diagnostics of equipment, the ISI-models dataset is focused on the description of the entire ISI procedure, including tasks such as the development of an ISI program, the analysis and interpretation of the results of the diagnostic, as well as making decisions for the repair and forming a conclusion (report) for ISI. For this reason, 173 concepts from ISI-161 were not further used. We define the main reasons that lead to reducing the recall and precision for both transformation stages. The failed cases of the first stage (the table canonicalization) were mainly caused by the following reasons: (I) new layout of source tables that is not supported by the transformation rules we developed, (II) erroneous assignment of a certain cell type for the resulting data (e.g. the date data type is assigned to a cell with the numeric data type). The errors of the second stage (tables-to-conceptual models) occurred mostly due to the following reasons: (I) imperfection of the transformation rules for processing an embedded hierarchy of concepts, in particular, skipping the third hierarchy level of concepts for row headings, (II) imperfection of the aggregation strategies for conceptual model fragments (e.g. the merging of "gt_20, MPa" and "gb_20, MPa" concepts that are syntactically similar but different semantically), (III) invalid cell type in the input spreadsheet data.
We made a qualitative comparison of our approach in terms of generating conceptual models from spreadsheets ( Table V). Each of the considered methods fails to process the table layouts from ISI reports. Therefore, the use of our approach and software (TABBYXL and PKBD) is promising in this aspect, in particular, helps to generate higher-level abstractions (e.g., specification and a source code of rulebased knowledge bases) based on the formalized and represented knowledge.
PKBD synthesizes syntactically correct source codes based on extracted knowledge, while the internal representation of rules (ensuring its independence from a specific programming language) provides generation of fairly simple rules that do not support such specific elements as variables, calculated expressions, functions, etc. However, such structures can be added after the synthesis of the codes in their debugging and integration.

VIII. CONCLUSION
We propose a novel three-phase approach based on the spreadsheet data and conceptual model extraction and transformation to automate the formalization and representation of complex engineering knowledge in the context of the construction of knowledge bases. Our approach was implemented by the data-level integration of the two tools: TABBYXL and PKBD. TABBYXL is used to extract spreadsheet data from arbitrary spreadsheets and to transform them into the canonical form. PKBD generates a a) b) c)  conceptual model from the canonicalized tables and synthesizes a rule-based knowledge base. We used the ISI-161 dataset to conduct the experimental evaluation of our approach. The experiments revealed that it is suitable for processing spreadsheet data from ISI reports and generating domain models and a source code of knowledge base in the ISI area. However, there is a need to improve a set of rules for aggregating fragments of conceptual models. For example, semantic similarity can be used to improve merging classes and individual attributes. It should also be noted that the thematic connectivity of source spreadsheet tables is important for obtaining useful domain models.
The novelty of our approach is justified as follows: (I) an original canonical form of tables, providing an inter-mediate representation and automated processing arbitrary tables with different layouts, (II) the use of CRL, a domain-specific language for expressing rules for the transformation of spreadsheet tables from ISI reports, (III) the algorithms for converting canonicalized tables to fragments and aggregating them to a conceptual model, (IV) the use of our tools (TABBYXL and PKBD), (V) ISI-161, the new dataset of source tables and conceptual models for the ISI procedure.
Our approach is considered solely in the context of semiautomated formalization and representation of knowledge, while only spreadsheet tables are used as source data, without additional information. The pro-posed solution can support the development of knowledge-based systems in different domains.
In the future, we plan to enhance the accuracy by clarifying and involving layout properties of source tables, improving rules for aggregating and converting canonicalized tables. Table transformations can also be improved by using external taxonomies (ontologies) and annotating techniques. It is also interesting to find out whether our approach can be applied in other domains, not only ISI. However, it should be noted that a meaningful rather than formal evaluation of the results requires the simultaneous existence of both tabular data and conceptual models for a certain domain, which is quite rare.