maplib: Interactive, Literal RDF Model Mapping for Industry

Knowledge graphs are important for industrial digitalization. Industrial knowledge graphs are often mapped from multiple existing large data sources, and creating a mapping requires the time of scarce subject matter experts (SME). Interactive, literal programming for large scale mapping would allow mapping engineers to make good use of SME time, and document their work. Currently, there are no open-source tools supporting such a process. To solve this problem, we implement maplib, which leverages existing tooling from data science. In data science, there is widespread use of literate programming using frameworks such as Jupyter notebooks to interactively prepare data and create analyses using in-memory tables called DataFrames. Maplib is implemented in Rust using Polars DataFrames and has Python bindings, allowing us to leverage tooling used in data science. Maplib implements the OTTR mapping language, which is highly suited for industrial use cases. Maplib features a SPARQL engine defined directly on DataFrames, making querying possible immediately after mapping. We evaluate our approach by comparing mapping and querying performance with Morph-KGC and SPARQL Anything on the GTFS Madrid benchmark. Our approach materializes the graph and is ready to query $47\times $ - $182\times $ faster, and scales to models that are over twice as large. Morph-KGC and SPARQL Anything perform better for most, but not all of the queries once the graph has been constructed. On the end-to-end task of mapping and querying however, which is very important for interactive mapping, maplib performs better for most queries.


I. INTRODUCTION
Knowledge graphs are digital representations of our knowledge of the world made up of nodes representing concrete or abstract entities or concepts, as well as edges between them, representing relationships of different kinds. In industrial settings, knowledge graphs can integrate heterogeneous data sources into a unified model of an asset or process. Industrial knowledge graphs are important for industrial digitalization [1], [2]. Constructing or mapping industrial knowledge graphs requires the time of scarce subject matter experts (SME). Interactive, literal approaches to knowledge graph construction can be useful in these settings, as the short The associate editor coordinating the review of this manuscript and approving it for publication was Mu-Yen Chen . duration of the feedback loop allows an engineer to make good use of the SME.
Data analysis and data engineering in Python is very popular. In combination with interactive Jupyter Notebooks [3], it is thought to improve the speed with which data professionals can explore, process and analyse a data set. Literal programming was invented by Donald Knuth, and is a form of programming which follows the form of literature, the flow of the program follows a line of reasoning [4]. In literal programming, prose is interspersed with code and outputs. Jupyter Notebooks are considered a form of literal programming, and allows the data analysis process to be inspected and documented as it proceeds [5], [6]. Data analysis and data engineering in Python very often involves the Pandas library [7], which has powerful in-memory tables called DataFrames. Pandas offers a rich and performant set of operations on these DataFrames, and can import from a myriad of heterogeneous sources with few lines of code [8]. These operations overlap to a large degree with the tasks that are performed when constructing a knowledge graph.
Open source, interactive, literal knowledge graph construction tools that allow intermediary results to be inspected and manipulated do not exist, making it difficult to quickly and exploratively create knowledge graphs. Integrating with the DataFrame-based Jupyter Notebook ecosystem could provide this support with no additional work. While there does exist an open-source tool for knowledge graph construction and querying in Python called Morph-KGC [9] which is based on Pandas, it relies on atomic, non-interactive execution of knowledge graph construction, and does not actually integrate with DataFrames. This lacking integration is likely due to the use of end-to-end declarative mappings using RML, which is a mapping language that defines end-toend transformations from sources such as databases or files to knowledge graphs [10]. RML implementations typically allow no inspection and no interactivity with intermediary results [11]. RML also does not support templates, which are important in industrial settings with a large degree of modularity [12]. Morph-KGC relies on third-party databases and performs a data transformation and copying operation in order to populate these databases with the mapped knowledge graph [13]. Using DataFrames as a database primitive, one can actually forego such a transformation step and query the mapping output directly.
To address these gaps, we introduce maplib. Maplib is written in Rust, and uses Polars DataFrames [14] as the core data structure. Polars DataFrames perform better than Pandas DataFrames due to lazy execution, high parallelism, and its implementation in Rust [14], [15]. It has Python bindings that allow DataFrames to be transferred back and forth between Rust and Python with minimal cost. Maplib implements OTTR [12], which is a template-based mapping language oriented around API-like functional abstraction. These abstractions are highly suitable for integration with DataFrames. The templates-based nature of OTTR also makes it highly suitable for industrial settings. We adapt a DataFrames-based SPARQL query engine that we developed earlier [16] for maplib, and make it possible to query mapped knowledge graphs immediately after they are constructed, without a separate transformation/copying step. The SPARQL engine of maplib is also able to extend the knowledge graph with the results of the construct query without transformation or copying, permitting fast enrichment of the knowledge graph. To evaluate maplib, we compare it with state-of-the-art solutions Morph-KGC [9] and SPARQL Anything [17] on the Madrid GTFS benchmark [18], which features knowledge graph construction together with challenging queries.
Our paper is structured as follows. We first introduce relevant background knowledge in Section II. We consider the motivation, research problem, and requirements in Section III. We describe existing solutions in relation to our requirements in Section IV. The solution approach and implementation is described in Section V. We evaluate maplib in Section VI, before presenting our conclusions as well as future work in Section VII.

II. BACKGROUND
This section covers background knowledge important to our work. We discuss Semantic Web Technologies (SWT) and knowledge graphs in Section II-A. Major languages used to construct knowledge graphs are introduced in Section II-B. We introduce literate programming and how it is being used in modern data analysis in Section II-C. Finally, we describe important technologies for modern data analysis and data engineering that we use in this work in Section II-D.

A. SEMANTIC WEB TECHNOLOGIES AND KNOWLEDGE GRAPHS
The Semantic Web started as an initiative to create interoperable ways of sharing meaningful information on the Web, where websites would share information in particular formats, allowing information to be integrated across websites and used to support new use cases [19]. Out of this work came powerful ways of representing knowledge, and for querying and reasoning about that knowledge.
The Resource Description Framework (RDF) is a way of representing knowledge based on triples containing subjects, verbs, and objects [20]. Triples either describe some relationship between the subject and the object or associate a piece of data with the subject. The subjects and objects of such a collection of triples form the nodes of a graph, while the verbs form the edges. The non-data nodes are called resources and may refer to real or imagined entities. Verbs typically have agreed-upon meanings. In sum, such a graph represents knowledge about a domain of discourse, and is often referred to as a knowledge graph.
Higher-order knowledge graphs called ontologies represent the concepts and terms in the knowledge graph, and are typically used to represent a particular domain of discourse in a uniform and coherent way [21]. Knowledge graphs based on RDF can be queried with SPARQL [22], a query language which allows users to formulate queries using the concepts from ontologies. We call such data access Ontology Based Data Access (OBDA) [23]. Compared to query languages such as SQL, SPARQL abstracts away more technical details on how data are stored, and focuses on expressing the conceptual relationships inherent in our information retrieval task. SPARQL can either be executed on a triple store which physically stores the RDF triples, or we can have data stay in an existing database and translate SPARQL queries into the query language of this database. In the first case, we say that we are materializing the Knowledge Graph. In the latter case, we say that we have a Virtual Knowledge Graph (VKG) [24]. VOLUME 11, 2023

B. RDF MAPPING LANGUAGES
A core challenge is to transform data from other formats into RDF-based knowledge graphs so that it can be e.g. queried and reasoned about. The Relational database To RDF Mapping Language (R2RML) is a W3C-recommended declarative language for describing mappings from relational databases to RDF [25]. The language describes how to construct the subject IRI for all rows in a table, and how to construct a number of triples for each row. The language is designed with two use-cases in mind. The first use case is to materialize RDF triples based on data in a relational database. The second use case is to support VKGs. That is, the mapping should contain the information necessary to be able to answer SPARQL queries by translating them into SQL queries over the relational database [26].
RML is a language that extends R2RML to cover other data sources such as CSV and JSON files [10], iterating over rows of the CSV or matches of a JSONPath in case of a JSON file. Multiple implementations of RML exist, such as SDM-RDFizer [27] and Morph-KGC [9] for Python. R2RML has also been extended (R2RML-F) with a facility for specifying functions that should be applied to the input data before it is used to instantiate triples [28], using a standardized set of functions called FnO [29]. Since RML includes R2RML, we will use the term ''RML'' to refer to them collectively in the rest of the paper.
The Façade-X mapping language allows users to introduce heterogeneous data sources such as relational databases, JSON files, and CSV files into SPARQL through the SPARQL Service construction [17]. An abstraction called a facade is used to expose such data as RDF, using special predicates that are used to access, e.g., the columns of a CSV file or keys in a JSON. This allows end-to-end mappings to be defined declaratively, using lightly annotated SPARQL queries.
However, both RML and Façade-X lack facilities for functional abstraction over mappings. Individual components may be reused, but it is impossible to group and nest such components. The issue of reusable components has been addressed by Ontology Design Patterns (ODP) [30], but has been criticized for being hard to use in practice [31]. In particular, ODP was criticized for not having user-facing abstractions [31]. The Reasonable Ontology Templates (OTTR) mapping language addresses this shortcoming by creating a templating language that is based on functional templates with arguments [12].
OTTR templates are essentially reusable macros that can be referenced by users. The head of an OTTR macro consists of its parameters. In the body, other templates are called using a combination of these parameters and static terms. Executing an OTTR template consists of rewriting each template in the body, replacing each parameter with the argument provided. The process of substitution is repeated with the templates in the body until there are only ottr:Triple-templates, which have three arguments that map one-to-one with the subject, verb, and object of the RDF triples to be created. OTTR templates are nested, but cycles are disallowed [31]. We discuss OTTR in greater detail and provide an example in Section V.
The Terse Syntax for OTTR (stOTTR) is a syntax for specifying OTTR templates in text in a way that seeks to be easy to use [12]. OTTR differs significantly from RML and Façade-X in that it makes no attempt to describe the data source or to allow translation of SPARQL queries to a source database. OTTR is especially suited for industrial use cases, as these tend to be built on standardized, modular representations that are combined in myriad ways to represent, e.g., various types of equipment and their properties [12], [32], [33].

C. LITERATE PROGRAMMING
Literate programming was invented by Donald Knuth, and refers to programming that takes a form similar to literature [4]. The purpose of a literate program is to structure code for readability, following a train of thought, motivating and describing the steps with prose. In a step called weaving, the output of the program is combined with the verbal description to create an integrated document [4].
Modern literate programming environments such as Jupyter Notebook have become very popular among data scientists for documenting data processing and analysis, particularly in an exploratory stage [5]. Jupyter Notebooks combine literate programming with interactivity, allowing data scientists to get rapid feedback at each stage of data processing and analysis, provided these steps are not long-running.

D. PANDAS, DataFrames, ARROW, POLARS AND PARQUET
Pandas is a high-performance open-source library for data manipulation and analysis in Python [7], which has become a standard tool in data analysis. In a survey of over 23,000 Python developers conducted by the Python Software Foundation and IDE-vendor JetBrains, 53% of respondents answered that they used Python for data analysis, and 55% of respondents used Pandas. Python is among the world's most popular programming languages, perhaps the most popular [6]. It is not clear from the survey how representative it is for the population of Python developers, but it is safe to assume that Pandas is an extremely popular data analysis/data engineering tool.
A central data structure in Pandas is the DataFrame. DataFrames are in-memory tables of data, consisting of named columns of data of a homogenous type. In Pandas, DataFrames are built on data structures from NumPy, a Python library for numerical computations [34]. The Apache Arrow project seeks to standardise such columnar in-memory representations, making it easy to share them among processes on the same machine [35]. Agreeing on the standard, one application can pass an Apache Arrow array to another by providing the pointer to in-memory data instead of copying data. Additionally, Apache Arrow Flight enables transport of columnar data from e.g. databases without having 39992 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
to perform costly row-oriented serialization associated with ODBC [36].
The open-source Polars library is based on Apache Arrow. It is built in Rust, but has Python bindings. It provides an API that is very similar to Pandas DataFrames, but that additionally supports lazy execution, allowing optimizations to be applied to the execution plan. Polars DataFrames can easily be created from Pandas DataFrames and vice versa. Apache Parquet is a column-oriented file format that is commonly used in data lakes [37]. Parquet can be read to-and written from Apache Arrow with very little serialization overhead.

III. PROBLEM
Exploratory data engineering for analytics relies on literate, interactive programming in the notebook format. At their best, notebooks enable data engineers to quickly and collaboratively prototype pipelines for data processing in the form of an illustrated story. Jupyter Notebooks face issues with testability, reproducibility, and modularity that makes it challenging to use them in production and maintain them over time [38], [39], [40]. We will assume that notebooks are rewritten into structured form after the exploration phase is done in order to meet common quality requirements for production code (cf. [38]). The core problem addressed by this work is to provide such a facility for mapping engineers that are building industrial knowledge graphs, which can be used in the exploration phase, and then refactored into structured form for production deployment and ease of maintenance.
In the rest of this section, we will motivate the problem of template-based interactive mapping that integrates well with DataFrames, arguing that such tooling can improve knowledge graph mapping practices, especially for industry. Finally, we describe our requirements.

A. MOTIVATION 1) THE VALUE OF INTERACTIVITY AND LITERAL PROGRAMMING
OBDA is used to make it easier to access data, especially when those data are located in separate databases, as we are able to abstract away the technical schemas of the individual data sets and integrate them in a common RDF model. Mapping engineers need to understand these data sets in order to map them to RDF and integrate them. Attempting to perform a rudimentary data integration is a way of accomplishing this task that provides quick feedback on the assumptions of the mapping engineer.
When integrating enterprise data sets, one can encounter insufficiently documented, ambiguous data that are hard to reconcile without extensive domain knowledge [41]. In such cases, a SME may have to be consulted. Such individuals are typically in high demand, and it is important to make effective use of their time. A similar argument also applies to the target ontology which will be instantiated. In industrial settings, there are sometimes precise modeling rules to follow when creating a model according to a standard. Correct instantiation of such ontologies relies on domain knowledge. Being able to interactively work on constructing a mapping is helpful, as this provides immediate feedback to the mapping engineer on how well she has understood and manipulated the source data sets, and on how appropriate the mapping is. An interactive mapping environment should enable the mapping engineer to quickly follow up with the SME to resolve questions.
The form of the feedback is important in order to be able to collaborate and communicate the results of the exploratory data integration. It is ineffective to manually collect the outputs of various mapping scripts and programs in a document. Corrections to the mapping upstream to such outputs lead to repetitive manual updating of the documentation. A literal programming environment such as Jupyter Notebooks avoids this toil, as the script is in a literal form with outputs that update when the script is rerun. The audience of this literal program is not just mapping engineers. A literal mapping will inform the SME on how the data sets of the organization can be integrated, and make visible the consequences of such integration through queries or other means of inspection.

2) THE VALUE OF INTEGRATING WITH EXISTING DATA ENGINEERING TOOLS AND FORMATS
Today, there are very mature data engineering tools that are very well maintained. Pandas is likely to remain so, as thousands of engineers and organizations depend on it. There are multiple advantages to being highly interoperable with DataFrame-based data engineering tooling that are due to the broad acceptance of these tools. It is easy to find documentation, and common issues are well documented on websites such as Stack Overflow. There is extensive support for reading and writing data to and from different file formats and databases. Many of the data preparation tasks that must be conducted by mapping engineers are already implemented. There is support for configurable literate outputs in Jupyter Notebooks. Existing data engineering tooling has high performance enabling interactive processing of large datasets. Many data engineers will know these tools very well, and thus face less effort in adopting a mapping engineering tool as their existing skills can be reused. By integrating with such tooling, it is possible for developers to focus on solving the tasks in mapping engineering that are not found in general data engineering tools. Less effort is required to maintain the tooling over time, as important functionality that is not unique to mapping engineering will be maintained separately by a much larger community.
Increasingly, analytical databases are relying on the column-based Apache Arrow and Apache Parquet formats for storage, transmission, and computations on data [42], [43], [44], [45], [46]. We have already discussed how these formats can drastically reduce serialization overhead. Modern DataFrame-based data engineering tooling has good interoperability with these formats. By interoperating well with DataFrames we can exploit these technological developments VOLUME 11, 2023 in knowledge graph construction, for instance by reducing the time taken to write and upload a knowledge base into a triplestore -if and when the triplestore supports such formats.

3) THE VALUE OF ENRICHMENT FOR INDUSTRIAL INFORMATION MODELS
We can consider the case where an application uses SPARQL to access a materialized knowledge graph as a read-only cache to offload a transactional database. Such a situation occurs often in data-intensive systems, and can be handled with caches built using batch processing [47]. However, some SPARQL queries are by their very nature expensive, and may in any case be more time-consuming than we would like for our application. In this case, it can be advantageous to materialize auxiliary RDF properties of interest. Such enrichment is obviously supported with SPARQL insert.
Access control is a feature that is not standardized and supported by SPARQL engines. In enterprise settings, it is important to limit access, for instance in accordance with geographical location or with the role of the user. Multiple approaches exist that rely on materializing access control as part of the knowledge graph, and on subsequently rewriting SPARQL queries to ensure the user only accesses the appropriate part of the graph [48], [49]. SPARQL insert queries are one way of creating and maintaining such materialized access control, provided the knowledge graph is sufficiently standardized.
Industrial knowledge graphs may index real-time data arising from sensors and actuators in facilities such as factories or power plants [50]. Applications may read from these values, producing aggregates, KPIs or other derived data, and we have argued in previous work that queries over industrial knowledge graphs can determine the integration of reusable applications [16]. The output data from these applications should also be indexed by the industrial knowledge graph, and an enrichment step is one possible place to add these nodes.

4) THE VALUE OF TEMPLATES
We have already discussed how industrial standards for representing information often are based on components that are composed in order to represent e.g. particular pieces of equipment. For instance, the IEC 61850 standard defines ways of representing information from electrical equipment in part 7-3 [51] that is reused when representing sensor data from hydroelectric power plants in IEC 61850-7-410 [52] and when representing sensor data from wind turbines in IEC 61400-25 [53]. A template-based mapping tool will be able to reuse these templates and create more uniform models. Uniform models are important when knowledge graphs support industrial application integrations, as they make it easier to identify places where e.g. a graphical user interface element or statistical analysis can be reused. This is particularly the case when seeking to reuse the same application across industrial installations or across companies.

B. REQUIREMENTS
Our solution will have to support interactive mapping through a scripting language with library-based integration. Moreover, the solution should be interoperable with DataFrames and Apache Arrow-based data engineering tools. Primarily this requirement applies on the input side. Coupled with interactivity, we should be able to pipe outputs from Arrow-based into the mapping library effectively, without using disk-based integration. These tools allow us to define all of the operations of SPARQL. This further implies that the mapped model can be immediately queriable, or queriable with very little I/O overhead if the database is Arrow-based and external.
We will further require the mapping tool to support SPARQL based inspection of mapped models, enabling inspection in the exploration phase, and validations and tests needed for production deployment in later phases. To support industrial scenarios involving templating, our mapping library should support templates. Additionally, we should be able to use this SPARQL support to post-process the model, enriching it with the outputs of construct-queries. Such support will also be important for industrial scenarios where the knowledge graph supports applications.
All of the mentioned functionality should have high performance for large datasets, as this is important for the mapping to be interactive in a practical way. By choosing an in-memory Arrow model, we are limited by available memory. Scaling the solution to work interactively with big data is outside our scope.
We summarise these requirements below: R1 Support interactive mapping in a script-based language R2 Interoperability with established Apache Arrow-based data engineering tools R3 SPARQL-based inspection R4 Templating support R5 SPARQL-based post-processing with construct R6 High performance for large datasets

IV. EXISTING SOLUTIONS
There are multiple existing interactive mapping tools that warrant discussion. SPARQL Anything implements the Façade-X mappings discussed earlier (Section II-B). SPARQL Anything enables the Façade-X language to be used for the Apache Jena ARQ SPARQL engine [54], [55]. SPARQL Anything supports one-off command line interface (CLI) based mapping as well as an Apache Jena Fuseki-based server for executing mappings [55], [56]. Using the serverexecutable, SPARQL Anything allows an interactive mapping process in the sense that users can alternate between querying source data, inserting data from these sources, and querying and enriching the resulting knowledge base. Users of the server must however interact with it using the SPARQL Endpoint, e.g. using HTTP with JSON or a SPARQL GUI [57]. Morph-CSV is a Python-based tool for executing SPARQL queries over CSV files mapped with RML [58]. The tool is not interactive, as it is run from the CLI.
Development on the Morph-CSV project appears to have stopped, with no new releases since 2020 [59], and the effect of time has rendered us unable to run the Morph-CSV code without considerable debugging of dependencies. We have therefore not considered Morph-CSV any further. Arenas-Guerrero et al. construct a related tool called Morph-KGC in Python which allows users to construct a materialized knowledge graph based on an RML mapping [9], which is then loaded into either RDFLib [60] or Oxigraph [61] for interactive querying using Python. Morph-KGC relies on the data engineering library Pandas to read inputs specified in RML mappings, but cannot directly read from Pandas DataFrames. RML mapping rules are potentially interdependent, and parallelization of their execution is not trivial. Arenas-Guerrero et al. introduce a mapping partitioning approach for RML, which allows for improved execution times and reduced memory usage [9].
In a recent review of declarative RDF generation tools, Van Assche et al. find that existing mapping tools execute the mapping process atomically from source to RDF [11]. This is for instance the case for the above mapping tools. Atomic execution is also the case for existing mapping tools implementing OTTR-support. Extensions to OTTR called bOTTR and tabOTTR enable support for all support structured input such as CSV files, Excel files, and SQL for OTTR, but process mappings that read such data atomically [12]. Van Assche et al. find that many mapping tools have support for data transformations. These are functions that are applied to input values to transform them before they are encoded in triples [11]. The authors cite the need to extract a year from a date when the target schema requires a birth year, while the input contains the birth date. There are problems with the atomic execution of mappings both in the exploratory phase and when making mappings production ready. In the exploratory phase, it is useful to observe intermediary results. For instance, when importing and mapping data from SQL, one will typically perform multiple iterations of adjusting the query and the mapping to get the correct end result. Validating the results of the query is very useful in this process. If there are data transformations involved, the utility of intermediary results in troubleshooting the mapping increases. When creating production-ready code, mapping engineers need maintainable mapping code with logging, data validation as well as tests. Atomic executions however, make it difficult to accomplish these tasks within the mapping process. For instance, it is hard to add validations for the results of SQL queries, to log intermediary statistics such as the number of SQL results and time taken, and to add unit tests. Either the scope of the mapping tool must increase to cover such functionality, or users must move the boundaries of atomic execution and rely on the file-based import of preprocessed CSVs. Alternatively, the mapping language could be extended with a testing facility, but this would greatly increase the scope and complexity of the mapping language.
The review of mapping tools by Van Assche et al. [11] found at the time that only Morph-CSV was based on Python, but it has CLI-based integration. Some of the tools described by Van Assche et al. do have library-based integration, but these are written in Java and have no Python library with bindings. The more recent Morph-KGC allows library-based Python interaction, but is bound by RML requirements to do integrations as specified in the mapping files. With existing tooling then, the way to pass a DataFrame to a mapping tool is to edit the mapping document to point to a CSV-file in the file system and write the DataFrame to CSV. In case the mapping tool is CLI-based, one executes the mapping through the Python subprocess module [62], which permits calling local executables. With SPARQL Anything, the mapping is executed by sending a request to a HTTP-endpoint. In the case of Morph-KGC, one has to call the materialize-method that points to a configuration file. Morph-KGC reads this configuration file, which then points to the RML mapping file which then points to the CSV file. RML mapping files are in any case stored on disk, and one has to leave the notebook environment to edit them. CSV files are not typed, so even though one has taken care to properly parse the DataFramecolumns, they must be parsed again inside the mapping file -this time with another set of parsing functions that the data engineer must look up.
In this section, we have thus far discussed mapping backends, but there are also mapping editors that permit interactivity, but none meet all of our requirements. An open-source R2RML mapping editor described by Sengupta et al. [63] allowed users to create a mapping using a wizard which provides feedback at each step. The wizard allows the user to view the results of SQL queries as well as an interactive view of the generated triples as the mapping is being created. The work was published approximately ten years ago at the time of writing, and we were unable to find the source code for the editor. More recently, Heyvaert et al. [64] created a mapping rule visualization tool that was able to provide better visibility into the input data sets used by RML mappings, as well as the resulting triples. In contrast to wizard-based editors, Heyvaert et. al supports non-linear workflows where users have both source definitions, mappings, and produced triples available at the same time. There was however no support for performing and displaying manipulations on the input data set, as the editor executes the mapping using an end-toend RML processor. Additionally, the tool has no support for inspecting the results by querying. Ontopic Studio is a commercial offering that also provides an interactive GUI-based way of exploring the data source and the mapped knowledge graph with SPARQL queries [65]. Their tool is however specialized for the setting of virtual knowledge graphs, not materializations.

V. SOLUTION APPROACH AND IMPLEMENTATION
Our solution approach is to perform stateful mapping, where what we call a mapping state is updated by calling a simple VOLUME 11, 2023 API. In this section, we first describe our chosen mapping language OTTR in Section V-A. We describe the API through a simple example in Section V-B. The template expansion process is described in detail in Section V-C, we discuss implementation details in Section V-D and we compare our solution with Morph-KGC in Section V-E.

A. OTTR
Multiple mapping languages could in principle be supported by the API, but we have chosen the OTTR mapping language. We describe how this choice supports our requirements below. Whereas RML is designed for tight coupling with data sources in order to support SPARQL queries to be executed over source relational databases, stOTTR is designed with loose coupling through an API in mind. This design feature makes it much better suited for composition in a data engineering workflow. Additionally, OTTR supports our need for reusable templates in industrial mapping scenarios (cf. [12], [32], [33]). We use the terse syntax for OTTR, which is called stOTTR [66]. By requiring the use of OTTR, our implementation will meet R4. OTTR does not support interdependencies such as joins between mapping rules. This lacking feature is in fact an advantage over RML in our context. It is likely that data engineers with experience in DataFrame-based tools prefer to perform joins using these DataFrames where they can be worked with and validated interactively instead of inside an atomically executed mapping language where join results can only be inspected by inspecting the generated knowledge graph. As such, lacking this feature helps us support R1 of interactive mapping.
Through industrial experience, we have found alignment problems that are much harder than can be handled with joins defined in RML. In industry, there are typically distinct systems managing operational data flows and asset structure. The naming schemes for individual streams of industrial data (tags) could be based on loosely followed company internal standards and be constrained in their amount of characters allowed by legacy communication protocols such as MOD-BUS [67], whereas an asset management system has a different way of identifying the data stream. Connecting data from asset management to operational data thus requires advanced string-matching tools or even lookup tables created from analog documentation. Pushing interdependent computations outside of the mapping framework, we simplify our tooling and create a uniform workflow for industrial settings. Second, the lack of interdependencies makes the OTTR mapping problem trivially parallelizable, which makes it much easier to improve the performance of the mapping tool, allowing us to meet R6 with greater ease.

B. API
To initialize a mapping state, a constructor takes named mapping rules as input, which crucially are not connected to input data. The API consists of an operation for applying a mapping rule to a DataFrame of inputs in order to map triples, an operation for querying the triples in the mapping state using SPARQL, an operation for adding triples with a SPARQL construct query, and data output methods. Fig. 1 shows the operations for adding triples, querying them, and enriching the knowledge base. The constructor is used to create a mapping state, and takes a list of stOTTR documents as arguments. A stOTTR document is a string that typically contains prefixes and stOTTR templates. An example abbreviated stOTTR-document together with an instantiation of a Mapping object in Python is given in Listing 1. We have based this example on the example templates for mapping knowledge about Pizzas in Skjaeveland et al. [12], which the authors use to introduce OTTR and the stOTTR-syntax. We note that the DataType annotation xsd:AnyURI on the ?c parameter is necessary since country occurs only in the object position and may be a string literal or an IRI. The same goes for the ingredient parameter ?is, except this datatype is nested in a list. In OTTR, template rule applications are called template expansions, which is why the API method for rule application is called expand. The expand method takes as input a template identifier (IRI or prefixed) together with an Apache Arrow-based DataFrame with corresponding columns as input. Note that Apache Arrow is only a specification of how columnar data is represented in-memory, and that multiple implementations exist. Listing 1. Example instantiation of mapping-object using an abbreviated template from [12].
Apache Arrow specifies a set of datatypes, including listtypes. We expect the input DataFrame to use these types, and not to perform string-encoding of input data. If we want to expand the template with heterogeneous types in one of the arguments, we must do separate calls to expand. From the datatype of a column, an implementation should infer the OTTR datatype (and consequently the RDF datatype) of the argument if it is not specified in the template signature. Inconsistent datatypes should produce an error. This type-inference also extends to Apache Arrow list-types. In sum, this API design allows us to support R2 of interoperating with data engineering tools in a comprehensive way.
We build a very simple dataset below in Listing 3, and expand the ex:Pizza template. We first define some prefixes in Listing 2.

Listing 3. Example dataset and template expansion.
The output of printing the DataFrame looks approximately like Table 1. Now, we can inspect the model using a SPARQL query.  The SPARQL query produces results given in Table 2. Next the model may be enriched using the insert method. First, we define a construct-query that we want to enrich the model with, and execute it to see that it produces the expected results.  This query produces a list of DataFrames as a result with length one. The contained DataFrame is shown in Table 3.
Since the results are as expected, we enrich the model and check that it is updated in Listing 6. The results of this query are shown in Table 4. Note that the insert method accepts construct-queries in addition to insert-queries, as this allows us to use the same construct-query to verify and to insert new triples without modifying the string.  Finally, we may export the mapping state either to an NTriples-file or to a collection of Parquet-files using the write_ntriples or write_native_parquet methods respectively.

C. TEMPLATE EXPANSION
This section looks closely at the template expansion, and how it works in our solution approach. OTTR template expansion can be implemented using a simple recursive function on the VOLUME 11, 2023 instance or row level, which we will discuss as a baseline algorithm, before we look at the column-based algorithm being used. Listing 7 shows this baseline recursive function written in Python notation. We allow ottr:Triple to refer to the corresponding Template object, even though this is not proper Python.

Listing 7.
Instance-level recursive procedure for expanding OTTR templates.
expand_row takes a template as the first argument. A template consists of a signature and a body with a list of patterns [31]. In Listing 1 the signature was between the [ and ]-brackets, while the lines with ottr:Triple(...) were the patterns. The base condition of this recursive function is when we reach an ottr:Triple-template. In this case, we can create the result. Otherwise, we perform recursive calls to expand_row after creating a new list of arguments, replacing parameters with their arguments.
We consider a call to expand_row with the ex:Pizzatemplate from Listing 1, without considering the list parameter ?is as it is not covered by Listing 7. The call is shown in Listing 8. We will allow ex:Pizza to refer to a Template object for simplicity, even though this is not proper Python syntax, as we did with ottr:Triple previously.

Listing 8. An example call to the baseline row-based expansion function.
We iterate through the patterns in our template, (excluding the cross for simplicity), producing new lists of arguments where parameter variables are replaced by their arguments. This produces new calls to expand_row, which are shown in Listing 9.
Listing 9. New calls to expand_row in our example.
The end result is a list of triples as shown in Listing 10. However, in our case, we are expanding templates in a column-based fashion. Although we follow the semantics of OTTR template expansion, there are important differences. The process of expanding OTTR templates in a column-oriented way is presented using Python syntax in Listing 11. Listing 11. Simplified algorithm for recursive expansion of OTTR templates in Python.
The expand_col method takes four arguments, the template to expand, the DataFrame and const_args-dict that together contain exactly the template parameters, and set of columns names that are unique. Initially, the const_argsdict is empty. The function produces a list of Resultobjects, which contains the resulting triples. These have a normalized form of having a DataFrame with subject and object columns, together with a constant for the verb in the dict const_args. The is_unique field tells us if the DataFrame rows are unique. After we have run the function, we store the results directly in the triplestore. The storage operation (not shown) makes note of whether the stored DataFrame has only unique values or not, but does not check whether it is actually the case, as this is an expensive operation.
As in expand_row, the base case is when the template is ottr:Triple. The normalize_results-function (implementation not shown) takes a result and produces a list of normalized results. Normalization is done in order to simplify the implementation of the triplestore. We normalize each result into one or more DataFrames with a subject and object column -converting constants to DataFrame columns. In case the verb is a column (not a constant) we convert the result into as many results as there are unique verb IRIs by partitioning the result DataFrame on the verb-column.
In the future, we may allow heterogeneous representations of verbs in the database with constant subjects or objects. However, this causes a lot of complexity in query processing logic which we have chosen to avoid for the time being. After results are normalized they can be stored directly in the triplestore by updating a map according to the verb IRI and object-datatype of the result to be stored.
If the base condition has not been reached, we create as many DataFrames as there are new templates to instantiate in the body of the template using the prep_pat_argsfunction, selecting and renaming the appropriate columns. We create new dictionaries of renamed constant arguments and keep track of which of the new columns are unique. The par_map function denotes that this operation is done in parallel and collected in a list. This parallelism comes in addition to the parallelism inherent in Polars operations. Next, we execute the expand_col function (also in parallel) for the prepared tuples of arguments. Finally, we must flatten the resulting list of lists of results into a list of results, which we return.
We are now ready to consider an example execution of the expand_col-method in Listing 12. The df-argument is the DataFrame from Table 1.
Listing 12. First call to expand_col in our example.
Continuing our example, we prepare the argument-lists shown in Listing 13.
Listing 13. List of list of arguments created for the ex:Pizza-template.  We display the DataFrames df_1, df_2 and df_3 in Tables 5, 6 and 7 respectively. These are created in parallel, and data are only copied as necessary. The DataFrame df_3 is the result of ''exploding'' the is-column of df, which is a built-in function in Polars, duplicating other columns as necessary. This means that the subject-column (renamed from p) no longer can be guaranteed to be unique, leading to an empty unique-set. Finally, we call the expand_col-function in parallel with the arguments above. We reach the base case of the expand_col-method, and the object-column is added to df_1 in the normalization process (normalize_results). The normalization process does not change the other resulting DataFrames. These resulting DataFrames are ready to store and query once we ensure that the DataFrame associated with pizza:hasIngredient is deduplicated.

D. IMPLEMENTATION DETAILS
We implement the solution in Rust, but create Python bindings where data are exchanged using Apache Arrow Interprocess Communication (IPC), meaning data are not copied [35]. We base our SPARQL engine on an engine used in our previous work with Apache Arrow-based SPARQLprocessing [16]. It relies on the Polars-library for representing triples and processing queries, and on the spargebra [68] and oxrdf [69] libraries for parsing SPARQL expressions and representing RDF data. The solution is available on GitHub under the Apache 2.0 License. 1 The real column-based expansion procedure keeps track of the data types of the arguments. We store the triples created by mapping as a verb and data type-indexed collection of Polars DataFrames. Associated with each verb-IRI is a map from data types to Polars DataFrames containing the columns subject and object of the predicate, where the object has the given data type. Column-based triple storage has been successfully adopted by previous approaches to analytic / read-only SPARQL querying [70], [71].
Before SPARQL processing can begin, we deduplicate the triplestore. We will however skip deduplication for a verb if it is already deduplicated or the user has (indirectly) indicated that the triples associated with the verb are unique. When processing SPARQL queries, we create a lazy expression corresponding to the SPARQL algebra-expression [22] of the query as provided by the spargebra-library [68]. The semantics of SPARQL-algebra can be mapped mostly directly to Polars operations on DataFrames. We thus rely on optimized parallel execution by Polars for actual processing. We follow the best practice advice of Polars and use a global string cache managed by Polars for Strings (e.g. IRIs), and use instead categorical integer values in place of them to perform joins. 1 https://github.com/magbak/maplib VOLUME 11, 2023 Our SPARQL implementation does not yet support solution mappings where a single variable has multiple data types. Such support adds a lot of complexity to query processing as we need to keep track of whether we are using native types or whether we are using the native-or the string encoding of a variable. This support will be added in a future release.

E. COMPARISON WITH EXISTING APPROACHES
The column-oriented graph construction algorithm described above differs from Morph-KGC in important ways. Morph-KGC also employs a column-oriented approach to mapping, and applying RML mappings to DataFrames backed by Pandas. However, data is loaded into a SPARQL database by writing to an in-memory ntriples-file, which is then loaded in bulk. 2 Before the ntriples-file is created, data are also deduplicated. We avoid these expensive operations by re-using the target representation for mapping as the underlying format for the SPARQL engine and by allowing the user to specify uniqueness. In many cases, we copy very little to no data during template expansions, and merely change how we relate to it. This is for instance the case for the c-column in Listing 1. When data is copied, the sequential nature of the columnar format likely means copying is very fast. In order to support RML, Morph-KGC also must support joins as part of the mapping, but there is no such support for joins in OTTR, which we implement. We return to this difference and what it implies in our discussion in Section VI-C.
When SPARQL Anything is used to materialize the knowledge graph, Apache Jena is indirectly used to query the source data set files and update the graph. From the highly abstract and configurable Apache Jena source code it is much harder to tell what this process actually involves, and an expert or detailed study of the Jena code base is likely required to determine that with certainty.

VI. EVALUATION
In this section, we evaluate the degree to which our solution meets R6 of high performance for large datasets. We evaluate this requirement by comparing our solution to two state-ofthe-art approaches on a demanding benchmark: • The Morph-KGC approach of Arenas-Guerrero et al. [9]. • The SPARQL Anything-approach of Asprino et al. [17]. In Section VI-A we describe how we set up and ran the benchmark. Our results are presented in Section VI-B and discussed in Section VI-C. Code and instructions for running the benchmark and our raw results are available on GitHub. 3

A. METHODS
We compare performance on the General Transit Feed Specification (GTFS) 4 Madrid benchmark described by [18], which contains a generator of scalable datasets based on openly accessible data from the public transport system in Madrid. The GTFS is a standardized format that public transport organizations can use to publish data, initially created at Google [72]. The GTFS Madrid benchmark contains both a RML mapping from CSV files and 18 SPARQL queries over the mapped knowledge graph. We use the same scaling factors (5,10,50,100,500) as are used in the benchmark paper [18].
We compare maplib with Morph-KGC [9] and SPARQL Anything [17]. In order to benchmark maplib, we created a stOTTR mapping equivalent to the RML mapping used by Chaves et al. [18]. We verify that the mapped models are identical up to small differences in datatypes due to differing defaults in maplib and Morph-KGC (e.g. xsd:long instead of xsd:integer), and crucially, that queries produce the same rows. We eliminated query 10 from consideration, as the generated data and the query do not agree on data types (comparing xsd:string with xsd:duration).
SPARQL Anything has a GTFS mapping in a code repository [73] associated with a recent GTFS benchmark [74], which is specified as a single SPARQL query annotated with a service that refers to the GTFS CSV files. However, running the joint query did not terminate in a reasonable timeframe (30+ minutes and still running). To achieve comparable performance, we split the joint materialization query into a materialization query for each GTFS CSV file. SPARQL Anything exists in a stateless-and as a server-based variant. The server-based variant is more appropriate for our setting. Running SPARQL Anything Server, we first run the ten materialization queries and then the GTFS queries using the SPARQLWrapper-library in Python.
Our evaluation consisted of materializing the knowledge graph and executing the GTFS queries in 10 separate executions. We set a timeout of ten minutes for knowledge graph materialization. We recorded wall times for materialization and for executing each query, as interactive creation of the mapping often requires (re-)execution of mapping and queries. We set a timeout of two minutes for query execution. The evaluation was run on a high end laptop from 2017 with an Intel i7-7600U CPU @ 2.80GHz, 16GB of ram, and a high-performance SSD. We used the 0. Morph-KGC requires the RML-file to be located in the file system, and the parameters to the materialization function to be specified in a.ini-file. Thus, Morph-KGC will read two extra files, but these are very small compared to the GTFS data sets, and we do not expect this difference to be reflected in our results. Similarly, SPARQL Anything reads ten very small files with the SPARQL queries necessary to materialize the GTFS graph.

B. RESULTS
The results from executing the materialization stage are given in Fig. 2, and the mean materialization times are given in Table 8. We denote Morph-KGC with a RDFLib materialization by ''M-KGC(R)'' and Morph-KGC with an Oxigraph materialization by ''M-KGC(O)''. SPARQL Anything is denoted by ''SA''. Following the convention of Chaves-Fraga et al. [18], ''TO'' in a table cell indicates that the materialization took longer than 10 minutes, and use M to denote the case where materialization ran out of memory. X indicates that a scale was skipped due to a smaller scale having timed out. Morph-KGC also reported a mapping-rule processing time of four seconds, relating to the partitioning approach described in Arenas-Guerrero et al. [9].
Mean query execution times (seconds) for each query, scale, and solution are presented in Table 9. The variability of execution times is small, and we omit the standard deviations from the table for readability. In the table, N indicates that the query is not supported by the implementation in question. This is only the case for maplib and query 15. X indicates that the query was not run due to missing materialization. M indicates that query processing ended in an error of the kind one gets from Rust when it has run out of memory. TO indicates that the query used more than the allotted two minutes to complete.

C. DISCUSSION
The results of our evaluation show a 47×, 60×, and 182× performance improvement in maplib over SPARQL Anything, Morph-KGC with Oxigraph, and Morph-KGC with RDFLib respectively. Moreover, our solution is able to materialize larger data sets than both Morph-KGC and SPARQL Anything, effectively allowing us interactively work with larger data sets. Maplib materialization time is less than ten seconds even for the largest successfully completed materialization.
Query processing times are on the whole in favor of SPARQL Anything and Morph-KGC (with Oxigraph) over maplib. Oxigraph and SPARQL Anything are able to do considerably better than maplib on many queries in terms of time and memory (Q3, Q5, Q6, Q7, Q8, Q11, Q12, Q17 and Q18). Morph-KGC with RDFLib times out on a lot of queries, but achieves comparable performance to the other query engines on many of the queries where it does not time out. Maplib is however able to do considerably better than Oxigraph and SPARQL Anything on queries Q1, Q2, Q9, Q13, Q14 and Q16. RDFLib is able to do considerably better than Oxigraph and maplib on Q11 and Q13.
In an interactive knowledge graph construction session, the engineer likely re-runs both knowledge graph construction and query evaluation(s) as a whole. The purpose of the query is precisely to inspect the effect of the mapping change. Mapping times and query execution times should be considered in conjunction. Considering materialization times and query times jointly, it is clear that the materialization time differences dominate the query performance differences in all but one query (Q13), leaving maplib with a performance advantage for the joint problem of materialization and querying. Moreover, the materialization process in maplib is much more scaleable than those of SPARQL Anything and Morph-KGC.
Both RML and SPARQL Anything specify the source CSV files as part of the mapping definition files (Turtle and SPARQL respectively). stOTTR does not describe input data files, and we give paths to input files in the Python script. To run Morph-KGC and SPARQL Anything on different data sets, we either had to adjust the mapping files or we had to move a different data set into the data directory. We found that this process impeded our productivity. This setup is also likely to complicate development environments where the same mapping is tested with different inputs. To test the RML or SPARQL Anything mapping with different data sets, for instance against a data set that contains an error we want to handle, we must copy the mapping files into an adjacent directory or change the source mapping-files to point to different CSV source files.
While the partitioning approach of Arenas-Guerrero et al. [9] is impressive, it comes at a performance cost (about VOLUME 11, 2023  4 seconds in the benchmark, independent of scale). In a one-way data engineering setting that materializes triples, it can be avoided altogether. In the GTFS-Madrid benchmark for instance, the STOPS.csv-file contains the stop_idcolumn. The STOP_TIMES.csv also contains this column, with a foreign key relationship to stop_id. A static prefix is used by the RML-mapping to construct the IRI for a stop based on data in STOPS.csv, and stop_id in STOP_TIMES.csv is used to reference this constructed IRI. If we can guarantee the foreign key constraint between these files, and have sufficient information to construct the IRI in both of them, the IRIs can be computed independently for both files without the need for a join. This is the approach we have chosen in our benchmark implementation for maplib and SPARQL Anything. IRI construction is string concatenation which scales linearly and is easily parallelizable. Joins however, do not scale linearly. A bad query execution plan for these joins may have been the cause of the long-running materialization of SPARQL Anything that we had to break up (cf. Section VI-A).
If on the other hand, the table containing the foreign key (STOP_TIMES.csv) does not have sufficient information to construct the IRI of the referenced table (STOPS.csv), the join is mandatory. In this case, we can join in the necessary data before we start mapping, and construct the mapping such that it contains no join. This strategy comes at the expense of having to create the IRI twice, but simplifies and improves the parallelism in the mapping process. Future research into RML parallelization should compare the above strategy of getting rid of dependencies early with the approach of Arenas-Guerrero et al. [9].
With an interactive mapping process in Jupyter, users should be able to compute this amended data set as part of their general data engineering workflow. If the join is expensive, they can place it at the very start of the workflow or store the amended dataset and start the workflow by importing it, effectively eliminating the need to recompute the join in further mapping iterations.
The choice of Polars over Pandas likely plays a role in the materialization performance difference between Morph-KGC and maplib, as this library is known to be faster [15]. To further understand the difference, we added time measurements to the process Morph-KGC uses to create the ntriples-string and load this string into Oxigraph (cf. Section V-E). In a single sample run for scale 10 that took 48.91 seconds for oxigraph, 0.86 seconds were spent creating the ntriples string, and a futher 22.75 seconds were spent bulk-loading the ntriples-string into Oxigraph. The extra work done by Morph-KGC described above on e.g. joins, as well as the collection of all produced triples in a set described in Section V-E likely also play a role. However, there is also a role to be played by Polars. Comparing csv-load times, Polars takes approximately 0.06 seconds to load the CSVs at scale 10, and Pandas takes approximately 0.3 seconds. Similar differences likely exist for operations that build the IRIs in the mapping and may well add up to several seconds at scale 10. However, it appears likely that the fact that maplib does less work is responsible for the majority of the difference between maplib and Morph-KGC in materialization times, and not that the work done by both solutions takes less time in maplib.

VII. CONCLUSION AND FUTURE WORK
In this work we have described maplib. Maplib allows users to perform interactive, literal mapping and enrichment of knowledge graphs through SPARQL, allowing us to meet our requirements of supporting interactive mapping in a script-based language with SPARQL based inspection and post-processing (R1, R3 and R5). By using OTTR instead of RML or Façade-X as the mapping language, we allow greater interactivity and greatly simplify the integration with DataFrames compared to the state-of-the-art Morph-KGC and SPARQL Anything libraries, allowing us to meet R2. Using OTTR also allows for improved and less complicated parallelism in the mapping process compared to Morph-KGC and SPARQL Anything. Compared to RML, the cost of this choice is that our mappings cannot be used to rewrite SPARQL queries into SQL queries over a source database. Additionally, OTTR has excellent template support, allowing us to support R4 for industrial use cases.
Using the Polars-library, we are able to use the same representation to store mapped triples and query them, with minimal data transformation and copying. We show that our solution is able to outperform the state-of-the-art Morph-KGC and SPARQL Anything libraries by a factor of 47x in the materialization step and in terms of scalability. Query execution is less effective with maplib, but these differences are dominated by the relative difference in materialization time in all but one query, allowing maplib to exhibit much shorter feedback loops in exploratory knowledge graph construction than Morph-KGC and SPARQL Anything. These performance increases allow us to meet R6 of high performance for large datasets. The performance improvements over Morph-KGC and SPARQL Anything are not the results of any clever algorithms created by the author, but a result of simplifying the mapping process and providing greater interoperability with Apache Arrow-based DataFrames, meaning we have to do less data transformation work and have greater access to high performance tooling such as Polars.
Our experience indicates that RML and Façade-X-based mapping is currently ill-suited for integration in interactive, literal Python-based workflows commonly used by Data Engineers. Using it in this way incurs toil on the part of the user, limits the interactivity and literal feedback from the mapping process and incurs a performance penalty. Additionally, the design of RML and Façade-X makes it harder to make the workflow production ready once the exploration stage is completed. The OTTR approach is better suited for this purpose by virtue of the functional, API-based design.
Maplib is currently limited to a single machine, but we are looking into a distributed, scalable implementation. The same API can be implemented in a distributed setting, reusing parts of the implementation. The mapping step is embarrassingly parallel and can be scaled easily. We have shown the benefit of Apache Arrow-based integrations over an approach that requires data transformation. This benefit can be extended to a distributed setting as Polars DataFrames produced by maplib can be immediately exported to cloud object storage as Parquet in a folder structure partitioned by predicate IRI and data type. Exporting from Arrow to Parquet is very fast. This ensures distributed support for the expand-method. Next, a data lakehouse such as Dremio can be configured to query the data set using SQL. With this in place, a mapping for a Virtual Knowledge Graph (e.g. Ontop) can be automatically constructed, so that the dataset can be queried using SPARQL, and support for the query and insert methods can thus be added.
While the support for interactive SPARQL queries means we are able to perform validations of mapping output, the SHACL specification defines a set of validation constraints that make such validations easier to define [75]. SHACL constraints are also a natural match for reusable mapping templates in industry, as they help us ensure that new users of the templates use them correctly, and in accordance with an underlying industrial standard [12]. In future work, we therefore plan on adding SHACL support. Having access to highly expressive and performant Polars primitives likely makes it possible to do so in a performant way.
Similarly, support for reasoning could likely benefit industrial applications. Supporting reasoning could e.g. be useful to create auxiliary structures that further harmonize and standardize the model, and thus improve interoperability. We plan on exploring support for reasoning in future work.