Technological foundations of Ontological ecosystems on the 3rd generation Blockchains

In this article we present the technological foundations on which an ecosystem of semantic data objects can be implemented on the latest Blockchain based systems. As the most important citizens among the semantic data objects are ontologies, the ecosystem is referred to as Ontospace. The foundations can be characterized by their architectural, cryptographic and transactional aspects. The architectural aspect borrows from the latest Layer-2 protocols of the 3rd generation blockchains and from the rules of Linked Data systems creation. The cryptographic aspect represents an original work that attempts to resolve the issue of efficient hashing of the graph data structures. The transactional aspect is concerned with the graph replication consistency, conditions for the direct access to graph data from the blockchain smart-contracts and with linkage between sidechains bearing semantic objects and the main network. The large parts of the work were implemented in the context of the Ontochain project – a part of the Next Generation Internet EU Initiative.


I. INTRODUCTION
Blockchain technology [1], [2] has revolutionized the ways in which distributed databases are designed and implemented. Fully decentralized with cryptographically guaranteed immutability, consistence and persistence of data, Blockchain offers a general conceptual framework for the ultimately secured storage of data. However, as the technology was first implemented to facilitate a new kind of financial transactions and was later developed to enable general programming through the use of programs called smart-contracts which implemented generalized transactions, it was never meant to store large amounts of data or the data of the structure imposed by underlying data model, like relational databases, key-value or graph databases. If storing data on the Blockchain itself is an inevitable demand of a given project, the developers can do so by serializing the data elements and embed them into blocks of the chain. However, such an approach is inefficient. It does not facilitate efficient querying of data. For many popular Blockchains associated with traded cryptocurrencies it is also costly. The typical solution is to store data in some old-style centralized databases while keeping the data elements digests (hashes) on the Blockchain. To improve the storage model, instead of centralized database, distributed P2P data sharing networks have been invented, offering better data persistence.
This fundamental deficiency of standard Blockchains is particularly troublesome when the data to be stored is used to build knowledge representation systems, like Knowledge Graphs [3] or Ontologies [4]. What these systems desire the most is trust, the notion of shared truth and efficiency of knowledge discovery, extraction and consumption. When it comes to trust, the awareness of the need for mechanisms that makes it a fundamental feature of knowledge representation is present from the inception of the particular branch of the domain, namely -the Semantic Web [5]. The original Semantic Web Layer cake depicted the trust layer on top of the stack. However, there is still no unified and holistic trust management model universally adopted even by the Semantic Web community [6].
On the other hand, trust in data consistency and immutability is the most important feature of Blockchains. Even in the trustless environment of the permissionless Blockchains, there are no reasons not to trust the data. The entire ecosystem of cryptocurrencies and decentralized financial products has been built on the notion of Blockchain trust.
The opportunity of using the Blockchain trust model to bring higher security and confidence in data motivated creators of solutions like: BigchainDB [7], ProvenDB [8], FlureeDB [9], Exonum [10] or ChainSQL [11]. The authors of this paper have created a working model of a system that directly addresses the challenge and brings Blockchain mechanisms to the RDF graph database [12]. This model is called GraphChain [13], and its first application was for enhancing trust of the digital identity system for legal entities [14].
Recently, the European Commission Next Generation Internet 1 initiative has launched a project called Ontochain 2 . The project aims at creation of "Blockchain-based knowledge management solutions that address the challenge of secure and transparent knowledge management as well as service interoperability on the Internet" 3 . The authors of this paper have proposed a construction of a framework for onchain data management for Ontochain based on the concept of aforementioned GraphChain solution.
The framework, depicted in the Figure 1, has been designed as an ecosystem of blockchains compliant with Blockchain Layer-2 protocol in which multiple sidechains coexist to provide services to different application domains. The Blockchains of the ecosystem can be closely coupled with graph databases, so that Blockchain state of the world includes the graph database state of the world. If the graph database is an RDF store, than the RDF named graphs correspond to the blocks of the Blockchain and effectively form a chain of distinguishable structures synchronized with the chain of blocks. The building blocks of the ecosystem called Ontospace (representing the entirety of Blockchains and semantic data pools) are: • OntoSidechain -a single Blockchain of the Layer-2 protocol sidechain type, • Ontonode -a single node of OntoSidechain, • Ontopod -a part of Ontonode responsible for handling the semantic data chains of named RDF graphs -implemented with the use of the RDF graph databases (triplestores) • Ontoshell -software modules for external communication for Ontonode (API & Linked Data via HTTP), • OntoHub -a special OntoSidechain designed to store the most important top ontologies for the domain. Designing and implementing such a system required meeting many challenges resulting from the architectural, cryptographic, and transactional aspects of the system. The architectural aspect concerned the implementation details of the Layer-2 protocol which was assumed to be the best foundation for the ecosystem construction and the design of components that allow for interaction with the system by standard web-based methods. We address this aspect in Section II. The cryptographic aspect concerned specific algorithms for calculating hash functions for graph structures and is addressed in Section III. Section IV elaborates on the mechanisms for transactionally correct synchronized updates of the Blockchain and graph database states. Section V is devoted to a discussion of the related work. The paper is concluded with a summary of the results obtained and with the plans for the further development.

Contributions
The key contributions of our article can be summarized as follows: 1) Development of the architecture of the Blockchain based ecosystem for Ontologies and semantic data objects that preserves both Linked Data and Blockchain standards. 2) Exploration of the Layer-2 Blockchain protocols for the creation of trusted Knowlegde Representation systems using multi-chain architectures. 3) Development of an innovative method for integrity proofs (hashing) for the named RDF graphs which includes the concept of vicious circle free RDF graphs. 4) A proposal for a modification of Blockchain (Ethereum) client which improves access to a database synchronised with Blockchain.

II. ARCHITECTURAL ASPECTS A. GRAPHCHAIN -THE FOUNDATION OF THE ONTOSPACE ECOSYSTEM
The definition and the first implementation of GraphChain, which forms the foundation on which Ontospace has been built was first proposed in 2018 [13]. GraphChain is a Blockchain solution where the fundamental data model is a collection of linked named RDF Graphs. GraphChain implements mechanisms typical to Blockchains such as data hashing, linking of named graphs into chains, replication of named graphs and achieving consensus on their content. The GraphChain network maintains a collection of RDF graphs in the databases of the network nodes forming the distributed system of chained RDF named graphs [15].
The fundamental advantage of such approach for the users is that they can work with the chained named graphs using standard tools developed in the domain of semantic web technology like SPARQL for quering [16], Linked Data mechanisms for accessing the nodes of the graphs [17], reasoners for ontologies [18] and many others -while benefiting from Blockchain mechanisms in their capacity to guarantee trust to the data. The first implementation of GraphChains used the 1st generation Blockchain framework of Hyperledger Indy [13]. While it was beneficial for the first application (the digital identity system for legal entities implemented for LEI.INFO portal 4 [14]), it was not rich enough for more sophisticated applications that required the capacity of a smart-contract based transaction. The work reported here implemented GraphChain using Ethereum based Layer-2 protocols, belonging to the family of the 3rd generation Blockchains.

B. 3RD GENERATION BLOCKCHAINS AND THE LAYER-2 PROTOCOL
When designing the architecture for Ontospace, we were motivated by the design principles of Layer-2 blockchain protocols and by principles of distributed databases. The Layer-2 blockchain protocols are overlying networks built on top of standard (Layer-1) blockchains, characterized by independent processing of transaction. From the standard, main blockchain perspective, these transactions are authenticated off-chain transactions, and the use of the main blockchain is 4 https://lei.info/ reduced to the resolution of disputes. This allows for much faster transaction processing and the reduction of transaction cost. From the architectural perspective, they can be created in orthogonal way -as chains that operate independently and in parallel to the main blockchain network. The growth and popularity of various kinds of Layer-2 protocols (e.g. State Channels [19], Rollups [20], Plasma [21], ZK-STARKs [22], Commit Chains [23] and sidechains [24]) enabled creation of innovative financial services based on blockchains, known as DeFi (Decentralized Finance) [25].
However, while the main motivation behind the most of existing Layer-2 protocol designs is related to scalability issues related to the cryptoassets applications, what motivated authors of this paper was the need to design a system that combines blockchain design with the distributed knowledge representation system design based on the RDF semantic data objects. The design of such system assumes a specific modification of blockchain client to allow for a non-standard access to the RDF graph data. While such modification would not be recommended on the main network (Layer-1), it is possible on a Layer-2 sidechain, assuming that it allows the parent chain smart-contract to verify operations of the sidechains. Reasoning this way, we concluded that the application of some basic principles of Layer-2 sidechain VOLUME 4, 2016 protocols to the design of blockchain based knowledge representation systems, allows for the creation of the concept of an ecosystem, whose main purpose is to enable creation of trusted representation for solutions like Knowledge Graphs, Ontologies and semantic data sets -based on the RDF representation.
The top level architecture of the Ontospace ecosystem has already been introduced in the previous section of the paper (see Figure 1), and was described in details in the previous paper [12]. However, for the completeness of explanation, let us review the ecosystem architecture. The ecosystem contains multiple sidechains, called OntoSidechains. Each sidechain of the ecosystem is a full-fledged blockchain network composed of Ontonodes, illustrated in the Figure 2.
Ontonode is an integrated software component composed of Blockchain node, RDF Graph Database (Ontopod), synchronization middleware, and a set of modules responsible for external communication with the node (Ontoshell) which implements methods allowing for accessing the data in the RDF Graph Database. Figure 3 presents another depiction of the ecosystem based on elements typical to Layer-2 protocol solutions. There is one particularly distinguished chain in the Ontospace -Ontohub. From the design perspective, it is identical to the other OntoSidechains, but its application is different. Designed as the primary distributed resource for the knowledge of the ecosystem, it serves as the repository of its most important (e.g. top-level) ontologies.
In accordance to the rules of Layer-2 protocols, the sidechains of Ontospace, the OntoSidechains are tethered (pegged) to the trusted parent network using mechanisms of Merkle Trees built with hashes of Ethereum transactions which include the RDF named graphs hashes. As this part of our work falls into transactional aspect of the system, it is described in Section IV. The section III is concerned with the cryptographic method for the RDF graph hashing.

C. ARCHITECTURE OF THE ACCESS TO THE RDF GRAPHS
The design of the OntoSidechain assumes that each node of the network (Ontonode) contains RDF graph database with identical content, i.e. the same chained sequence of named graphs. However, when building the web interface which allows for the interaction with the data according to the Linked Data principles, it is important to provide a unique end-point for the HTTP based access methods. This requirement was behind the access architecture presented in the Figure 4.
The access architecture assumes the presence of three layers: 1) The layer of distributed RDF graph databases replicated at every Ontonode of OntoSidechain. 2) The load balancer which distributes requests over the nodes of OntoSidechain 3) The web interface which implements SPARQL endpoint, REST API and provides user interaface.
In the PoC phase of the aforementioned OntoChain project, the RDF databases were implemented using Blazegraph [18], for the load balancer Nginx web server [26] was selected and the web interface was implemented using Java Tomcat webserver [27].
One of the most important functionalities of the On-toSidechains is the ability to store ontologies. In a typical scenario the ontologies are procesed in the OntoHub chain. For example, when a user plans to store ontology, the process illustrated in Figure 6 is initiated. It begins with the upload of the graph that represents an ontology. Firstly, the graph in a serialization of choice (Ontohub supports most of the RDF serializations) is posted to the Ontohub web interface, and it is then redirected by the load balancer to a selected Ontonode. Next, the service inserts the graph into the graph database (Ontopod) using standard SPARQL request or dedicated connector. When the insert succeeds, synchronization middleware detects that change and sends a new transaction to the blockchain node containing hash calculated from a new graph. After a successful transaction, the web service is informed about that fact and can return "success" response.

D. METAGRAPH
Because of the data immutability requirements, graphs stored in the system cannot be changed or deleted. Thus, when designing the graph database data management and the Ontoshell functionalities, a unique approach was required. Every time a graph is uploaded, Ontonode service has to check if the graph already exists, and if so, change the graph's name and store it without modifying previous versions. To keep track of the graph changes, the structure called Metagraph was proposed. The Metagraph is a special named RDF graph that has a name of an original graph uploaded by the user (further named real_graph_uri). Its content is a tree of graph version names with update dates for easy access to graph history, and with a pointer to the latest graph in sequence. The structure of Metagraph is depicted in Figure  5.

E. MODIFIED BLOCKCHAIN CLIENT
The key feature of the Ontonode is the tight coupling between the blockchain client and Ontopod -the graph database engine. In the case of Ethereum client, the coupling enables the graph data to be available almost as on-chain data, despite the fact that Ontopod object is external to the blockchain client. Typically in such a situation, third-party services, known as Oracles, connect smart-contracts with the external data [28]. In the design presented here, smart-contracts can interact with graph-data directly, without mediation of an Oracle. This was achieved by the creation of new EVM instructions (opcodes). In the PoC implementation with Hyperledger Besu, it was straightforward to create such new opcodes. At a minimum, a single read operation is needed -a function accepting the named graph ID to be fetched and returning the fetched graph. During the operation, a connection with the graph database is used to fetch the data. After the data   is fetched, its integrity and authenticity is also verified. In the simplest case, it is not even necessary to modify the Solidity compiler to include the new opcodes. It is possible to embed the new operation in a singleton smart-contract that can act as a proxy, accessed using its address by all other smart-contracts requiring the graph database access. In a more sophisticated implementation, a modified Solidity compiler can translate an added language construct to the EVM opcode. We will discuss transactional aspects of such modification in the part IV of this paper.

III. CRYPTOGRAPHIC ASPECTS
An efficient mechanism for the RDF graphs integrity proofs, realized by cryptographic hashing is an essential part of the process of building a chained sequence of subgraphs (RDF named graphs [29]) that follows the creation of Blockchain blocks in OntoSideChains.

A. PRELIMINARIES
The graph data models are essentially different from the block data models used by most of Blockchain implementations and what is crucial in their handling is cryptographically safe and repeatable computation of the RDF hashes (digests). However, this seemingly simple task, is far from trivial [30] due to the following circumstances: • The presence of the "blank nodes", i.e. the identifiers which are implementation dependent, so they may change while transferring the same graph between different RDF databases, or even between different access acts to the same database. • No predefined order of RDF graph building blocks -the triples. They form an unordered set, causing the same graph serializations differences between different acts of access. • When an RDF graph is serialized by different software routines, it can be encoded differently. As hashing is sensitive to encoding, it is subject to encoding-related problems. There is also a more fundamental problem. The obvious goal of RDF graph hashing is to ensure that any two graphs that have the same hash are isomorphic. On the other hand, as the graph isomorphism is an NP-complete problem within the class of all possible RDF graphs, it seems infeasible to design a universal hashing algorithm that would be computationally efficient.
For that reason, when designing the Ontospace ecosystem, we have defined a subclass of RDF graphs for which we can construct a reliable, fast RDF hashing algorithm. We call this subclass "a vicious circle free class". The possibility of RDF graph to contain vicious circles is linked to the mechanism of blank nodes, i.e. the triples where subject or object are not expressed as global IRIs but as local addresses. Blank nodes represent unknown information which can represent something in the real world. Searching RDF graph for information can be compared to searching the encyclopedia, where unknown terms are explained by other terms, some of them unknown as well. Those unknown terms are analogous to the blank nodes. While searching them further in encyclopedia, we can find another unknown terms, etc. It would be a logical error of an encyclopedia, if there is a circle in such a term explanation sequence. Imagine that the encyclopedia entry "vicious circle" refers to "circulus vitiosus", and "circulus vitiosus" refers back to "vicious circle". In this case, the process of finding the entry meaning would be its explanation (or it is rather a joke for those who know its meaning), but in most cases, such looping is a logical error known as a vicious circle. This reasoning demands a more formal analysis.

B. VICIOUS CIRCLE FREE GRAPHS
The IRIs used in RDF graphs are global identifiers that refer to any online resource, e.g. the IRI https://dbpedia.org/page/ Lemon refers to a lemon fruit in DBpedia [31]. Literals are sets of lexical values. Blank nodes are locally-scoped identifiers for resources that are not otherwise named. Blank nodes cannot be referenced from outside of an RDF document. They can be bijectively relabelled without affecting the interpretation of the document. Accodring to [29], we define I as a set of all IRIs, B as a set of all blank nodes, and L as a set of all literals.
We also assume that t = (s, p, o) is an RDF triple, where s is a subject, p is a predicate, and o is an object. The typical roles of elements s, p and o of RDF triple t are as follows: • subject s is either an IRI or a blank node that refers to the primary resource described by t; • predicate p is an IRI that identifies the relation between the subject and the object; • object o is either a literal, an IRI or a blank node that fill the value of the relation.
According to [29], we define an RDF graph as a finite set of RDF triples, RDF graphs are isomorphic if they are the same up to bijective blank-node relabeling. Then, isomorphic RDF graphs have the same content. An RDF graph is a set of triples, so the triples in RDF dataset can be written in any order; moreover, blank nodes can be labelled arbitrarily. It is important to have an efficient algorithm finding whether two RDF graphs are isomorphic. A formal definition of RDF graphs isomorphism is the following. Definition 1 (Isomorphism of RDF graphs): Let G and G ′ be two RDF graphs. Let B and B ′ be sets of all blank nodes used in G and G ′ , respectively. We say that G and G ′ are isomorphic if there exists a bijection φ :

C. EMBEDDING GRAPHS AND THE COMPLEXITY
Let G = (V, A) be a finite directed graph. That is G consists of vertices v ∈ V and arcs, that is ordered pairs (v, w) ∈ A. Two directed graphs G and H are isomorphic if there is a bijection ψ : G → H such that is an arc. A directed acyclic graph is a directed graph with no directed circles.
Let G = (V, A) be a directed finite graph. Let {v 1 , v 2 , . . . , v n } be an enumeration of all vertices of G and fix n blank nodes b 1 , b 2 , . . . , b n and one IRI u. For any arc Let us define an RDF graphḠ as a set of all those triples.
The RDF graphḠ looks very artificial. It consists entirely of blank nodes and such a document contains no information, but the structure of it includes G. It is defined against the RDF spirit. The definition of RDF graph is so general that it, unfortunately, includes such pathological examples. It is possible to show how to refine class of RDF graphs to omit such peculiar RDF documents, and still to have a large enough RDF graph class that covers most of interesting databases. Now, we will focus on showing that some kind of refining is really necessary to build a fast hashing algorithm. The following fact will allow us to prove that hashing algorithm problem is NP-complete in the class of all RDF graphs. Theorem 2: Let G and H be two finite directed graphs. Then G and H are isomorphic if and only ifḠ andH are isomorphic. Proof 1: Let ψ : G → H be a graph isomorphism. Then, G and H have the same number of vertices, say n. Let v 1 , . . . , v n and w 1 , . . . , w n be enumeration of G and H, respectively, such that w i = ψ(v i ) for i ≤ n. Further, let b 1 , . . . , b n and b ′ 1 , . . . , b ′ n be blank nodes used to defineḠ andH, respectively. Note that which shows thatḠ andH are isomorphic RDF graphs. Now, suppose thatḠ andH are isomorphic. It means that there is a bijection φ between B = {b 1 , . . . , b n } and We define ψ : G → H by the formula: Therefore G and H are isomorphic.
It is well-known that a graph isomorphism problem is NPcomplete for directed graphs. Theorem 2 along with the fact that there exist trivial polynomial-time conversions between graph G and corresponding RDF graphḠ implies that an RDF graph isomorphism problem is NP-complete as well. The embedding G →Ḡ of directed graphs to the class of RDF graphs that we have defined has a nice property prescribed in Theorem 2. Let us generalize this property to the following definition. We say that the embedding G → e(G) of finite directed graphs into RDF graphs is proper whenever G and H are isomorphic finite directed graphs ⇐⇒ e(G) and e(H) are isomorphic RDF graphs.
Let C be a subclass of all RDF graphs. The fact that isomorphism problem for C is not NP-complete implies that there is no proper embedding of finite directed graphs into C.

D. VICIOUS CIRCLE FREE RDF GRAPH
Let's define a type of RDF bases in which no blank nodes reference sequence forms circles nor looping. We will call them vicious circle free RDF bases (or graphs). Let G be an RDF graph. Let B = {b 1 , . . . , b n } be the set of all blank nodes in G. By B(G) we denote a directed graph whose set of vertices V B (G) consists of all blank nodes of G and the set of arcs is given by We will call B(G) := (V B (G), A B (G)) a blank node subgraph of G. Definition 3 (Vicious circle free RDF graph): We say that an RDF graph G is vicious circle free if its blank node subgraph B(G) is acyclic.

E. DETECTING RDFS WITH ACYCLIC B(G)-GRAPHS
There are many known algorithms for cycle detection for directed graphs. Many of those are, in fact, adjusted version of methods used for topological sorting (see [32,Section 22.4] for details) of the graph. Here, we are going to present one of such methods, based on Kahn's algorithm, which works in O(V + E) time complexity [32, (see [33] for original paper).
Algorithm 1 presents a pseudocode, which happens to be a completely valid Python code that performs the detection for cycles in B(G) graph. For this method to be used properly, a BG_Graph has to be read previously while simultaneously keeping count on the structure of blank nodes.

F. INTERWOVEN HASH
Having a subclass of RDF graphs which do not have vicious circles defined, we can safely proceed with a proposal of hashing algorithm suitable for such class of RDF graphs.
We were inspired by the approach to the calculations of RDF Graph hash (digest), first presented in [13]. The authors We named this approach as "DotHash" algorithm. Its summation operation is associative and commutative and allows for the implementation of incremental algorithm in which the computation of the hash of the graph created by addition of new triples can be done by combining the hash of original graph and the sum of hashes of the added triples. An optimal (i.e. exhibiting a good compromise between performance and security) approach is based on the "AdHash" algorithm [34] and defines the specific summation operation as a modulo operation with a suitably large value of the divisor. Incrementality of the calculations is of high importance for our application as it allows for a very efficient implementation in situations permitting new triples to be added to existing graphs. In addition to that, the method allows for highly efficient optimization. However, the method requires a very specific and non-generic approach when the graph contains blank nodes. In the original approach, its authors proposed a method where the blank nodes are labelled using statements like:

[ _bNode hasLabel L ]
Such labels are then used to rename the blank nodes during hash verification calculations. Of course, then, the calculation of hash will result in the same value. While this approach is practical, we did not find it generic enough and proposed the modified DotHash approach.
We named our approach "Interwoven DotHash". The fundamental feature of the "Interwoven DotHash" method is its ability to compute the graph hash without prior canonicalization of the entire graph nor a use of additional triples with adhoc labels. We will also use "iHash" acronym for the method.
The method we propose ignores the actual format of the blank nodes for the computation of hashes for the vicious circle free graphs while securing the essential effect of the blank nodes: their ability to what can be described as "weaving" multiple triples together.
We assume that the hash of a named graph is computed as the combining operation that is associative, commutative and supports incremental hashing of the graph.
As it was for DotHash, the combining operation can be implemented as a modulo (with a sufficiently large divisor) of the hashes of the triples. The hash of the triple is computed using the serialized triple form. Because the blank nodes are anonymous, the actual form and names do not matter. What is essential is the structure of the weaving of the triples together in the vicious circle free graphs.
Using such an approach, we propose the following algorithm for the calculation of the hash: (1) If the triple does not contain a blank node, we compute its hash by applying SHA-256 algorithm [35] for its N-Triples serialized format [36].
(2) If the triple contains a blank node as its Subject, we compute its hash as the sum of SHA-256 results for the N-Triples serialized Predicate and Object and SHA-256 results for non-blank nodes of all those triples where the blank node appears in the Object nodes. (3) If the triple contains a blank node as its Object, we compute its hash as the sum of SHA-256 results for the N-Triples serialized Subject and Predicate and SHA-256 results for non-blank nodes of all those triples where the blank node appears in the Subject nodes. (4) If the triple contains blank nodes in both Subject & Object nodes, use the above rules twice, once for Subject, then for Objects. The iHash pseudo-code is presented in Algorithm 2 and Algorithm 3. For practical application, we have implemented Interwoven Hash in Java, C#, Python, Javascript and Solidity.

IV. TRANSACTIONAL ASPECTS
The design choices we have made in the process of creation of Ontospace have impact on the transactional features of the combined system of Blockchain and the RDF graph database. In this section, we address the most important processes of the combined system in its transactional behavior: a. Replication of the RDF named graphs and assurance of consistency b. Direct access to the RDF graph data using modified Ethereum client c. Tethering of the Layer-2 OntoSidechains into the parent blockchains.

A. TRANSACTIONAL ASPECTS OF THE RDF GRAPHS REPLICATION
To ensure consistency of the synchronised replication of the RDF named graphs and the Blockchain replication and consensus mechanisms, an elaborated protocol has been proposed. The protocol assumes an interaction between the replication of graphs and the Blockchain internal replication mechanism through the use of specific smart-contracts. Such an approach helps in the preservation of the proper sequence of "world state" after transactions involving both graph databases and the blockchain. To explain how the protocol works, it is best to follow the steps of the process from the user upload of a new graph to the insertion of the graph into the database local to each node of the system. Figure 6 illustrates the process: (1) User inserts a new graph through Rest API exposed by the Ontoshell component of Ontonode. (6) The graph data is requested from the source Ontopod database. (7) After validating graph data with iHash from the smartcontract, the graph is inserted into the local Ontopod instance.

1) Graph version names
An important factor of successful working of the process is the named graph identification scheme based on Internationalized Resource Identifier (IRI). The named graph version IRI is constructed in accordance with Digital Identifiers (DID) Syntax specification 5 . The adopted graph version IRI scheme is: where: • IHASH -iHash calculated for this graph written as hexstring (no caps) • VERSION_TIME -update date as Unix time with milliseconds We also used a regular expression to test the adopted IRI validity: The identifiers of graph versions have a specific structure that aids named graphs identification and management.

2) New graph addition
The graph database (Ontopod) endpoints are not directly exposed to external users. Instead, there is a special layer called Ontoshell, specifically designed to accept requests. The primary Ontoshell interface implements the SPARQL 1.1 Graph Store HTTP Protocol [37].
After a POST or PUT request, the RDF graph is not immediately stored in the graph database. Before that, iHash of that graph is computed and a new IRI for the current graph version is generated. Then, the graph is inserted into the graph database, the metagraph is created (or modified) and the smart-contract method is executed for adding the new graph info to the blockchain.

3) Graph data propagation
Blockchain data is propagated through the network in the standard way for the given Blockchain technology of choice (in the case described here -Ethereum). On every Ontonode there is a special task scheduled for polling blockchain ledger for new transactions with graph data. When a transaction This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

FIGURE 6. Ontochain synchronization overview
with new graph arrives, Ontonode starts a graph database synchronization operation.

4) Graph database (Ontopod) update
Ontonode reads metagraph IRI (real_graph_uri), timestamp and iHash of the new graph content, assembles the graph version IRI from this info and sends a request to the source graph for actual graph data.
Then, a synchronizing code checks if the graph data iHash matches the one from the smart-contract, and if the metagraph content is the same as the local version (if present). If all checks are successful, the new graph is inserted into local Ontopod, the metagraph is updated and the node status is synchronized.

5) Adding new OntoSidechain node
When adding a new node to the chain, or when restarting it after a longer pause, the synchronization mechanism is pretty much the same as for the working node. The graph records are downloaded in batches asynchronously. There is a separate smart-contract on the blockchain for storing indices of the last graph record synchronized for each node and also for the URLs of the nodes.

6) Smart-contracts
There are two smart contracts deployed on the On-toSidechain blockchain that are responsible for the graph synchronization mechanism: GraphEventStorage and NodeSynchronizationRegistry.
The GraphEventStorage contract is responsible for storing information about the graphs added to the On-toSidechain. After adding a new graph to the database, the Ontonode must call the addGraph function to notify all other Ontonodes about the new graph. The data contained in the GraphEvent structure will then allow other Ontonodes to fetch the new graphs from their original source.
The role of NodeSynchronizationRegistry contract is to persist the information about nodes and about Ontonode graph synchronization progress, i.e. how many graph events have been processed from the GraphEventStorage contract. Its most important field is nodeSynchronizationProgress which is a counter of the synchronized graphs.
If nodeSynchronizationProgress is equal to GraphEventStorage.graphEventCount for a given nodeId, it signifies that the Ontonode's graph database is up to date. If nodeSynchronizationProgress is less than GraphEventStorage.graphEventCount for a given nodeId, it signifies that the Ontonode's graph database needs to fetch new graphs from the source. If nodeSynchronizationProgress is zero for a given nodeId, it signifies that the Ontonode has not yet begun or has just started synchronizing the first batch.
The key feature of the synchronizer is the use of the smart-contract for monitoring its synchroniza-tion progress. After it finishes synchronizing a batch, it persists the index (N) of the last synchronized graph in the NodeSynchronizationRegistry, i.e. the index in the GraphEventStorage.graphEvents array. When the synchronization job runs again, it reads the value from the index (N + 1). If (N + 1) <= GraphEventStorage.graphEventCount, it starts the new batch.
The NodeSynchronizationRegistry smart-contract is also used to reinforce the replication mechanism. In the basic algorithm described in Subsection IV-A4, the synchronizer sends a request for the graph data to the node that initially entered that graph into the system. If that node was unavailable for some reason, the synchronization process would be stopped. To prevent that from happening, the synchronizer uses the data from NodeSynchronizationRegistry to check which nodes have already got the particular graph (through nodeSynchronizationProgress field) and tries to send the request to them to fetch the data.

B. TRANSACTIONAL ASPECTS OF THE MODIFIED ETHEREUM CLIENT
To guarantee the direct access from smart-contracts to the RDF graphs, we have proposed a modification of the Blockchain client (we described it in the Subsection E of the Section II. However, such modification have important transactional ramifications as Ethereum is a "transactionbased state machine" where all transactions in the blockchain processed sequentially result in the same machine state, i.e. "the version of the world of Ethereum". All nodes of the network have to arrive at the same "world" state -otherwise, network participants would see different states (ex. account balances, elements of knowledge graphs) depending on which node they ask. To fulfill this requirement, all transaction processing from existing blocks needs to be deterministic. For this reason, in the standard implementation, the following operations, common to every programming language, cannot be performed by a smart contract: • generating random numbers, • directly calling an external APIs, • external database operations. The result of these operations can be unexpected and vary depending on the runtime and execution environment. API calls could succeed on some nodes, while failing on othersthe returned values could also be different.
If a modified Ethereum client was to add an ability for smart contracts to access data outside of the internal state, then such a mechanism would need to satisfy the following requirements: a: Data immutability Once accessed, the external data cannot be changed. As Ethereum transactions must be deterministic, the data access operations need to obey the same restriction. If the underlying external data would keep changing, then the same transactions would yield different states depending on when they were executed: for example, a new node synchronizing with the network processing old transactions would arrive at a different machine state than the nodes that processed the transactions as they were arriving. Henceforth, the underlying external database has to be append-only with no ability to delete old records. Moreover, it must be impossible to attempt to access data that does not exist, e.g., pass an ID of a non-existent record. If a past transaction failed to retrieve such a record and at some future point it came to existence, then divergent EVM states would occur -a state fork, as old transactions would start yielding different outcomes.

b: Eventual finality
As Ethereum transactions processed by all the nodes in the network must yield the same machine state, it is obligatory that the external data provides time-critical guarantees of finality. Once the external calls are executed, there has to be a 100% certainty that they will yield the same results across the whole network .
In order to achieve that, a condition similar to this needs to be fulfilled: If a read from an external data source is executed and data older than X blocks is accessed, then the operation will be successful and return the same value for all nodes in the network.
The higher the X, the more time the external database network would have to synchronize new records. After X blocks have passed, a database read must return a record, i.e., all external database states older than X blocks must be final.
Guaranteeing this condition can sometimes be impossible in a complex distributed system, thus, it is crucial that proper mechanisms that prevent it from happening during the transaction processing phase are in place. Sometimes a database access operation fails unexpectedly due to network errors, hardware problems etc. If that is the case, the transaction cannot be accepted until it succeeds -if it does not do so, eventually it must be discarded. State synchronisation scenario is presented in Figure 7.

c: Security guarantees
As Ethereum networks and transaction processing are meant to be trust-less, the same requirement extends to the data provided by the external data source. Data providers can be incentivized to provide false data in order to extract value from the network, e.g., tokens locked in a smart contract. The data provision in a smart contract should be followed by checks guaranteeing that the data has not been tampered with. Ideally, it would be a verifiable cryptographic signature of the data creator.

d: Data availability
In order for the newly connected Ethereum node to synchronize with the existing network, it will need to have access to a fully synchronized external data source to process all the transactions properly. As both external database and an Ethereum node could be starting in a desynchronized state, an additional external database will need to be provided for the needs of initial synchronization.
The efforts to meet these requirements have brought very good results. Using the PoC implementation of the system, we experimentally verified that the combined Blockchain and graph database system fulfills the requirements, while the access to the graph data was by the order of magnitude faster than with using Ethereum Oracles and an unmodified client [38].

C. MECHANISMS FOR ONTOSIDECHAINS TETHERING INTO THE PARENT CHAINS
Following Layer-2 protocols, Ontospace ecosystem demands sidechains to be tethered (or, in common parlance -pegged) to the mainnet (or to the parent net which then can be tethered to the mainnet). As the main goals of the Ontospace ecosystem are not just crypto-asset applications as it is the case with majority of Layer-2 blockchains, but they are related to the trusted knowledge representation, the mechanism we designed for the tethering is different from the standard Layer-2 mechanisms like rollups or state channels. What motivated our approach was the need for additional security of the data on Ontosidechains, gained from the linkage (tethering) to the mainnet. We have used Merkle Trees [39] for that purpose. Depending on the Ontosidechain activity (number of new blocks generated per unit of time), Every 2 N transactions of the OntoSidechain, a Merkle Tree is generated. In the leaves of the tree the mechanism stores both Blockchain transaction hash and the corresponding RDF named graph hash (iHash -Interwoven Hash described in the preceeding sections). Alternatively, only Blockchain transaction hash is stored, if the RDF named graph hash is already stored in the transaction. This is illustrated in Figure 8 and in Figure 9.
The roots of the Merkle Trees generated for the sidechain are stored in the transactions of the parent chain. Using standard Merkle proof, every transaction can be audited and verified against the root of the Merkle Tree. Such an audit can be initiated on both the sidechain and on the mainnet, by a code specifically designed for the audits.  There are multiple types of Layer-2 implementations [40]- [45]. The first approach is Loopring [40] that runs as a public set of smart contracts responsible for trade and settlement, with an off-chain group of agents aggregating and communicating orders. Another proposal is AZTEC [41], which defines a set of zero-knowledge proofs that determine a confidential transaction protocol, designed for use within Blockchain protocols that support Turing-complete generalpurpose computation. Another one proposal is Zecale [43] which is a general-purpose proof aggregator that uses a recursive composition of small arguments. Yet another approach is Hermes [42], which is a platform for trading sensor data, using distributed ledgers as intermediaries to add safeguards against malicious behavior. Other approaches are Raiden [44] and Lightning [45]. Both proposals operate on top of a blockchain and enable fast peer-to-peer transactions. They are based on state channels and the creation of communication channels between nodes out of the blockchain. Those networks are in charge of managing transactions between connected nodes, which reduces the main chain's workload. Merkle Tree

FIGURE 9. Pegged sidechain
On the other hand, Ethereum platform offers a few proposals for scaling [19]- [21]. These approaches are associated with a server or cluster of servers, each of which may be referred to as a node, operator, block producer, or similar term. State channels [19] utilize multisig contracts to enable participants to transact quickly and freely off-chain, then settle finality with the mainnet. Rollups [20] is another proposal that perform transaction execution outside Layer-1, and then the data is posted to Layer-1 where consensus is reached. There are two approaches based on rollups: ZKrollups 6 and Optimistic rollups 7 . The first one generates a cryptographic proof, known as a succinct non-interactive argument of knowledge. This is known as validity proof and is posted on the Layer-1. The second one offers improvements in scalability because, after a transaction, they propose the new state to the mainnet. The next approch is Plasma [21] that aims at extending the concept of sidechains, as a way to reduce the number of transactions to be processed by the Layer-1 Blockchain.

B. BLANK NODES AND GRAPH DIGESTS
Blank nodes are well-studied. There are papers covering their theory [46], semantics [47] and complexity [29], Carroll [48] presents a method to canonicalise RDF graphs with blank nodes in such a manner that they could be digitally signed. The proposed method is based on writing it to N-Triples serialization, mapping all blank nodes to a global blank node, sorting the RDF triples lexically, and then relabelling the blank nodes. Another method to compute the digest of an RDF graph is proposed by Sayers and Karp [49], where there is an assumption that all blank node labels are fixed. Yet another method is proposed by Giereth [50]. In this paper, the author proposes to artificially add new triples to distinguish individual blank nodes in the encryption process. The next method is presented by Lantzaki et al. [51] and it is based on computing a signature for blank nodes based on the constant terms in their direct neighborhood. Yet another two algorithms are proposed by Hogan [52]. The first one computes an iso-canonical form and generates the same result for a pair of input RDF graphs if and only if they are isomorphic. The second one computes an equicanonical form and generates the same result for pairs of simple-equivalent graphs.

C. MERKLE TREE-BASED GRAPH INTEGRITY
Crosby and Wallach [53] present History-Based Merkle Tree that is a tree-based history data structure for tamper-evident logging based on Merkle tree [39]. Position-aware Merkle tree is another method proposed by Mao et al. [54]. In this approach, each node in a Merkle tree can keep track of its relative position to its parent node. Yet another proposal is a Merklix tree [55]. It is a binary tree that has Merkle and radix tree features. Sutton and Samavi [56] proposed two methods: semantic-based approach and a structure-based approach. The first method leverages timestamps as an indexing key to construct a sorted Merkle tree variation. The second method utilizes the redundant structure of large RDF datasets to compress the dataset statements prior to generating a variation of a Merkle tree.

VI. CONCLUSIONS AND FUTURE WORK
Our work demonstrated a realistic possibility for the creation of Knowledge Representation system on Blockchain. This possibility stems from carefully designed synergy between RDF graph databases, Linked Data access methods and Blockchain functionalities that guarantee immutability, nonrepudiation and decentralization of the Knowledge Representation realized by standard RDF graph databases. The work assumed meeting important challenges related to cryptographic methods adequate for the semantic objects expressed as RDF graphs, transactional requirements resulting from a modified standard Blockchain client or architectural challenges resolved using 3rd generation Blockchain Layer-2 protocols.
As the theoretical work reported here was accompanied by a Proof-of-Concept kind of software development, we were able to positively verify the developed technological foundations.
The work is now in progress on a production grade system using the technologies described here, and on a possible application of our approach to the next generation of graph databases, i.e. the Property Graphs [57] which promise higher performance and new extended capabilities, but which require development of concepts like subgraphs, the property graphs hashing (integrity proofs) or the partial replication of such graphs.