An Abstraction-Based Approach for Privacy-Aware Federated Process Mining

Process awareness is an essential success factor in any type of business. Process mining uses event data to discover and analyze actual business processes. Although process mining is growing fast and it has already become the basis for a plethora of commercial tools, research has not yet sufficiently addressed the privacy concerns in this discipline. Most of the contributions made to privacy-preserving process mining consider an intra-organizational setting, where a single organization wants to safely publish its event data so that process mining experts can analyze the data and provide insights. However, in real-life settings, organizations need to collaborate for performing their processes, e.g., a supply chain process may involve many organizations. Therefore, event data and processes are often distributed over several partner organizations, yet organizations hesitate to share their data due to privacy and confidentiality concerns. In this paper, we introduce an abstraction-based approach to support privacy-aware process mining in inter-organizational settings. We implement our approach and demonstrate its effectiveness using real-life event logs.


I. INTRODUCTION
Process mining provides a family of techniques to discover, analyze, and improve latent business processes [1]. It provides fact-based and actionable insights into the actual processes using event logs. Three basic types of process mining are (1) process discovery, where the goal is to learn real process models from event logs, (2) conformance checking, where the aim is to find commonalities and disconformities between a process model and an event log, and (3) process enhancement, where the aim is to extend or improve a process model using different aspects of the available data.
Events are the smallest units of process execution characterized by their attributes. Process mining requires that each event contains at least the following main attributes to enable the application of analysis techniques: case id, activity, and The associate editor coordinating the review of this manuscript and approving it for publication was Jolanta Mizera-Pietraszko .
timestamp. The case id often refers to an individual to whom the event belongs, e.g., a patient or customer. The activity refers to the activity associated with the event, and the timestamp is the time when the activity was executed. A sequence of events having a fixed ordering based on their timestamps is called a trace, which is considered a crucial case attribute for process mining techniques.
Depending on the context of a process, the corresponding events may contain more attributes. For example, in the healthcare context, the resource attribute can be used to indicate the activity performer, e.g., a nurse, or an event attribute may show the disease of the corresponding patient. Table 1 shows a part of an event log recorded by an information system in a hospital. Some of the event attributes may refer to individuals such as case id and resource. For instance, in Table 1, the case id attribute refers to the patients whose data are recorded, and the resource attribute refers to the employees performing activities for the patients. Moreover, some attributes may be considered as sensitive attributes, e.g., the disease attribute in Table 1. Privacy issues are highlighted when person-specific information is included in an event log. For example, in Table 1, knowing that "Injection" was performed for a patient, the corresponding case, which is case 4, is re-identified. Consequently, the whole sequence of activities performed for the patient and also the disease are disclosed.
The terms inter-organizational process mining, crossorganizational process mining, and federated process mining all refer to a sub-discipline of process mining where the goal is to jointly discover, monitor, analyze, and improve cross-organizational processes [2], [3], [4], [5]. Remaining in the healthcare context, consider a collection of clinics and hospitals involved in the treatment process of some patients. Federated process mining can be used to discover the overall treatment process that traverses several hospitals, find bottlenecks in the process, identify successful/ failed treatment processes, etc. However, process mining is rarely applied in an inter-organizational setting mainly due to privacy/confidentiality concerns. Setting the right interorganizational boundaries, regarding privacy issues, is an important element of advancing process mining.
Organizations, such as healthcare providers, are obviously unwilling to share their entire event logs containing highly sensitive information with other parties involved in a joint process. Moreover, they cannot afford to trust third parties. Thus, the main challenge of Privacy-Aware Federated Process Mining (PAFPM) is to get process mining insights regarding the entire process while considering privacy, and without a need for a trusted third party.
We consider two main levels for privacy concerns in federated process mining: the individual level and the organizational level. The former aims to protect private data belonging to individuals in organizations. The latter considers all the internal activities of an organization as sensitive private information that should not be revealed. We propose an approach that can address both levels of privacy concerns. Our approach for PAFPM focuses on the following challenges: (1) no need to bound the number of involved parties, (2) no need for a trusted third party, (3) no need for designing complex communication protocols among parties, e.g., secure multi-party computation protocols, and FIGURE 1. The general overview of the abstraction-based approach for privacy-aware federated process mining. Inside the dashed squares is considered as the trusted environment, and outside these squares is considered as the untrusted environment.
(4) possibility of ensuring that all the involved parties can share data while allowing for the necessary levels of data utility and data privacy.
The proposed approach is based on the concept of abstractions in process mining [6]. Abstractions are intermediate results of process mining algorithms that relate event logs and final results. For example, a directly follows graph, representing directly follows relations between activities, is an abstraction of process discovery algorithms that relates an event log to a formal process model describing the observed behavior in the event log. Abstractions are generated by specific abstraction functions that reduce event logs, containing highly sensitive detailed information, to the minimal information required for obtaining specific results. Thus, sharing abstractions is of lower risk for organizations compared to sharing original event logs. However, using abstractions arises the following challenges that need to be addressed. First, given a single event log, the effectiveness of an abstraction function on both aspects of data utility and data privacy needs to be evaluated. For the specific type of abstraction used in this paper, we demonstrate the usefulness of the abstraction function. Second, abstractions need to be shared in such a way that the provided data utility and data privacy for a single event log do not degrade in an inter-organizational setting. Figure 1 depicts the general overview of the abstractionbased approach for privacy-aware federated process mining. Gray arrows depict sharing abstractions where the same type of abstractions of event logs, i.e., obtained by applying the same abstraction function, are shared rather than sharing original sensitive event logs. The challenge w.r.t. the data utility is to merge abstractions in such a way that the merged abstraction is the same as the abstraction obtained from the collection of event logs by applying the same abstraction function. If so, the process mining results obtained by sharing abstractions are the same as the results obtained by sharing original event logs.
In this paper, we focus on the control-flow aspect of process mining that requires the basic three attributes (i.e., case id, activity, and timestamp) to perform two main activities of process mining, i.e., process discovery and conformance checking. After discovering a joint process model using the abstraction-based approach, we propose the so-called Risk-Aware Reveal Method (RARM) that can be used to answer more in-depth inquiries about the process while considering the privacy concerns of organizations.
The remainder is organized as follows. In Section II, the preliminaries are explained. In Section III, we discuss related work. In Section IV, we analyze the data utility and privacy constraints of abstractions, and the risk-aware reveal method is proposed to address the limitations. In Section V, we demonstrate our approach for privacy-aware federated process mining. In Section VI, we employ real-life event logs to show the effectiveness of the abstraction-based approach, and Section VII concludes the paper with a discussion regarding limitations and possible next steps.

II. BACKGROUND
In this section, we introduce some basic concepts and provide formal models that will be used in the remainder of the paper to describe the approach.

A. EVENT LOG
We first introduce some basic notations. For a given set A, A * is the set of all finite sequences over A and B(A) is the set of all multisets over the set A. A finite sequence over A of length n is a mapping σ :{1, . . . , n} → A, represented as σ = ⟨a 1 , a 2 , . . . , a n ⟩ where a i = σ (i) for any 1≤i≤n. |σ | denotes the length of the sequence. For σ ∈ A * , we write a ∈ σ , iff ∃ 1≤i≤|σ | σ (i) = a. Given A and B as two multisets, A ⊎ B is the sum over multisets, e.g., . For σ 1 , σ 2 ∈ A * , σ 1 ⊑σ 2 if σ 1 is a subsequence of σ 2 , e.g., ⟨z, b, c, x⟩⊑⟨z, x, a, b, b, c, a, b, c, x⟩, and σ 1 ⊕ σ 2 is the concatenation of two sequences, e.g., ⟨a, b, c⟩ ⊕ ⟨d, e⟩ = ⟨a, b, c, d, e⟩. For σ ∈ A * , {a ∈ σ } is the set of elements in σ , and [a ∈ σ ] is the multiset of elements in σ , e.g., [a ∈ ⟨x, y, z, where c ∈ C is the case id, a ∈ A is the activity associated with the event, t ∈ T is the event timestamp, r ∈ R is the resource, and d 1 ,. . . ,d m is a list of additional attributes values, where for any 1≤i≤m, For e = (c, a, t, r, d 1 , . . . , d m ), π case (e) = c, π act (e) = a, π time (e) = t, π res (e) = r, and π dom i (e) = d i , 1≤i≤m, are its projections. An event log is L⊆E where events are unique.
Definition 2 (Trace, Trace Variant): A trace σ = ⟨e 1 , e 2 , . . . , e n ⟩ ∈ E * is a sequence of events, s.t., for each e i , e j ∈ σ : π case (e i ) = π case (e j ), and π time (e i ) ≤ π time (e j ) for 1 ≤ i<j ≤ n. A trace variant σ ∈ A * is a trace where all the events are projected on the activity attributes.

Definition 3 (Simple Event Log):
A simple event log is a multiset of trace variants, i.e., L ∈ B(A * ). We assume that each trace in L belongs to an individual and σ ̸ = ⟨⟩ if σ ∈ L. A L = {a ∈ σ | σ ∈ L} is the set of activities in L, and L = {σ ∈ L} denotes the set of trace variants in L. Table 2 shows a simple event log derived from Table 1. In this paper, the term event log refers to a simple event log unless it is clearly mentioned that we mean a set of events.
Definition 4 (Entropy of Event Log): ent : B(A * ) → R ≥0 is a function which retrieves the entropy of traces in an event log, s.t., for L ∈ B(A * ), ent(L) = − σ ∈L |L| . max_ent(L) = log 2 |L| is the maximal entropy that can be achieved when all the trace variants are unique.
Definition 5 (Directly Follows Relations (DFR)): is the multiset of directly follows relations between activities in σ , where the first and last activities are tuples with ▶ and ■ as the dummy start and end activities, respectively. The tuple including ▶ is called the start relation, and the one including ■ is called the end relation. For L ∈ B(A * ), DFR L = σ ∈L DFR σ is the multiset of directly follows relations between activities in the traces of L. Given dfr = (a, b), π 1 (dfr) = a and π 2 (dfr) = b are the projections of dfr.
Definition 6 (Directly Follows Graph (DFG)): Let L be a simple event log, A L be the set of activities in L, and DFR L be the multiset of directly follows relations in L. DFG L = (A L ∪ {▶, ■}, DFR L ) is the directly follows graph of L. Figure 2 shows the DFG of the simple event log shown in Table 2, where each node represents an activity, and the directed arcs represent the DFRs between activities. Note that ▶ and ■ denote the dummy start and end activities, respectively. The numbers above arcs represent the frequency of the corresponding DFRs. A multiset of DFRs is a specific type of event log abstraction that can be used to generate a DFG. Definition 7 provides a generic definition for abstraction functions on simple event logs.

B. DISCLOSURE RISKS
The main process mining activities, i.e., process discovery and conformance checking, can be performed using simple event logs including only sequences of activities. Such event logs, which do not contain other attributes, seem to be safe. However, the trace itself, as a complete sequence of activities performed for a case, is considered a sensitive attribute [7]. In this subsection, we demonstrate two types of disclosure risks associated with publishing simple event logs.
Consider the event log shown in Table 2. Assuming that an adversary knows that a patient's data are in the event log, little information about the activities performed for the patient could result in a successful re-identification. For example, if the adversary knows that an injection was performed for a victim patient, the only matching case is 4. We assume that the adversary's Background Knowledge (BK) is a subsequence of activities performed for a victim case which can be considered as the strongest assumable knowledge w.r.t. the available information in simple event logs. Thus, the attack model is defined as follows.
Definition 8 (Attack Model): Let L be a simple event log and A L be the set of activities in L. We formalize the attack model as a function match L : For example, if the adversary knows that bk = ⟨HO, BT ⟩ is a subsequence of activities performed for a case, case 2 is the only matching case. Once a case is re-identified a complete sequence of activities performed for the case is disclosed which is considered sensitive information. The strength of such an attack highly depends on the strength of the background knowledge that can be quantified based on the length of the sequence, so-called the size of BK [7].
Definition 9 (Background Knowledge Candidates): Let L be a simple event log. Given l ∈ N >0 as the size of background knowledge, cand l (L) = {σ ∈ A * L | |σ | = l} denotes the set of candidates for the background knowledge of size l.
For example, given Table 2 as the simple event log L, cand 1 (L) = {RE, VI , HO, BT , IN , RL}. In [7], the authors introduce two main types of disclosure risks associated with such an attack: case disclosure and trace disclosure.

1) CASE DISCLOSURE
The uniqueness of traces w.r.t. the background knowledge of size l is used to measure the corresponding case disclosure in an event log. Equation (1) calculates the case disclosure which is the average uniqueness based on the candidates of background knowledge.   Table 2. The nodes represent activities, and the arcs represent the directly follows relations.
The uniqueness alone cannot show some risks. Consider a scenario where for a candidate of BK there are several identical traces in the event log matching the knowledge. Since all the matching traces are the same, one can still know the trace of the case without the need for singling out a specific trace. Thus, trace disclosure is defined that is based on the entropy of matching traces. The less entropy of matching traces results in a high trace disclosure risk.

2) TRACE DISCLOSURE
The entropy of matching traces w.r.t. the background knowledge of size l is used to measure the corresponding trace disclosure in an event log. Equation (2) calculates the trace disclosure where max_ent(match L (bk)) is the maximal entropy for the matching traces that is achieved when all the matching traces are unique.
Note that in both Equations (1) and (2), equal weights are considered for the candidates of background knowledge. However, one can consider different weights based on different criteria, e.g., the sensitivity of activities. Moreover, the worst cases can be used rather than average values, i.e., the maximal uniqueness in Equation (1) or the minimal entropy in Equation (2).
Unsurprisingly, event logs containing more information provide more opportunities for attackers. If we consider an event log where traces are sequences of events with more attributes rather than only activities, each attribute could be an attack point or sensitive information. For example, in the event log shown in Table 1, if the adversary knows that for a victim patient, the visit activity performed by Doctor 1, the only matching case is 1. Once the case is re-identified all the other attributes are revealed, e.g., disease which is a sensitive attribute.

C. FEDERATED PROCESS MINING (FPM)
Federated process mining has been explored by researchers from different angles and in different contexts from EDIsupported inter-organizational business processes [3] to the application of Blockchain technology in cross-organizational process mining [8]. The majority of papers in this field focused on the application of process mining in supply chains [9], [10], [11], while supply chains are one of the types of interoperability among organizations.
In [2], different types of interoperability are introduced (see Figure 3) including (a) chained execution: the process is split into a number of disjoint subprocesses that are executed by different organizations in sequential order, (b) subcontracting: one organization subcontracts subprocesses to other organizations, (c) case transfer: the process description is the same among organizations. However, cases can be transferred among organizations, and at any time, each case resides at exactly one organization, (d) capacity sharing: the process description is the same among organizations and the execution of tasks is distributed among organizations, and (e) loosely coupled: the process is cut into subprocesses, and different subprocesses could be active at the same time.
In most of the above-mentioned types of interoperability, the communication type can be synchronous or asynchronous. In the asynchronous type of communication, a case's process can be simultaneously executed in several organizations. On the contrary, in the synchronous type of communication, a case's process cannot be run in different organizations at the same time. In this paper, we consider the synchronous type of communication and describe our approach based on the different types of interoperability.

III. RELATED WORK
In this section, we provide a short summary of the research that has been done in privacy-preserving process mining, federated process mining, and privacy-aware federated process mining.

A. PRIVACY-PRESERVING PROCESS MINING
Privacy and confidentiality issues in process mining are recently growing in importance. The work having been done covers different aspects of the topic ranging from discussing challenges [12], [13], [14], [15], to providing privacy guarantees [16], [17], [18], [19], [20], [21] and privacy quantification [7], [22]. Confidentiality has been introduced as one of the main challenges of Responsible Process Mining (RPM) in [12]. The proposed privacy preservation techniques can be categorized into three different categories: group-based, noise-based, and encryption-based.
The group-based privacy preservation techniques are often based on the concept of k-anonymity [23] and its extensions, e.g., l-diversity [24] and t-closeness [25]. Some examples are as follows. In [19], the authors introduce a group-based privacy preservation technique for preserving the privacy of resources, who are performing activities. In [18] and [26], TLKC-privacy is introduced and extended to deal with high variability issues in event logs for applying group-based anonymization techniques. The noise-based privacy preservation techniques are based on the notion of differential privacy [27]. For example, in [16], [17], [21], [28], and [29], the notion of differential privacy is utilized to provide privacy guarantees in process mining. A general framework for confidentiality in process mining based on encryption and abstraction is proposed in [30].
There are also other work targeting other aspects of privacy and confidentiality in process mining. Some examples are as follows. In [13], the authors provide an overview of privacy challenges for process mining in human-centered industrial environments. The data privacy and utility requirements for healthcare event data are discussed in [14]. In [31], the authors propose a solution that allows the outsourcing of process mining while ensuring confidentiality. In [32], the goal is to propose a privacy-preserving system design for process mining. A privacy-preserving method for discovering roles from event logs is introduced in [33]. The risks regarding privacy degradation of privacy preservation techniques when event data are continuously published are discussed in [34] and [35]. In [22], the authors propose a measure to evaluate the re-identification risk of event logs. Also, in [7], a general privacy quantification framework, and some measures are introduced to evaluate the effectiveness of privacy preservation techniques. In [36], the authors propose a privacy extension for the XES standard (www.xes-standard.org) to manage privacy metadata. Some tools have also been provided to support the proposed techniques in practice such as [37], [38], and [39].

B. FEDERATED PROCESS MINING
In [2], inter-organizational process mining is explained and multiple categories of inter-organizational data flows are characterized. In [3], EDI messages are used to illustrate a case study of effective inter-organizational process mining in the automobile industry. In [4], the authors focus on improving the performance aspect of the process by utilizing the insights gained from cross-organizational process mining. In [40], the authors propose an approach to compare collections of process models and their event logs recorded VOLUME 11, 2023 in different Dutch municipalities. Furthermore, cloud computing [41] and blockchains [8] have been recognized as opportunities within the cross-organizational process mining context. In [9], an approach is proposed to discover distributed processes in supply chains. In [10], the authors describe basic patterns to capture modeling concepts that arise commonly in supply chains. In [11], the authors focused on different case notions in supply chains where objects are grouped, and in [5], the so-called federated process mining has been introduced to enable cross-organizational process mining by providing a framework that recommends event log abstractions.

C. PRIVACY-AWARE FEDERATED PROCESS MINING
Most related to our work are [42], [43], and [44]. In [42], the authors propose a technique based on secure multi-party computation for discovering directly follows graphs considering only two parties. In [43], the authors propose a framework for sharing public process models and discovering organizationspecific process models from multiple parties which requires a trusted third party. In [44], the authors propose an approach for discovering process models in inter-organizational settings. This approach relies on a (semi) trusted third party and uses secure multi-party computation algorithms, e.g., for securely computing unions and aggregated values.

IV. PRIVACY AND UTILITY ANALYSIS
In this section, we analyze the data utility and privacy issues when directly follows relations are shared instead of an event log. We propose the risk-aware reveal method to overcome the data utility shortcomings.

A. PRIVACY ANALYSIS
The disclosure risks demonstrated in Subsection II-B are based on sequences of activities. Thus, it seems that removing the concept of trace by using the abstraction function abs DFR , which maps an event log onto a multiset of DFRs, eliminates such risks. However, similar risk analyses can be done based on DFGs obtained from DFRs.
As demonstrated in Definition 7, the main advantage of sharing abstractions such as DFRs is that they impose uncertainty regarding original event logs. However, there may be a situation where certain information about original event logs can be revealed. In the following, we explain such a situation. Given an event log L, the complete paths on DFG L , i.e., the paths from the start node to the end node, represent trace variants that may or may not be the variants of L. However, given an activity a ∈ A L as a node of DFG L , if there exists only one complete path on DFG L that contains a, then that path represents a trace variant of the original log L. For example, given the event log shown in Table 2  Consider a scenario where the background knowledge of an adversary contains an activity that holds the above-mentioned condition. As a result, the whole sequence of activities performed for a victim case is disclosed. For instance, if the background knowledge of an adversary is bk = ⟨IN ⟩, then the only matching path is ⟨▶, RE, IN , RL, ■⟩ that is the trace variant of case 4.
Such scenarios are more relevant to the disclosure risk analysis of results [45] and privacy preservation techniques for result protection that are out of the scope of this paper.

B. DATA UTILITY LIMITATIONS
Abstracting the control-flow aspect of an event log from a simple event log to a multiset of DFRs implies several data utility limitations such that even the most straightforward analyses that are based on traces cannot be performed. Two examples are shown below: • The most frequent traces in an event log. For example, in the healthcare context, it may be important to know what are the most frequent sequences of activities performed for patients.
• All traces that include a particular sequence or set of activities. For example, it may be helpful to know what is the process of treatment for patients who had a blood test before being visited by doctors. Obviously, it is also impossible to answer inquiries that are based on the attributes removed from an original event log by simplifying the event log. In the following, we provide two types of such inquiries: • All traces with a certain case or event attribute. It may be helpful to know the process of treatment for patients of a specific range of age, or the patients who are visited by particular doctors.
• A set of attributes based on other attributes. For example, it may be important to know the set of activities performed by a particular doctor, or the set of joint activities performed by a set of doctors. We categorize such queries into two main categories: Trace-Based Queries (TBQs) and Attribute-Based Queries (ABQs). All the queries expecting traces as responses are considered as trace-based, while attribute-based queries are those that expect event or case attributes as responses. In the following, we introduce the risk-aware reveal method that can be used to answer such in-depth questions regarding a process and mitigate the aforementioned utility limitations.  Figure 4 shows the general overview of our abstractionbased approach for privacy-aware process mining in an intra-organizational setting. DFRs, as an abstraction of the control-flow aspect, are shared with process analysts to perform process discovery and get control-flow insights.
To answer more in-depth questions triggered by control-flow insights, the Risk-Aware Reveal Method (RARM) can be used. RARM can provide more information in a selective manner to answer more in-depth questions regarding original traces and also removed attributes from an original event log. It can provide responses without revealing data about cases that are irrelevant to addressing a specific query.
The main component of RARM is the so-called Risk-Aware Reveal Engine (RARE). For each query, RARE first provides a response in the trusted environment. Then, it does a risk analysis before composing a response to the untrusted environment. If the risks are above a preset threshold, the engine refuses to answer the query. For TBQs, the result of a query is a multiset of traces, and RARE does case disclosure and trace disclosure analyses. For ABQs, RARE checks the sensitivity of the attribute of interest in the query based on the predefined set of sensitive attributes. If the attribute in the query is non-sensitive, the result can be shared. However, the queries regarding the sensitive attributes are refused. In Subsection V-G, we demonstrate the usage of RARM in inter-organizational settings, where several organizations have to be involved to provide answers for trace-based and attribute-based queries.

V. PRIVACY-AWARE FEDERATED PROCESS MINING
In this section, we expand our abstraction-based approach to enable Privacy-Aware Federated Process Mining (PAFPM) which has the following main properties. It does not limit the number of parties, it does not need a trusted third party, it supports privacy at both the individual level and the organizational level, and it does not require designing communication protocols among parties.

A. PROBLEM STATEMENT
For describing the main approach, we assume that privacy concerns are at the level of individuals. Particularly, as explained in Subsection II-B, traces are considered as sensitive private information. Nevertheless, we later explain that the approach can also support privacy at the level of organizations. We also consider the following standard assumptions.
The sets of activities of organizations are disjoint, and the involved organizations share the same set of case identifiers for the joint cases. C joint ⊆C denotes the set of joint case identifiers. Note that organizations may use different case identifiers for the joint cases in their internal environment and use a mapping to map the shared identifiers to the internal ones. There are two main challenges when abstractions are shared rather than entire event logs: (C1) How to merge abstractions from different organizations such that the merged abstraction is the same as the abstraction that can be obtained by applying the same abstraction function to the merged event logs, and (C2) How to answer more in-depth questions regarding the information not included in the shared abstractions.
Definition 10 (Merging Abstractions Challenge): Let O be the universe of organization identifiers, and LA be the universe of abstractions. Consider C L = {L 1 , L 2 , · · · , L n } as an event log collection where 1≤i≤n and L i ∈ B(A * ). Assume C LA = {la 1 , la 2 , · · · , la n } as the collection of abstractions, where la i = abs(L i ) ∈ LA represents the abstraction of L i belonging to the organization o i ∈ O using abs as an abstraction function. If merge(C LA ) ∈ LA is an overall abstraction obtained by merging the individual abstractions, then merge(C LA ) have to be the same as abs(C L ).
Since we consider DFRs as abstractions, the challenge of merging abstractions C1 is specialized to the challenge of merging DFRs. In inter-organizational process mining, merging DFRs is a challenge because of the missing socalled handover relations related to the interconnections among the organizations. Thus, to address challenge C1, we first define the concept of handovers. Then, we demonstrate the process of retrieving missing handover relations, and finally we explain the merging process based on the different types of interoperability described in Subsection II-C. We adapt RARM in the inter-organizational setting to address challenge C2.

B. HANDOVERS
A so-called handover happens when a case moves from one organization to another. A directly follows relation indicating a handover is called a handover relation, and the involved activities are called handover activities. The first handover activity of a handover relation is called the handover to activity that hands over a case to another organization. The second handover activity is called the handover from activity that receives the handed-over case from another organization.
Definition 11 (Handover Relation (HoR)): Let L and L ′ be two simple event logs belonging to two organizations involved in a joint process, C joint ⊆ C be the set of joint cases, and c 1 ∈ C joint . Consider σ c 1 = ⟨a 1 , a 2 , . . . , a n ⟩ as the trace of case c 1 in L, and σ ′ c 1 = ⟨b 1 , b 2 , . . . , b m ⟩ as the trace of case c 1 in L ′ . hor = (a i , b j ) is a handover relation, s.t., 1 ≤ i ≤ n, 1 ≤ j ≤ m, a i ∈ σ c 1 , and b j ∈ σ ′ c 1 . Given hor = (a, b) as a handover relation, π 1 (hor) = a and π 2 (hor) = b are the projection functions, and considering a ′ ∈ A as an activity, set 1 (hor, a ′ ) = (a ′ , b) assigns the activity a ′ to the first handover activity, and set 2 (hor, a ′ ) = (a, a ′ ) assigns a ′ to the second handover activity.
Since each organization has only access to its own event log, merged DFRs obtained by sharing the DFRs of different organizations in an inter-organizational setting do not include handover relations. To retrieve such missing relations, the organizations need to share the so-called handover tables together with DFRs. A handover table is a collection of handover records that are defined as follows.
Definition 12 (Handover Record): Let O be the universe of organizations identifiers including ⊥ as the null identifier. A handover record is a tuple rec = (id, c, o 1 , a, o 2 , o 3 ) where id ∈ N >0 is the incremental identifier of the record, c ∈ C is the case identifier of the case involved in the handover, o 1 ∈ O is the identifier of the organization generated the handover record, a ∈ A is a handover activity, o 2 ∈ O indicates the organization that handed over the case to o 1 , and o 3 ∈ O indicates the organization that o 1 hands over the case to it. HR = N >0 × C×O × A × O × O denotes the universe of handover records.
Given rec = (id, c, o 1 , a, o 2 , o 3 ), π id (rec) = id, π case (rec) = c, π org (rec) = o 1 , π act (rec) = a, π from (rec) = o 2 , and π to (rec) = o 3 are the projections of the record .  For any (id, c, o 1 , a, o 2 Note the following constraints in Definition 12. An organization cannot have a handover with itself, and a handover record has to indicate one and only one type of handover activity, i.e., handover from or handover to. Given Consider the chained execution type of interoperability, where cases can move from one organization predictably to the next one. An example of this type of interoperability in the healthcare context could be a patient arriving at the emergency room, receiving a sepsis treatment, and ultimately a specialty check-up. Figure 5 shows example event logs for such a scenario, and Figure 6 shows the handover tables of the event logs in Figure 5. For example, the first record of EC's handover table shows that by performing IVA activity, EC hands over a case to ST. Consequently, the first record of ST's handover table shows that by performing REG activity, ST receives the handed-over case from EC. Note that the records of a handover table must be inserted with the order that they happen in reality.

C. RETRIEVING HANDOVERS
Algorithm 1 demonstrates the process of retrieving missing handover relations. Before explaining the algorithm, we need to define the selection operations over handover tables.

Algorithm 1 The Process of Retrieving Missing Handover Relations
Input: A collection of handover tables HoTs = {HoT 1 , HoT 2 , · · · , HoT n } Input: A set of joint case identifiers C joint ⊆C Output: A multiset of handover relations HoRs ∈ B(A × A) foreach c ∈ C joint do foreach HoT ∈ HoTs do HoT c = φ (HoT , (case, c)); while HoT c ̸ = ∅ do hor = (⊥, ⊥); rec = min id (HoT c ); if π to (rec) ̸ = ⊥ then set 1 (hor, π act (rec)); else HoT from ← get HoT ∈ HoTs where ∃ rec ′ ∈HoT π org (rec ′ ) = π from (rec); HoT from c = φ(HoT from , (case, c)); HoT to from c = φ(HoT from c , (to, π org (rec))); rec to = min id (HoT to from c ); set 1 (hor, π act (rec to )); remove rec to from HoT from ; end if π from (rec) ̸ = ⊥ then set 2 (hor, π act (rec)); else HoT to ← get HoT ∈ HoTs where ∃ rec ′ ∈HoT π org (rec ′ ) = π to (rec); HoT to c = φ(HoT to , (case, c)); HoT from to c = φ(HoT to c , (from, π org (rec))); rec from = min id (HoT from to c ); set 2 (hor, π act (rec from )); remove rec from from HoT to ; end remove rec from HoT and HoT c ; add hor to HoRs; end end end return HoRs;   and max id (HoT ) retrieve the records with the minimum and maximum id in HoT , respectively.
The retrieving process can be started from the first record of any case in any of the handover tables. Note that handover records in handover tables are ordered based on the timestamps of the handover activities. The handover activity of the record is considered as the handover from (to) activity of a handover relation if handover from (to) organization of the record is specified. The missing handover activity of the handover relation is retrieved by referring to the first corresponding record of the case in the handover table of the organization specified in the handover from (to) organization of the starting record. The processed records are removed from the handover tables of organizations and this process continues until all the handover tables become empty.
For example, the retrieving process for a handover relation w.r.t. the handover tables shown in Figure 6 is as follows. Starting the process from case 1 in the handover table of EC ( Figure 6 (a)), since handover to organization is specified, IVA is considered as the first handover activity, i.e., handover to activity. To retrieve the second handover activity, i.e., handover from activity, first, the handover table of the organization specified in the handover to organization is obtained (the handover table of ST). In the handover table of ST (Figure 6 (b)), all the records of case 1 are obtained. The handover activity of the first record of these records where the handover from organization is EC (i.e., REG) is considered as the second handover activity.
The retrieved handover relations need to be added to the merged DFRs obtained through sharing DFRs by each individual organization. Figure 7 shows the overall abstraction merging process to obtain the original DFRs including handover relations. In the following, we demonstrate the update operation for the different types of interoperability.

D. THE UPDATE OPERATION FOR CHAINED EXECUTION
We first demonstrate the problem that arises by not sharing handover tables. Consider the event logs shown in Figure 5  as the event logs of a chained execution scenario. Figure 8 (a), (b), and (c) show the DFRs of these event logs. Figure 8 (d) shows the frequency annotated DFG obtained from the merged DFRs. One can see that the resulting graph does not reflect the real paths followed by the patients in the event logs. For example, the DFG includes three start activities, while ERR is the only start activity for all the patients. That is because handover relations have not been captured. For example, there are handover relations between REL in ST and DOC in SC having been replaced with (REL, ■) and (▶, DCO). Such missing relations can be retrieved by sharing handover tables.
Since we consider the synchronous type of communication (see Subsection II-C), in chain execution scenarios, handovers cannot happen in the middle of an intra-organizational trace. Thus, each retrieved handover relation is replaced with one start relation and one end relation matching with the handover relation. For example, (IVA, REG) is a handover relation retrieved by processing the first record of EC and ST in Figure 6. This relation is replaced with (IVA, ■) and (▶, REG). We call this specific type of update a coupling update which is defined as follows.

E. THE UPDATE OPERATION FOR SUBCONTRACTING
In the subcontracting type of interoperability, an organization hands over a part of the process to a sub-organization or a third party. In this type of interoperability, handovers can happen within the middle of an intra-organizational trace. Note that we still assume the synchronous type of communication, i.e., a case's process can not be continued in the main organization while it runs in a sub-organization. An example of this type of interoperability in the healthcare context is a patient registered at emergency care, goes to a laboratory for certain tests and returns to the emergency care.
Consider the event logs shown in Figure 10 as the event logs for such a scenario. Figure 11 shows the handover tables for this scenario, and Figure 12 (a) and (b) show the DFRs.  Figure 10.  One can see that due to missing handover relations, the resulting graph does not reflect the real paths followed by the patients in the event logs. For example, there are handover relations between IVA in EC and REG in LB and between RES in LB and DCO in EC that have been replaced with (IVA, DCO), (▶, REG), and (RES, ■).
The process of updating the merged DFRs using handovers is based on two main properties of the synchronous subcontracting scenarios: (P1) handovers happen within the middle of traces of the main organization that outsources part of the process, (P2) a case's process in a sub-organization starts by receiving the first handover relation from the main organization, and it ends by the last handover to the main organization. P1 implies that two handover relations need to be replaced with one directly follows relation, and P2 shows that start and end relations in a sub-organization need to be removed. For example, (IVA, REG) and (RES, DCO) are the handover relations retrieved by processing handover tables shown in Figure 11. These relation are replaced with (RES, ■) and (▶, REG), and (IVA, DCO) in the merged DFRs. We call this specific type of update a decoupling update which is defined as follows.

F. THE UPDATE OPERATION CASE TRANSFER, CAPACITY SHARING, AND LOOSELY COUPLED
In this subsection, we focus on the case transfer, capacity sharing, and loosely coupled types of interoperability. In all three scenarios, organizations can take part in the events of a process in random order. Thus, the main difference between these scenarios and the ones discussed in the previous subsections is that handovers resulting from these types of interoperability happen randomly, and they do not follow any specific rule.
Random handovers mean that we cannot follow a specific rule to update merged DFRs with missing handovers. Hence, extra information needs to be provided by the organizations involved in the process. In particular, each organization needs to specify the DFRs involved in handovers. A DFR is involved in handovers if its non-dummy activities are involved in handovers. For instance, in the event logs shown in Figure 10, (IVA, DCO) is indicated as a DFR involved in handovers because IVA is involved in a handover from EC to LB, and DCO is involved in a handover from LB to EC. (▶, REG) is also involved in handovers because REG, as a non-dummy activity, is involved in a handover from EC to LB.    ((a, b), i) ∈ (A∪{▶} × A∪{■}) × {0, 1}, π dfr (dfr h ) = (a, b) and π hor (dfr h ) = i are the projections of dfr h onto the directly follows relation and the handover indicator, respectively.
Note that indicating handover DFRs does not reveal sensitive information. It only imposes extra effort on the organizations, yet, at the same time, it drastically simplifies the process of updating merged DFRs with missing handover relations. Definition 18 demonstrates the process of updating merged DFRs with handover relations, where the DFRs involved in handovers are indicated.
Definition 18 (Update): Let L 1 , L 2 , . . . , L n be simple event logs belonging to n organizations involved in a process, HoRs ∈ B(A × A) be a multiset of handover relations among the organizations, and mDFRs = abs DFR h (L 1 ) ⊎ abs DFR h (L 2 ) ⊎ . . . ⊎ abs DFR h (L n ) be the merged DFRs. update(mDFRs, HoRs) updates mDFRs based on HoRs as follows: update(mDFRs, HoRs) = [π dfr (dfr h ) | dfr h ∈ mDFRs∧ π hor (dfr h ) = 0] ⊎ HoRs The general idea of updating merged DFRs is to add missing handover relations and remove the wrong DFRs added because of unknown interconnections among organizations. In Definition 15 and Definition 16, we exploited some properties of the interconnections to update merged DFRs with the minimum available information. However, when there is no specific property for interconnections, we utilize DFRs with handover indicators to update DFRs. Since the DFRs involved in handovers are indicated, one can simply remove all of them and add missing handover relations.

G. RARM FOR FEDERATED PROCESS MINING
In this subsection, we demonstrate the risk-aware reveal method for answering more in-depth questions about a process. We explain the general approach for two main types of queries, i.e., attribute-based and trace-based.

1) ATTRIBUTE-BASED QUERIES
A query is sent to the risk-aware reveal engines of all the involved organizations. RARE of the respective organization VOLUME 11, 2023

2) TRACE-BASED QUERIES
To get a complete response for the trace-based queries, a process analyst may need to send several queries to different organizations in a specific order depending on the responses received from each single organization. Similar to the attribute-based queries, a query is first sent to the risk-aware reveal engines of all the involved organizations. RARE of each organization verifies the risk associated with a response. If the risk is above a predefined threshold, the RARE refuses the query. Otherwise, it provides a response with the corresponding handover tables for the cases whose data are included in the response. Such handover tables are utilized by the process analyst to get the possible missing pieces of the response from other organizations. A response provided for the trace-based queries by an organization with the identifier o ∈ O is a set Res o ⊆ C × A * × 2 HR (see Definition 22).
Consider the chained execution type of interoperability, the event logs shown in Figure 5, and the following query: what are the traces of cases whose treatment process contains IVA? Assuming that the risks are acceptable for the organizations, the reponses provided by EC, ST, and SC are as follows: The process of sending queries stops when both launcher and terminator organizations of cases in all the responses are visited. Given a case c, the launcher organization is the one that starts the process of the case c (Definition 20), and the terminator organization is the one that ends the process of the case c (Definition 21). For each case, the process analyst joins  Let c ∈ C joint be a case, and HoTs = {HoT 1 , HoT 2 , · · · , HoT n } be a collection of handover tables of the organizations involved in the process of case c. An organization with the handover table HoT i ∈ HoTs is the launcher organization of case c iff π from (min id (φ (HoT i , (case, c)))) = ⊥.
Definition 21 (Terminator Organization): Let c ∈ C joint be a case, and HoTs = {HoT 1 , HoT 2 , · · · , HoT n } be a collection of handover tables of the organizations involved in the process of case c. An organization with the handover table HoT i ∈ HoTs is the terminator organization of case c iff π to (max id (φ(HoT i , (case, c)))) = ⊥. The response provided regarding a case c ∈ C joint is Res c = σ 1 ⊕ σ 2 ⊕ · · · ⊕ σ n , where (c, σ i , HoT i ) ∈ Res o i , o 1 is the launcher organization of case c, and o n is the terminator organization of the case. Figure 14 shows an overview of our approach for PAFPM for all the types of interoperability where cases can be shared among organizations. The general approach for all the mentioned types of interoperability is the same. The only difference is different update operations for merged DFRs based on the different types of interoperability. Note that the generic update definition (Definition 18) can be used for all the types of interoperability if the DFRs involved in handovers are indicated. The federated DFG obtained from DFRs  containing handovers can provide a high-level understanding of the joint process. However, to analyze more complicated aspects of the process, e.g., which activities are performed simultaneously, the process discovery step needs to be done.

VI. IMPLEMENTATION AND EVALUATION
In general, three criteria can be considered for evaluating different aspects of our approach: data utility, privacy, and interorganizationality. In Section IV, we explained the privacy and data utility aspects of the abstraction used in this paper, i.e., DFRs. Since DFRs remove the concept of trace, they can mitigate the disclosure risks w.r.t. the control-flow aspect (see Subsection II-B). Nevertheless, as we explained, there are situations where the original trace variants of an event log can be revealed based on its DFG. We also demonstrated the data utility shortcomings of abstractions and introduced the riskaware reveal method to overcome the shortcomings. Since the main focus of this paper is on inter-organizational process mining, in this section, we evaluate the inter-organizationality aspect that also incorporates the other aspects.
We employ Sepsis as a real-life event log for our experiments [46]. Sepsis is an event log recorded by an information system in a hospital that contains 15214 events and 16 unique activities performed for 1050 patients (cases). We demonstrated five different types of interoperability including chained execution, subcontracting, case transfer, capacity sharing, and loosely coupled. Assuming that handover tables are provided by the organizations involved in a process, the most challenging part of the abstraction-based approach is the update operation. The update operation for case transfer, capacity sharing, and loosely coupled relies on the information regarding handover indications in directly follows relations. Given such information, the update operation is a straightforward task. Thus, we mainly focus on the chained execution and subcontracting types of interoperability. We implemented a Python script for our evaluation. The implementation is available as a GitLab repository 1 and can be installed as a Python package. 2

A. SCENARIO DISCOVERY
In this subsection, we demonstrate the process of discovering the chained execution and subcontracting types of interoperability from the Sepsis event log. As described in [47], Sepsis is an event log collected from three main departments: Emergency Room (ER), Labratory (LB), and Financial (FI). Table 3 shows our categorization for the activities in the Sepsis event log. Note that to avoid having uncategorized activities, our categorization for the activities is more general compared to the categories discussed in [47]. Namely, we consider "Return ER" as an activity performed by the financial department and admission activities performed by the laboratory.
To discover the chained execution scenario based on the Sepsis event log, we generalize the activities to their department level. A DFG discovered from the generalized event log shows the paths that patients (cases) follow at the department level. The set of cases following a path that corresponds to the department-level DFG shown in Figure 15 generates a sub-event-long from Sepsis that matches a chained execution scenario. We name this event log Sepsis-CE. By projecting Sepsis-CE onto the activities of each department, we get three event logs for three different departments. We name these event logs Sepsis-CE-ER, Sepsis-CE-LB, and Sepsis-CE-FI.
To discover a subcontracting scenario, we generalize ER and FI to one department, called Emergency Ward (EW). After this generalization, the same set of cases, as the ones in Sepsis-CE, follow a subcontracting scenario, i.e., the cases follow a path that corresponds to the department-level DFG shown in Figure 16. Although the set of cases is the same, for the sake of simplicity, we name this event log Sepsis-SB. By projecting Sepsis-SB onto the activities of each department, i.e., EW and LB, we get two event logs for two different departments. We name these event logs Sepsis-SB-EW and Sepsis-SB-LB. Note that Sepsis-SB-LB is also the same as Sepsis-CE-LB. Table 4 shows the general statistics of the event logs that we obtained for the above-mentioned scenarios.

B. THE MERGING CHALLENGE
In this subsection, we show the results of applying our approach to the event logs explained in the previous section. Figure 17 shows the original DFG for the main event log in both scenarios, i.e., Sepsis-CE/SB. As explained in Section V, the first step in all the scenarios is that each organization applies the abstraction function to its own private event log and shares the resulting DFRs. Figure 18 shows the DFG of the merged DFRs without handover relations for the chained execution scenario. Since handover relations are missing, one can see three submodels with their own start and end activities, and there is no connection between the activities of different departments. In fact, the concept of chained execution has completely vanished. Based on our scenario for the chained execution type of interoperability, all the cases follow a path matching the DFG shown in Figure 15. Thus, we expect to see the activities of the ER department at the beginning and the activities of the FI department at the end (as shown in Figure 17). By applying Algorithm 1 to the handover tables of all the departments, we retrieved 1198 missing handover relations. Figure 19 shows the DFG of the merged DFRs updated with these missing handover relations exploiting the coupling update operation. This DFG is exactly the same as the original one. Figure 20 shows the DFG of the merged DFRs without handover relations for the subcontracting scenario. One can see two submodels corresponding to two departments, EW and LB. There is no connection between the activities of the two departments, and the submodels have their own start and end activities. Based on our scenario for the subcontracting type of interoperability, the activities of the LB department should appear between the activities of EW. Similar to the chained execution scenario, we apply Algorithm 1 to retrieve the missing handover relations. Then, the decoupling update operation is applied to update the merged DFRs with the missing handovers. Figure 21 shows the DFG of the merged DFRs which is exactly the same as the original event log.

VII. CONCLUSION AND DISCUSSION
In this paper, we proposed an abstraction-based approach for privacy-aware federated process mining. We employed DFRs as abstractions of event logs. We introduced the riskaware reveal method to overcome its data utility limitations. In Section V, we specialized our approach to federated process mining for five different interoperability scenarios. We introduced the concept of handover relations and handover tables and demonstrated an algorithm for retrieving missing handover relations in an inter-organizational setting. We also demonstrated update operations to update directly follows relations with missing handover relations. We employed Sepsis as a real-life event log to evaluate our approach for reproducible scenarios.
In our problem setting, we assumed that privacy concerns are at the level of individuals, i.e., traces are sensitive information that need to be protected. However, our approach can also support the department level of sensitive information. If we assume that the entire internal activities of an organization are private, the organization can share only its handover table. As a result, the generated DFG in the untrusted environment only represents the communication points of the organization.
For explaining the risk-aware reveal method, we focused on an intuitive type of attack and the corresponding disclosure risks. However, attack scenarios and their corresponding risk analysis can be done more extensively. Since organizations are not aware of the event logs and risk thresholds required by other organizations, they may provide responses that violate the risk requirements of one another. Namely, intersectionbased attacks can be launched [15]. For example, consider the following scenario. In organization o 1 , the age attribute is considered as a sensitive attribute and it gets generalized before publishing. However, in organization o 2 , age is not considered as a sensitive attribute and it is shared without generalization. If there exists only one case in a specific range in the response provided by o 2 , the privacy requirement of o 1 is violated. Each individual response contains more than one case considering a sequence of activities with a maximum length of 2 as the background knowledge. However, the integrated responses for cases are as follows: Res c 1 = ⟨a, b, c, d, e, f ⟩, Res c 2 = ⟨a, b, d, e⟩, Res c 3 = ⟨b, c, e, f ⟩. As can be seen, there are sequences of activities with length 2 that single out a case. For example, ⟨c, e⟩ singles out the case c3, or ⟨c, d⟩ singles out the case c 1 . Such risks can be mitigated using an integration engine that considers the risk thresholds of all the organizations and re-evaluates the risks associated with integrated responses before exposing them to the untrusted environment. Such an integration engine can be considered as a semi-trusted third party that never gets unprotected information and may not misbehave. Nevertheless, we still need to realize third-party independent solutions for such scenarios.
Moreover, the current risk-aware reveal engine employs no privacy-preserving technique. It solely analyzes the risks associated with a single response and either refuses the corresponding request or shares the response. In the future, the engine can be equipped with privacy preservation techniques, e.g., differential privacy, that provide privacy guarantees for responses. The engine is also stateless, i.e., it does not keep the track of queries. In the future, the engine can be upgraded to a stateful one that tracks queries to avoid privacy leakage resulting from responses provided to several queries.
We described our approach for the synchronous type of communication. However, the approach can also support the asynchronous type of communication using modeling techniques that can represent concurrencies. In the future, we plan to extend this technique with other representation models to support the asynchronous type of communication. We also plan to perform a case study to comprehensively evaluate the effectiveness of the risk-aware reveal method.