Improving Predictive Process Monitoring Through Reachability Graph-Based Masking of Neural Networks

Predicting the next event during process runtime is an objective of interest in predictive process monitoring (PPM). Decay replay mining is one of few deep learning-based next event prediction approaches that are built upon process model notations. However, this algorithm does not fully intertwine its neural network with the available process knowledge contained in the process model. This work, which is an extended version of an earlier conference publication, investigates the reachability graphs of underlying Petri net process models for masking the neural network of decay replay mining to ultimately increase the quality of next event predictions. A more comprehensive set of experiments is performed to provide robust statistical evidence of the usefulness of the approach and relativizes earlier made claims and hypotheses. In addition, the decay replay mining approach is applied with the suggested reachability graph-based masking extension to a healthcare use case of sepsis patients facilitating decision-making for healthcare practitioners. The obtained results further underscore the validity of the masking of neural networks using knowledge contained in the reachability graph of a Petri net process model.

As an example, the process of patient care in a hospital can be analyzed using process mining techniques and observations of the procedures that are performed on the patient. These steps can be any observations from patient admission and to patient discharge, such as administrative notes, taking laboratory samples, surgeries, or hospital floor transfers. Process discovery and conformance checking can provide an as-is view on the process using historically recorded data to provide insights on, e.g., inefficiencies. The outcomes of process mining analyses can help to manage resources more efficiently and improve hospital treatment-specific key performance indicators in the future [5].
In recent years, the set of process mining techniques has been extended by machine learning methods to provide a predictive perspective on how processes are executed and beyond as-is only views of historically recorded data. These techniques are categorized under the term predictive process monitoring (PPM) [6], [7]. PPM can have many objectives and usually focuses on the prediction of events and corresponding attributes such as timestamps of observations or resources during process runtime. This allows stakeholders of the process and process mining practitioners to gain transparent and ideally early insights into process deviations and enables them to intervene and prevent undesired process outcomes from occurring. Common questions in the business context that can be answered during runtime are "will the customer complain or not?" or "will an order be delivered, canceled, or withdrawn?" [6]. PPM has an exhaustive set of applications beyond the mentioned example, such as enabling the sharing of process knowledge across different organizations [8] and building the cornerstone for prescriptive process monitoring [9]. This shows that PPM is an important process mining task that is going to increase its applied significance forthcoming.
Predicting future observations, while the patient care process is executed, enables the detection of potentially unexpected patient care deviations. This prediction is of importance to healthcare practitioners, including physicians as it allows for timely intervention [10]. During the initial stage of the COVID-19 pandemic, physicians were interested in the early prediction of patient observations, such as needed ventilation, transfers to ICU, and mortality risks [11]. PPM techniques are suitable and well-acknowledged methods for such objectives, with deep learning-based approaches demonstrating superior predictive performance [12], [13], [14]. However, a problem with these techniques is that they are commonly built on This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ subsequences of observations, which disregards knowledge obtained from process mining initiatives. To date, a limited number of PPM techniques leverage process models in combination with deep learning [12]. As a consequence, almost all of the PPM methods that are based on deep learning are disconnected from process discovery results though the main objective of process discovery is to learn about the overall process [15]. Therefore, leveraging the results of process discovery, i.e., process models, should be considered when developing PPM methods, especially as this comes at no additional cost. To the best of the authors' knowledge, no published methods exist integrating process knowledge into PPM yet. However, initial research efforts such as Jacobs et al. [16] are documented.
Decay replay mining [17] is a PPM method that utilizes process models, specifically Petri nets, in combination with neural networks for the task of next event prediction. While Petri nets provide an interpretable overview on how processes are executed, they lack associated probabilities of Petri net states. Specifically, Petri nets show which internal states can be visited from a given state. However, there are no probabilities associated with states providing you with information which state follows timewise next, given that multiple next states are possible. Decay replay mining overlays the Petri net with neural networks to introduce the missing probability dimension and enables to model long-term and latent dependencies. The method extends a Petri net with time decay functions. Sequences of observations are synchronously replayed to obtain vectors of Petri net states. These states are then used to train a neural network classifier to predict the next observation, i.e., the next event. Decay replay mining has been successfully applied to various problems [18], including healthcare [10], [19], [20], [21]. While this method utilizes outcomes from process discovery, the neural network is yet not fully connected with the Petri net to its full extent [22]. This article is an extended work that has been originally presented at the 2021 International Conference on Cyber-physical Social Intelligence [22]. In terms of contributions, this article provides a deeper understanding of the distinct theoretical rationals of the originally proposed method and more exhaustive empirical insights based on enlarged experiments with various new result perspectives. In addition, multiple applied perspectives are granted by relating the method to PPM within the healthcare domain throughout this article. Also, a separate subsection of this article focuses on a healthcare-related use case and the application of the proposed approach to next event prediction of sepsis patient hospital trajectories [23].
The reminder of this article is structured as follows. The introduction is provided in this section, before a preliminaries section in Section II. Afterward, related work is discussed in Section III. The originally proposed methodology is described in a more holistic manner in Section IV and relates the theoretical rationals to the healthcare domain. An extended experimental evaluation is described and the results discussed in Section V followed by a use case investigation of a hospital process of sepsis patients in Section V-C. Section VI provides the conclusion of this article.

II. PRELIMINARIES
This section is based on the Preliminary Section in [22].

A. Petri Nets
The following definitions are based on [1] and [17]. A Petri net is a mathematical, graph-based model, which can be used to represent the logic of processes. Its definition consist of three sets: P is the set of places, T corresponds to the set of transitions, and F ⊆ (P × T ) ∪ (T × P) defines the set of arcs. P ∪ T are the nodes of a Petri net, which are unidirectionally connected using F . In the case that transitions correspond to events (and vice versa), a Petri net is called labeled. Equation (1) shows the formal definition of a labeled Petri net The set of all possible events of a process is denoted by A. π is a function that maps events to transitions and vice versa. If a transition is hidden, then π maps the transition to a nonobservable event ⊥, i.e., π : T → A ∪ {⊥}; hence, ∀ a∈A ∃! t∈T π(t) = a applies.
A node x connects unidirectionally to a second node y if and only if (x, y) ∈ F . Consequently, the incoming nodes for any node x ∈ P ∪ T are defined as ·x = {y|(y, x) ∈ F }. Outgoing nodes are consequently defined as x· = {y|(x, y) ∈ F }.
Each p ∈ P can hold tokens. The nonnegative integer number of tokens that are held by a place is returned by the function ρ ( p). The vector M ∈ Z |P| defines the state of the Petri net where Z is the set of all nonnegative integers. M i equals ρ ( p i ), where i corresponds to the i th place in P and i = 1, . . . , |P|. The set M describes all possible states of the Petri net. M is also called marking. In process mining, the beginning of process execution and its end are usually indicated by a dedicated start and end. Similarly, Petri nets have a dedicated source and sink place such that all other nodes are located in between the source and sink nodes. In other words, the dedicated source place has no incoming nodes, whereas the dedicated sink place has no outgoing nodes. All other nodes have at least one incoming node and one outgoing node. The initial marking, i.e., M init , equals exactly one token in the dedicated source place, while for all other places, the number of tokens held equals 0. Similarly, for the final marking M final , only one token is held in the place of the dedicated sink place, while all others do not hold any tokens.
A transition is called enabled if and only if all incoming places hold at least one token. This requirement is defined as ∀ p∈·t ρ ( p) ≥ 1. When a transition is executed, i.e., fired, the number of tokens of all incoming places of that transition is reduced by one. Accordingly, the number of tokens of all outgoing places of that transition is increased by one.

B. Reachability Graphs
A Petri net can be converted into a reachability graph [24]. In this article, a reachability graph of a Petri net is defined as is a set of directed edges. Every node of the RG maps to one marking of the Petri net, meaning that N = M. When an edge of the RG connects a node x to y, this corresponds to the transition required to be fired to move from the respective marking x to the marking of y. The function η maps a node N of an RG to its respective marking M of the PN. The inverse of this function, η −1 , maps a marking M of the Petri net to a node N of the RG. Similarly, κ maps an edge of K to the corresponding transition T and κ −1 maps a transition of T to a set of edges of the RG. The conditions of (2) and (3) apply for a reachability graph of a Petri net

A. Next Event Prediction
A major focus in PPM is the prediction of the next event given the partial sequence of already observed events. Within PPM, the most common probabilistic methods tackling this objective are hidden Markov models [25], probabilistic finite automata [26], and probabilistic process models [27]. A more recent probabilistic approach uses dynamic Bayesian networks to achieve a competitive performance score [28]. These explicit probabilistic approaches have the benefit of being understandable and interpretable making them well-suited models for healthcare applications in which explainability is of utmost importance to practitioners. Moreover, they are relatable to process models such as Petri nets even without using process model notations. On the downside, these methods are usually incapable of leveraging all available event information. This is especially unfortunate as clinical data are usually highly dimensional [10]. Moreover, these methods commonly fail in discovering long-term dependencies [28], which are usually present in treatment-relevant processes.
Due to the higher performance of neural networks in terms of accuracy and precision, next event prediction algorithms are increasingly focusing on deep learning. Multiple deep learning architectures for PPM, particularly next event prediction, have been proposed.
One of the first deep learning PPM methods has been proposed in [29] by leveraging recurrent neural networks to predict the next event given a partially observed sequence of observations of a given object moving through the process, i.e., a case. In particular, the authors created word embedding for each event from the event log and then trained a long short-term network (LSTM). Therefore, the neural network learns implicitly the rules of the process. This approach has later been followed, adapted, and improved in [13].
Convolutional neural networks were explored for next event prediction as well [30], [31], [32]. The fundamental idea of these deep learning architectures is based on convolutions, which resulted in state-of-the-art applications in computer vision. For next event prediction, these approaches convert temporal data contained in the event logs into a 2-D spatial format such that they can be treated as images and therefore can be applied to convolutional neural networks.
As recurrent and convolutional neural networks might not always converge due to limited training data or nonideal architectural setups, the work of [14] proposed the application of generative adversarial networks to the problem of next event prediction. Their method puts two neural networks into an adversarial setting such that made predictions cannot be distinguished from the ground truth. Due to the semisupervised setting, this setup can overcome data limitation issues.
Graph neural networks have been employed for the next event prediction in [33]. Their objective has been to use gated graph neural networks to derive explainable next events. Mehdiyev et al. [34] proposed a stacked autoencoder approach by encoding events into n-gram features with a sliding window. In addition, feature hashing is applied. The resulting features are then used to train unsupervised stacked autoencoders. Just recently, Pasquadibisceglie et al. [12] proposed a multiview deep learning approach for predicting next events using event log data.
All these deep learning models frequently outperform probabilistic techniques. However, due to the black-box nature of deep learning, these methods lack interpretability. In addition, these methods ignore the knowledge contained in process models that were established by in-advance process mining efforts such as process discovery. Only a limited number of methods use deep learning combined with explicit process model notations for the prediction of the next event, one of them being decay replay mining introduced in the following.

B. Decay Replay Mining
The decay replay mining [17] method is one of the methods that utilize in-advance discovered Petri nets in combination with deep learning for the prediction of the next event. This method consists of two steps. The first step enhances all p ∈ P of a Petri net with time decay functions to model time by replaying an event log on the Petri net. Every time that a token is created in a given place of the Petri net, the time delta between the timestamp of the corresponding event leading to this place and the last token creation of this place is calculated. If only one token is created in a process execution run, the time delta is calculated between the token creation and the end of the process. The time decay function is a clipped linear function with a slope that is parametrized using the average of the time deltas for each place. When replaying an event log on the enhanced Petri net, a continuous time-based value can be obtained for each place for the Petri net, which describes the time progression of the token creation of the place. The enhanced Petri net is then used to replay event sequences from an event log to obtain multiple vectors at any time τ of the event sequence: M(τ ), which is a vector that represents M of the PN at time τ , C(τ ) representing the number of tokens that were added per p ∈ P, and F(τ ), which is a vector containing the above described time decay function values, specifically one per p ∈ P at time τ . The decay replay mining approach concatenates each of those vectors to a so-called timed state sample, denoted by S(τ ). This concatenated vector corresponds to a time-based representation of the Petri net's state as it contains the vector for time decay function values (i.e., modeled time), token counts per place, and the marking at a given time. S(τ ) is therefore an enhancement of a time-independent marking. The second step of the method then builds a conventional dense neural network using the vectors S(τ ) as input to predict the next event at a given time τ . The output of the neural network is a softmax layer over the set A.
This approach led to high predictive performance and has been applied in healthcare [10], [19], [20] to predict 30-day readmission events of heart failure patients and mortality events of diabetes ICU and paralytic ileus patients based on electronic health record data. Especially for diagnostic event predictions that can assist in treatment decisions of critically ill patients, decay replay mining is a potential candidate method as it can provide the physician with the required decision transparency. Physicians can follow the careflow through the Petri net graph at every point of time and see and understand the states of the Petri net used that lead to a particular prediction. However, this approach does not interlock the Petri net with the neural network to its full extent as the first step of the method is decoupled from the second one, i.e., the neural network does not explicitly leverage the structure of the underlying Petri net.

C. Prescriptive Process Monitoring
PPM and its associated methods, such as decay replay mining, provide, among others, probabilities of the next events at each time step during process execution. This ability builds the cornerstone for a further emerging process mining subdiscipline titled prescriptive process monitoring. While PPM provides the necessary likelihoods of the unfolding of process execution, value for practitioners in form of measurable process improvements can most of the time only be generated when the predictions are translated into corresponding prescriptions to control the process execution effectively. Research has shown that process practitioners select obvious and subjective actions for the intervention of predicted at-risk cases, and however, these actions did not lead to a reduction in faulty process executions [9]. This underscores the need for effective prescriptions upon accurate predictions. Multiple recent studies address this challenge. Fahrenkrog-Petersen et al. [35] proposed a framework to extend PPM by raising alarms that trigger interventions to prevent negative outcomes. The experimental investigation of their approach proved that the cost of negative outcomes can be effectively minimized. Similarly, Bozorgi et al. [36] developed a causal machine learning model to trigger cost-optimized interventions to reduce the cycle time of process executions. De Leoni et al. [37] proposed a framework that combines prescriptive analytics with PPM to design a process-aware recommender system. This system does not just provide generalized predictions but recommends effective interventions to reduce risks based on factual data. Another approach has been developed by Weinzierl et al. [38], which transforms PPM, particularly next event prediction, into the next best action with respect to process KPIs.

IV. METHODOLOGY
This section describes reachability graph-based extensions of the decay replay mining approach [22], which provides additional technical details and logical rationals. In addition to the initial and mostly theoretical proposition in [22], rationals and advantages of the extensions are elaborated for the next event prediction. The core idea of the decay replay mining extension is to strengthen the liaison of the Petri net in the next event prediction process to take advantage of the knowledge contained in the Petri net. As a general prerequisite, it is assumed that the Petri net is a sound workflow net [39]. The three originally proposed extensions are visualized in Fig. 1. The first one is a reachability graph exploration step, which is described in Section IV-A. The second extension encompasses the calculation of masking vectors during replay and its corresponding timed state sample expansion in Section IV-B. Third, an adaptation of the feedforward neural network architecture is proposed in Section IV-C, comprising of the addition of a reachability graph-based masking layer and the replacement of rectified linear unit [40] with Swish activation functions [41]. Finally, an end-user view is provided in Section IV-D.

A. Reachability Graph Exploration
In the reachability graph, every M is a node. All of a node's outgoing edges correspond to the potential future events uncovered by process mining methods that extracted the Petri net. This property is currently not used by the decay replay mining method to limit the set of possible next events Fig. 2. Conversion from Petri net to (reduced) reachability graph. Rectangles correspond to transitions, round circles correspond to places, and oval circles correspond to markings. For simplicity, each place can hold only one token, and therefore, a marking is composed of a set of place identifiers that hold a token.
dynamically based on the state of the Petri net. However, this knowledge is available and comes at no additional cost and should therefore be leveraged. The idea of the reachability graph exploration step is hence to calculate for every marking of the Petri net the set of possible next events to eliminate events that cannot occur based on the logic of the Petri net. Therefore, the concept of a reduced reachability graph RG r and its calculation is proposed in this section. The objective of RG r is twofold: first, to significantly reduce the usually large size of a reachability graph, and second, to eliminate edges that are corresponding to hidden transitions and hence are of no value for next event prediction.
Formally, RG r encompasses a node set N r ⊆ N and an edge set K r . Based on the above introduction, RG r can be derived such that N r contains η −1 (M init ) and N ∈ N fulfilling the following equation: For two nodes N 0 and N 1 such that In other words, two M corresponding to nodes of the RG r are connected when both of them are reachable via either one nonhidden transition or multiple hidden transitions followed by a nonhidden transition. The idea is to detect all paths from one marking to another one where at least one nonhidden, i.e., an event in A, will be observed. This provides the information if an event can be observed as a next step, therefore being a next event candidate, given that the Petri net is currently in the state of the respective node. Per definition, K ∈ K r maps to the corresponding Petri net transition, i.e., κ(K ) = κ(K n ). Thus, every edge of the reduced reachability graph corresponds to an observable event included in A, meaning that the reachability graph has been reduced to only nonhidden transitions. Every edge expresses one nonhidden transition, no matter if the transition is enabled yet. The only requirement is that the nonhidden transition can be enabled from the current marking by firing hidden transitions. The reduced reachability graph builds the basis for the calculation of the later reachability masks as all nodes of RG r represent the Petri net's markings, which are reachable by executing a nonhidden transition. In addition, all markings of RG r are the markings that are part of the timed state samples. Therefore, the reduced reachability graph can be used as a map to detect a potential subset of next events given the marking from a timed state sample. Fig. 2 provides an example visualization of the conversion from a Petri net to a reachability graph and from a reachability graph to the reduced reachability graph. From the reduced reachability graph, it is straightforward to deduce the set of next possible events given the state of a Petri net.
In the decay replay mining algorithm flow, RG r is calculated following the enhancement of the places P. To summarize, RG r encompasses nodes for every marking that is reached instantly post-execution of a nonhidden transition. As the marking post-execution of such a nonhidden transition is used as a component of a timed state sample, the set of outgoing edges of the respective node corresponds to a subset of A, which relates to multiple possible next events accordingly based on the information contained in the Petri net.

B. Masking Vector
With each node of RG r corresponding to a marking that is observed after firing a nonhidden transition, it is straightforward to discover the node N ∈ N r , which reflects a reached M post replay of an event sequence from the event log. As mentioned before, M(τ ), which is included in S(τ ), is an M that can be found in the set of nodes of RG r . Therefore, η −1 (M(τ )) ∈ N r applies. Consequently, the set of possible next events given M(τ ) can be denoted by A M(τ ) and is definable such that ∀ K ∈η −1 (M(τ ))· ∃π(κ(K )) ∈ A M(τ ) and |η −1 (M(τ )) · | = |A M(τ ) | according to the Petri net, as described in [22]. This means that it is possible that for every marking, i.e., node of the RG r , there is a different set of possible next events according to the Petri net.
Using this set to derive a probability of the next event conditioned by the marking is more reasonable than calculating the probability of the next event over the complete set of events A if it is already known that certain events cannot occur due to the respective marking of the Petri net. By doing this, variance is reduced, which should ultimately lead to better predictive performance. To incorporate this conditional knowledge into the next event probability calculation of the neural network, a masking vector MSK is proposed. The objective of MSK is to positively weight events a ∈ A that can occur next and cancel weights of events that cannot occur next according to the reachability graph and the marking. MSK is a vector of length |A| where each value is either 0 or 1. A positive weight value equaling to 1 at the position of the i th index of MSK means that the i th event of a in A M(τ ) is present, and inversely, a zero corresponds to the nonpresence of the i th event in A M(τ ) . By doing this, MSK is a fixed-length vector describing potential next events using information from the Petri net, derived via its reachability graph. In special circumstances, MSK can be a zero vector. This reflects the case that the final marking of the Petri net is reached, meaning that no further events can occur.

C. Reachability Mask Layer and Activation Function
How specifically MSK is applied to mask the dense neural network using an additional network layer is discussed within this section. The idea of the masking layer is to add an operation to cancel the probabilities for every a ∈ A − A M(τ ) . Thus, the overall objective is followed to interlock the neural network more strongly with the knowledge contained in the Petri net. When predicting the next event using this approach, a reduced subset is considered rather than then global system event set. In other words, at every prediction step, the potentially next event results from a set modeled by the Petri net that supposedly reflects the true logic of the process. In [17], the originally proposed neural network consists of four fully connected hidden layers. The input to first fully connected hidden layer is S(τ ). The final layer returns a softmax vector, which interprets as probabilities per a ∈ A.
To incorporate the knowledge from the reachability graph into the probability calculation, two adaptations are proposed in [22]. First, the input of the original architecture is replaced with the concatenated S(τ ) and MSK vectors. By doing this, the mask vector is available to the predictive model from the beginning and the hidden layers can use this information for probability calculation of the next events. Hence, the neural network does not need to deduce this information exclusively from S(τ ), as in the original architecture [17]. It is assumed that this leads to an optimized latent representation in the hidden layers for the prediction of the next event.
Another architectural modification addresses adding a layer to the output of the neural network. Initially, work [17] calculated naively the softmax probability vector over all a ∈ A. However, the probabilities should be computed such that the probability for every a ∈ A − A M(τ ) is close to or equal to zero. This can be realized by computing the Hadamard product of the original softmax and the masking vector MSK. Standardizing the derived vector recovers a probability vector where the sum of each value equals 1. By following this approach, probabilities for the next event are exclusively computed using A M(τ ) rather than A as in the original approach. As this can be incorporated as a separate layer, it is also part of the weight optimization during the training stage of the neural network, meaning that the structural properties of the Petri net are incorporated into the latent representation learning of the hidden layers. The reachability masking layer function is formally described in the following equation: Finally, the initial work of [22] proposed to exchanging the rectified linear unit activation functions in [17] with Swish ones [41]. When comparing those two functions, the Swish activation function has various potential benefits. One benefit is that it is a smooth function with no abrupt changes in the slope. Swish is also a nonmonotonic function. Consequently, the Swish activation function output landscape is smoother than the one of rectified linear unit function [41]. This is presumably an advantage to providing an improved learning signal during the weight optimization process of the neural network in the training stage. As a consequence, the expectation is that Swish delivers a better training signal toward than a rectified linear unit activation function for the purpose of this article.

D. End-User View
This section provides an end-user perspective since reachability-graph-based masking can be used to visually assist in understanding the prediction process. Fig. 3 provides a visual example of how the proposed approach outputs can be provided to an end user compared to standard decay replay mining. The standard decay replay mining method simply provides a probability for all events of the event space occurring next. In contract, the proposed approach limits the next possible events to only three events due to the process knowledge contained in the reachability graph. Moreover, the reachability graph can be used as a visual support of this prediction to highlight the current state of the partially observed case (i.e., sequence).

V. EXPERIMENTAL EVALUATION
This section provides on the evaluation of the above described method in terms of predictive performance and expands the limited experiments and results discussion of [22]. The section is split into two parts. A description of the experimental setup and its differences to [22] is provided in Section V-A. Then, the results of the experiments are discussed in-depth in Section V-B. The corresponding experimental source code is available on Github. 1

A. Setup
In this section, the selected datasets and evaluation metrics are discussed.
The helpdesk data [42], [43] describe both a ticketing process of software organization. The first helpdesk event log has nine different events, 2662 cases, and an average case length of 3.6 events. The second helpdesk event log has nine different events, 4580 cases, and an average case length of 4.7 events.
The BPIC'12 [44] dataset represents a loan application process of a Dutch financial institute. The event log has been split into five smaller processes. Each of them is used as a separate dataset. The first split encompasses events corresponding to the work subprocess where the lifecycle status of the event is complete. This subprocess has six different events, 9658 cases, and an average case length of 7.5 events.
The second split encompasses events corresponding to the work subprocess. This split disregards the value of the lifecycle status. The BPIC'12 work dataset has seven different events, 9658 cases, and an average case length of 17.6 events.
The third split encompasses events corresponding to the offer subprocess of the loan application. This BPIC'12 offer dataset has seven different events, 5015 cases, and an average case length of 6.2 events.
Another split encompasses events corresponding to the actual application subprocess. The BPIC'12 application dataset has ten different events, 13 087 cases, and an average case length of 4.6 events.
The final split considers all events without any filtering applied. This complete BPIC'12 dataset has 24 different events, 13 087 cases, and an average case length of 20 events.
The BPIC'13 dataset [45] describes the handling of incidents and problems at Volvo IT. As the dataset covers two subprocesses, it is split similarly to BPIC'12. One split focuses on problem management, whereas another split focuses on the incident management subprocess. The BPIC'13 problem management dataset has five different events, 1487 cases, and an average case length of 4.5 events. The BPIC'13 incident management dataset has four different events, 7554 cases, and an average case length of 8.7 events.
The splitting into subprocesses is leading to nine datasets that are used for evaluating the approach experimentally. Every dataset is randomly divided into five unique training, validation, and testing subsets using each time a respective 60/20/20 split ratio.
2) Metrics: As in [22], the modifications described above are evaluated using the unweighted and weighted multiclass receiver operating characteristic area under the curve (mAUROC) metrics. The AUROC score is a robust measure of how well a classifier can separate between two classes. The unweighted mAUROC is the mean of the one-vs-rest AUROC scores for each a ∈ A. For the weighted mAUROC, the support of events is used to compute the weighted mean of the one-vs-rest AUROCs of each a ∈ A. AUROC metrics have been chosen for evaluation as this provides a superior classification measure compared to other metrics [46]. Generally, the AUROC score estimates "the probability that a classifier ranks a randomly chosen positive instance higher than a negative instance" [47].
In the evaluation of this article, each model is trained two times with different initial random seeds on each of the five training splits per dataset. This leads to ten training iterations per model and dataset and results ultimately in 90 different comparisons. The original decay replay mining method is selected as a baseline for comparison as this is the initial model upon which improvements were suggested in Section IV. The 95% confidence intervals are computed based on the mAUROC score difference between the model described above and the baseline model for each dataset and split.

B. Predictive Performance
In this section, the predictive performance of the proposed decay replay mining extensions is experimentally evaluated. First, the different experiments are described followed by a results discussion.
First, for each of the nine datasets, one Petri net model is discovered using the inductive miner algorithm [48] implemented in PM4Py [49]. The Petri net is discovered based on the complete event log before training-evaluation-test splitting as the underlying process model should be as good as possible to generalize the behavior of the underlying process [22]. Moreover, this potentially mitigates the shortcomings of the process discovery algorithm by providing a more comprehensive dataset. At this experimental stage, potential data leakage during process discovery is negligible as both the baseline method and the proposed approach will be performed on top of the same Petri nets.
Multiple different combinations of the proposed approach are evaluated against the baseline, similar to the procedure in [22]. First, the basic decay replay mining approach with only the Swish activation function replacement is tested against the original decay replay mining approach with rectified linear unit activation functions. The objective of this setup, which is denoted by S, is to investigate the exclusive contribution factor of the Swish activation function to the mAUROC scores.
Second, the proposed approach from Section IV is tested without the Swish replacement, i.e., by still leveraging the rectified linear unit activation functions, against the original decay replay mining approach. With this setup, the objective is to find the exclusive contribution of the reachability mask layers. This experiment is denoted by M-R.
Third, all proposals are considered in an experiment denoted by M-S, including the reachability mask layers and the Swish activation function replacement.
As a fourth and fifth experiment, two slight variations of M-S experiments are performed by increasing and decreasing the dropout rate parameter by five percentage points to investigate whether the dropout rate has an impact. This as well is compared to the original decay replay mining approach.  As described earlier, each experimental and baseline model is performed ten times per dataset rather than five times as in [22]. Two scores, the weighted and unweighted mAUROC score difference to the original decay replay mining method, are obtained. The aggregated results per experiment are visualized in Figs. 4 and 5 showing the 95% confidence intervals for unweighted and weighted mAUROC value differences, respectively.
It can be observed that replacing only the activation function of the decay replay mining method with a Swish activation function does not lead to any improvement. In both cases, weighted and unweighted mAUROC is usually lower to nonchanging. Therefore, this setup has, if any, a negative impact on the performance and should not be considered as a sole improving extension of the method.
For the experiment denoted by M-R, the results remain mixed. The 95% confidence interval of both, the weighted and unweighted mAUROC score differences, is almost perfectly centered around zero. Therefore, the mask layer as a standalone extension might improve but sometimes worsen the outcome compared to the baseline. Similar to S, M-R should therefore not be considered for improvement of the decay replay mining approach alone.
However, when combining both extensions, i.e., M-S, an exclusively positive 95% confidence interval for both, weighted and unweighted mAUROC score differences, can be observed. While the weighted mAUROC seems to be only improving by 0.2%-0.03%, the unweighted mAUROC usually increases by about 1% by solely considering the structural properties of the underlying Petri net and substituting the activation function of the neural network. While the improvements seem rather small, it must be put in relation that the decay replay mining baseline already performs with a state-of-theart performance [17].
The outcomes of M-S + 0.05 and M-S − 0.05 are very similar to M-S and underscore the fact that the proposed changes improve the decay replay mining predictive performance in terms of AUROC. For these experiments, the 95% confidence interval of weighted mAUROC differences are mostly overlapping with M-S; therefore, a statistically significant difference cannot be made. The same applies to the unweighted mAUROC score differences. Therefore, a dropout increase or decrease of 0.05 does not have a positive impact on the overall performance of the proposed modifications.
Another observation can be made when comparing the weighted with the unweighted mAUROC difference confidence intervals. In all experimental cases, the weighted mAU-ROC confidence intervals are significantly smaller, therefore indicating only minor improvements, if any. Naturally, the unweighted mAUROC confidence intervals are larger. While this could indicate a larger variance, it can also be understood as a sign that the proposed approach supports balancing minority class predictions.
While these results are mostly congruent with the initial results obtained in [22], a difference can be spotted when looking at the sizes of the confidence intervals. Due to the increased number of experiments, the work of this article established a more trustworthy and robust statistical confidence in the improvement of the proposed changes. In particular, all confidence intervals are narrower than the first reported.
To provide more details into the aggregated results, Fig. 6 provides the obtained weighted and unweighted mAUROC score differences for each run and M-S, M-S + 0.05, and M-S − 0.05 experiment per dataset. From here, it can be seen that the largest improvements are usually made on the unweighted mAUROC scores, which underscore the assumption that the minority class predictions are particularly improved by the proposed extensions. This is reasonable as due to the masking layers and the standardization to recover a probability vector using (5) might straightforwardly lead to a high probability of a usually rarely seen event in the training data, therefore emphasizing its probability. Another positive observation from this visualization is that usually, an improvement is observed rather than degradation. Degradation is, if observed, rather small to nonexistent. However, it is noticeable that for the datasets of BPIC12, BPIC12 offer, and BPIC12 application, improvements could not be observed. Therefore, the proposed extensions seem to have failed. The explanation for the phenomenon observed in the case of the BPIC12 offer and BPIC12 application dataset can be derived from the variance of the visualized data points. As neither an improvement nor a predictive performance degradation is observed, both models perform equally well. This means that the original decay replay mining method performs very well on those datasets predicting the events equally well and learning-implicitly-the reachability masks, as originally intended of [17].
The phenomenon for the BPIC12 dataset is different. Here, almost no improvements can be observed but rather small performance decreases. The reason for this behavior can be found within the structure of the discovered Petri net on the BPIC12 dataset. The Petri net is highly parallelized with   7. This Petri net was discovered from the preprocessed sepsis event log. The initial marking and final marking are shown in green and red, respectively. White rectangles correspond to events mappable to a ∈ A. Black rectangles are hidden transitions. Circles represent places. many hidden transitions. Therefore, the reachability masks are meritless as they cannot reduce the global event set A dynamically, as intended in Section IV. This is a limitation of the proposed approach, which is strongly intertwined with the quality of the discovered Petri net.

C. Selected Deep Dive
In this section, a selected deep dive, specifically a comprehensive healthcare use case, is depicted, which highlights the advantages of the proposed reachability-based masking of the decay replay mining neural networks for the next event prediction of sepsis patients throughout their hospital journey. The initial situation is described within the problem definition in Section V-C1. An experimental setup is provided in Section V-C2. Finally, the results are discussed in Section V-C3.
1) Problem Definition: Patients that are diagnosed with sepsis upon hospital admission are in a life-threatening condition. As mortality rates of sepsis patients can range from 15% to 56% [50], physicians and nurses are required to perform immediate and correct actions upon the hospital admission to mitigate the life-threatening risks for their patients [51].
In this example, the process under investigation can be described as the patient's trajectory from arrival in the emergency department to patient discharge [52] using observations that are recorded in an electronic health record system. The used event log originates from a Dutch hospital with 1050 patients that show sepsis symptoms [23]. This event log has been originally built to leverage process mining for the analysis of patient trajectories and to validate the compliance with medical guidelines [52]. While process mining methods have shown to be effective in analyzing patient trajectories, this can be moved one step further and provide practitioners with PPM, particularly next event prediction capabilities. In the case of sepsis patients, predicting the next event allows the physician to consider a statistical perspective into the journey of the patient's future to facilitate critical decision-making. PPM can provide the physician with early critical indications, thereby contributing to making better decisions that ultimately mitigate the risks of the patient. The objective is therefore to predict the next event given a partially observed patient trajectory as correctly as possible with a high probability.
2) Application: In order to apply the proposed approach, multiple event log preprocessing steps are required to be applied to the original sepsis event log. The original event log consists of 1050 cases with in total more than 15 000 event occurrences belonging to 16 unique events. The details of the event log and its collection can be found in [52]. Among those unique events is also a readmission event, which will be disregarded in this use case section as it occurs after patient discharge. The remaining events are considered. However, the laboratory measurements, i.e., CRP, Leucocyctes, and Lactic Acid, contain additional continuous metadata, i.e., the resulting value of the laboratory measurement. These events are expanded. The arithmetic mean value is calculated for each of these laboratory measurement types. If the observed value is larger than the mean, then a discrete suffix high is added to the original event. If it is lower, a suffix low is appended. If the event has no record continuous value, the suffix none is appended. This discretization preprocessing has been successfully applied in the work of [10] and is therefore adopted in this use case. Ultimately, this leads to a preprocessed sepsis event log with 21 unique events resulting in 14 920 event occurrences and an average patient trajectory, i.e., case length of 14.2 event occurrences.
Similar to the experimental evaluation in Section V, this event log is used to discover a Petri net using the inductive miner algorithm. The resulting Petri net is visualized in Fig. 7. The preprocessed event log is randomly split five times into training, validation, and test datasets with a respective 60/20/20 split ratio, respectively. For each split, M-S and the original decay replay mining method are trained two times. Then, the same metrics as before are evaluated.

3) Results:
From the obtained results that are visualized in Fig. 8, it can be seen that the proposed approach leads on average to an improvement of predictive performance. This supports the outcomes of the experiments performed in Section V-B. The unweighted mAUROC score differences tend to be larger than the weighted ones, indicating that the less frequent events are predicted better. Moreover, the 95% confidence intervals show that an improvement can usually be assumed when applying the proposed extensions for predicting the next event of sepsis patient hospital trajectories. By more closely interlocking the Petri net graph structure with decay replay mining, one can obtain up to 1% in mAUROC improvement based on the observed results.
Furthermore, the hypothesis is investigated that the proposed extensions lead to a better prediction of rare events. Therefore, each obtained one-vs-rest AUROC score for each unique event is plotted in a 2-D space where the Y -axis shows the AUROC score and the X-score the support of the event. The results are visualized in Fig. 9. By visual inspection, the results confirm the hypothesis as it can be seen that improvements are mainly observed for events with low support. The reachability-based masking seems to have less of an impact the larger the support of an event gets. Therefore, the reachability masking seems to address a weak point of decay replay mining, i.e., the prediction of rare events.
In this applied section, it has been demonstrated that reachability-based masking of neural networks for PPM can improve the prediction of the next events of patient trajectories using the specific example of sepsis patients admitted to the hospital. This can provide physicians and other responsible stakeholders a tool that supports them in decision-making.
Most importantly, it has been shown that due to the proposed extensions, rare events can be predicted better using information which visualizable in a reachability graph and therefore interpretable to the practitioner.

VI. CONCLUSION
This article investigated a method to further interlock process discovery outcomes with PPM and is an extension of the work in [22]. The state-of-the-art decay replay mining approach for next event prediction has been considered for experimental extension. This algorithm builds upon a Petri net usually discovered from process mining efforts and calculates state representations using an event log, which is then fed to a neural network for the next event prediction. The three extensions of the algorithm that have been initially proposed in [22] have been extended with rationals and more detailed explanations. This includes first the simultaneous discovery of a reduced reachability graph, then the calculation of a reachability mask to dynamically limit the subset of next event candidates, and finally a neural network adaptation using a reachability-based network layer and the replacement of activation functions.
Compared to the initial publication in [22], a comprehensive and complementary set of tenfold experiments have been performed leading to more robust and statistically trustworthy outcomes. These experiments have statistically shown that the proposed approach improves the quality of predictions. However, the outcomes of this work relativize the outcomes of [22] due to its limited set of experiments. In addition, an analysis per dataset has been performed providing new insights into circumstances when the proposed approach is useful to be applied. Finally, a healthcare use case has been selected and investigated in-depth using sepsis patient trajectories throughout a Dutch hospital. This experiment underscores the validity of the made claims of the earlier experiments. Moreover, this use case has shown that the proposed approach specifically increases the predictive performance of rare events.
However, from the experimental evaluation, further limitations of the approach were identified. On the one hand, it has been clearly shown that the approach does not work with Petri nets of limited quality. In the case of the BPIC12 dataset, the Petri net allowed for too much behavior, and therefore, the reachability graph was not able to reduce the set of potential next events significantly. As a consequence, the reachability masks do not have any value and lead solely to a meritless increase of neural network complexity. This highlights the need for high-quality process discovery algorithms such that discovered Petri nets do reflect realistically the behavior of the process at any point in time. This limitation also highlights the importance of research efforts such as [15] and [53] to measure and obtain trustworthy and generalizing process models.
On the other hand, it has also been observed that the proposed approach does not lead to predictions of higher quality if the underlying Petri net correctly depicts the process behavior and is at the same time simple. As seen in the case of the BPIC12 offer and BPIC12 application datasets, neither an improvement nor degradation of the predictions could be observed compared to the baseline model. This is a strong indicator that the baseline decay replay mining model already learned the structure of the Petri net well without a strong intertwinement of the neural network and the Petri net, hence making the reachability graph-based masking redundant.
Future research is anticipated in three directions. First, measurements between an event log and a Petri net should be investigated to provide an early indicator if reachability-based masking will be of advantage when planning on applying decay replay mining to the PPM task of the next event prediction. Second, it is assumed that better generalizing process models lead to a more precise reachability mask, therefore benefitting the proposed approach [22]. An empirical study is anticipated to investigate whether this relationship in fact exists. Third, a nonmetric-based evaluation is recommended focusing on process experts to evaluate whether the proposed enhancements provide more clarity and value to practitioners and to investigate how their process knowledge can be added to the prediction process.