Improving Process Discovery Algorithms Using Event Concatenation

Process mining is the discipline of analyzing and improving processes which are known as an event log. The real-life event log contains noise, infrequent behaviors, and numerous concurrency, in effect the generated process model through process discovery algorithms will be inefficient and complex. Shortcomings in an event log result in current process discovery algorithms failing to pre-process data and describe real-life phenomena. Existing process mining algorithms are limited based on the algorithm’s filtering, parameters, and pre-defined features. It is critical to use a high-quality event log to generate a robust process model. However, pre-processing of the event log is mostly cumbersome and is a challenging procedure. In this paper, we propose a novel pre-processing step aimed to obtain superior quality event log from a set of raw data, consequently a better performing process model. The proposed approach concatenates events which hold concurrent relations based on a probability algorithm, producing simpler and accurate process models. This proposed pre-processing step is based on the probability of the frequency of concurrent events. The performance of the pre-processing approach is evaluated on 18 real-life benchmark datasets that are publicly available. We show that the proposed pre-processing framework significantly reduces the complexity of the process model and improves the model’s F-Measure.


I. INTRODUCTION
The goal of the process mining techniques is to support organizations by discovering, monitoring, and improving their processes [1]. Process mining has different applications in a variety of fields such as health care [2][3][4], insurance [5], and in manufacturing industry [6]. Process mining consists of three steps: process discovery, conformance checking, and enhancement [7]. Process discovery transforms an event log to a process model that describes the behavior of the processes in the forms of Petri Nets (PN) [8], Business Process Modeling Notation (BPMN) [9], Event-driven Process Chain (BPCs) [10], and Casual Nets (CN) [11]. Alpha-Algorithm [12], Heuristics Miner [13], Fodina [14], Inductive Miner [15], and Split Miner (SM) [16] are some examples of the existing process discovery algorithms. Conformance checking illustrates the deviations of the event log from the process model. Finally, the enhancement step focuses on improving the process by implementing several modifications to the process model and the event log as well. Among these three steps, process discovery plays the most fundamental step that demonstrates how process instances are executed in real life.
Most of the process discovery algorithms use all available data in an event log to produce a process model. However, noise, inappropriate infrequent behaviors, and concurrency are presented in the real-life event log. The expression garbage-in garbage-out ( Figure 1) underlies the fact that the poor quality of an event log would lead to poor quality of the corresponding generated process model. Therefore, the quality of the event log information is of the utmost importance when it comes to the generation of a process model [17].
The objective of the current work is to propose a preprocessing step aiming to determine a simpler and more accurate process model. This approach concatenates events which hold concurrent relations based on a probability algorithm which is explained comprehensively in the methodology section. The proposed model is tested on 18 real-life benchmark datasets which resulted in the complexity reduction of the process model and improvement of the F-Measure metric compared to the results of using the raw event log statistically. We also compared the results of our method to those of the 4 best recent pre-processing approaches, and determined if the results are statistically significant or not.
The structure of this paper is as follows. Section II summarizes the literature review. The preliminaries are provided in Section III. We focus on the proposed approach in Section IV and Section V evaluates the approach by testing it against a raw event log. We then conclude the paper and discuss future work in section VI.

II. RELATED WORKS
Most process discovery algorithms use all behavior of the event log to generate a process model [7,11,[18][19][20]. Although recent process discovery algorithms have been shown to remove some of the noise and infrequent behaviors in an event log [14,15], the resultant process model is not clear in its execution semantics. Even though pre-processing of the event log is a cumbersome process, and ad hoc in its task, several methods have been developed which prioritize the motive to directly pre-process the raw event log.
A general rule for filtering activities is to filter out infrequent activities from the main event log. An example of an efficient tool that supports activity filtering is the plugin Filter Log using Simple Heuristics in the ProM process mining toolkit [21]. Another process discovery tool named Inductive Visual Miner has been discovered by Leemans et al. [14]. Inductive Visual Miner, filters activities by using a slider mechanism.
Another technique in performing the pre-processing tasks is to distinguish outlier traces from an event log and to filter them out. An example is provided by Ghionna et al. [22]. This technique initially determines frequent patterns from an event log and employs a Markov Cluster processing (MCL) on the traces of the event log. In this method, outlier tracers are considered as the ones that are unassigned to a cluster and are subsequently filtered out from the event log.
Lu et al. proposed another novel method, which uses event mapping that distinguishes outlier events from the events that are part of the mainstream behavior of a process. Between two executions of the process, pairing events, which were mapped to each other, are considered similar and the unmapped pairs are considered as dissimilar in behavior. Dissimilar behaviors are considered as outlier behaviors and will be filtered out from the event log to generate a more accurate process model [23].
A supervised manual approach has been proposed by Cheng and Kumar, which filters noisy traces from an event log. The authors used the Process Rule-Induction Modular (PRISM) algorithm to train marked sub-logs, which are labeled with noisy and clean traces. Unmarked sub-logs are predicated as noisy or cleaned traces. The noisy traces are then subsequently removed from the entire log [24].
Another related work is to build a prefix automaton of the event log, which was recently proposed by Conforti et al. [25]. The method filters out the outlier events from the event log by using an Integer Linear Programming (ILP) solver. A prefix automation event log that is minimal in terms of the number of arcs is built. Infrequent arcs and subsequent events belonging to these arcs are finally removed from the event log.
Another technique has been developed by Fani Sani et al.In this work, sequential pattern mining techniques are used to differentiate between outlier events and mainstream behavior events [26].
More recently, Suriadi et al. have suggested another pattern-based approach to provide a document that contains typical problems any event log may encounter and provides solutions to these problems. The authors confirm that these document patterns can serve as a repository of knowledge for analysis that is conducted in a semi-automated manner [27]. Tax et al. showed that filtering out chaotic activities from an event log helps to discover more accurate process models. Chaotic activities are defined to be those that can happen extemporaneously at any point in the process execution. Direct and indirect entropy-based activity filtering is used in separating chaotic activities from an event log [28].
Fernandez et al. proposed a pre-processing step to achieve execution workflows of raw event logs by using the Google Cluster traces method. The process model is then characterised and analysed by measuring the complexity metric [29].
Fani Sani et al. proposed an instance selection method to select specific variants or traces in the event logs to be fed to the process discovery algorithms. They introduced two types of selections which are variant-based, and tracebased sampling. In variant-based sampling, each variant in a sampled event log represents all the traces with the behavior in the original event log, so the frequency of each sample is 1. On the other hand, in trace-based sampling, the frequency of each unique sample varies and it is not 1 [30]. Fischer et al. presented an approach to detect and remove timestamp-related issues (timestamp imperfections) in an event log. They defined 15 metrics related to timestamp quality across two axes: four levels of abstraction (event, activity, trace, log) and four quality dimensions (accuracy, completeness, consistency, uniqueness) [31].
The summary of the mentioned methods is shown in Table  1. Year Findings [21] 2005 Filtering activities using Simple Heuristics in Heuristic Miner algorithm [14] 2013 Filtering activities using a slider mechanism in Inductive Visual Miner algorithm [22] 2008 Detecting and filtering the outlier traces from an event log by using the MCL on the traces of the event log [23] 2005 Detecting outlier events from an event log through event mapping [24] 2015 Detecting and filtering out noisy traces from an event log by using the PRISM rule-induction algorithm [25] 2017 Filtering out the outlier events from the event log by using an ILP solver [26] 2018 Detecting outlier events by using sequential pattern mining techniques [27] 2017 Identifying typical problems that event log may encounter through pattern-based approach [28] 2019 Filtering out chaotic activities from an event log [29] 2020 Filtering noisy traces by using Google Cluster traces method [30] 2020 Filtering out the specific traces using instance selection method [31] 2020 Detecting and eliminating timestamp-related issues in the event log through their proposed model The above pre-processing methods are useful to generate a more accurate and simpler model when an event log exhibits fewer concurrency between events. However, in cases where the event log contains many concurrent relations between events, these techniques would result in removing most of the relevant traces. Since most of the real-life processes contain concurrency between events and loops, it motivates us to discover a new pre-processing method that leads to significant results for any type of event log not just the event log with fewer concurrency.
To achieve this, we first find the probability of the frequency of pairs of events with concurrent relation in the event log. We then begin concatenating these events based on the descending order of the probability sums. In the case that two pairs of events have the same probability sums,a reposition function is performed which is explained in details in the methodology section. In the end, self-loops are removed from the event log. The resultant event log is fed to the SM algorithm which is a process discovery algorithm to generates an efficient process model. The efficiency of the process model is evaluated by measuring common metrics such as F-Measure, and complexity on 18 real-life benchmark datasets. We demonstrated statistical improvements in the efficiency of the process model by comparing our pre-processing approach result to that of using raw event log. Moreover, the results of our method were statistically compared to those of the 4 best recent pre-processing approaches.

III. PRELIMINARIES
In this section, we introduce process mining notations which are uniformly required throughout this paper. The preliminaries include brief definitions of the event log, directly-follows frequency, self-loops, concurrency relations, PN, F-Measure, complexity, and SM algorithm followed by an introduction to process mining and notations that ease the understanding of our pre-processing approach. Definition 1: Event Log.
Let E be the universe of activities and T ∈E is a set of activities. An event log is a multi-set of sequences over T . A simple trace t is a sequence of activities, such as t ∈ T , where t = {e i ∈ E|e 1 , e 2 , ..., e i ; 1 ≤ i ≤ n}. An event log is a multi-set of sequences over T , i.e., L∈(T * ). Moreover, each event has a label ∈ L and it refers to a task executed within a process, we retrieve the label of an event with the function λ : E → L to provide the label of the event, using, λ(e) = e [16].
Given any event log L and any two event labels , ∈ L, the directly-follows frequency of → denotes the number VOLUME 1, 2022 of times immediately appears after , in at least one trace in the given event log L. We denote this frequency with | → | [16].

Definition 3: Self-loops.
A self-loop exists in our event log L, if | → | is positive for some event e ∈ E with λ(e) = [16].

Definition 4: Concurrency Relation.
Given any event logs, L and any two labels , L, and any two events e i , e j ∈ E, are said to have concurrent relation, denoted by (e i ||e j ), if and only if the following conditions exist: | → | > 0 and | → | > 0 where λ(e) = and λ(e ) = [16].

Definition 5: Petri Net.
A Petri net (PN) is a mathematical modeling technique, which is used to represent a process model in many different areas such as healthcare, and manufacturing. A PN can be formally described as P is a set of places, T i is a set of transitions, is a set of directed arcs, and π is a function that maps transitions to activities [32]. PNs are visualized as follows: circles represent places and transitions are signified with rectangles. Unidirectional arcs that are the set of F, connect transitions to the places and vice versa. The function, maps each transition to the activity u ∈ U , note that an activity can be non-observable, and these transitions are called hidden transitions and are shown by a black rectangle. PNs contain at least one initial marking and one final marking which represents the start and the end of a process execution respectively. A transition is called enabled if it fires. In order to fire a transition, all input places of a transition must hold at least one token. During firing a transition, the tokens move from all places which are inputs to the transition. As a result, the number of tokens for the mentioned places will be reduced by one. On the other hand, the tokens will be entered to the places which are the outputs of that transition, so the number of the token for those places increase by one. One of the important properties to maintain in a PN is the soundness of the PN [1].
F-Measure is a common metric, which is used to measure the quality of the process discovery models. It creates a balance between fitness and precision values. In other words, F-Measure is a trade-off value between fitness and precision [33], and it is calculated based on the following formula: , such that, T P = Number of true positives F P = Number of false positives F N = Number of false negatives Note that fitness defines how well a process model shows the behavior of the event log. If the value of fitness is 1, this means the model is able to reproduce all traces in the event log. In this paper, we use an alignment-based approach to measure fitness [34]. In the alignment-based method, the degree of the alignment between real traces in the log and the produced traces in the process model is measured.
Precision shows the capability of a model to produce the behaviors that are found only in the event log. The value of 1 shows that traces that are produced in the process model are available in the event log. In this work, the alignment-based method is used to measure precision in [35].

Definition 7: Complexity.
Complexity is another metric that is used to measure the quality of the process model. Complexity shows how much a process model is easy to understand. Several metrics are used to measure the complexity of a model [36]. In this work, Control-Flow Complexity (CFC) [37], size of the models [38], and structuredness [38] are used to measure the complexity of the models. CFC shows how many branching is prompted by the split gateways in a process model. The higher the amount of CFC value is, the more complex the model is. The size of a process model calculates the numbers of the nodes and arcs in the model. The lower the number of nodes and arcs in a process model, the lower the complexity. In the end, a process model is structured, If, for any split gateway in the process model, there is a corresponding join gateway. The more structured is a process model, the simpler the model is.
SM is one of the process discovery algorithms that produce a sound PN from the event log, and it shows pointedly more improvement in producing models with less complexity and higher F-measure value than other process discovery algorithms. SM algorithm includes the 5 following steps. First, it generates a directly-follow graph and identifies short loops. Second, it discovers the existing concurrency relation between events. Then, it utilizes filtering, and in order to guarantee the soundness of this process, each node should be located on a path from the start node to the end node. Afterward, the algorithm derives choice and concurrency relations by adding split gateways. In the end, the join gateways are discovered. To accomplish this, SM also introduces 2 threshold values. The first one controls the filtering process, which is called frequency threshold, . The other one is η, which controls concurrency relations. Both of these parameters are percentile which numerical values are in the range of 0 to 1. The output of these steps is a BPMN, and it can be converted to a PN through using ProMs' BPMN Miner package [1]. This algorithm is publicly available as a java application [39].
Process mining consists of 3 steps: process discovery, conformance checking, and enhancement of the process model [1]. An event log is the input of a process discovery algorithm and the output is the process model that can be in different forms such as PN, BPMN, EPCs, and CNs. We focus on BPMN, which is the output of the SM. Conformance checking evaluates whether the process model is an accurate representation of the event log. The most commonly used metrics of evaluations are F-Measure and complexity. Enhancement is used to provide improvement to the event log and its corresponding process model. By doing process discovery and conformance checking, we will uncover any problems, loopholes, and potential solutions, which can be subsequently implemented. Finally, we provide recommendations to improve a process model.
Given an event log, L containing containing n number of events and m number of traces, We define f (e i ) as follows: where count(e i ) is the frequency or a total number of times the event e i is present in every trace t j in the event log, L.
Let e i and e j be two events in the given event log, L. We define the frequency or number of times that e j happens instantly after e i in the event logs, L as follows: Similarly, the frequency or number of times that e i happens instantly after e j in the event log, L is denoted as follows: We denote the probability of the frequency that an event e j occurs immediately after event e i as, Similarly, we define the probability of the frequency that an event e i occurs immediately after event e j as follows: Let P (e i ||e j ) be the probability sum of a concurrent set of events e i ||e j such that, and P (e k ||e l ) = P (e k → e l ) + P (e l → e k ) denote the probability sum of a second set of concurrent events, e k ||e l . For two concurrent events (e i , e j ) and (e k , e l ) ∈ L such that e i ||e j and e k ||e l , we define an index for a set of concurrent events e i ||e j as (I s e i , I s+1 e j ), such that I s e i = I s+1 e j , where 0 ≤ s < ∞. We define a second index for e i ||e j as (I q e i , I q+1 e j ), such that I q e i = I q+1 e j , where 0 ≤ q < ∞. Similarly, for a second set of concurrent events e k ||e l , we define an index for e k ||e l as (I s e k , I s+1 e l ), such that I s e k = I s+1 e l . We define a second index for e k ||e l as (I q e k , I q+1 e l ), such that I q e k = I q+1 e l .
Let e i and e j be two concurrent events. We define F s eiej as the total number of instances that e i and e j are concurrent at the index (I s e i , I s+1 e j ) and F q eiej as the total number of instances that e i and e j are concurrent at the index (I q e i , I q+1 e j ) for each trace t j in the event log, L.
Finally, p * denotes the cut-off probability in a given concurrent relation.

IV. APPROACH
The proposed approach is a pre-processing algorithm that aims to remove concurrency and self-loops based on a probability algorithm to improve the quality of an event log for achieving an optimal process model. The method achieves a higher F-Measure compared to maneuvers using the raw event log. Moreover, our approach results in a simpler and more accurate model.
The approach consists of the following steps. Given an event log, L , containing the sequence of events E = {< e i , e j , .....e n >}, we find all P (e i ||e j ) for all possible combinations of concurrent events, (e i ||e j ) in the event log. We introduce a threshold value of p * and only select the combinations whose probabilities pass the threshold value. Any given concurrent relation is selected if and only if both of its corresponding probabilities exceed the predefined threshold p * . We then add the probabilities of the selected combinations and sort them in descending order. We perform a re-position step for the combinations with equal probability sums. After the re-position step, we concatenate the ordered combinations on the original event log. Finally, we remove all the self-loops that are presented on the event log. The intuition behind concatenating events with a concurrent relation is that in many real-life event log, concurrency accounts for a significant part of the behavior captured on the event logs. Therefore, by concatenating some of the concurrent events, the complexity of the model decreases significantly. The various steps of the proposed model are shown in Figure  2. The corresponding source code is publicly available on our GitHub repository.

1) Discovering concurrent events.
In this stage, we discover all concurrent relations between all sets of 2 events by calculating all possible combinations of 2 events using n 2 = n! 2!(n−2)! and choosing only those events that are concurrent such that (e i → e j ) and (e j → e i ).
2) Finding P (e i → e j ) and P (e j → e i ) .
We calculate the frequency, f (e i ), for each event e i , in each trace, t j ∈ L.
We then calculate f (e i → e j ) and f (e j → e i ) of only the concurrent events e i ||e j , and P (e i → e j ) and We introduce the threshold value p * which denotes the cut-off probability in a given concurrent relation. The value of the threshold p * is found via hyper-parameter optimization and selection of the highest F-Measure values. 4) Selecting combinations using p * .
The concurrent combination of events will be chosen if and only if both P (e j → e i ) and P (e i → e j ) exceed the value of threshold p * such that, P (e i → e j ) > p * and P (e j → e i ) > p * .

5)
Adding and sorting concurrent event probabilities.

VOLUME 1, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Lastly, we take the sum of P (e i → e j ) and P (e j → e i ) for all concurrent events that meet the condition above. For example, let P (e i ||e j ) be the probability sum of a concurrent set of events e i ||e j such that, and P (e k ||e l ) = P (e k → e l ) + P (e l → e k ) denotes the probability sum of a second set of concurrent events, (e k , e l ).
We then sort all the P (e j ||e i ) for all concurrent events chosen in descending order the following way: P (e i ||e j ) > P (e k ||e l ).

B. RE-POSITION OF THE COMBINATION WITH EQUAL PROBABILITY SUMS
This step uses the selected combinations of step A as an input. In this stage, those selected combinations with concurrent relation are sorted, in descending order, based on their probability sums. However, we may come across several cases with equal probability sums, for which case re-position is necessary to sort them for the concatenation step. Reordering the position of those combinations is done according to the following steps.
For P (ei||ej) = P (e k ||e l ) For concurrent events (e i , e j ) and (e k , e l ) ∈ L such that P (e i ||e j ) = P (e k ||e l ).
1) For 2 concurrent events, e i ||e j we find all the indices, (I s e i , I s+1 e j ) of (e i , e j ), for every trace, t j ∈ L. 2) For e i ||e j , we calculate F s eiej and F q eiej as the total number of instances that e i and e j are concurrent at the index (I s e i , I s+1 e j ) and (I q e i , I q+1 e j ) respectively. 3) If F q eiej > F s eiej , then we announce indices (I q e i , I q+1 e j ) as the most probable or likely index for concurrent events e i ||e j . 4) We repeat steps 1−3 for all for all the concurrent event combinations with equal probability sums.
We take 2 concurrent events (e i , e j ) and (e k , e l ) ∈ L such that P (e i ||e j ) = P (e k ||e l ) and F s eiej = F q eiej . 1) Let's take 2 indices for the concurrent events e i ||e j , (I s e i , I s+1 e j ) and (I q e i , I q+1 e j ) such that F s eiej = F q eiej . 2) We find that (I s e i , I s+1 e j ) happens before (I q e i , I q+1 e j ) for e i ||e j . 3) We announce, (I s e i , I s+1 e j ) as the earlier indices between F s eiej = F q eiej 4) We repeat steps 1 − 3 for all the concurrent event combinations with equal probability sums and equal frequencies for the indices. P (ej||ei) = P (e k ||e l ) and F s e i e j = F q e i e j and (I s ei, I s+1 ej) = (I q ei, I q+1 ej) In the rare scenario where 2 combinations may have the same probability sum and the same most likely indices in an event log, such that P (e j ||e i ) = P (e k ||e l ), F s eiej = F q eiej and (I s e i , I s+1 e j ) = (I q e i , I q+1 e j ), we choose the order of combinations for the concatenation randomly.

C. CONCATENATION AND REMOVING SELF-LOOPS
1) Replacement with the concatenated event.
Thus far, the order of the combinations for concatenation is finalized in the previous step. Our algorithm selects the ordered combinations one by one for the concatenation. In order to concatenate the 2 events in a combination, suppose that we have a combination (e i , e j ), we remove the event, e i replace e j with the concatenated event, e i ⊕ e j . Note that the name of the concatenated event is derived based on the combination of 2 event names which are going to be concatenated. In this case, the name of the concatenated event will be e i ⊕ e j . This step will be repeated until all selected combinations will be concatenated in an event log. 2) Replacement with most probable combination.
After concatenating all combinations in the event log based on descending order, there will be events that are yet not concatenated. We examine each event individually that is not concatenated and see if the event, e i ∈ e i ⊕ e j . If that is the case, we remove e i and replace e i with e i ⊕ e j . If e i e i ⊕ e j , then the final trace will remain unchanged. 3) Removing Self-Loops.
Finally, we remove all the self-loops, such that e i → e i . Let E = {< e i , e j , .....e n >} be the sequence of events in a given trace t j and T = t 1 , t 2 , ...., t n be the set of all traces in the event log, L. Our algorithm is as follows:

Algorithm 1 Stochastic Concatenation Algorithm
Input: Raw event log, L; Output: Output event log, L"; Initialisation : 1: Using 0 < p * < 1; LOOP Process 3: for e i ||e j L, i < j do 4: if P (e i → e j ) > p * and P (e j → e i ) > p * then 5: Add e i ||e j to C; S{I s e i , I s+1 e j } ← Find all F s eiej , the total number of instances that e i and e j are concurrent at the index (I s e i , I s+1 e j ); 16 if (I s e i , I s+1 e j ) = (I q e i , I q+1 e j ) then 28: G ← The order of combination for concatenation will be chosen randomly; 29: end if 30: end for 31: Replace the keys of D which has equal probability sums with the corresponding ordered keys of G; 32: for e i ||e j T do 33: if e i ||e j D then L"← Remove first e i ; 44: end for 45: Return L"

V. EVALUATION
In this section, the proposed model is evaluated. After modifying the event log, we used the SM algorithm to generate the process models. The SM algorithm takes an event log in XES or MXML formats, and the thresholds of η and as inputs, and produces a BPMN process model as an output. For evaluation, a set of publicly available logs were used. In the following parts, we provide a summary of benchmark datasets. Then we discuss the results.

A. DATASETS
4TU Center of Research Data as of August 2021 provided a collection of several real-life event logs [40]. The event logs listed in Table 2 were used to evaluate the proposed approach: All logs of annual Business Process Intelligence Challenge (BPIC), Road Traffic Fine Management Process (RTFMP), SEPSIS Cases log from a hospital each of which denotes the pathway through the hospital, and Hospital event log which contains information related to billing. These logs record executions of processes in a variety of fields such as healthcare, finance, and government affairs. A pre-processing step [41] was applied to remove infrequent behaviors from the BPIC15, BPIC14, and BPIC17 logs because of the complexity of the logs. Before removing infrequent behaviors, the process discovery algorithm generates a model with F-Measure close to zero due to the complexity of the event logs. The statistics of the logs are shown in Table 2.

B. RESULTS
To conduct the experiments, we used RapidProM [42] which extends RapidMiner with process mining analysis capabilities. This platform helps us to use a workflow for the experiments. The algorithm takes a threshold value of p * to choose the combinations of concurrent events for the concatenation step.
To find the optimal of p * , we ran hyper-parameter optimization with the steps of 0.1 on 18 benchmark datasets. The optimal p * was chosen based on the highest value of the F-Measure. The experimental results showed that if the optimal value for p * is chosen as 0.7, it yielded the highest value of F-Measure in 13 out of 18 datasets. However, the rest of those 5 had no to negligible differences in F-Measure before and after concatenation. Consequently, the optimal value for p * was chosen to be 0.7. Figure 3 illustrates the variation in F-Measure by applying hyper-parameter optimization with the step of 0.1 on 18 benchmark datasets. Based on this figure, it can be observed that the F-Measure value peaks at the threshold of 0.7 on almost all datasets.
Also, to choose the optimal values of SM thresholds, we used 16 different settings, i.e., the threshold η in For the first experiments, we used the optimal values of the thresholds for the SM algorithm, and the raw event logs to discover process models. Then, we applied our methodology to the raw event logs and used the optimal value of the thresholds for our algorithm and the SM algorithm to discover the process models from the concatenated logs. Finally, the quality of the process models was measured based on the common metrics of fitness, precision, and F-Measure as the proxies of accuracy, size, CFC, and structuredness as the proxies of complexity. The results of this evaluation on 8 VOLUME 1, 2022 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   18 benchmark datasets are summarized in Tables 3 and 4.
In the cases in which improvements were observed in any of the evaluation metrics after applying concatenation, the Wilcoxon Test was run to calculate the P-value, and find out if the improvements are statistically significant or not. The level of significance was selected to be 0.05.
The measuring tools to calculate fitness and precision, work on the PN. However, the output of SM process discov-VOLUME 1, 2022 9 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   ery is a BPMN model. Therefore, we needed to convert the resulting BPMN models to the PN, which is performed by using ProMs' BPMN Miner package [43]. The complexity metrics were computed on BPMN models. All the tests were performed on a computer running Windows 10 with an Intel i7-6700 CPU and 16 GB RAM. Table 3 shows the experimental results for the best obtained F-Measure values, corresponding fitness, and precision values before and after pre-processing. Our results show that the F-Measure values were significantly improved by concatenating the events, which have a concurrent relation in the real-life event logs. 14 of 18 benchmark datasets have clearly shown major improvement in F-Measure values, which is depicted in Figure 4c. Also, Figure 4a and Figure  4b illustrate the fitness and precision respectively for the corresponding F-Measure. Table 4 shows the results of the comparison of the discovered process models with respect to their complexity both before and after the pre-processing. The complexity of the models has been evaluated based on 3 different metrics; CFC, VOLUME 1, 2022 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and size, and structuredness, which we have mentioned earlier in the preliminaries section. All of these metrics are inversely related to the understandability of the process models. Based on our results, 16 out of 18 benchmark datasets clearly showed that the process models that are generated through SM after the pre-processing step are simpler in terms of CFC. Besides, the overall size of the models is also found to be greatly reduced as shown in 17 out of 18 benchmark datasets. Moreover, the structuredness of the models has improved in 14 out of 18 benchmark datasets. These results point out that our models are much more feasible to understand and effective in performance with less complexity as shown in Figures 5a,5b, and 5c. When comparing P-values, 14 out of 18 datasets significantly improved for F-Measure and CFC; 12 out of 18 datasets were observed to have significant improvement for size; and 13 out of 18 datasets were found to be critically improved in terms of structuredness. Hence, our proposed approach has shown statistical improvements on different evaluation metrics using 18 benchmark datasets.
Based on Table 2, the improvements in F-Measure and complexity resulted from decreasing/removing the infrequent traces which cause minimal scarification in fitness, on the other hand, it increases much in the precision values. In some event logs, our approach removed as much as 30% of the infrequent traces from the original event logs in order to reach the best value of the F-Measure. The percentage of the remaining traces in the event logs after pre-processing is shown in Table 2.
Another reason that our approach leads to the improvements of the F-Measure values and the complexity of the process models is that concatenating some concurrent relations, removes some of the unnecessary behaviors from the event logs. Table 2 shows that the numbers of distinct events after pre-processing steps are decreased which indicates that unnecessary behaviors are removed and caused significant improvements in F-Measure values and the complexity of the process models.
RTMF dataset contains high frequent behavior, whereas the SIPSIS dataset contains many distinct traces that happen only once. Also, both these datasets contain many concurrency and loops compared to other datasets. The algorithm was able to successfully eliminate concurrency of data and remove self-loops by effectively concatenating a large set of events that have concurrency relations. As a result, most infrequent behaviors, and infrequent traces in the event logs were removed. Therefore, the algorithm was able to facilitate the process discovery algorithm to generate a process model that has distinctly higher F-Measure values and lower complexity. To elucidate further, most of the BPIC15 datasets contain only fewer numbers of concurrency relations between events compared to other datasets. Even when the concurrency relation exists, the probability value of the first event follows the second event that exceeds the p * threshold value of our algorithm. However, the probability of the second event follows the first event does not pass our p * VOLUME 1, 2022 11 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and threshold value. As a result, our algorithm did not choose these combinations for concatenation. In fact, our algorithm selects combinations, which are having strong concurrent relations for concatenation. As a result, the F-Measure values for BPIC15 datasets either have improved slightly or have not changed compared to the results of using just the raw data. In terms of complexity, even though our algorithm made minor changes in these event logs, still the complexity of the process models have improved in most of the BPIC15 datasets. Table 5 demonstrates the total discovery running time for which the proposed model was run, that is a summation of running time for our pre-processing step, and the running time for process discovery step after applying our approach. Also, it shows the total discovery running time without applying any pre-processing step. Comparing total discovery running time with applying the proposed approach and without applying the pre-processing step on 18 benchmark datasets, 14 of those were shown to run faster when the preprocessing step was applied. As a result, even though, our proposed approach contains a hyper-parameter optimization step, the total discovery time was improved. This could be due to removing some concurrency and self-loops through our algorithm from the logs, consequently, producing more efficient event logs.
Moreover, we ran experiments to compare the results of our methodology with the best 4 recent pre-processing steps in the existing literature. To do so, we used the threshold of 0.7 for our proposed approach, and the thresholds of 0.4 and 0.4 for η and respectively for the SM algorithm. For the existing approaches, first, we applied their algorithms on the logs, and the resultant logs were then fed to the SM algorithm with the thresholds of 0.4 and 0.4 for η and respectively to discover the process model. Similar to what was mentioned above, when we compare the proposed model with the mentioned existing approaches result, in the cases in which improvements were perceived in any of the evaluation metrics, the Wilcoxon Test was run to measure the level of significance. Comparing P-values in F-Measure, 17 out of 18 datasets were shown to be significantly improved as compared to other approaches. Additionally, in terms of CFC, size, and structuredness, the p-value in all datasets had critical improvements as shown in Table A1 in the appendix section. Hence, the results indicate that our proposed approach statistically outperformed the best 4 recent preprocessing steps. Table A2 in the appendix section indicates that the results of the proposed approach as compared with the results of the best 4 recent approaches in the literature, improved not only on the datasets which hold fewer concurrency between events, but also on those that contain more concurrency. Table  A2 is constructed to find out if there is any relation between the percentage of existing distinct concurrent events on each dataset, and the improvements of each evaluation metrics on proposed model and the current four best existing methods in the literature. As it can be observed, the existing preprocessing methods are useful to generate a more accurate 12 VOLUME 1, 2022 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  5: Results of running time for the pre-processing step, process discovery time after applying pre-processing, total discovery time before and after applying pre-processing step. and simpler model when an event log exhibits fewer concurrency between events. However, in cases where the event log contains many concurrent relations between events, these techniques would not improve the results. Since most of the real-life processes contain concurrency between events and loops, we believe that our approach can be more useful on the real life datasets as compared to other approaches. For instance, for the RTFMP and the BPIC-cp datasets in which the percentage of the concurrent events were higher, other methods had no to negligible improvements, but for the datasets like BPIC14-f and BPIC12 that had the lower percentage of concurrency, other methods had improvements. However, our method showed improvements in almost all evaluation metrics regardless of the percentage of concurrency in the event logs.

VI. CONCLUSION
Process mining provides insights into the real processes happening in real life. Most process discovery algorithms are designated to model event logs, which are assumed they are clean. However, real-life processes are mostly noisy, contain infrequent behaviors, many concurrency, and loops. As a result, the generated process models are mostly inaccurate and largely complex. However, adding a pre-processing step before the process discovery leads to essentially removing these issues to visualize the data more clearly and drastically improving the performance of the process discovery algorithms. Here, we proposed a pre-processing technique to increase the performance of the process discovery algorithms by recommending a probability-based concatenation of events, which hold concurrent relations. For evaluation of the process models, first 18 real-life benchmark datasets were used and fed to the SM algorithm since it is the utmost discovery algorithm in terms of F-Measure and complexity for the process discovery to discover the process models. On the other hand, our preprocessing step was applied on the same event logs, and the resultant event logs were used SM algorithm to discover the process models. Next, we measured the F-Measure and complexity of the models in both cases. On the other hand, the results of our proposed model were compared with the best 4 recent pre-processing steps in the existing literature. In the cases in which improvements were observed in any of the evaluation metrics, the Wilcoxon Test was run to measure the level of significance. The results indicate that the preprocessing step increases the F-Measure values and decreases the complexity of the models statistically by concatenating some of the unnecessary concurrent events and removing self-loops.
In future work, we aim to incorporate other perspectives such as what pre-processing approaches can be developed in order to streamline and improve our results by incorporating different prevailing pre-processing strategies. Finding the optimal values of thresholds and speeding up the process is challenging, and time-consuming. Consequently, we have a plan to find techniques to automatically adjust the optimal values of the parameters than the hyper-parameter optimization we currently favor. Also, we concatenated the events which hold concurrent relations, so we aim to test the effects of concatenating other events which hold other relations such as causality and conflict relations.

E
set of events T set of traces L event logs L mutli-set of labels label of each event λ function that retrieves the label of each event p set of places in a PN T i set of transition in a PN F set of directed arcs in a PN π a function, which maps a transition to either a single observable event or to the non-observable event ⊥ non-observable event U set of activities p * threshold value for concatenation algorithm η SM threshold which controls concurrency relations SM threshold which controls filtering process f (e i ) the frequency of e i f (eiej ) frequency of e j which occurs immediately after event e i F s eiej s th frequency of the indices for the combination (e i , e j ) I s e i s th index of event e i in an event log (e i ⊕ e j ) concatenated event for (e i , e j ) f (e i → e j ) the frequency that e j happens instantly after e i f (e j → e i ) the frequency that e i happens instantly after e j P (e i → e j ) the probability that e j happens instantly after e i P (e j → e i ) the probability that e i happens instantly after e j P (e i ||e j ) probability of the frequency that event e j which occurs immediately after event e i P (e j ||e i ) probability of the frequency that event e i which occurs immediately after event e j  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. 17 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185235