Complex Process Modeling in Process Mining: A Systematic Review

Process mining techniques are used to extract knowledge about the efficiency and compliance of an organization’s business processes through process models. Real-life processes are unstructured, and applying process mining to discover such processes often results in complex process models that do not provide actionable insights. Several solutions have been presented to overcome this problem. However, the process mining domain lacks an explicit definition of complexity and its measurement. This vagueness results in ad-hoc solutions that vary according to the approach, modelling construct, and process properties. Additionally, the strength and limitations of the proposed solutions have not been adequately highlighted. Therefore, we conducted a systematic literature review on complexity in process mining over six popular scholarly literature indexing databases. Based on the review results, an explicit definition of complexity, the main contributing factors and their impact on process mining results were identified. We discovered various process complexity matrices and their application context. The analysis of studies led to the development of a taxonomy consisting of four different approaches for addressing the complexity problem, along with their strengths and limitations. Finally, the open research challenges and potential for future research are discussed.


17
The current age of technology has significantly changed how 18 an organization manages its business operations. Organiza-19 tions have shifted from manual processing to automated and 20 technological methods of business operations. Information 21 systems are used almost everywhere, from banks to hospi- 22 tals. With the increased usage of technology for information 23 management, there has been an increase in data generation. 24 This outburst of data introduces difficulty for organizations 25 to extract valuable insights from these systems. Regardless 26 of the statistical analysis techniques to assess business oper- 27 ations, it is also crucial for an organization to know how 28 The associate editor coordinating the review of this manuscript and approving it for publication was Fu Lee Wang . efficiently their business operates, where and why bottlenecks 29 exist, and how they can be removed. Although Data min- 30 ing techniques can uncover certain patterns in the data of 31 business operations, no temporal relationship exists between 32 such data. An end-to-end multi-perspective process execution 33 insights are not possible using data mining techniques [1]. 34 To make informed decisions, even a data scientist finds it 35 crucial to analyze the relationship between data and business 36 operational processes, which is not possible without a holistic 37 understanding of the underlying process [2]. 38 Process mining (PM) is an umbrella term for combin- 39 ing the data mining and business process management 40 approaches that analyze event log data using advanced algo-41 rithms, machine learning, and statistical methods to analyze 42 and improve business processes. PM techniques extract the 43 trace may contain five activities whereas in the other case, 78 it may contain twenty-five. Such a variable process is called 79 a complex or unstructured process. The model generated by 80 process mining using logs of such an environment result in a 81 complex or spaghetti-like process model [8], [11]. Examples 82 of both the complex and structured process models are shown 83 in Fig. 1. The process model on the right i.e., structured model 84 is easy to understand and clearly conveys the flow of the 85 process. In contrast, the model on the left is overly complex 86 and does not provide any insight into the process execution 87 flow. 88 According to Mendling et al. [12] and Reijers and 89 Mendling [13], as cited in Li et al. [14], a human ana-90 lyst's ability to understand a process model is known to 91 be influenced by the complexity and density of a process 92 model. So, to understand and improve processes, process 93 models must not be overly complex and should be easy to 94 understand [15]. 95 Several researchers have attempted to resolve this problem. 96 However, a general understanding of the complexity problem, 97 what causes this problem, and what approaches can be used 98 to mitigate this problem remains missing. Reviewing the 99 literature on the complexity in process mining will help to 100 better understand the primary factors that contribute to com-101 plexity and possible resolution strategies. A broader analysis 102 of the strengths and limitations of the complexity reduction 103 approaches is also essential to understand the suitability of 104 the approach. Furthermore, the identification of research gaps 105 will help in identifying the untapped research areas. We con-106 ducted a systematic literature review on complexity in the 107 process mining domain to answer the previously mentioned 108 questions. Six popular scholarly indexing databases were 109 systematically searched, specifically focusing on published 110 papers between 2012 and 2022. In addition to the above 111 their focus was complexity reduction techniques focused 144 on block-structured process models and specific modeling 145 languages such as YAWL modeling notation. Methods have 146 evolved since then, and the focus has shifted from only a [20] investigated whether a 153 low-structured process can be modeled using process dis-154 covery techniques. However, their research was one of the 155 experimental studies of applying process maps, whose results 156 are already known earlier as conducted by Günther and 157 Van Der Aalst [8]. A systematic review was conducted by 158 Duan and Wei [21], focusing only on the complexity caused 159 by duplicate tasks in process mining. Van Zelst et al. [22] per-160 formed a literature review on abstractions in process mining 161 and presented a taxonomy of works on event abstraction tech-162 niques. Nevertheless, their assessment was explicitly focused 163 only on event abstraction. 164 Although a fair bit of literature exists on the topic, there is 165 no systematic review of the diverse approaches dealing with 166 the problem of complexity in the process mining domain. 167 Almost every proposed approach to deal with complexity in 168 process mining holds a different view on this problem, and 169 there is a lack of a unified view of the topic. Therefore, this 170 research aims to fill this void by systematically reviewing 171 the available literature and presenting the strengths and lim-172 itations of existing approaches and opportunities for future 173 works, along with the taxonomy of different approaches used 174 to resolve this problem.

175
For this purpose, we formulated the following research 176 questions and sub-questions as presented in following 177 section.   also enforces the selection of only the conference proceedings 246 and journal articles. Furthermore, some significantly impor-247 tant papers were added by snowballing that remain unaffected 248 by the duration bound.
249 Table 1 shows the inclusion criteria, whereas the exclusion 250 criteria are presented in Table 2. In this stage, the review plan was executed. The search results 253 were extracted in respective formats (BIB/ CSV/ CIW). For 254 Google scholar specifically, the ''Publish or Perish'' [24] was 255 used to search and extract search results. All the search results 256 were imported into the Zotero reference manager. The inter 257 and intra-database duplications were resolved based on the 258 method for systematic reviews [25], [26].

284
A form was designed in a Microsoft Excel sheet to collect 285 answers corresponding to the research questions. While per-286 forming full-text screening, such papers were deemed inel-287 igible whose objectives were unclear, remained ambiguous 288 about methodology, and were weak on novelty perspective. 289 Furthermore, to maintain the quality, the articles with repli-290 cated publications were excluded, in addition to those with 291 contrasting claims such as title and abstract indicated novelty 292 but the paper's body suggested otherwise.

294
We report the results of our findings based on the questions 295 this review intends to answer. In the results and discussion 296 section, the answers to the questions are first presented in 297 frequency-based quantitative analysis format. A taxonomy of 298 approaches/ techniques that deal with process model com-299 plexity in process mining is formulated. Moreover, we also 300 elaborate on the interpretation of the results based on the 301 review. 302 VOLUME 10, 2022

III. RESULTS
• The model discovered from the process is not compre-341 hendible, i.e., the inability to understand the model and 342 how the process was executed.

343
In the light of observing the above two perspectives on 344 considering a process a complex, we deduce the definition 345 of process complexity as:

346
A process can be termed a complex process if any of the 347 following conditions hold:

348
• Processes executed in a less restricted environment 349 • A process whose event log contains a fine-grained level 350 of detail about process execution 351 • A process in which applying process mining results 352 in spaghetti-like visualization, complicating the under-353 standability of the process execution.

354
From the above definition, we can safely say that the 355 complexity in process mining is the level of difficulty in 356 bringing simplicity to processes. Please remember that the 357 terms ''complex,'' ''flexible,'' and ''unstructured'' processes 358 are used synonymously throughout the process mining litera-359 ture. So, we also use these terms interchangeably throughout 360 this paper. Following three significant factors were found to be the rea-364 sons for introducing complexity to the process, also shown 365 graphically in Fig. 5. Flexibility in process execution remains the top reason for 369 introducing complexity to the process where there is no 370 restriction imposed on the execution behavior of the pro-371 cess. The process execution behaviors will increase with the 372 number of activities. Even if an order-preserving constraint 373 is imposed on a few activities, a fraction of process execution 374 behaviors still increases with the number of activities. Model-375 ing such a dynamic behavior will result in a complex process 376 model.   The Noise in the log was observed as the third most dominant 392 factor contributing to process complexity. In process mining 393 terms, noise is commonly defined as infrequent behavior [4].

394
Most of the process discovery algorithms model the most 395 frequent behaviors in the process [8], [27], [28]. For example, 396 a path from activity A to B may appear 100 times in a log.

397
At the same time, some paths may infrequently occur, such 398 as activity A to C being observed three times and activity 399 B to A being observed five times (looping pattern). Pro-400 cess discovery from such a log will result in spaghetti-like 401 visualization, which complicates the understandability of the 402 process execution behavior.

403
According to Conforti et al. [29], models discovered using do not have a specific position and occur randomly [30], [31].   structure, which leads to difficult-to-understand process 430 models.  Another impact of complex process mining is the inability 441 to render an accurate process model. Noise introduces erratic 442 connections in the process model that are never executed in 443 reality, thus resulting in a false reflection of reality [30].  Table 3 presents the different 449 complexity metrics found in the literature. 450 We categorize process evaluation metrics into two major 451 categories based on their goals:  more specific behavior in the reference model. According 483 to Conforti et al. [29], noise significantly reduces model 484 precision because it establishes erroneous links between 485 model activities. The Generalization is the inverse of the 486 precision metric. It measures how much additional behavior 487 is observed in the mined model, which is not present in the 488 log and vice versa [29]. The simplified version of formulas 489 found in the literature for calculating the fitness and precision 490 measures are presented in (1)  F-measure, also called f-score or f1-score, is the harmonic 497 mean of fitness and precision, as shown in (3). Researchers 498 observed that if a particular behavior is excluded from the 499 model, the fitness value decreases, increasing the preci-500 sion (accuracy) value. Considering the trade-off between 501 fitness and precision metrics, researchers proposed the 502 f-measure as an alternative to balancing the model fitness and 503 precision [36].
If a technique produces multiple models, e.g., clustering or 506 abstraction-based approaches, it is logical to use the average 507 fitness, precision, or f-measure [37], [38]. The number of subprocesses metric is explicitly related to the 510 abstraction-based approaches where a process is simplified 511 by dividing it into subprocesses and assessing whether the 512 formed subprocess relates to the reference model [39].  In a process model, the activities represent nodes, and the 525 relationship between two activities is portrayed as an arc 526 between them. For example, if activity A is followed by 527 Activity B, the resulting process model will contain an arc 528 between activity A and B. As the number of activities and 529 their relationship increases, the process model becomes com-530 plex, resulting in the Spaghetti model leading to a less under-531 standable model. The total number of arcs and nodes is the 532 number of arcs going out from each of such constructs [29].

553
CFC is specifically relevant when the process model repre-554 sentation is a Petri net [29]. The CFC measurement formula 555 is shown in (5).

CFC =
All split constucts in process model The Coefficient of Network Connectivity (CNC) measures 559 the ratio between the number of arcs and nodes [40]. Parts 560 of the process model containing cycles tend to be more chal-561 lenging to understand than sequential ones. So, the increase 562 in cycles results in the rise in complexity of the process 563 model. Refer to (6), where |A| represents the total number 564 of arcs whereas |N| represents the total number of nodes in 565 the model. is the count of unique process variants and unique activity 607 classes. According to Baier et al. [28], a drop in the activity 608 classes and variants was observed after performing abstrac-609 tion, resulting in complexity reduction of the process model 610 because of a lower variation.

611
The frequency of different matrices found for both the 612 structural and behavioral complexity dimensions is presented 613 in Fig. 9. Although most of the studies claim that structural 614 complexity has been taken care of, our analysis, as shown 615 in Fig. 9, reveals that significantly less emphasis has been 616 given to structural complexity metrics. The x-axis shows the 617 usage of each metric in selected studies, whereas the y-axis 618 represents the name of each metric.

619
Among the found process complexity metrics, the number 620 of nodes and edges was the most straightforward way of 621 measuring process complexity, and it remained the dominant 622 structural complexity metric throughout the literature. Also, 623 it poses the advantage of being feasible for both the Block-624 structured and graph-based process models. Based on the 625 studies' data, the number of nodes and edges and the Average 626 Connector Degree (ACD) were negatively correlated with 627 understandability [41]. The Cyclomatic complexity metric is 628 another potential measure having its roots in the software 629 quality domain, where it is used to quantify the number of 630 possible code execution paths. It is said that the more control 631 structures in the code, the more branching and complex the 632 code. Similar to complexity in a code caused by control struc-633 tures, the split constructs cause the complexity and branching 634 in process models. Cyclomatic complexity is equally helpful 635 for process model complexity analysis [44] since the basic 636 idea of complexity and understanding is similar in both cases. 637 It was further found that an increase in places, transitions, and 638 precisely the number of splits and joins affect the comprehen-639 sibility of the process model [47].  Nevertheless, the suitability of structural complexity measure 645 is coupled with process modeling notation in hand.  The frequency is based on their usage in selected articles.  Taking a different perspective on clustering approaches, 737 De Koninck et al.
[57] developed a technique to explain why 738 certain traces were clustered into specific clusters. They did 739 so by investigating common behavioral patterns in clusters, 740 such as the presence of activity X and Y in a cluster or a 741 relation X directly followed by Y in the cluster and many 742 other such rules. Ekanayake et al.
[40] used a mixed-method 743 approach. They first clustered traces based upon variants, 744 and then abstraction was introduced to derive subprocesses in 745 each cluster by abstraction of activities that split and joined 746 in the same place.

747
Conclusively, the clustering approaches effectively divide 748 the log into subsets. Nevertheless, the main objective of 749 reducing the complexity of process models and making them 750 understandable has remained uninvestigated. The emphasis 751 of evaluation mostly remained on the behavioral quality of 752 process models. It is acknowledged that the resulting process 753 model should exhibit similar behavior as in reference models, 754 however, the actionable knowledge is also related to the sim-755 plicity of process models and is vital to consider  Günther and Van Der Aalst [8] first presented the concept 769 of abstraction and aggregation in process mining. Their work 770 is inspired by the idea of cartography (study of maps), where 771 they aim to show only relevant information at a specific level. 772 They used the concept of aggregation to show process model 773 elements (arcs and nodes) in an aggregated manner while 774 abstracting from insignificant details to simplify the model 775 significantly. Instead of Petri net, they used Process maps, a 776 Directly Follows Graph (DFG) based notation to represent the 777 process model. Despite the inability of DFGs to differentiate 778 between splits (AND & OR Splits), the DFG notation remains 779 the most popular process modeling notation in process min-780 ing since 25 commercial process mining products use DFGs 781 in their products [61]. However, their abstraction mechanism 782 is guided by the frequency of activities, i.e., infrequent activi-783 ties are abstracted from the process map. Despite simplifying 784 the process models significantly, compliance checking is not 785 trivial using such models.   containing the same preceding and succeeding activities is 847 abstracted based on window size. They do so recursively by 848 introducing multilevel abstractions over the log. Their tech-849 nique simplifies the process model but is limited by assuming 850 that each class at a specific abstraction level belongs to one 851 higher level class. However, in real-life cases, the same class 852 may belong to multiple higher-level process classes.

853
Instead of activities-based abstraction, Tsagkani and 854 Tsalgatidou [38] considered additional attributes for abstrac-855 tions, such as role hierarchies. However, their abstraction 856 technique was highly influenced by the experience and exper-857 tise of those involved in abstraction.  Vathy-Fogarassy et al. [70] proposed a goal-based process 869 discovery methodology specifically for a healthcare envi-870 ronment and ignored tasks not required to analyze disease 871 under focus. They simplify the process model based on the 872 domain-specific taxonomy by introducing multilevel abstrac-873 tions over the log. However, their approach is specific to 874 the healthcare domain where process taxonomy such as dis-875 ease investigation levels is present. They used no behavioral 876 or structural complexity metric for model assessment and 877 relied on visual analysis. Nevertheless, their technique has 878 the potential to be applied in other domains too. A similar 879 healthcare specific approach was suggested by Erdogan and 880 Tarhan [71]. Their approach remained focused on challenges 881 associated with evaluation of complex processes primarily 882 originating from healthcare domain.

884
Noise filtration approaches apply filters on certain properties 885 of log or model against specific thresholds to simplify the 886 process model. No single definition of noise exists in process 887 mining literature; instead, the noise is defined in an ad-hoc 888 manner keeping in view the context. However, according to 889 Sani et al. [9], filtering approaches lower the size of pro-890 cess instances needed by process discovery algorithms and 891 thereby reducing the complexity of the process.

892
Considering the complex nature of processes in the Health-893 care domain, Kaymak et al. [72] used a goal-based process 894 discovery approach and filtered behavior irrelevant to the goal 895 of process discovery. They claimed that their process is of 896 medium complexity; however, they remained vague about 897 complexity assessment and relied upon visual analysis of 898 model complexity.

953
An incremental process discovery was proposed by 954 Schuster et al. [78]. Rather than automatically discovering 955 the whole process model, they suggest human involvement 956 in discovering the process model and only model. Each trace 957 is added to the process model, and its impact on model com-958 plexity is observed. The trace increasing model complexity 959 is filtered out. The approach seems feasible when only a 960 bunch of traces are mined. However, a high number of traces 961 result in time-consuming process discovery and ignores the 962 main objective of process mining, i.e., to discover the process 963 automatically.

964
Finally, Zhang et al. [34] extracted mainstream behavior 965 from the event log, i.e., the traces occurring more frequently 966 or those containing frequently occurring activities. They used 967 mainstream behaviors to extract behavioral probabilities of 968 traces using Hidden Markov Models. The traces having less 969 probability against mainstream behavior are removed from 970 the log, thus simplifying the process model. Their frequency-971 based filtration technique also remained biased towards infre-972 quent behavior.

973
Although filtration-based techniques use a straightfor-974 ward method to deal with complexity, the frequency-based 975 treatment of behavior filtration is somewhat unreasonable. 976 From the perspective of compliance checking, the Infre-977 quent behavior does not always represent noise [2]. Such 978 behaviors are important for further investigation about why 979 and when these happened. Conformance checking is a post-980 process discovery activity; this implies that if infrequent 981 illegal behaviors are removed using filtration techniques, the 982 violations against standard process executions would not be 983 detected. This problem calls for techniques that can distin-984 guish between ''infrequent legal behavior'' and ''infrequent 985 illegal behavior'' and allow for filtration only over infrequent 986 legal behavior to preserve the compliance checking properties 987 in the log. Secondly, most noise filtration techniques rely 988 on visual analysis for complexity analysis. The complexity 989 analysis should be compared against some standard metrics. 990 Thirdly there was a gap in evaluating techniques against real-991 world data sets and the context of real-world scenarios.

993
Pattern mining-based approaches simplify complex process 994 models by extracting such execution behaviors from pro-995 cess models that contain specific patterns and only generate 996 process models from such patterns. Such patterns represent 997 subprocesses and process discovery based upon frequent 998 behavioral patterns in the log, commonly referred to as Local 999 Process Models (LPMs) [79]. LPM discovery techniques are 1000 guided by patterns. The traces which do not contain rule-1001 satisfying patterns are removed from the log [80].

1002
Frequency-based pattern mining approach was used by 1003 Liesaputra et al. [80]. They simplified the log based on the 1004 thresholds, such as the frequency of specific patterns in the 1005 log. Infrequent patterns are abstracted from, and only fre-1006 quent patterns are mined. In the case of highly variable data with 1066 low frequencies, it will be challenging to implement such 1067 techniques for model simplification.

1068
In addition to the previously mentioned four dominant 1069 complexity reduction approaches, Kaouni et al. [85] proposed 1070 a visual analysis approach for complexity reduction. They 1071 propose using dotted charts and frequency-based graphs for 1072 process analysis. Although the graphs and dotted charts do 1073 help in conducting preliminary evaluation of the process data 1074 [2], [86], [87] and can be used as supplementary analysis 1075 types, however, the end-to-end analysis of the process is not 1076 possible using such visualizations.

1077
Alongside the complexity reduction strategies, we also 1078 analyzed the literature on the types of datasets utilized for 1079 evaluation and validation of the complexity reduction tech-1080 niques in order to determine the researchers' concentration on 1081 specific areas. Fig. 12 depicts the datasets of several processes 1082 utilized by researchers for evaluating and validating com-1083 plexity reduction approaches. Almost one-third (30%) of the 1084 studies made use of healthcare datasets. The usage of datasets 1085 for the processes of securing a bank loan, managing incidents, 1086 and administering traffic fines were followed by healthcare 1087 domain. The majority of these datasets are from the Business 1088 Process Intelligence Challenge (BPIC) event, which is a busi-1089 ness process analysis competition in which competitors get 1090 both real-world and synthetic datasets. In addition to process 1091 complexity research, these datasets are also popular and are 1092 commonly used throughout the process mining literature and 1093 openly accessible. The strength of clustering-based techniques for clustering 1101 lies in dealing with process complexity in an unsupervised 1102 fashion and segmenting a complex process to traces level. 1103 However, this may also result in an unacceptable cluster-1104 ing solution when traces are clustered based on a spe-1105 cific perspective rather than random. Trace-level clustering 1106 approaches are suitable for dealing with model complexity 1107 when trace-level heterogeneity is observed in the log. In this 1108 direction, Jablonski et al.
[52] used a frequency-based trace 1109 clustering method. Their clustering solution did not differ-1110 entiate between frequent and infrequent behavioral patterns. 1111 Although there is a possibility that traces end up in the wrong 1112 clusters using their approach, nevertheless, all behavior is 1113 included in the final trace clustering solution. When trace 1114 level fitness is the goal, trace clustering is a better choice 1115 because it results in a good average fitness value [45].

1116
The curse of dimensionality was another limitation in trace 1117 clustering approaches. When many features are available for 1118 clustering, dimensionality reduction techniques can be used, 1119 but this results in the loss of individual features' impact over 1120 VOLUME 10, 2022 appearing in multiple levels of the activity hierarchy was 1151 used in isolation without considering the activity execution 1152 context to determine the correct level [14], [64], [67]. The 1153 validation of the abstraction solution remained missing and 1154 mostly remained influenced by the experience of the abstrac-1155 tor [38]. It was observed that experiments were performed 1156 on simulated datasets only, and patterns-based abstraction 1157 approaches remained expensive as it generates many can-1158 didate patterns for abstraction [67]. Further works on vali-1159 dation and optimization of abstraction-based approaches are 1160 required.

1162
As opposed to clustering and abstraction, the strength of 1163 filtration-based techniques lies in the straightforward treat-1164 ment of the log to simplify the process model by applying 1165 constraints to remove noise from the log. Behaviors not 1166 fulfilling the pre-specified criteria are considered noise and 1167 thereby removed from the log. A frequency-based treatment 1168 was employed on the log to detect the noise and remove it. 1169 Some researchers used directly-follows dependency of events 1170 as a metric to filter logs from infrequent behaviors. In treating 1171 infrequent behavior, domain knowledge is of utmost impor-1172 tance as they are essential for compliance checking. to validate patterns such as support, confidence [33], [79], and The abstraction approach combined with the clustering 1234 techniques is a prospective approach to better deal with model 1235 complexity. At the same time, trace clustering can also ben-1236 efit from patterns-based approaches such as making clusters 1237 based on patterns rather than activity frequencies.

1238
It was also noted that there is a lack of work on look-1239 ing at model complexity from the log perspective. Similar 1240 labels are considered equal in clustering, but their context 1241 may differ [39]. Same activity labels from different depart-1242 ments affect the aggregate frequency and will be clustered 1243 together during clustering, but in reality, the context may 1244 differ, and they should be clustered separately. Moreover, 1245 in a manual recording of logs, an inconsistency in activ-1246 ity names will render an activity a separate modeling con-1247 struct, thus increasing the complexity. The same holds for 1248 abstraction and patterns-based approaches, too, where the 1249 impact of duplicate instantiations of activity, e.g., duplicate 1250 label (same activity in multiple levels), needs to be eval-1251 uated. This calls for an investigation of the effect of such 1252 patterns on model complexity and the relevant remediation 1253 approaches.

1254
One important finding was that many experiments had 1255 been conducted on already present datasets from the BPI 1256 Challenges data repository. One reason for this can be the 1257 unavailability of datasets. However, the datasets from other 1258 domains and sources must also be considered for the gener-1259 alization of the approach.

1260
Regarding the discriminatory treatment of infrequent 1261 behavior, there is a need to differentiate between real noise 1262 and infrequent behavior and redefinition of term noise in 1263 Process mining, such as differentiation of infrequent legal 1264 behavior and infrequent illegal behavior. There can be many 1265 alternative explanations for infrequent behavior in the log, 1266 e.g., infrequently occurring incompliant behavior is crit-1267 ical for compliance checking. An approach to clustering 1268 and modeling frequent and infrequent behaviors separately 1269 can help to reduce this impact. In a similar avenue, incor-1270 porating domain knowledge to filter out irrelevant activi-1271 ties is also a potential approach to deal with the effect of 1272 frequency-based biases towards logs to simplify the process 1273 model.

1274
In the context of patterns-based approaches, the problem of 1275 computational complexity exists. Researchers encountered a 1276 high computation time during pattern detection [79]. Further 1277 research is required to optimize the pattern computation time. 1278 Finally, it is recommended that the complexity metrics 1279 should be utilized for complexity assessment rather than rely-1280 ing on visual analysis. The type of modeling notation should 1281 also be kept in mind, as not all metrics work for all types of 1282 modeling notations. For example, the number of nodes and 1283 arcs, density, CNC, and CN measures seem equally feasible 1284 for two popular modeling notations, i.e., Petri net and Directly 1285 Follows Graphs (DFGs). On the other hand, P/T-CD and CFC 1286 metrics are unique to the Petri net modeling notation as they 1287 evaluate split constructs in the process model that are not used 1288 in Directly-Follows graphs. 1289 VOLUME 10, 2022 dealt with by only recording certain abstracted versions of 1332 activities. Nevertheless, the fine-grained recording of pro-1333 cess execution is highly significant for diagnostics and accu-1334 rately pinpointing process performance bottlenecks. Both the 1335 supervised and unsupervised event abstraction strategies are 1336 observed in the literature. Supervised abstraction is limited 1337 to being guided by domain knowledge, whereas in the case 1338 of unsupervised event abstraction, there remains uncertainty 1339 about whether the abstraction hierarchy is valid. It remains 1340 undiscovered how to assess the validity of abstraction in cases 1341 where no domain knowledge is present. Further research is 1342 needed in this direction.

1343
In the case of post-event execution, the processes can be 1344 simplified using process mining algorithms. The researchers 1345 have proposed four approaches, clustering, abstraction, filtra-1346 tion, and pattern mining. The Fuzzy miner [8] incorporates 1347 clustering and abstraction mechanisms to simplify process 1348 models. Because of its simple, scalable, and filtered Directly-1349 Follows Graphs (DFGs) based model generation, fuzzy miner 1350 remains the top choice for commercial process mining tools 1351 [61], [89]. However, DFG-based models fail to distinguish 1352 between a choice and a split construct [10].

1353
The filtration approach was found to be used as an alterna-1354 tive choice. Frequency-based event prioritizing in the model 1355 is an interesting approach to only model certain recurring 1356 behaviors. Nevertheless, the filtration approach has nega-1357 tive implications for further advancing the process mining 1358 project. It threatens the validity of the conformance check-1359 ing perspective of process mining, where process compli-1360 ance is assessed against pre-specified rules and regulations. 1361 Incompliant behavior often infrequently occurs and will be 1362 ignored if filtration approaches are utilized. The filtration 1363 approaches can benefit from a prospective direction where 1364 a distinct treatment of frequent and infrequent behavior is 1365 made. Compliance-related significant behaviors should be 1366 preserved during the filtration process, and the rest of the 1367 behaviors can be filtered out as per normal flow.

1368
Subjective process mining techniques, such as guided and 1369 Local Process Model (LPM) discovery, are bounded by the 1370 modeler's choice. The process modeler selects only those 1371 fragments of interest and concern to the process, e.g., pro-1372 cess fragments that involve high financial costs. They are 1373 superior to abstraction and filtration-based model discovery for simpler model production as their focus remains only on a specific part of the process regardless of frequent or infre-1376 quent behavior. However, LPM and guided process model 1377 discovery are only feasible when end-to-end model discovery 1378 is not a concern. Still, we think pattern-based approaches 1379 are potential approaches to dealing with logs where multiple 1380 processes or subprocesses are recorded in a single event log.

1381
Results indicate that trace clustering remained the most 1382 popular technique for dealing with complexity. However, 1383 it was noted that rather than applying the straightforward 1384 simplification approach, the clustering techniques implicitly

1429
The findings of this review comprehensively shed light on 1430 the process complexity problem in process mining. We think 1431 that the arguments about practical and theoretical implica-1432 tions of existing approaches will help the novice and the 1433 currently working researchers in this domain understand this 1434 problem in a broader context. Moreover, it is expected to 1435 pave the way for extending knowledge in this direction using 1436 proposed future research endeavors.

1437
Although this review was intended to be comprehensive, 1438 there are some threats to the validity of the results and find-1439 ings. For quality purposes, the papers indexed in popular 1440 databases having at least one citation were included in the 1441 review. Ignoring this criterion can expand the number of arti-1442 cles; however, this compromises the quality. Moreover, the 1443 focus remained on journal articles and conference proceed-1444 ings. The book chapters and workshop papers were excluded. 1445 However, many conference proceedings have been published 1446 as book chapters in the well-known Springer Lecture Notes 1447 Series. Such articles are unaffected by our book exclusion 1448 criteria.

1450
Process mining techniques hold the potential to find bottle-1451 necks and improve the business processes of organizations. 1452 However, If the process mining results are not understand-1453 able or complex, the whole project becomes useless. Several 1454 researchers approached the complexity problem in process 1455 mining, but a general overview of the topic at hand remained 1456 missing. In this paper, we conducted a systematic literature 1457 review to have a unified overview of the approaches for 1458 dealing with the problem of complexity in the process mining 1459 domain. Six well-known research databases were searched. 1460 In addition to formulating a conceptual model of complexity 1461 problem, a taxonomy of complexity reduction approaches 1462 was also formulated. It was identified how the process com-1463 plexity problem is realized across different studies, what 1464 factors contribute to it, and how complexity is analyzed and 1465 prevented. Subsequently, the identification of research gaps 1466 and future research directions are proposed.

1467
Findings reveal that the flexibility in the process, the fine-1468 grained level of detail, and noise in the logs are the main 1469 contributors to process complexity. Moreover, it was found 1470 that the complexity problem is solved using four prospective 1471 approaches, clustering, abstraction, noise removal, and pat-1472 terns mining. Different metrics used for these measures were 1473 identified. It was also noted that the emphasis of complexity 1474 analysis remained on behavioral complexity measures. At the 1475 same time, less importance is given to structural complexity, 1476 which directly relates to the process model's comprehen-1477 sibility. Finally, several research gaps and future research 1478 directions are also presented.
1479 received the bachelor's degree from the Univer-1826 sity of Malaya, Malaysia, the master's degree 1827 from the University of Putra Malaysia, and the 1828 Ph.D. degree from the University of Malaya. She 1829 has more than 20 years of teaching experience 1830 since she started her career as a Lecturer at the 1831 University of Malaya, where she is currently an 1832 Associate Professor with the Department of Infor-1833 mation Systems, Faculty of Computer Science and 1834 Information Technology. She was involved in various researches, leading to 1835 the publication of several academic papers in the areas of information sys-1836 tems specifically on educational technology, recommender systems, and data 1837 mining. She has been actively publishing more than 70 conference papers at 1838 renowned local and international conferences. A number of her works were 1839 also published in reputable international journals. She has participated in 1840 many competitions and exhibitions to promote her research works. She has 1841 been appointed as competition judges for several innovation competitions. 1842 To date, she has successfully supervised ten Ph.D. and 23 master's students 1843 to completion. She hopes to extend her research beyond information systems 1844 in her quest to elevate the quality of teaching and learning.