Graph-Based Token Replay for Online Conformance Checking

Conformance checking detects deviations in business process executions. An online detection method is needed to give immediate response to anticipate possible impacts. The state-of-the-art online conformance checking is the Prefix-Alignment (PA) technique. However, this technique has a limitation of maintaining all of the administration data of cases in memory. In an online environment, the last event of a case is never known, whereas a PA requires last event information to release the case from memory to free up space for other cases. Hence, the PA does not meet the requirements of online conformance checking in processing infinite data of event stream without memory constraints. PA also has a complex state space search computation especially for large and complex process model references. In this paper, a Graph-Based Online Token Replay (GO-TR) method is proposed. This method takes benefit from Graph Database to adapts the Token-Based Replay (TBR) technique which has simple replay computation. We propose a Replay Image (RI) to store the case administration and develop a cypher based algorithm to simulate token replay on the RI to handle the event stream. We also propose a cypher-based algorithm to identify and replay invisible paths. The experiment results show that GO-TR has been successful in adapting TBR and solving the problem of wrong-placed tokens in TBR. GO-TR outperforms PA in yielding replay throughputs of relatively small amount of data in online conformance checking. In terms of memory usage, GO-TR shows its superiority over PA because it does not have memory limitations problems.

niques work in offline environments. TBR was first intro-23 duced by Rozinat and Aalst [1] as a replay technique based 24 on Petri net. While Alignment [2] is currently the de facto 25 The associate editor coordinating the review of this manuscript and approving it for publication was Zhangbing Zhou . standard for offline conformance checking because of its 26 ability to provide optimal alignment information. 27 In real life, there are many conditions that require immedi-28 ate inspections. Therefore, an online conformance checking 29 technique is required. Online detection capability enables 30 anticipative action of possible impacts as soon as possible. 31 Prefix-Alignment (PA) [3], [4], [5] is a state-of-the-art 32 replay-based online conformance checking technique. It is 33 a modification of the conventional Alignment technique. 34 However, PA has a limitation of maintaining all of the 35   3. Algorithm for identification and replay a Graph-based 84 invisible path. We take advantage of the graph database 85 to identify invisible paths accurately and efficiently. 86 The experimental results show that GO-TR has been suc-87 cessful in adapting TBR to the graph database and at the same 88 time providing solution to the wrong-placed token problem 89 on the TBR. We also found that, for relatively small amount 90 of data, GO-TR resulted in higher throughput compared to 91 prefix-alignment. However, GO-TR's throughput decreases 92 as the amount of data increases. In terms of memory usage, 93 GO-TR shows its superiority over PA because it does not have 94 memory limitations problems 95 The next section of the paper will be presented as fol-96 lows. Section 2 describes related works. Section 3 explains 97 the definitions and concepts that underlie our proposals. 98 Section 4 discusses the fundamentals of our proposed 99 method. Section 5 presents the experiment results and dis-100 cusses the findings from the experiment. Section 6 provides 101 conclusions and overview of future research opportunities. 103 In this section, researches related to this work are described, 104 from conventional conformance checking, online confor-105 mance checking, to the development of graph-based process 106 mining researches.

108
At the beginning of its growth, process mining was oriented to 109 extracting event log data in an offline environment. Likewise, 110 conformance checking techniques, such as the Token-Based 111 Replay (TBR) [1] and Alignment [2] techniques, can only 112 work in an offline environment. 113 TBR is a conformance checking technique with replay out-114 put that describes a series of activities resulting from a replay. 115 Basically, the TBR algorithm is very simple, but when the 116 reference model contains an invisible task, the TBR requires 117 additional efforts to detect it. Rozinat and Aalst [1] proposed 118 the detection of invisible paths by building a local reachability 119 graph and then tracing the entire state space. This method 120 requires complex computations so that it slows down TBR 121 execution.

122
The alignment technique [2] improved TBR by building 123 a synchronous product between the reference model and the 124 execution log to choose the best replay route. The resulting 125 series of activities is referred to as alignment. Meanwhile, the 126 computational result to get the alignment with the smallest 127 cost is known as optimal alignment. 128 Berti and Aalst [11] proposed an Improved TBR (ITBR) 129 by adding preprocessing to detect all invisible path lists at 130 the beginning. The invisible path search is done by select-131 ing the shortest route from a list of invisible-path candi-132 dates and then running the algorithm for invisible replay. 133 This solution makes ITBR faster than the Alignment tech-134 nique for Petri nets also for models with invisible tran-135 sitions. However, ITBR requires an invisible tasks replay 136 check which when it fails will leave a wrong-placed token 137 problem.

138
Our work was inspired by TBR algorithm and we modi-139 fied the algorithm in order to work on the graph database. 140 A cypher-based algorithm was also built to identify invisible 141 path accurately, to replay the invisible path efficiently, and to 142 avoid wrong-placed token problem.

210
In this section, some of the definitions and concepts that 211 underlie the proposed method are described.

213
An event log is data that is generated as a record of activities in 214 an information system. As an example can be seen in Table 1. 215 Each row is an event that describes an instance of a process. 216 Event logs are generally stored in XES format (eXtensible 217 Event Stream). XES groups each event in a single trace 218 sequentially according to its case id. A simple illustration for 219 the XES format of the event log in Table 1 is described in 220 Table 2.      Each event arrival adds to the completeness of a case bound 268 by a behavioral relationship so that event streams containing 269 several events of alternate case ids must be handled separately 270 and concurrently.

271
One of the important activities in operational support is 272 deviation detection in form of online conformance checking. 273 In contrast to the offline environment, the online conformance 274 checking system has the following unique characteristics 275 [19]: (a) it cannot see the complete case, so it focuses more 276 on the event stream as a partial case of a particular case, 277 (b) when there is a deviation then a fast response is required. 278 Fig. 2 illustrates an online conformance checking system for 279 detecting deviations. Due to these uniqueness, the methods used in the offline 281 environment cannot be directly applied to the online environ-282 ment. Further modification and improvement are needed so 283 that the techniques and algorithms used can respond to the 284 data flow in real-time.

285
The differences of requirement between offline and online 286 conformance checking are summarized in Table 3 with refer-287 ence to [18] related to the assumptions of data streams and 288 [19] related to the unique characteristics of online confor-289 mance checking.  This section focuses more on discussing TBR that will be 324 adopted in our proposed method. The theory of Alignment 325 is described in [2] and [18]. 326 The TBR discussed in this paper is a method that works it is assumed that the environment consumes the tokens from 355 the final marking, so the value of c is increased. If the marking 356 achieved after completing the replay trace is different from 357 the final marking, then the missing tokens will be inserted 358 and the last one calculates the number of remaining r tokens. 359 The following formula applies during the replay: c ≤ p + t 360 and m ≤ c so that the relation p + m = c + r applies at the 361 end of the replay.

363
A graph database is a database management system that is 364 based on graph theory. The graph theory uses nodes for stor-365 ing entities and edges for relationships among them. Graph 366 databases emphasize the relationship between data points. 367 The implementation of the graph in this study uses Neo4j 368 GDBMS and the graph query language Cipher [9].

369
The main elements of a graph are nodes and relationships. 370 A node in Cipher is symbolized by brackets ''()''. The node 371 that gets the additional label name ''(: Label)'', will limit 372 the selection of the node designation in question based on 373 that label. In addition, a variable can also be used on the 374 ''(variableName: Label)'' node so that the next variableName 375 can be used to access nodes labeled Label.

376
While a relationship is symbolized by a string such as 377 an arrow ''->'', which implicitly indicates the direction of 378 the relationship since each relationship is associated with an 379 ordered set of nodes, i.e., a source (from) and a destination 380 (to) node. Cipher annotations always require two nodes, even 381 if no specific node is declared. So a minimal example of 382 defining a relationship in Cipher is: ''()->()'' i.e., the rela-383 tionship can never be without source and destination nodes.  database (or it could be that the event log is already 430 available natively in the graph database), then a graph-431 based process model discovery is made [7], [20], [21]. 432 The results obtained are still in the directly followed 433 graph (DFG) representation. The next step is to convert 434 the DFG to Petri net using algorithm 1. add Place and relation at join point 10: merge relation on the output side of Place    The reference model in the petrinet model can be brought 470 to its reachability graph by using PM4Py which provides 471 libraries for generating reachability graphs. Fig. 9 is an exam-472 ple of a process model that will produce a reachability graph 473 presented in Fig. 10.

474
The resulting reachability graph object is then loaded into     The GO-TR schema can be seen in Fig. 11. Basically, GO-499 TR accepts input data in form of event streams. Each event 500 that comes is accompanied by its respective case id as a replay 501 reference. When a case comes with a new id, a replication of 502 process model from the reference master will be created as a 503 Replay Image (RI).  Algorithm 5 explains in detail the algorithm for replaying 505 the GO-TR. The GO-TR technique begins by detecting the 506 identity of the event that comes. If this event is an event with 507 a case id that has never been detected, it will be recognized 508 that a new process is in progress (line 6). Therefore, it is 509 necessary to prepare a new Replay Image (RI) in the graph 510 database that can be recognized through the case id identity 511 (line 7). Next, the program will make sure the activity name 512  The basic algorithm of Token-Based Replay cannot replay an 540 invisible task. The algorithm needs missing token insertion, 541 VOLUME 10, 2022 FIGURE 11. Proposed graph-based online token replay.
which will be detected as a deviation, to replay a visi-542 ble task which preceded by an invisible task. We proposed 543 graph-based method to handle the invisible task replay. Our 544 proposed method takes advantage of the graph database's 545 ability to store graph data natively and its fast node traversal 546 capabilities.

547
In every iteration of the replay event that comes (in

565
An invisible path is found when a reachable and the short-566 est distance target_state is found (line 7-8). The invisibleRe-567 play function (algorithm 9) will update the attributes of all 568 nodes and edges along the invisible path that is found to 569 simulate replay on an invisible path (line 9). Returns is True 570 if the invisible path is found (line 11). On the other hand, 571 return is False if the spf_target_state is not obtained, hence 572 the invisible path will not be obtained (lines 11-12).

574
The experiment was carried out on a computer with an Intel 575 Core i7-3632QM processor with 16GB of RAM, and Python 576 3.6. The pm4py 2.2.4 1 library was used to perform process 577 discovery and to generate the reachability graphs.

578
The following are the scenarios of the experiment that was 579 carried out:     The event data was sent to the conformance checking 600 machine on streams based on the order of arrival (not follow-601 ing the arrival timestamp). The data set used was CCC19 3 602 2 https://data.4tu.nl/articles/dataset/BPI_Challenge_2013_incidents/ 12693914/1 3 https://data.4tu.nl/articles/dataset/Conformance_Checking_Challenge_ 2019_CCC19_/12714932 which is a public data set. We duplicate the number of case 603 ids as much as five and ten folds of their original number of 604 20 available case variants to compare the throughput of both 605 techniques.

607
This experiment was aimed to compare memory usage 608 between PA and GO-TR. The dataset used was CCC19 pub-609 lic data. The variable observed in this experiment was the 610 amount of memory consumed along the arrival of the event. 611

612
This section presents and discusses the results of the experi-613 ments that had been carried out. In this section, observations were made on several test sce-617 narios to ensure the correctness of the GO-TR replay results 618 by comparing them with TBR results. The dataset used was 619 BPIC13 incident management. Fig. 12 presents the process 620 model of the BPIC13 Incident. The model was generated 621 with the help of PM4PY using Inductive Miner with a noise 622 threshold of 0.2. The PM4PY was also used to generate the 623 reachability graph in Fig. 13 from its Petri net object model. 624 There are AND branches (e.g. AND-Split in tau_1) and 625 loops (e.g. Accepted). There is also an invisible task tau-626 Join_4 which has two inputs. The first input, p_12, is linked 627 to the invisible tasks skip_13 and skip_9. While the second 628 input, p_9, is connected to the visible task Accepted. This 629 condition is hereinafter referred to as invisible task with multi 630 visibility input tasks.

631
The first experiment used a normal case Queued → 632 Accepted → Completed. This experiment was to prove that 633 the TBR algorithm used in GO-TR could recognize invisible 634 paths. The experimental results are presented in Table 4. The 635 next experiment used a case containing a loop, i.e. Queued 636 → Accepted → Queued → Completed → Completed. The 637 experimental results are presented in Table 5. The marking 638 movement presented in Table 6 can be explained as follows. 639 First of all the system provides initial marking on a place 640 labeled as ''Source''. Then with the arrival of the first event, 641 labeled as ''Queued'', the TBR algorithm starts to work. With 642 the reference model in Fig. 12, it can be seen that the initial 643 marking position on ''Source'' causes all input places in 644 ''Queued'' to not have tokens. Therefore, the TBR algorithm 645 will try to (p_8, p_11) → (p_8, p13), can be found so that 646 ''Queued'' can be replayed normally to produce state (p_8, 647 p_14).

648
The results in Table 4 and Table 5 show that GO-TR and 649 ITBR can work well in all normal cases. They also give the 650 same results for all statistics.

651
The third experiment with ''wrong-placed token'' problem 652 contains the following activities: Queued → Completed → 653 Completed. A good TBR algorithm will recognize that it 654 is necessary to add a missing token to p_16. With marking 655 VOLUME 10, 2022 ''wrong-placed token'' problem is different. It is interesting.

659
The results from the marking in the third experiment for 660 GO-TR and ITBR are presented in Table 6.

661
The marking movement presented in Table 6 can be     it comes to p_12 the replay stops as p_9 has no token 680 and it is not connected to the invisible task. There-681 fore, tauJoin_4 cannot be activated. As a result, the 682 replay attempt via invisible path failed to reach p_16. 683 However, ITBR is already running tokens from p_14 684 to p_12 and so the marking becomes (p_12, p_8). The 685 current position of the token at p_12 is the ''wrong-686 placed token''. The token position at p_12 will cause 687 errors on analysis. Because, apart from being able to be 688 achieved through ''Queued'' activation, p_12 can also 689 be directly reached via skip_9.

690
b. Meanwhile, GO-TR, by using algorithm 1, will find 691 the invisible path from (p_14, p_8) to p_16 through 692 the reachability graph. In this case the invisible path 693 is not found so the marking does not change i.e. it stays 694 at (p_14, p_8). As a result, GO-TR is safe from the 695 ''wrong-placed token'' problem.

700
The last event to be replayed is the second activity 701 with ''Completed'' label. This time, ITBR finds the route 702 p_17→p_16 as an invisible path. On the other hand, in 703 GO-TR, because the previous activity was a deviation, 704 no state of ''(p_14, p_8, p_17)'' is found in the reachability 705 graph. In that case, we had to run the algorithm 2 through 706 a simulation by tracing the invisible paths. If the search 707 failed to reach the target (the missing token place point), the 708 algorithm rolls back all states of the invisible paths to the 709 initial condition. As a result, both ITBR and GO-TR can find 710 the invisible path that reaches p_16. After a successful replay 711 of ''Completed'', the marking positions will return to (p_12, 712 p_8, p_17) for ITBR and (p_14, p_8, p_17) for GO-TR.   In the GO-TR simulation, the CCC19 log data needs to be 726 converted from an event log to an event stream representation 727 by sorting each event based on arrival time. 728 We need to load the reference model into the graph 729 database using Algorithm 2 as the first step of the GO-TR 730 experiment. The algorithm will generate a petri net represen-731 tation of the model process from its pnml format on the graph 732 database. In the next step, a reachability graph (RG) is also 733 needed by GO-TR to identify the invisible paths. We use the 734 PM4PY library to generate the RG model from the Petri net  Table 7 and the throughput results 744 are shown in Table 8 and depicted in Fig. 14.

745
The PA with ''w=full'' requires the highest computation 746 because it performs a complete optimal alignment computa-747 tion from the beginning of each new event arrival. A PA with 748 a small window, for example ''w=1'', is very fast. However, 749 it reduces the guarantee of getting optimal alignment. More-750 over, it still has memory limitation problems.

751
Based on the results of the experiment, it appears that 752 GO-TR with a small number of cases has the highest through-753 put. This is due to the simple computing that it can execute 754 the replays in a short time. On the other hand, the PA com-755 putation is very influential on replay speed. The faster the 756 computation, the greater the throughput. At ''w=1'', the PA 757 throughput is close to GO-TR throughput.

758
The data in Table 8 shows that GO-TR experiences a 759 decrease in throughput as the number of handled cases 760 increase. It is because that in GO-TR, each case has its own 761 representation of the RI. The more cases that are handled, the 762 longer the query time required in the replay process. the RI. Based on the experiment, it is proven that GO-TR is 786 invulnerable from memory limitation problems.

788
In this paper, we propose the Graph-based online token 789 replay (GO-TR) as a replay-based online conformance check-790 ing which is invulnerable to memory limitations. Our pro-791 posed solution adapts the token replay technique on a graph 792 database. By building the GO-TR, we made several con-793 tributions, which are: proposing replay images as the rep-794 resentations of the Petri Net models in a graph database, 795 adapting Token-based Replay on a graph database for online 796 conformance checking that receives event stream data, and 797 proposing a cypher-based invisible path identification and an 798 invisible path replay algorithm.

799
Based on observations and analysis from the experiments, 800 it is proven that GO-TR has been successful in adapt-801 ing TBR and is invulnerable to the wrong-placed token 802 problem. For small amounts of data, GO-TR works with 803 the highest throughput when compared to PA. However, 804 GO-TR's throughput performance decreases as the amount 805 of data increases. In terms of memory usage, GO-TR shows 806 its advantages over PA as it is invulnerable to memory 807 limitations.

808
In future work, a study will be conducted to maintain the 809 query performance along with the data growth. In addition, 810 it is also necessary to observe the performance of its response 811 to high-speed data.

812
INDRA WASPADA is currently pursuing the 888 Ph.D. degree in computer science with the Institut 889 Teknologi Sepuluh Nopember, Surabaya, Indone-890 sia. He is also a Lecturer and a Researcher with 891 the Department of Computer Science, Universitas 892 Diponegoro, Semarang. His current research inter-893 est includes process mining. He is also interested 894 in data mining and business process management. 895 RIYANARTO SARNO (Senior Member, IEEE) 896 received the Ph.D. degree, in 1992. He is cur-897 rently a Professor with the Informatics Depart-898 ment, Institut Teknologi Sepuluh Nopember (ITS). 899 He was the author of more than five books and 900 over 300 scientific articles led him incorporated in 901 the top 2% world ranking scientist by Standford 902 University, in 2020. He has researched process 903 mining for a period of five years.