Petri Net Based Data-Flow Error Detection and Correction Strategy for Business Processes

Conceptual modeling, which includes both control-flow and data-flow modeling, has posed great challenges for conventional business process management systems. To support systematic data-flow modeling and analysis, in this paper, we propose a novel and effective Petri net-based approach. We first introduce a new type of Petri net, called WFIO-net, the firing rule of which is formally defined by extending classical Petri net with read and write semantics, to model both the control-flow and data-flow information. Then, we discuss about three possible data-flow errors over a WFIO-net, i.e., the missing, redundant and conflicting data errors. To detect such data-flow errors, we develop a polynomial complexity algorithm based on the so-called Activity Data Incidence Matrix (ADIM) of a WFIO-net. Following by that, we propose three effective correction strategies to resolve the detected data-flow anomalies. Finally, we present a property loan approval business process case study for our approach, and the study results demonstrate that the proposed detection and correction approaches are indeed very effective and can be applied to real-life settings.


I. INTRODUCTION
With the growing of event data from large information systems, Business Process Management (BPM) has received more and more attention from researchers and practitioners from various domains. A business process is made up of a well-defined collection of activities (referred to as tasks), and also their execution orders. Modeling and verification of business processes (i.e., workflows) have been studied for many years, and it is well known that the control-flow is the backbone of a business process [1]. In fact, to capture useful information in a process in a more comprehensive way, many other characteristics of a process like time (e.g., constraints on activity execution duration), resources (e.g., roles), and data-flows (e.g., decisions) are becoming more and more important for today's business process analytics.
The associate editor coordinating the review of this manuscript and approving it for publication was Shouguang Wang . Up to now, the topic of business process structural correctness verification has been extensively studied and various approaches have been proposed in the past years [1]- [5]. In the meantime, to cope with the analysis requirements of a process, such as the time and resource as mentioned above, some more advanced techniques have been also studied. For example, time management [6]- [10] and resources management [11]- [13] in a business process. The former study often considers the issues of activity execution duration, deadline constraints, and the temporal consistency of a process, and the latter one focuses on analyzing resource usages of activities in a process, as the activities always need to access system resources (e.g., facilities) and social resources (e.g., employees) during their executions.
Similar to time and resource, data is another important factor for business process modeling and analysis. For instance, routing choices of a process are typically dependent on certain data elements. Since a constructed data-flow model could be erroneous, various methods on correctness verification of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ data-flows have been investigated in these years [14]- [21]. However, to the best of our knowledge, all the approaches just focus on detecting data-flow errors, and how to correct the detected anomalies has not been fully studied yet.
In this paper, we introduce a novel approach, targeting to systematic modeling, and error detection and correction of data-flow in business processes. More accurately, our research goal is to develop a methodology for detecting data-flow errors and correcting those errors with effective strategies by extending our previous work [25]. Here, we summarize the main contributions of this paper as follows: • We introduce a Petri net-based approach (WFIO-net) for formally modeling business processes, from both the control-flow and data-flow aspects.
• We present a comprehensive formal definitions for three possible data-flow errors in our approach.
• To detect the data-flow errors, we introduce a novel and effective algorithm, the complexity of which is in polynomial time. Moreover, to resolve the detected data-flow errors, we also develop a set of correction strategies.
• We present a property loan approval business process case study for our approach. Our results demonstrate that the proposed approach is very practical and can be applied to real-life settings. The remainder of this paper is organized as follows. Section II discusses about the related work. Section III introduces a systematic data-flow modeling approach based on WFIO-net. Section IV presents the formal definitions of three different data-flow errors and their correction strategies. Section V gives a property loan approval business process to validate the applicability of the propose approach while Section VI draws the conclusions of the work.

II. RELATED WORK
This section summarizes the work related to control-flow and data-flow verification of business processes.

A. CONTROL-FLOW AND DATA-FLOW MODELING AND VERIFICATION
As one of the most dominant models, WF-net [1], which is a special type of Petri net, has been widely applied for processoriented modeling and analysis. For a WF-net, a correctness criterion called soundness is defined by van der Aalst in the work [1]. Note that sometimes a WF-net is unable to represent a correct business process, since errors such as deadlocks and livelocks may occur in the process model. For some subclasses of WF-nets, for example the free choice net, soundness can be detected in polynomial time. However, most of current verification techniques always require an exhaustive exploration of the state space of a net, and thus they often suffer low efficiency [2], [5], [40], [41], in terms of detection performance. To handle this limitation, [41]- [44] propose a guard-driven reachability graph approach that can improve the analysis performance.
The concept of data-flow correctness verification of business process is first introduced in the work [14].
Generally, there are three main types of data-flow problems, i.e., missing data, redundant data and inconsistent data. Currently, there are only a few approaches can be used to detect and correct these errors. For example, approaches based on Dual Workflow Nets (DWF-net) [15] try to extend a controlflow model by adding data-flow elements. However, a DWFnet is always very complex even for a very small business process. Meanwhile, there is no explicit indication on how to create this type of model.
To obtain control-flow information from data-flow, Sun and Zhao present an approach in [17]. They investigate the possibility of incorporating formal analytics into business process designs, to alleviate the intensive intellectual challenges faced by business analysis. This analytics can construct a business process using relevant activity information and their associated (i.e., input and output) data. Moreover, in [18], Trcka et al. introduce the notion of Workflow Data (WFD) nets. Basically, a WFD is a formal business process model with data which can be read, written and destroyed. In [19], a data-flow matrix and a set of relations between data elements and activities are first defined. Then, some basic types of data-flow errors are conceptualized on the basis of UML activity diagrams. To extend the applicability of the approaches based on activity diagram, a number of rules are defined for the transformation from activity diagrams to Petri nets by Wohed et al. [39]. Their work is further generalized by Sundari et al. in [20]. More recently, a comprehensive survey on data-flow modeling and verification of business processes is reported by Dolean and Petrusel [21]. They conclude that: (1) data is essential in a business process and its execution requires data; (2) the control-flow cannot be executed without data; and (3) there is no modeling approach focusing on how data state changes during process execution.
In fact, the problem of how to find out the described data errors in our Petri net-based formal model is quite similar as the classical model checking problems, which are on the basis of state-based model checkers such as TAPAAL 2.0 [22]. However, traditional model checking tools (like TAPAAL) normally have the following three limitations in the scope of checking data-flow errors: (1) the timed-arc Petri nets do not support data elements explicitly. Therefore, we will have to extend the tool or to transform the data perspective constraints to timed-arc Petri net constructs; (2) the analysis of TAPAAL tool is based on generating reachability of a model. Although an efficient algorithm has been proposed, the performance is still an issue because the search space will be huge. In contrast, as we will present later, our proposed approach uses a structure based on Petri nets (i.e., the incidence matrix) to do the computation, and obviously our approach will be much clearer and simpler; and (3) TAPAAL-like tools provide full time factors support, and thus they are extremely suitable for time performance analysis. However, the data-flow error detection problem does not require a model to contain any time factors, as the temporal checking is based on the precedence relationship among transitions. In such scenarios, the detection process will be over-complicated when using a TAPAAL tool.

B. SUMMARY
Based on above literature review, we can see that there has been a long history of studies on business process correctness verification. Although an abundance of formal techniques have been proposed to detect control-flow errors, there is very little support for systematic data-flow modeling and verification. Moreover, existing work on data-flow has following problems: (1) most of existing data-flow analysis approaches only provide methods to check data-flow errors, but not the formal models with operation semantics; and (2) how to correct the detected data-flow anomalies is totally ignored.
To address these issues, as we will present in the following that we propose an approach on defining and detecting dataflow anomalies in a formal way. There is no doubt that effective anomaly resolution methods will be very helpful on constructing correct workflow models. Therefore, we also introduce various strategies to support the correction of detected data-flow errors. Moreover, we also demonstrate the effectiveness and applicability of the proposed approach by a property loan approval business process case.

III. PETRI NET BASED MODELING OF CONTROL-FLOW AND DATA-FLOW
Our work is based on Petri net, or WF-net to be more precise. Here, we assume readers are familiar with the basic concepts of Petri nets [23]- [26]. Some basic terminologies and notations are given in the following for self-completeness of the paper.

A. PETRI NET AND WORKFLOW NET
A 3-tuple N = (P, T , F) is called a net if: (1) P is a finite set of places and T is a finite set of transitions such that: P ∩ T = ∅ and P ∪ T = ∅; and (2) F ⊆ (P × T ) ∪ (T × P). Here, we give the formal definition of a Petri net following the work [16], [23], [24] and [2], [27]- [38].
represents the number of tokens in place p and Z + is a non-negative integer set.
For any x ∈ P ∪ T , the set • x = {y|(y, x) ∈ F} is named its pre-set (input), and x • = {y|(x, y) ∈ F} is named its postset (output). To describe the semantic of a Petri net, we use markings. A marking is a multi-set of places, indicating how many tokens each place contains. An initial marking is If M [t holds, t may fire, and this will bring in a new marking M , such that . A Petri net can model a business process, and we call such a model as a WF-net. The definition of a WF-net is given as below, following the work [1].
there is one sink place p e ∈ P such that p • e = ∅; (3) for any x ∈ P ∪ T is on a path from p s to p e ; and (4) for any p ∈ P, M 0 (p) = 1 if p = p s , and otherwise M 0 (p) = 0.
It should be noticed that a path in a Petri net is defined as the sequence of nodes (i.e., places and transitions) linked by directed arcs. WF-net is capable of modelling different control-flow constructs of a business process. In the following, we synonymously use transition and activity as the WF-net is specifically used to model business processes in this paper.

B. CONTROL-FLOW NET MODEL
The control-flow aspect of a business process is modelled by a WF-net model.

C. DATA-FLOW NET MODEL
The control-flow net model can be used to model business process logic, however, it does not support data-flow elements of a business process. To remedy this problem, we propose a data-flow net model as below.
Definition 4: [25] A Data-flow Net Model is a 5-tuple where (1) P D represents the data element set; (2) T A is the activity set of the business process; For the transitions in Fig. 1, a responsible data-flow net model is demonstrated in Fig. 2. It can be formalized as:

D. WFIO-NET MODEL
To combine the modeling of the control-flow and data-flow elements of a business process, we introduce a new kind of workflow net, called WFIO-net, by extending each transition with its corresponding input and output data sets. We give its detailed definition as following: VOLUME 8, 2020  • P = P D ∪ P L and P D ∩ P L = ∅, where R D is the data place set and R L is the logic place set; where T A is the activity transition set and T L is the logic transition set; is the data flow; is the input data set of transition t a ; • O : T A → P D . For any t a ∈ P A , O(t a ) is the output data set of transition t a ; and Based on above definition, we can see that (P L , T L , F L , M 0 ) is a WF-net and it represents the control-flow structure of a business process. In such a case, compared to a WF-net, a WFIO-net will have three main differences: (1) a WFIO-net has two types of places, i.e., the data place set P D and the logic place set P L ; (2) a WFIO-net has two kinds of flow relations, i.e., one kind is used to represent the traditional token flow F L while the other is to represent the data flow F D . From a graph angle, in a WFIO-net, a token flow arc is drawn by a direct arc that ends with arrow, and a data flow arc is represented with a direct arc that ends with small circle; and (3) the transition firing rule of a WFIO-net is different from that of a classical WF-net (more details see later of this subsection).
From above differences and according to Definition 5, we can see that a WFIO-net model is actually the union of a control-flow net model and a data-flow net model, by merging their transitions with same labels. Therefore, if we take the control-flow net model and the data-flow net model in Fig. 1 and Fig. 2 as inputs, then we can get a WFIO-net model as demonstrated in Fig. 3.
For more details, we denote the pre-set of a transition t in a WFIO-net as • t = t ∪ • t. Here, t represents the logic pre-set of transition t and • t represents the data pre-set of transition t (or read place set). Similarly, the post-set of a transition t also involves two parts, i.e., t • = t ∪ t • , in which t represents the logic post-set of transition t and t • represents the data post-set (or write place set) of transition t. For example, Fig. 4(a) shows a simple transition model of a WFIO-net. Its pre-set of t a is • t a = {p 1 , p read }, logic pre-set is t a = {p 1 } and read place is and otherwise M (p) = M (p). Based on this, we have the following explanations for the WFIO-net model illustrated in Fig. 4: (1) p read and p write represent data places and p 1 and p 2 represent logic places; (2) t a is enabled in Fig. 4 (a) as p 1 and p read contains a token; and (3) Fig. 4(b) gives the state after firing t a where the token in p 1 is removed to p 2 and p write obtains one token without consuming the token in p ready , i.e., the data elements are un-consumable.

IV. DATA-FLOW ERROR DETECTION AND CORRECTION STRATEGIES
In this section, we present the details of the data-flow errors of a business process as well as their detection and correction strategies.

A. TAXONOMY OF BASIC DATA-FLOW ERRORS
As discussed by Sun et al. in the work [10], missing data error, redundant data error and conflicting data error are the most fundamental data-flow anomalies in business processes. Other types of advanced data-flow errors have been given by Sadiq et al. in [7], and they can be represented by the most basic ones. Therefore, in this work, we restrict ourselves to the three basic data-flow errors. We give their details and formal definitions in a WFIO-net as following.

1) MISSING DATA ERROR
A missing data error occurs when a data element is accessed without been initialized. It is very similar as a variable in a program that the variable is used without definition or initialization. For the missing data error in a WFIO-net, we give its formal definition as below.
Definition 6: Let IO = (P, T , F, I , O, M 0 ) be a WFIOnet, for any p d ∈ P D , p d is defined as a missing data element if: (1) • p d = ∅ and (2) p • d = ∅. In above Definition 6, the first condition indicates that the data element p d is not initialized, and the second one means this data element is used by at least one  follow-up activity. Similar to the analysis in the work [10] and [7], missing data errors would lead to system exceptions and dangling. Therefore, they should be detected and corrected before system enactment.

2) REDUNDANT DATA ERROR
A redundant data error occurs when some data elements are produced but never been used by any other follow-up activities. It is a bit like that a program variable is defined and initialized but it has never been used. The formal definition of redundant data errors based on a WFIO-net is given as below. Similarly, the first condition in Definition 7 indicates that the data element is initialized, and the second one means the data element is not used by any follow-up activities. Redundant data error may lead to system inefficiency, and therefore, it should be detected and corrected beforehand.

3) CONFLICTING DATA ERROR
A conflicting data error occurs when there exists multiple versions of the same data element in one process instance. For example, multiple activities attempt to initialize (or write) the same data element at the same time. The formal definition of conflicting data error based on a WFIO-net is given as follows.
Definition 8: Let IO = (P, T , F, I , O, M 0 ) be a WFIOnet, ∀p d ∈ P D , p d is defined as a conflicting data element if: (1) • p d = ∅; (2) p • d = ∅; and | • p d | ≥ 2. In Definition 8, the first and third condition mean that the data element is initialized by multiple activities, and the second means that the data element will be used. Conflicting data errors may cause uncertainty or even confusion, and therefore, they should be detected and corrected before a business process is enacted.

B. DATA-FLOW ERRORS CHECKING ALGORITHM
In this subsection, we present an approach to detect these previously defined data-flow errors.
According to the definitions in [23], any kind of Petri nets can be represented in the form of an incidence matrix. Existing incidence matrix only shows the control-flow information but no data-flow relation is included. To this end, we define the Activity-Data Incidence Matrix, ADIM for short, as following. The ADIM of the WFIO-net in Fig. 4 is constructed as the matrix demonstrated in Fig. 5.
To detect the previously described three dataflow errors in a WFIO-net based on its ADIM, we propose an efficient method as presented in Algorithm 1. There, we check the possible errors on each data elements (line 2). For each element, we first calculate its statistic information over each transitions using the ADIM (lines 3-9), and then we identify the errors based on the Definitions 6-8 (lines [10][11][12][13][14][15][16]. Obviously, the complexity of Algorithm 1 is determined by its two loops (line 2 and line 3), and thus the complexity of the algorithm is O(|P d | × |T a |), where |P d | is the number of data elements and |T a | is the number of activities.
Considering the WFIO-net in Section III as an example, we run Algorithm 1 by taking the ADIM in Fig. 4  RNum ← 0 and WNum← 0 18: end for and we can obtain that MisSet = {p d7 }, RedSet = {p d4 , p d10 } and ConSet = {p d5 }. More specifically, (1) data element p d7 is missing because it is accessed by activity t A7 without been initialized; (2) data elements p d4 and p d10 are redundant because they are initialized by activities t A7 and t A9 respectively, but never been used by any activity in the business process; and (3) data element p d5 is conflicting because it is initialized (or written) by different activities (t A3 , t A4 and t A5 ), i.e., there may exist different versions of p d5 in a single business process instance.

C. DATE-FLOW ERROR CORRECTION STRATEGIES
In this subsection, we propose different strategies to correct the three types of data-flow errors in a WFIO-net. When performing data-flow error corrections, we give a priority to the control-flow, so as to avoid changing the business process logic. In the following, we present the details of our correction strategies.

1) CORRECTION STRATEGY 1
Missing Data Error Correction: Let IO = (P,T ,F,I ,O, M 0 ) be a WFIO-net, and p d ∈ P D , p d is a missing data element. The activity t AM that generates the data element p d should be added to the WFIO-net, satisfying T = T ∪ {t AM } and The activity t AM can be added to any location in the WFIO-net before the missing data element is used. To make  it simpler and straightforward, we suggest adding it sequentially, before the time that the activity uses the missing data element. An example of this strategy is shown in Fig. 6. There, p d is used by an activity t a without be initialized, therefore, we add the activity t AM to initialize p d before t a . In this way, the missing data error is resolved.

2) CORRECTION STRATEGY 2
Redundant Data Error Correction: Let IO = (P, T , F, I , O, M 0 ) be a WFIO-net, and p d ∈ P D , p d is a redundant data. The redundant data element p d should be removed from the WFIO-net, satisfying that P = P − {p d } and An example of this strategy is shown in Fig. 7. In the example, p d is first produced by activity t a but it is never been used by follow-up activities, therefore we just remove p d from the WFIO-net, and consequently the redundant data error is corrected.

3) CORRECTION STRATEGY 3
Conflicting Data Error Correction: Let IO = (P, T , F, I , O, M 0 ) be a WFIO-net, and p d ∈ P D , p d is a conflicting. We duplicate the conflicting data element p d according to their different producers, satisfying that Fig. 8 shows an example of how to apply this strategy. There, p d is initialized (or written) by activities t a and t b , therefore according to Correction Strategy 3, we duplicate the conflicting data element p d according to their different producers. Therefore the conflicting data error is resolved.
Considering the WFIO-net in Fig. 3 as a running example. Based on Algorithm 1, we have known that the dataflow errors are MisSet = {p d7 }, RedSet = {p d4 , p d10 }, and ConSet = {p d5 }. Following the three proposed data-flow correction strategies, we can get a corrected WFIO-net model as illustrated in Fig. 9. More specifically, (1) data element p d7 is missing because it is accessed by activity t A7 without been initialized. According to Correction Strategy 1, we add activity t AM to initialize p d7 before used by t A7 . Without changing the business logic, we add it sequentially before t A7 . Therefore the missing data error is resolved. (2) data elements p d4 and p d10 are redundant because they are initialized by activities t A7 and t A9 respectively, but without being used by   any activity in the business process. According to Correction Strategy 2, we remove p d4 and p d10 from the WFIOnet. Therefore the redundant data error is resolved. (3) data element p d5 is conflicting because it is initialized by activities t A3 , t A4 and t A5 , i.e., there exists different versions of p d5 in one process instance. According to Correction Strategy 3, we duplicate the conflicting data element p d according to their different producers as p d1 , p d2 and p d3 . Then, we correct the conflicting data errors.

V. CASE STUDY
In this section, a property loan approval business process case study, which is widely adopted in current literature such as [19], is used to validate the applicability of our proposed approach.
The whole business can be described in the following: (1) The process starts when a property loan application request is received; (2) then completeness of this application is verified; (3) if this application is not complete, the missing information is required; (4) to determine the qualifications of an applicant, the financial service company verifies the applicant's employment status, checks the applicant's credit history as well as the applicant's liquid assets at the same time; (5) if the applicant is judged to be qualified, the current interest rate is locked for a certain period.; (6) after this, the financial service company requests the appraisal information for the applicant; (7) then the loan application is evaluated. Based on the applicant's credit score, the loan amount and the appraised value of the property, the risk level associated with the loan are calculated; (8) if the risk is higher than the applicable threshold, the loan amount should be adjusted; (9) the financial service company then contacts the applicant to discuss the necessary adjustment and other options; (10) if the applicant agrees with everything, it is forwarded to a loan officer for signature; and (11) if the loan amount is over $500,000, the general manager's signature is required. The activity information and the corresponding data elements are described in Table 1 and 2 respectively.
The read/write relations between activities and data elements of the property loan approval business process are    Table 3. There, ''w'' represents write operation and ''r'' stands for read operation. For example, for activity A 6 , it needs to read data elements D 1 and D 8 before execution, and writes D 9 and D 10 when finishes.
According to Definition 3, the control-flow net model of this property loan approval business case is shown in Fig. 10, According to Definition 4, the data-flow net model of this property loan approval business process is illustrated in Fig. 11, where we have that (1)  According to Definition 5, i.e., a WFIO-net model is the union of a control-flow net model and the data-flow net model by merging transitions with same labels, by taking the control-flow and data-flow nets in Fig. 10 and Fig. 11 as inputs, we have the WFIO-net in Fig. 12.
Moreover, the ADIM of the WFIO-net in Fig. 12 is constructed and shown in Fig. 13.
Taking the ADIM in Fig. 13 as input, based on Algorithm 1, we obtain that MisSet = {D 18 }, RedSet = {D 13 , D 20 , D 21 } and ConSet = {D 10 }. Namely, (1) data element D 18 is missing because it is accessed by activity A 11 without initialization; (2) data elements D 13 , D 20 , D 21 are redundant because they are initialized by activities A 8 , A 12 and A 13 respectively, but are not used by any activity; and (3) data element D 10 is conflicting because it is initialized by different activities (A 4 , A 5 and A 6 ), i.e., there exist different versions of D 10 in one process instance.
Based on above analysis, applying the proposed data-flow correction strategies, we have that (1) data element D 18 is missing because it is accessed by activity A 11 without being initialized. According to Correction Strategy 1, we add activity A Manual (t AM ) to initialize D 18 before being used by A 11 . Without changing the business logic, we add it sequentially before A 11 . In this way the missing data error is resolved; (2) data elements D 13 , D 20 and D 21 are redundant because they are initialized by activities A 8 , A 12 and A 13 respectively, but without being used by any activity in the process. On the basis of Correction Strategy 2, we remove D 13 , D 20 , D 21 from the WFIO-net. In such a case, the redundant data error is removed; and (3) data element D 10 is conflicting as it is initialized by activities A 4 , A 5 and A 6 , i.e., there exists different versions of D 10 in one process instance. Using Correction Strategy 3, we duplicate the conflicting data element D 10 according to their different producers as D 10A , D 10B , and D 10C . Then, the conflicting data error is resolved. At this time point, the corrected WFIO-net model of the property loan approval business process is demonstrated in Fig. 14. There, the missing data error related correction is VOLUME 8, 2020 highlighted in light green color, and the conflicting data error related correction part is highlighted using red color. We can see that our approach is indeed very practical for real case studies.

VI. CONCLUSION
To formalize a systematic data-flow modeling, detection and corrections of business processes, we propose a Petri netbased approach in this paper. More precisely, we introduce a Petri net-based approach (WFIO-net) for formally modeling business processes, from both the control-flow and data-flow aspects. Moreover, we provide the formal definitions for three basic data-flow errors in our approach, and also develop an effective algorithm to detect these errors. To efficiently resolve the detected data-flow errors, we finally propose a set of correction strategies.
To the best of our knowledge, this is the first work towards formally modeling, detecting and correcting of data-flow anomalies in a business process. Our case study has shown that the proposed approach is actual very practical and thus can be applied to real-life settings. Our future work mainly lies in the following two aspects: (1) we will investigate the possibility of a more detailed taxonomy of data-flow errors. For instance, the missing data error could be divided into several sub-classes, such as absolute missing and conditional missing; and (2) we will explore the opportunities on incorporating access control permission factors in our WFIO-net so as to achieve a more accurate verification. An example is that writing operation to a data element will be not permitted when the element is being read/wrote by another activity.