Extracting Reusable Fragments From Data-Centric Process Variants

Data-centric process management supports knowledge workers in performing knowledge-intensive processes in a flexible way. An essential ingredient of data-centric process management are process templates that are manually modified for a specific case to suit the context of that case. Modifying templates results in many different yet related process variants. However, manually modifying a template is time consuming and may lead to errors. This article defines an approach to extract reusable fragments from data-centric process variants. The set of extracted fragments is minimal. By composing the fragments not only the input variants but many more process variants can be derived. We have implemented the approach in a prototype and evaluated it on several business processes. Using the fragment extraction approach, complex data-centric process variants can be designed more efficiently and their quality can improve, since well-known modifications are applied.


INTRODUCTION
M ANY business processes in modern organizations rely on knowledge workers that have to make informed decisions about specific cases. The available data and knowledge drives the decision making and processes in such knowledge-intensive processes (KiPs) [1]. Data-centric business process management approaches such as case management [2] support KiPs. A key notion in data-centric process management is that of a template: a representation of a "baseline" process. The template is modified to suit the needs of the particular case being processed. Common modifications are adding and deleting elements of the template [3]. Such modifications result in a variant of the template.
Creating new data-centric process variants is labor intensive, since the exact changes need to be explicitly specified. Moreover, the changes may have undesirable side effects, for instance a deadlock or a task that is done twice. But if already some process variants of good quality have been defined, the modifications that were made to the template in order to define these variants, can also be used to quickly create additional process variants of good quality.
To support this way of working, this paper develops an approach to reuse modifications that were used in defining existing data-centric process variants. The modifications are not explicitly defined by domain experts, but they are extracted in an automated way from the template and the variants. By applying modifications that already were applied before to the template for existing variants, the design time for additional data-centric process variants is reduced. Moreover, it can improve the quality of new process variants, since well-proven modifications are applied to generate them.
More concretely, this paper defines an approach to extract reusable fragments from a data-centric process template and variants based on the template. Each fragment represents a cluster of work that is not in the template, but was included in an input variant. We view these extracted fragments as features, that can be composed into new data-centric process variants by applying an existing feature composition approach [4]. The notions of feature and feature composition originate from the field of Software Product Line Engineering [5], where features are used to distinguish common and variable parts in software artifacts and where feature composition is used to generate different variants of a software product [6], [7].
As host language for modeling data-centric processes, we use Guard-Stage-Milestone (GSM) schemas, a technique to declaratively model life cycles of key business entities, called business artifacts [8], [9]. In previous work, we defined a feature-oriented composition operator for GSM schemas [4], which defines how a GSM fragment, viewed as feature, is applied to a base GSM schema (template). Using feature-oriented composition, extracted GSM fragments can be composed into variants in a declarative way. GSM schemas are one of the predecessors of the industry standard for data-centric processes, Case Management Model and Notation [10], [11], which has been endorsed by the Object Management Group. Thus, the results in this paper provide a basis for applying feature extraction to CMMN models.
The remainder of this paper is organized as follows. Section 2 discussed related work. Section 3 introduces GSM schemas, feature-oriented composition and the problem of extracting fragments from a set of variant GSM schemas, while Section 4 gives a formal background on GSM schemas. Section 5 explains how a base schema (template) can be constructed for a set of variant GSM schemas. Section 6 defines how fragments can be extracted from two GSM schemas, where one extends the other. Section 7 defines the method for extracting fragments for a set of variant GSM schemas. Section 8 evaluates the method on several business processes. Finally, Section 9 concludes the paper.
Model fragments have been used in a few other data-centric and declarative process modeling approaches [16], [21]. Fragment-based case management [16] is an approach for production case management in which procedural, activitycentric process fragments are composed at run-time by linking them, i.e., a case variant is executed in a distributed fashion by executing linked process fragments. The fragments are linked via shared data objects. Mukkamala et al. [21] define a commutative composition operator on instances of DCR graphs. Each DCR graph instance can be viewed as a process fragment being executed. Both approaches assume that the composed process fragments already exist, whereas the approach in this paper focuses on extracting fragments from existing process variants such that the fragments can be composed.
A few data-centric and declarative process modeling approaches support management of process variants [22], [23], [24], [25]. Cognini et al. [23] describe an approach for modeling process fragments of knowledge-intensive processes, but do not address extraction of fragments. Other declarative modeling approaches support the derivation of process variants by different configurations of a single process model [24], [25] or by applying editing operations to a template model [3], [22], but these are not based on process fragments.
Other related work studies the general problem of designing data-centric process models, either by defining a methodology for specifying data-centric process models [26] or by defining an automated synthesis of data-centric process models [27], [28], [29], [30], but these works do no consider fragments that are composed. Hewelt et al. [31] evaluate three manual approaches to define fragments for case management, but none of these approaches analyzes existing variants. In combination with the feature-composition approach defined earlier [4], the fragment extraction approach in this paper enables an alternative design method for data-centric process models, that facilitates reuse of process fragments in new process variants.
Variability has been well studied for procedural, activitycentric business process models, which differ considerably from data-centric and declarative process models. Recent surveys [32], [33], [34] describe the state of the art in dealing with variability in procedural modeling languages [33], in different phases of the business process life cycle [32], and discovering processes from events log [34]. For activity-centric process models, approaches exist to extract shared fragments in process model repositories [35], [36], [37], to detect differences between process models [38] and to discover configurable process fragments from process models [39]. All these approaches consider procedural, graph-like process models, which differ considerably from declarative and data-centric process models like GSM schemas.
In software engineering, feature extraction has been studied for software artifacts, e.g., [40], [41]. However, those features need to be manually identified, whereas in our method features are derived automatically. Studying how domain knowledge from experts can improve the quality of the generated features is an interesting direction for further work.
In sum, the main contribution of this paper is an approach to extract reusable fragments from data-centric, declarative process variants, such that additional process variants can be composed from the fragments in an efficient way.

OVERVIEW
We briefly introduce by examples GSM schemas, featurebased composition of GSM fragments, and the problem of extracting reusable fragments from a set of GSM schema variants. In the next sections, formal definitions are provided.
GSM Schemas. A Guard-Stage-Milestone (GSM) schema defines the life cycle of a business artifact [8], [9]. A business artifact is a key business entity that is changed during a business process, for instance an order or a claim request [12]. Key modeling constructs in GSM schemas are stages and milestones. A stage represents a cluster of business activity performed for the artifact. Stages are organized into a hierarchy. Each atomic stage contains exactly one task, an atomic piece of work. A milestone represents a business objective, usually achieved by completing an attached stage. Stages and milestones change status if certain conditions, called sentries, are met. There are two kinds of sentries. Plus sentries ensure that a stage is opened or a milestone achieved, while minus sentries ensure that a stage is closed or milestone invalidated. Guards are plus sentries of stages. Example 1. Fig. 1 shows sample GSM schemas. Rounded rectangles denote stages, circles denote milestones, and each diamond denotes the presence of a guard. Other sentries are not visualized. In the original process (a), business criteria for a partner contract are assessed: first data about the partner is gathered and prechecked, and next a detailed check is performed to decide whether the criteria should be changed or not and a report is written. If new information arrives before the business criteria have been assessed, the data is gathered anew and the business criteria check is restarted, if applicable. Sentries for this process are listed in Tables 1 and 2. For instance, milestone PCS is achieved if plus sentry BPECS is true, so milestone BPECS has been achieved.
GSM Variants and Feature-Oriented Composition. A GSM schema can be modified into another GSM schema. This way, from a GSM schema a set of variant GSM schemas can be derived. To derive a variant, several changes can be applied to a GSM schema [3]. For instance, Figs. 1b and 1c shows three variants that are modifications of the GSM schema BCA in Fig. 1a. They have been derived by inserting stages and milestones for additional checks on the credit level and the addressable market. Also sentries need to be modified for these variants: Tables 3 and 4 show the modified sentries for the stages and milestones of the different variants in Fig. 1.
A GSM fragment can be used to specify a set of related changes. For instance, a fragment containing stage Credit Check and milestones CCS and CCU specifies the changes between BCA (Fig. 1a) and BCA 1 (Fig. 1b). To enable the reuse of shared parts between GSM variants, we proposed in earlier work [4] to view GSM fragments as features and to use feature composition, well-known from the field of Software Product Line Engineering [6], [7], to compose GSM fragments into different variants; we defined a declarative, binary operator '' for composing GSM schemas [4], where the left-hand side GSM schema is composed with, or applied to, the right-hand side GSM schema.
Extracting Fragments. Though the feature-oriented composition approach [4] allows to generate different variants from a common base GSM schema (template) and a set of fragments, it assumes that the fragments already exist. But defining fragments manually can be time consuming and costly, relying on domain knowledge from domain experts. In this paper, we define an approach to extract fragments from a set of variant GSM schemas derived from the same base GSM schema. The approach can be automated and does not rely on domain knowledge to define fragments, because that knowledge is already codified in the different variants. Moreover, extracting fragments enables the reuse of modifications among different GSM variants, each fragment representing one set of related modifications.
In essence, the approach decomposes a set of GSM schema variants into reusable fragments. Fig. 2 shows the fragments extracted for the schema BCA and its three variants BCA 1 and BCA 2 and BCA 3 . For each variant that refines (extends) another variant, denoted with an arrow, one or more fragments are constructed. For instance, BCA 1 refines BCA and feature f 2 is constructed. By composing f 2 with BCA, variant BCA 1 is constructed.
The original schema BCA is not a base schema, because BCA 3 does not refine BCA. Therefore, a base schema is constructed by taking the intersection of BCA and BCA 3 . This schema BCA in is refined by all other schemas, including BCA. Compared to BCA, the only change in BCA in is that the plus sentry (guard) of Write Report is false.
Some fragments are overlapping with each other or are combinations of other fragments. For instance, the fragment f 3 that reflects the change between BCA 2 and BCA is actually equivalent to the composition of fragments f 2 and f 4 , so f 3 ¼ f 2 f 4 . The approach detects overlaps in fragments and refactors them, which means that feature f 3 is not in the output fragment set, since it can be composed from f 2 and f 4 . Four fragments f 1 ; f 2 ; f 4 ; f 5 are generated. Two fragments, f 1 and f 5 , are conflicting, since both define a new sentry for Write Report. The other pairs of fragments can be arbitrarily combined, meaning that their order in a composition chain does not influence the outcome. In total 11 distinct variants can be constructed from these four fragments. This means that not only the four original GSM schemas can be constructed by   applying combinations of the 4 fragments to BCA in , but 7 additional variants can be constructed, for instance a variant in which fragments f 2 , f 4 and f 5 have been composed. In sum, the approach takes a set of GSM schema variants as input. If there is no base schema, it first finds or constructs a base schema such that each variant refines the base schema (Section 5). Next, it extracts a minimal, complete set of reusable fragments (Sections 6 and 7). Each input variant can be constructed from the base schema by composing one or more of the extracted fragments, but many more variants can be constructed from the extracted fragments.

GSM SCHEMAS
In this section, we formally define GSM schemas [8]. A GSM schema of a business artifact consists of data attributes and status attributes. Data attributes model the information state of the business artifact. A status attribute is a Boolean variable that denotes the status of a stage or milestone. For a status attribute of a stage, value true denotes that the stage is open, value false that the stage is closed. For a status attribute of a milestone, value true denotes that the milestone is achieved, value false that the milestone is invalid.
Event-Condition-Action rules define for which event under which condition a status attribute changes value (action). The event-condition part of a rule is called a sentry. The event of a sentry is optional. We distinguish between external and internal events. An external event signifies a change in the environment. It is either a stage completion event C :S, where S is an atomic stage, defined below, or a named external event E :n, where n is an event name. An internal event signifies a change in value of a status attribute a: internal event þa denotes that a becomes true, Àa that a becomes false. For instance, þPreliminary Check in Table 1 is an internal event that signifies that stage Preliminary Check gets opened. The condition of a sentry is a Boolean expression that can refer to data attributes or status attributes. The action of each rule is that a status attribute becomes true or false, which leads to two types of sentries. A plus sentry defines when a stage becomes open or a milestone gets achieved. A minus sentry defines when a stage is closed or a milestone gets invalid.
Stages and milestones can be nested inside other stages. A milestone cannot contain any other milestone or stage. We require that the nesting relation induces a forest, i.e., the nesting relation is acyclic and if a stage or milestone is nested in two other stages S 1 , S 2 , then either S 1 is nested in S 2 or S 2 in S 1 . The most nested stages, which are called atomic, launch tasks. To ease the presentation, we assume that atomic stages launch tasks having the same label, so for instance stage Detailed Check launches a task with the same name.
M is a collection of three disjoint finite sets of attributes, called data, stage, and milestone attributes, respectively. We write A s to refer to the set of status attributes S [ M .
is a collection of two disjoint, finite sets of events, called named external and stage completion events, respectively. 0 ðS [ M ÞÂS is a strict partial order on stages and milestones, where a 1 0 a 2 means that a 1 is descendant of a 2 . Relation 0 induces a forest, i.e., if a 1 0 a 2 and a 1 0 a 3 , then a 2 0 a 3 or a 3 0 a 2 . A stage that has no descendants is called atomic. R þ , R À are functions assigning to each status attribute in A s non-empty sets of sentries (see Definition 2). For each a 2 A s , R þ ðaÞ is the set of plus sentries that define the conditions when to open stage a 2 S or achieve milestone a 2 M , while R À ðaÞ is the set of minus sentries that define the conditions when to close stage a 2 S or invalidate milestone a 2 M .
Relation 0 is visualized using nesting. For instance, BPECS 0 Preliminary Check in Fig. 1a. The top level stage is not shown.
Each sentry ' in set R þ ðaÞ, where a 2 S [ M , maps into an Event-Condition-Action rule "' then þa", where sentry ' is the Event-Condition part and action þa denotes for a 2 S that stage a gets opened and for a 2 M that milestone a gets achieved. Each sentry ' in set R À ðaÞ maps into a rule "' then Àa", where action Àa denotes for a 2 S that stage a gets closed and for a 2 M that milestone a gets invalid. Each sentry in set R þ ðaÞ or R À ðaÞ is sufficient for triggering a status change in the stage or milestone a.
For defining sentries, we assume a condition language C that includes predicates over integers and Boolean connectives. The condition formulas may refer to stage, milestone and data attributes from the universe of attributes U . Keyword orig implicitly refers to the original sentry defined in another GSM schema for the same status attribute [4], [42]; below we introduce a feature composition operator '' that replaces occurrences of orig with the sentry definition it refers to.
Definition 2 (Sentry). A sentry has the form t^g, where t is the event part and g the condition part. The event part t is either empty (trivially true), a named external event E :n 2 E ext , where n is an event name, a stage completion event C :S 2 E cmp , where S is an atomic stage, or an internal event þa or Àa, where a is a stage or milestone attribute in A s . The condition g is a Boolean formula in CNF in the condition language C that refers to A [ {orig} so g can use data attributes in D, status attributes in S [ M and the keyword orig. The condition-part can be omitted if it is equivalent to true.
The condition part of a sentry is a boolean formula in conjunctive normal form (CNF) in order to ease the presentation. However, this is not a severe restriction: a condition part not in CNF can be rewritten using Boolean laws into an equivalent set (disjunction) of sentries. For instance, formula E :n^ða > 10 _ b < 5Þ is equivalent to set of sentries fE :n^a > 10; E :n^b < 5g.
A GSM schema can have a sentry that contains orig. Such a GSM schema is not self-contained, since orig is not interpretable by a process enactment engine. Instead, such a schema needs to be composed with another GSM schema in order to replace orig. In earlier work we proposed a declarative feature composition operator '' for GSM schemas [4], [42], similar to feature-based composition introduced for software product lines [7]. Given a GSM schema G 1 and a GSM schema G 2 , applying G 1 to G 2 , also known as composing G 1 with G 2 , is expressed as G 1 G 2 . The resulting GSM schema G ¼ G 1 G 2 is constructed by merging the GSM schemas G 1 and G 2 by taking the pairwise union of the different components of the GSM schema tuples, except for sentries (rules). For shared stages and milestones, the definition of sentries in G 1 override those in G 2 . If keyword orig is used in a sentry of a shared stage or milestone in G 1 , in the evaluation of operator '' each occurrence of orig is replaced with the sentries of that shared stage or milestone in G 2 . For instance, if G 1 and G 2 share milestone m and data attribute x, the sentry of m in G 2 is x > 10 and in G 2 is orig^x < 100, then the sentry of m in G 1 G 2 is x > 10^x < 100. This way, operator '' allows for both coarse-grained composition, by inserting new stages and milestones and sentries, and fine-grained composition, by modifying sentries using orig.
We distinguish between complete and fragment GSM schemas, based on orig. Fragment). A GSM schema G defines status attribute a if a 2 A s and either no plus sentry of a or no minus sentry of a contains orig. A GSM schema G refines status attribute a if a 2 A s and at least one plus and one minus sentry of a contains orig. A GSM schema G is complete if G defines each status attribute a 2 A s and a fragment if G refines one or more status attributes a 2 A s .
To compare GSM schemas, we next define a refinement relation ' u ' that plays a pivotal role in these comparisons.

Relation '
u ' is defined both on complete and fragment GSM schemas. We exploit this in Section 7 to refactor fragments.
If G 1 u G 2 , so G 2 refines G 1 , then G 2 extends G 1 with additional status attributes or sentries, or G 2 modifies one or more sentries of G 1 by adding additional conjuncts.
The refinement relation ' u ' defines an order on a set of GSM schemas. We next define base and variant schemas.
Example 2. For the GSM schemas in Fig. 1 Tables 2 and 4). Consequently, the GSM schemas in Fig. 1 do not have a base schema; in the next section, we explain how a base schema can be derived in that case.
In the next sections we show how fragments are extracted from a base and a set of variants schemas. First, we explain how a base schema can be constructed.

CONSTRUCTING BASE SCHEMAS
A set of complete GSM schemas may have no base schema. In that case, a base schema can be constructed, as we explain in this section.
The base schema is constructed by taking the intersection of the complete GSM schemas. We first define a general, binary intersection operator '\' on GSM schemas. The sets of data attributes and stages and milestones of an intersection GSM schema can be easily constructed by taking the intersection of the data attributes, stages and milestones of the two input schemas. The most involved is constructing the sentries of the intersection. Aim is to construct sentries for the intersection such that by using fragments that use the orig keyword in their sentries, the sentries of the two schema can be derived. How these fragments are extracted is defined in the next section.
To properly define sentries of an intersection, we need to define the largest common subformula of two input sentries. For example, if b 1 ; b 2 ; b 3 are boolean variables, then the largest common subformula of sentries b 1^b2 and b 1b 2^b3 is b 1^b2 , whereas the largest common subformula of sentries b 1^b2 and b 3 is false. For sentries ' 1 ; ' 2 , let lcsð' 1 ; ' 2 Þ denote the largest common subformula of ' 1 and ' 2 .
We also need to select the maximal largest common subformulas, rather than all largest common formulas. To motivate this, consider the example in Fig. 3. The intersection of the GSM schemas in Figs. 3a) and 3b) equals a), since b) contains a). However, if all largest common formulas are used, the schema in Fig. 3c) results: milestone m4 has a sentry m1 ¼ lcsðm1^m2; m1^m3Þ. However, this sentry is already subsumed by sentry m1^m2.
Function max els returns the maximal elements of a set F of formulas, i.e., the formulas in F that are not strictly contained by other formulas in F: max elsðFÞ ¼ f' 2 F j 6 9 c 2 F such that c strictly contains 'g.
The intersection operator defined next uses lcs and max els to properly define the sentries of the intersection of two complete GSM schemas.
Definition 6 (Intersection of GSM schemas). Let G 1 ; G 2 be two complete GSM schemas. Then The intersection BCA in of BCA and BCA 3 contains the status attributes of BCA, and has the same plus and minus sentries as BCA with one exception: the guard of Write Report in BCA in is false.
In the online appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TSC.2022.3190987, we have stated conditions under which the intersection results in a valid GSM schema. The next lemma states that a base schema constructed from two variant GSM schemas is contained by both the variants. Lemma 1. Let G 1 ; G 2 be two GSM schemas such that G ¼ Having defined the intersection of two GSM schemas, we now define an algorithm COMPUTEINTERSECTIONS that for a given set V of complete GSM schemas, computes a unique base schema that is refined by all schemas in V (Fig. 4). The intersection operator plays a pivotal role in the algorithm. In order to extract fine-grained fragments, the algorithm also returns the intermediate intersections that are computed. The algorithm first computes set B of local base schemas in V (l. 2), i.e., schemas that refine other schemas and that are not refined by other schemas themselves. Note that relation u is acyclic, so set B is not empty. While there is more than one schema in B (l. 4), the intersection of each pair of local base schemas in B is computed and added to the set of computed intersections I (l. 5-7), and next set B of local base schemas is computed again from I (l. 8). Finally, if B contains a single schema, the set I of computed intersections is returned (l. 10).
From Lemma 1 follows easily that the algorithm terminates. It always returns a single base schema, so the algorithm is also correct. Note that if V already has a base schema, no intersections are computed.

EXTRACTING FRAGMENTS
In this section, we explain how fragments can be extracted from a pair of variant schemas. Let G 1 , G 2 be GSM schemas such that G 2 refines G 1 , i.e., G 2 extends G 1 with new status attributes or modifies for some shared status attributes the sentries or hierarchy relation. GSM schemas G 1 and G 2 can be complete or fragments. We next characterize fragments of G 2 that specify changes made to G 1 , so fragments containing parts of G 2 that are new or that are different from G 1 . By composing the extracted fragments with G 1 , variant G 2 is obtained.
To properly identify fragments of G 2 , we first define the general change set of G 1 and G 2 . The change set contains a status attribute a of G 2 if: G 2 introduces a, so a is not in G 1 ; or G 2 extends the sentries of a by either adding sentries or by adding extra conjuncts to sentries of a in G 1 ; or G 2 extends the set of direct descendants, called children, of a with additional stages and milestones. For instance, in comparison with BCA in Fig. 1a, BCA 1 in Fig. 1b introduces stage Credit Check and milestones CCS and CCU, extends the sentries of milestones PCS and PCU by extending the sentry of PCS with an additional conjunct and adding a sentry to PCU, and extends the set of children of stage Preliminary Check by introducing the new stage and milestones.
In the definition of change sets, we use a function r : S ! P ðS [ M Þ that assigns to each stage the set of children of that stage. Given a stage S of GSM schema G, rðSÞ ¼ f x j x 2 S [ M^x 0 S^@y 2 S : x 0 y 0 S g, so rðSÞ contains the direct descendants of S. Definition 7 (Change set). Let G 1 ; G 2 be two GSM schemas such that G 1 u G 2 . Then change set DðG 1 ; G 2 Þ is the set of status attributes in G 2 that are new or that have changed compared to G 1 with respect to sentries or hierarchy: Each fragment should also be coherent, i.e., there should not be unconnected parts. This means that status attributes that are changed in G 2 should only be in the same fragment if they are connected via sentries or hierarchy. If some status attributes are unconnected, they are expressed as separate fragments. For instance, Credit Check and Analyze in Management Board in BCA 3 are both changes compared to BCA in , but they are not connected since DCS is not in the change set of BCA in and BCA 3 . Therefore, for these two changes two fragments should be created.
To define coherent fragments from a change set, we introduce the notion of context: a set of changed status attributes that are connected. First, we introduce a few auxiliary definitions for a GSM schema G with status attributes a 1 ; a 2 2 A s . Then a 1 touches a 2 via a sentry, denoted touches Sentryða 1 ; a 2 Þ, if and only if there is a sentry ' such that either ' 2 R þ ða 1 Þ [ R À ða 1 Þ and a 2 is referenced by ' or ' 2 R þ ða 2 Þ [ R À ða 2 Þ and a 1 is referenced by '. And a 1 touches a 2 via hierarchy, denoted touchesHierarchyða 1 ; a 2 Þ, if and only if a 1 is a stage that is a child of a 2 or a 2 is a stage that is a child of a 1 . Finally, for A 1 A s , let touchesðAÞ ¼ f a 2 2 A s j 9a 1 2 A 1 : a 1 ¼ a 2 _ touchesSentryða 1 ; a 2 Þ _ touches Hierarchy ða 1 ; a 2 Þ g.
Next, a context is defined as a set of status attributes in the change set DðG 1 ; G 2 Þ such that each attribute in the set (in)directly touches all other attributes in the set.
Definition 8 (Context). Let G 1 ; G 2 be two GSM schemas such that G 1 u G 2 and let A A 2 s . Then borderðAÞ ¼ touchesðAÞ \ DðG 1 ; G 2 Þ. Set A is a context of G 1 ; G 2 if A is a minimal, non-empty set such that borderðAÞ ¼ A. In general, if C 1 ; C 2 DðG 1 ; G 2 Þ and both C 1 and C 2 are contexts, then either C 1 ¼ C 2 or C 1 \ C 2 ¼ ;. In other words, the context sets partition the change set DðG 1 ; G 2 Þ.
A context can be easily computed by initializing A with an element a from DðG 1 ; G 2 Þ and iteratively adding set borderðAÞ to A until borderðAÞ ¼ A. Note that borderðDðG 1 ; G 2 ÞÞ ¼ DðG 1 ; G 2 Þ so this procedure terminates.
We next define how a fragment is extracted from a context C. We first introduce a few auxiliary definitions. Given a sentry ', let að'Þ denote the set of data and status attributes that are referenced in '. For a status attribute a 2 A s , let refsðaÞ denote the set of data and status attributes that are referenced in the plus and minus sentries of a, so refsðaÞ ¼ [ '2R þ ðaÞ[RÀðaÞ að'Þ. For a set A A s of status attributes, we define refsðAÞ ¼ [ a2A refsðaÞ.
Next, if a status attribute a 2 C is shared by G 1 and G 2 , the sentries of a in the fragment need to be defined such that composing the fragment with G 1 yields the sentries of G 2 for a. Since G 1 u G 2 , for each plus (minus) sentry ' of a in G 1 , either (1) ' is also a plus (minus) sentry for a in G 2 , or (2) there is a plus (minus) sentry '^c for a in G 2 . For (1), the corresponding plus (minus) sentry for a in the fragment becomes orig, for (2) orig^c, to express that the fragment extends ' with conjunct c. Note that a sentry in G 2 of the form ' _ c is not allowed (cf. Definition 2); instead, a set of sentries f'; cg can be used, which becomes forig; cg in the fragment.
To define the sentries of an extracted fragment, we introduce a function weave that given a set F 1 of original sentries and a set F 2 of modified sentries, defines a new set of sentries. If a sentry is both in F 2 and F 1 , it is replaced with orig. If a sentry ' is in F 2 but not in F 1 , then in ' each occurrence of a sentry c 2 F 1 is replaced with orig, written '½c=orig.
Definition 9 (Fragment extraction). Let G 1 ; G 2 be two GSM schemas such that G 1 u G 2 and let C DðG 1 ; G 2 Þ be a context of G 1 and G 2 . Then extractðG 1 ; C; The definition states that the attributes of the fragment extracted for context C are the status attributes of C plus additional attributes that are referenced in rules of status attributes in C. According to the definition of GSM schemas, each sentry can only reference status and data attributes that are in the GSM schema, hence these attributes are included. For instance, the context C 2 in BCA 3 w.r.t. BCA in , defined in Example 4, does not include milestone DCS, since DCS is not in the change set of BCA in and BCA 3 . However, the opening sentry for Analyze in Management Board references DCS (see Table 3), so milestone DCS should be included in the extracted fragment for C 2 .
Example 5. Table 5 shows the fragments extracted for the variants introduced in Section 3, cf. Fig. 2. To illustrate Definition 9 in more detail, consider the extracted fragment f 5 ¼ extractðBCA in ; C 2 ; BCA 3 Þ with context C 2 ={AS, Analyze in Management Board, AU, Write Report}. Since the opening sentry of Analyze in Management Board references DCS, this milestone is also included as status attribute of G, with plus and minus sentries orig. The plus sentry for Write Report becomes AS. Status attributes Analyze in management boad, AS and AU are introduced by f 5 and therefore their sentries do not contain the orig keyword.
Correctness: We state the correctness of the approach with a few lemmas; proofs are in the online appendix, available in the online supplemental material. The first lemma states that each extracted fragment is a GSM schema. Lemma 2. Let G 1 ; G 2 be two GSM schemas such that G 1 u G 2 . Let C DðG 1 ; G 2 Þ be a context of G 1 and G 2 . Then G ¼ extractðG 1 ; C; G 2 Þ is a GSM schema.
Before showing correctness, we state two auxiliary definitions for GSM schemas. The first definition rules out GSM schemas that have data attributes and events that are not referenced in any sentry. Such GSM schema variants cannot be reconstructed by composing the fragments extracted from them, since these fragments will not contain the unreferenced data attributes and events.
Definition 10 (Fully referenced). Let G be a GSM schema.
Then G is fully referenced if each data attribute in D and each event in E is referenced by one or more sentries.
The second definition rules out refinements G 1 u G 2 for which the extracted fragment does not yield, when composed with G 1 , the sentries in G 2 . Composing an extracted fragment may result in additional sentries compared to G 2 . For example, suppose G 1 ; G 2 have shared milestones p, q, r and z with R 1 þ ðzÞ ¼ fp; qg and R 2 þ ðzÞ ¼ fp; q; p^rg. Then the extracted fragment f will contain plus sentries forig; pr ; orig^rg for z. Then f G 1 yields R 2 þ ðzÞ ¼ fp; q; p^r; qr g, which contains a sentry q^r not in R 2 þ ðzÞ. Alternatively, composing an extracted fragment may result in different sentries, namely if one of the sentries in G 2 uses orig. (Then G 2 and G 1 are fragments that have been refactored, as explained in the next section.) For example, suppose G 1 ; G 2 have shared milestones p, q and z with R 1 þ ðzÞ ¼ fpg and R 2 þ ðzÞ ¼ fp; orig^qg. Then the extracted fragment f will contain plus sentries forig; orig^qg for z. Then f G 1 gives R 2 þ ðzÞ ¼ fp; p^qg, with sentry p^q that is different from sentry orig^q in R 2 þ ðzÞ. To rule out such refinements, we next characterize three different ways of refining a set of sentries. Let F 1 ; F 2 be two sets of sentries. Then F 2 preserves F 1 if F 1 F 2 and each sentry in F 2 n F 1 does not contain orig and does not contain a sentry in F 1 . And F 2 extends F 1 if there is a set of sentries C such that F 2 ¼ S c2C f '^c j ' 2 F 1 g and each sentry in C does not contain orig and does not contain a sentry in F 1 . And F 2 overrides F 1 if each sentry in F 2 does not contain orig and does not equal or contain a sentry in F 1 .
The next lemma shows that consistent refinements result in equal sentry sets for refined status attributes. Lemma 3. Let G 1 ; G 2 be two GSM schemas such that G 1 u G 2 and let C DðG 1 ; G 2 Þ be a context such that G 2 refines G 1 consistently in C. Let G C ¼ extractðG 1 ; C; G 2 Þ and G C1 ¼ G C G 1 . For each a 2 C, R C1 þ ðaÞ ¼ R 2 þ ðaÞ and R C1 À ðaÞ ¼ R 2 À ðaÞ. Based on this lemma, the following lemma ensures that if G 2 refines G 1 consistently and G 2 is fully referenced, that composing the extracted fragments with G 1 yields G 2 . Lemma 4. Let G 1 ; G 2 be two GSM schemas such that G 1 u G 2 , let F be all extracted fragments, so F ¼ fextractðG 1 ; C; G 2 Þ j C DðG 1 ; G 2 Þ is a contextg and let f 1 ; ::; f n be an arbitrary sequence of all fragments in F , so F ¼ ff 1 ; ::; f n g. If G 2 refines G 1 consistently and G 2 is fully referenced, then The lemma is key for ensuring the correctness of the extraction algorithm and the refactoring algorithm defined in the next section.

METHOD
We next define a method that extracts a set of fragments from a set of GSM schemas. Each extracted fragment is specified as a GSM schema fragment. We first state the requirements on the method and then the method itself. We also discuss how the extracted fragments can be properly composed into variants.

Requirements
There are two requirements that the set of fragments that the method extracts must satisfy. The fragment set must be:

Write Report
S AS orig R1 minimal: there is no fragment that is a combination of other fragments in the set. R2 complete: each variant can be derived by composing one or more fragments with the base schema. Requirement R1 states that a fragment cannot be a composite of other fragments. For instance, variant BCA 2 can be derived by composing two fragments with the base schema, one of which is defined for generating variant BCA 1 . It is therefore redundant to introduce a third fragment to derive BCA 2 directly from BCA. Requirement R2 ensures that each variant that is input can be derived from the base schema by composing it with a combination of the fragments.

Algorithm
Algorithm COMPUTEFRAGMENTS in Fig. 5 extracts fragments, while meeting the two requirements. Input is a set V of GSM schema variants. The set may contain more than one base schema, i.e., a schema G base that does not refine any other schema in V . In that case procedure COMPUTEINTERSEC-TIONS, defined in Section 5, iteratively computes a set I of intersections of the base schemas until I contains a single base schema (l. 3). This set I is added to V (l. 4).
Next, for each pair of GSM schemas G 1 ; G 2 , if G 1 u G 2 and G 2 is fully referenced, then for each context C DðG; G 0 Þ such that G 2 refines G 1 consistently in C, a GSM schema extractðG; C; G 0 Þ is constructed and added to the set of F of fragments (l.5-l.6).
The set of extracted fragments may contain fragments that are not minimal. For instance, applying the algorithm to the base schema in Fig. 1a and 1 results in one fragment f 5 that is constructed for DðBCA; BCA 2 Þ but is itself a composition of fragments f 2 and f 4 (see Fig. 2). Therefore, set F is refactored to replace fragments such as f 5 with more finegrained fragments (l. 17); the invoked procedure Refactor-Fragments() is defined below. Finally, the set of fragments after refactoring is returned (l. 18). ComputeIntersections() returns set fBCA in g. The resulting set of five schemas has seven refinements (cf. Fig. 2), and each refinement is consistent. The set of extracted fragments at l.16 in Fig. 5 Table 5. For some refinements the same fragments are extracted; for instance, fragment f 1 is extracted for the three refinements BCA in u BCA, BCA in u BCA 1 and BCA in u BCA 2 .
Algorithm REFACTORFRAGMENTS, defined in Fig. 6, takes a set of fragments F and refactors them into a minimal set F new , i.e., there is no fragment F new that is a combination of other fragments in F new . All fragments in F can be constructed by combining one or more of the fragments in F new . More in detail, the algorithm first initializes two sets F old and F new that represent the set of fragments before and after the last refactoring iteration, respectively (l. [2][3]. While these two sets are different (l. 4), refactoring is possible and a refactoring iteration is performed that consists of the following steps. First, set F old is assigned the set F new of fragments that are processed for refactoring (l. 5). Sets R and N contain the set of refactored fragments and the set of new fragments, respectively. For each pair of schemas G 1 ; G 2 2 F old (l. 8), if G 2 is fully referenced (l. 9) and G 2 refines G 1 consistently (l. 10), then G 2 is replaced with new fragments, so G 2 is added to R (l. 11), else G 2 is not refactored. Next, the fragments extracted for each context of G 1 and G 2 (l. 12ff) replace G 2 and are therefore added to N (l. 13). After N and R have been computed by processing each pair of fragments in F old , set F new is updated with the unchanged fragments in F old , i.e., the fragments that are not in R, and the new fragments in N (l. 18). If refactoring is still possible, so set F old 6 ¼ F new (l. 4), the refactoring procedure is repeated. The algorithm finally returns the set of refactored fragments F new .
Example 7. Continuing Example 6, consider RefactorFragments(F), with F ¼ ff 1 ; f 2 ; f 3 ; f 4 ; f 5 g (cf. Fig. 2). In F , fragment f 3 consistently refines f 2 , and f 3 consistently refines f 4 . Therefore, in the first refactoring iteration fragment f 3 is replaced, so R ¼ ff 3 g. The extracted fragment for f 3 and f 2 equals f 4 ; the extracted fragment for f 3 and f 4 equals f 2 . Therefore, N ¼ ff 2 ; f 4 g, so F new becomes ff 1 ; f 2 ; f 4 ; f 5 g. In the second refactoring iteration, no fragments can be refactored, so R ¼ N ¼ ;, and the while loop terminates.
The next lemma states the correctness of the algorithm.
Lemma 5. Let V be a set of GSM schemas. The set of fragments generated by COMPUTEFRAGMENTS(V ) in Fig. 5 is minimal and complete.

From Fragments to Variants
Extracted fragments returned by ComputeFragments() can have interdependencies. One fragment f 1 may define a status attribute that another fragment f 2 refines (cf. Definition 3). If a composition chain contains both f 1 and f 2 , then f 1 should be before f 2 in the chain. Next, f 1 and f 2 are in conflict if they define the same status attributes: if a chain contains both f 1 and f 2 then the one occurring last in the chain overrides the effect of the other. We next define well-formed composition chains, in which fragments are properly composed. In a composition chain G n ::G 1 , GSM schema G j , where 1 j n, introduces status attribute a if a 2 A j , and for each earlier G i in the chain, where 1 i < j, a 6 2 A i . So G j is the first schema in the chain that has a as attribute. In that case, G j must define a, i.e., it must not use orig in its sentries (cf. Definition 3). A composition chain f n Á Á Á f 1 G base , where ff 1 ; ::; f n g F , is well-formed if for each status attribute a and each GSM schema G i in the chain: if G i introduces a, then G i defines a.
Given a set F of extracted fragments, the set of wellformed composition chains that can be constructed from F and its subsets can be enumerated. Composition chains that are permutable result in identical variants [4]. Fragments f i , f j in a composition chain can be permutable, so f i f j ¼ f j f i . In that case, changing the ordering of f i and f j in the chain results in the same variant. If F is a set of permutable fragments, so each distinct pair of fragments in F is permutable, we write F G base , which abbreviates a set of composition chains constructed by taking all the permutations of elements in F , each permutation translating into a composition chain that ends with G base . By definition, all the composition chains in the set result in the same variant.
Example 8. Table 6 shows the 11 different variants that can be generated from the fragments extracted for the variants of BCA. Note that 7 variants are new. Composition chains containing both f 1 and f 5 are not listed, since in such compositions chains the fragment occurring last overrides the effect of the other. For instance,

EVALUATION
To evaluate the feasibility of the fragment extraction approach and assess its effectiveness in generating data-centric process variants, we have implemented the approach in a prototype tool and applied it to several existing data-centric business processes. 1 The tool is programmed in Java and processes CMMN schemas [10]. Advantage of CMMN over GSM is the existence of tool support for designing and executing CMMN-based schemas. The tool uses a subset of CMMN that resembles GSM, as explained elsewhere [4].
We applied the approach and the tool to model in CMMN variants of six business processes that have been described before in the literature [23], [43], [44], [45], [46], [47]. Table 7 lists the descriptive statistics of the six analyzed business processes and the extracted fragments. The table describes for each business process the average size of the CMMN schemas. In accordance with CMMN, we distinguish between tasks, which correspond to atomic stages in GSM, and stages, which correspond to composite stages in GSM. Next, the table lists key statistics for the algorithms Compu-teIntersections(), ComputeFragments() and RefactorFragments(), plus the number of before relations and conflict relations between distinct fragments, which were introduced in Section 7.3. The one but last column lists whether the fragments are complete, so each variant can be constructed by composing one or more fragments with the base schema. Each variant is fully referenced by construction. The final column shows how many distinct variants can be generated by composing the fragments with the base schema. For both columns, only well-formed composition chains are considered in which there are no conflicting fragments.
The results show that for each business process multiple fragments could be extracted. Fig. 7 shows the average complexity of the extracted fragments, where the complexity of a CMMN schema is defined as the sum of the number of stages, tasks, milestones and sentries in the schema [48]. Computing intersections and refactoring fragments were needed in most cases. For most processes, the fragments generate many additional variants next to the input variants, with one interesting exception. For the Acknowledging Unborn Child (AUC) case, one of the variants cannot be constructed from the generated fragments, since three relevant fragments have a cycle in the before-relation.
For the Social Support Act (SSA) and AUC processes, the fragments resulted in an abundance of variants compared to the other analyzed processes. Both processes are performed by municipalities which need to perform certain tasks by law but there is no strict ordering specified for these tasks. This means the different input variants share common fragments but each variant has different dependencies between these fragments. To reduce the number of generated variants, we performed for both processes a second experiment in which we added to the set of input variants the artificial base schema computed in the first experiment. This way, no intersections needed to be constructed in the second experiment. For both processes, this resulted in a set of complete fragments that still generated many more variants compared to the set of input variants (SSA-B and AUC-B). Two interesting observations can be made. First, AUC has an incomplete while AUC-B has a complete set of fragments. Second, the complexity of the generated fragments decreased for SSA but increased for AUC (see Fig. 7). Reason is that for AUC-B two large conflicting fragments are generated that both exceed the complexity of the fragments generated for AUC. For AUC we performed another experiment in which we added another intersection that is in between the base schema and the two variants for which the two large fragments were constructed (AUC-BI). This resulted in fragments with lower complexity than those generated for AUC-B, though still higher than for AUC. More fragments are generated for AUC-BI compared to AUC-B, but still considerably less than for AUB.
The results show that the approach can be applied to a given set of data-centric process variants in order to extract reusable fragments that can be composed into not only the input variants, but also many more additional variants. Knowledge workers can use this approach to extract fragments, select the most meaningful ones, and this way quickly generate multiple variants from those fragments.

CONCLUSION
This paper has defined a novel approach to extract reusable fragments from data-centric, declarative process model variants. Using feature-oriented composition [4], [42] the extracted fragments can be composed in a declarative way to obtain not only the original variants, but also additonal variants. The approach can be used to decompose variants of data-centric process templates into reusable fragments, that codify welltested modifications. This way, complex data-centric process variants can be efficiently composed in a declarative way.
Though the approach has been defined for the host language GSM schemas, it is on a conceptual level also applicable to other languages. Defining the approach for other data-centric process modeling languages requires the definition of composition, intersection, and refinement operators, plus definitions for concepts like change sets and context.
In future work, the practical applicability of the approach for knowledge workers can be further evaluated. Also, an open challenge is to reconcile feature-oriented composition of fragments, which is additive, with modifications of variants, which is non-additive, since deletions can be applied [3]. Another direction is to explore the repairs of fragments that cannot be refactored into minimal fragments.  (ECOWS 2009) and the programme co-chair of IEEE EDOC 2020. His main research interests is in data-driven business process management for knowledge and decisionintensive processes. He is and has been involved in several national and EU research projects together with industry that focus on developing advanced, flexible business process support. He is a member of the ACM.