Deep Analysis of Process Model Matching Techniques

Process Model Matching (PMM) aims to automatically identify corresponding activities from two process models that exhibit similar behaviors. Recognizing the diverse applications of process model matching, several techniques have been proposed in the literature. Typically, the effectiveness of these matching techniques has been evaluated using three widely used performance measures, Precision, Recall, and F1 score. In this study, we have established that the values of these three measures for each dataset do not provide deeper insights into the capabilities of the matching techniques. To that end, we have made three significant contributions. Firstly, we have enhanced four benchmark datasets by classifying their corresponding activities into three sub-types. The enhanced datasets can be used for surface-level evaluation, as well as a deeper evaluation of matching techniques. Secondly, we have conducted a systematic search of the literature to identify an extensive set of 27 matching techniques and subsequently proposed a taxonomy for these matching techniques. Finally, we have performed 432 experiments to evaluate the effectiveness of all the matching techniques, and key observations about the effectiveness of the techniques are presented.


I. INTRODUCTION
The conceptual models that depict the workflow of an organization are called business process models [1], [2]. It is abundantly established that these models are valuable assets for organizations, due to the broad range of application areas ranging from documenting software developed requirements to configuring ERP systems [3]. Process Model Matching (PMM) refers to the identification of the corresponding activities between two process models that represent the same or similar behavior [4]- [6]. The examples of the corresponding activities of two process models are: 'Assess candidate' -'Evaluate applicant' and 'Get birth certificate' -'Collect letter of birth registration'.
The application scenarios of PMM techniques are manifolds which include computing similarity between process models, querying process models, and harmonizing variants of process models [7]- [10]. Recognizing the importance of identifying corresponding activities, several attempts have been made to develop new matching techniques, as well as The associate editor coordinating the review of this manuscript and approving it for publication was Imran Sarwar Bajwa . to use string matching techniques for PMM [11]. Typically, the effectiveness of these techniques is evaluated in terms of three measures, Precision, Recall, and F1 scores, using benchmark datasets from the process model matching contests [11]- [13]. In this study, we have synthesized these available benchmark datasets to reveal two interrelated problems, a) the F1 score achieved by PMM is inflated due to the presence of trivial corresponding pairs, and b) the three measures merely establish the effectiveness of a PMM technique at surface-level and do not provide deeper insights into PMM techniques.
In the following subsections, firstly, we illustrate the PMM problem, secondly, we establish the limitations of the existing approaches that we have identified. Finally, a summary of the contributions is presented.
A. ILLUSTRATION OF PROCESS MODEL MATCHING PROBLEM Figure 1 shows two process models, A and B, separated by horizontal boxes. The corresponding activities between A and B process models are highlighted in gray colour. Where, the changes in the intensity of the colour, from light gray to dark gray, represents the varying levels of differences between the formulation of labels. More specifically, the corresponding activities having identical or similar labels are highlighted with light gray. The examples of such pairs are: 'submit application' -'submit application' and 'Rank student' -'Ranking of student'. In contrast, the corresponding activities having the same semantics but a significantly different formulation of labels are highlighted with dark gray shade. The examples of such pairs are: 'Check academics' -'Verify education documents' and 'Appoint applicant' -'Send offer letter'. In this study, we contend that in the presence of such varying levels of differences in the labels of corresponding activities, it is desirable that the PMM techniques should be able to identify all types of corresponding activities.

B. ESTABLISHING THE PROBLEM
There are three publicly available datasets that have been widely used to evaluate the effectiveness of PMM techniques [4], [13]. These datasets include real-world process models from three different domains: university admission, registration of newborns, and asset management. The datasets are formally referred to as University Admissions (UA), Birth Registration (BR), and Asset Management (AM) datasets [12]. Ontological representations of these datasets have also been generated to define an ontology mapping task, formally called OAEI'17 dataset [14].
The first two datasets are composed of 9 process models, whereas the last dataset is composed of 72 process models. Furthermore, all the three datasets include 36 pairs of process models, and collection of gold-standard annotations between 1575, 633, and 799 activity pairs. Furthermore, to the three datasets, we have also used another large dataset, Multi-Genre (MG) dataset, that is manifolds larger than UA, BR, and AM datasets. The MG dataset that we have used in this study is composed of 600 process models, 406 process model pairs, and gold-standard annotation between 89558 activity pairs.
For this study, we have synthesized all the four datasets by re-assessing their gold-standard annotations. For that, two researchers having expertise in natural language processing were involved in the re-assessment process. Both researchers independently reviewed each corresponding pair in the four datasets and reached the same conclusion that all these datasets contain a substantial amount of corresponding pairs that are either identical or similar i.e. having a slight variation in the formulation of the labels. The results of our synthesis are presented in Table 1. In the table, the corresponding pairs having identical labels or slight variations in the formulation of labels are represented as trivial corresponding pairs. Based-on the results, we have identified the following two interrelated issues regarding the evaluation of PMM techniques.

1) INFLATED SCORES
All the four datasets include a substantial amount of trivial corresponding pairs, which artificially inflates the Precision, Recall, and F1 scores achieved by a matching technique. Therefore, any value of the evaluation measures achieved by the complete datasets does not correctly represent the effectiveness of a PMM technique.

2) SURFACE LEVEL EVALUATION
Due to the possible diversity in the corresponding pairs, illustrated in Figure 1, the use of a Precision, Recall, or F1 score for a complete dataset merely provides a surface level evaluation of a matching technique. That is, it does not provide any deeper insights into the capabilities of PMM techniques.
These limitations justify the need for enhancing the benchmark datasets by classifying the corresponding pairs based on the variations in the labels of activities and computing Precision, Recall, and F1 score for each type of the corresponding pair.

C. EXTENSION OF PREVIOUS WORK
This paper is a significantly extended version of our previous work [15] having has the following notable extensions:

1) ENHANCING ADDITIONAL DATASET
In this extended version, we have enhanced four datasets compared to the three datasets used in our previous work. This is a significant extension as the fourth dataset is composed of 89958 activity pairs, which are 29-folds higher than the total number of pairs in the three previously used datasets. Furthermore, the fourth dataset includes 6443 corresponding pairs, which are 12-folds higher than the total number of corresponding pairs in the three previously used datasets.

2) NOVEL TAXONOMY
The extended version provides an overview of a comprehensive set of 27 matching techniques that were identified by employing a systematic search and screening procedure. Furthermore, a novel taxonomy of the matching techniques is proposed which will serve as reference for the BPM research community.

3) NUMEROUS ADDITIONAL EXPERIMENTS
In the earlier version, three semantic similarity techniques and three datasets were used for experimentation. In contrast, in this extended version, experiments are performed using 27 matching techniques and four datasets. Notably, the techniques include 12 syntactic techniques, 12 semantic techniques, and three state-of-the-art word embeddings. Furthermore, as discussed earlier, the additional dataset used is manifolds larger than the three datasets.

D. PAPER ORGANIZATION
Section II presents the enhancements that we have made to the existing benchmark datasets. Section III presents search and screening details, as well as our proposed taxonomy of the matching techniques. Furthermore, an overview of each technique is presented in this section. Section IV presents the details of experimental settings, results of the experiments, as well as the analysis of the results of all the techniques. Finally, the paper concludes in Section V.

II. ENHANCED BENCHMARK DATASETS
This section presents our first contribution, enhanced benchmark datasets which include classification of corresponding pairs into three sub-types. In particular, firstly, we discuss the conceptual bases that we have used for enhancing the datasets. Subsequently, the details of our enhancements, as well as the specifications of the enhanced datasets are presented.

A. CONCEPTUAL BASES FOR ENHANCING DATASETS
It is widely acknowledged that the level of similarity between a given pair of strings can be defined in terms of three types: Near Copy, Light Revision, and Heavy Revision [16]- [18]. Where, Near Copy represents a slight variation in the formulation of two strings in the pair by changing the form of a word or adding stop words. An example pair of this type of pair is 'best student' -'the best student'. Light Revision represents substantial variation in the formulation of two strings, which can be achieved by adding new words or replacing words with their synonyms. 'Identify best student' is a light revision of 'find the outstanding student'. Heavy Revision represents significant variation in the phrasing of the two strings in such a way that the semantics are not changed. The change operations may involve the use of alternative words or reordering existing words. Examples of heavy revision of an activity label 'find the best student' are 'perform class topper search' and 'find the student having the highest GPA in the class'.

B. ENHANCING THE BENCHMARK DATASETS
This study has used the conceptual bases presented above to enhance the four benchmark datasets. To that end, firstly, the three types of pairs that we have used, are introduced, which is followed by a discussion about the pre-processing of the datasets that we have performed. Finally, the procedure of enhancements and the specifications of the enhanced datasets, are presented.

1) DIVERSITY IN CORRESPONDING PAIRS
This study has used three types of pairs, Verbatim, Modified Copy, and Heavy Revision, to represent the diversity in the corresponding pairs. These types stem from the levels of string similarity discussed in the preceding subsection. Below, we provide an overview of our three types of pairs and our criteria for deciding the category of a given pair. Note, the criteria are developed based on our synthesis of the four datasets keeping in view the definitions of NC, LR, and HR. The criteria of each type of pair, as well as their respective examples from the benchmark datasets, are presented in Table 2.
• Verbatim (VB): Two activities in a pair are verbatim of each other if their labels are either identical or almost the same. In particular, a corresponding pair is verbatim if the two labels are identical labels, identical labels without stop words, or re-ordered pairs. VOLUME 8, 2020  Table 2.

2) PRE-PROCESSING DATASETS
As a first formal step towards enhancing the benchmark datasets, we also analyzed the publicly available results, 1 as well as the gold-standard annotations, 2 of all the twelve matching systems from the matching contest. The analysis revealed the following three discrepancies: • UA dataset contains at least 188 corresponding activity pairs that are not meaningful as they include logical labels that do not belong to an activity. An example of such an activity label is 'IntermediateCatchEvent' -'ExclusiveGateway'.
• All the four datasets (UA, BR, AM, MG) contain activity pairs that have identical or similar labels but they are declared as unequivalent in the benchmark datasets. The four datasets includes 13, 42, 213, and 1704 such pairs.
• Each dataset contains duplicate pairs which are likely to skew the overall performance scores, as well as scores of each type of pair, after classification.
To overcome the discrepancies, firstly, the meaningless pairs were removed from the UA dataset. Secondly, The identical or similar labels that were declared as unequivalent were marked as equivalent pairs. That is, 13, 42, 213, and 1704 pairs were marked in the UA, BR, AM, and MG datasets, respectively. Finally, 189, 164, 78, and 789 duplicate pairs were omitted from the four datasets leaving 171, 259, 378 and 7358 corresponding pairs in the four datasets.

3) CLASSIFYING CORRESPONDING PAIRS
The corresponding pairs in the pre-processed datasets were used for the classification task. That is, each corresponding pair was classified as VB, MC, or HR, based on the criteria presented in Table 2. In particular, two experts independently classified all the corresponding pairs into one of the three types, using the developed criteria. Subsequently, all the conflicts resolved by discussing each pair and developing consensus about its type. Finally, each dataset was enhanced by annotating the respective concurred type of each pair.
The distribution of the types of pairs in each dataset is presented in Table 3. A key observation from the table is that the number of pairs in each type is significantly different. For instance, the number of HR pairs in UA datasets is more than twice the number of VB and MC pairs. This lack of uniformity, in terms of the number of pairs of a specific type, reinforces that the use of overall effectiveness scores merely provides a surface level evaluation of a matching technique, and it does not provide any deeper insights about the capabilities of a matching technique. Hence, this study computes the overall effectiveness scores, as well as the effectiveness score of each type of pair.

III. TAXONOMY OF THE MATCHING TECHNIQUES
In this section, we present our second contribution which includes identifying a comprehensive set of matching techniques and building a taxonomy of the process model matching techniques that exist in literature. Additionally, an overview of each matching technique is presented in this section, hence, providing a unified and comprehensive source of knowledge for the research community.

A. SEARCH STRATEGY
Starting from a survey of process model matching techniques, we used the snowballing technique to identify existing studies. In particular, we started with a notable study [5] and applied backward and forward tracing techniques to identify relevant studies. As a result of the backward tracing, we found twenty six studies that have discussed one or more similarity measures. Whereas, as a result of the forward tracing, we initially collected seven research studies from which five studies either did not discuss any similarity measure or they discussed one additional similarity measure is already collected during the backward tracing step. At this step, the latest studies which relied on the use of three state-of-the-art text similarity techniques, named word embeddings, for process model matching. Accordingly, we have identified twenty seven similarity measures that are included in this study. Note that we only considered research articles published in journal and proceedings of international conferences and workshops. The developed taxonomy is presented in the following subsection.

B. THE TAXONOMY
In this study, we have included all the twenty seven similarity techniques and designed a taxonomy for these measures. The taxonomy is composed of more abstract categories, as well as specific categories, based on the commonalities between the techniques. Figure 2 shows the first level of our taxonomy.
It can be observed from the figure that there are three categories of similarity techniques: syntactic, semantic, and word embeddings. Where, the syntactic techniques compute the similarity between a label pair based on the exact content of the involved labels. These are very basic measures that even do not consider simple relations between words of labels, such as inflected form, inclusion or exclusion of stop-words, synonym, etc. In contrast, the semantic similarity measures take into consideration the different forms of words and relationships between words for computing similarity between labels. The third category, word embeddings, includes a set of state-of-the-art techniques that go beyond synonyms and considers the context of words during the comparison of activity labels in a pair.

1) SYNTACTIC SIMILARITY
The syntactic similarity measures are further divided into two categories: word-based similarity measures and stringbased similarity measures. The word-based similarity measures compare the two labels in a pair based on individual words that exist within both labels. Whereas, the string-based similarity measures compare label pair by considering the complete label as a string and then using string matching operations to compute the similarity between the two labels. word-based similarity measures are further divided into two categories named as overlap based similarity measures and vector-based similarity measures. The overlap based similarity measures compare the label pair on the bases of the common words between both labels. Whereas, vectorbased similarity measures firstly generate the word vectors for both labels either based on term frequency or based on term frequency-inverse document frequency. Subsequently, cosine between these two vectors is calculated to decide the similarity of both labels. String-based similarity measures are also divided into two categories that are distance-based similarity measures and substring-based similarity measures. If any similarity measure computes the similarity of label pair by observing that one label is part of other labels or not it is called sub-string based similarity measure. In contrast, distance-based similarity measures compare both labels based on different types of distances between both labels, such as edit distance and hamming distance, etc.

2) SEMANTIC SIMILARITY
The semantic similarity measures are divided into two categories: word similarity based similarity measures and word VOLUME 8, 2020   relatedness based similarity measures. Word-similarity based similarity measures compares the label pair by also considering the synonyms of both labels' words. While word relatedness based similarity measures consider other relations between words, such as Lin, Lesk, etc., whereas word similarity based similarity measure considers synonyms of individual words. These measures are further divided into two categories based on the weights assignment to identical words and synonyms words. Weighted similarity measures assign weights to identical words and synonyms words that exist between both labels, whereas, non-weighted similarity measures do not assign weights to different types of word pairs. As shown in Figure 4, word relatedness based similarity measures are further divided into two categories which are individual and ensembled. Those similarity measures which only use single word relatedness measures are placed into an individual categories, such as Lin, Lesk, etc.

C. THE MATCHING TECHNIQUES
We have selected 27 similarity measures that are previously used in the process model matching literature. A brief overview of each similarity measure is as follows:

1) JACCARD SIMILARITY
Jaccard similarity computes the similarity between labels based on the number of words that are common between them. Formally, Jaccard similarity is defined as follows: where, L 1 and L 2 are the two labels, |L n | is the length of the n th label, and W com is the number of words of the first label that is also present in the second label. This similarity measure returns similarity score between 0 and 1, where 1 represents that both labels are identical, and 0 represents that both labels are entirely different. For instance, consider two labels, ''conduct oral exam'' and ''conduct interview''. There is only one word ''conduct'' in the first label that is also available in the second label. Therefore, Jaccard similarity between the labels becomes: Dice similarity is a variant of Jaccard similarity that also computes the similarity between a label pair based on the common words between the two labels. However, it computes the similarity of both labels based on the number of words that are common between them. Formally, Dice similarity is defined as follows: 1 and L 2 are the two labels and |L n | is the length of the n th label, in terms of number of words. Furthermore, W com is the count of common words from the first label that can also be found in the second label. This similarity measure returns similarity score between 0 and 1, where 1 represents that both labels are identical, whereas 0 represents that both labels are entirely different. For the two labels, ''conduct oral exam'' and ''conduct interview'', the Dice similarity becomes

3) COSINE TF
To compute cosine similarity, first labels vectors should be generated. There are different ways to generate the labels vectors; Term Frequency is the simplest one. Here word count in a label is used as term frequency. All unique words of both labels are used as a corpus for vectors where words are in an ordered form. After the successful generation of labels vectors, cosine similarity is calculated based on the equation that is given below.
where L is used for Label representation. The above equation returns similarity value between 0 and 1, where 1 means both labels are identical, and 0 means both labels are completely dissimilar. For example, ''conduct oral exam'' and ''conduct interview'' are two candidate labels. Therefore, Cosine similarity value with Term Frequency is ( Two labels are similar if they are identical or one label is the substring of the other label. This binary decision means either the two labels are similar or dissimilar. For example, ''conduct oral exam'' and ''conduct interview'' are dissimilar labels as per the above criteria. Furthermore, in the example pair ''conduct interview'' and ''conduct interview with panel'', these both labels are similar as one is a complete substring of the other label.

6) EDIT DISTANCE 17
There are several variations to compute similarity using Edit Distance. Given below equation is mostly used variation of Edit Distance to calculate the similarity between two labels. According to the given below equation value of edit distance is normalized with string length of larger label.
Edit Distance is also known as Levenshtein Distance. Edit Distance of two labels L 1 and L 2 is the required the minimal number of atomic string operations to transform one label to another. The atomic operations are insertion, deletion, and substitution of a character. The above equation returns a similarity score between 0 and 1, where 1 means both labels are identical and 0 means both labels are completely dissimilar. For example, ''documentation'' and ''document'' are two labels for which five atomic operations are required to transform one label into the other, therefore the similarity score becomes 1 − 5 13 = 0.615.

7) EDIT DISTANCE 16
The equation given below is another variant of Edit Distance to calculate the similarity between two labels. According to the technique, the value of edit distance is normalized by the sum of both labels' length.
|L 1 | + |L 2 | The above equation returns a similarity score between 0 and 1, where 1 means both labels are identical and 0 means both labels are completely dissimilar. For example, ''documentation'' and ''document'' are two labels, the five atomic operations are required to transform one label into the other. The similarity score then becomes 1 − 5 21 = 0.761.

8) EDIT DISTANCE 15
The equation given below is another variant of Edit Distance to calculate the similarity between two labels. According to the given below equation value of edit distance is weighted with the length of shortest label.
This equation returns a similarity score between 0 and 1, where 1 means both labels are identical and 0 means both labels are completely dissimilar. For example, ''documentation'' and ''document'' are two labels having edit distance of five. The similarity score then becomes 8−5 8 = 3 8 = 0.375.

9) HAMMING DISTANCE
Earlier, we discussed the three similarity techniques that are based on Edit Distance. This technique based on the Hamming Distance. The equation given below is used to calculate VOLUME 8, 2020 the similarity between two labels based on their Hamming Distance. According to the technique, Hamming distance is normalized with string length of the larger label.
Typically, Hamming distance is calculated on equal sized labels, however activity labels may not be of comparable size, therefore, Hamming distance is extended to extra lengths of the larger label. Hamming distance is calculated by inspecting every index of both labels. If characters of both labels at any specific index are not the same then the Hamming distance is incremented by one. The above equation returns similarity value between 0 and 1, where 1 means both labels are identical and 0 means both labels are completely dissimilar. For example, consider the two labels ''documentation'' and ''document'', the Hamming distance of these labels is five. The similarity score therefore becomes 1 − 5 13 = 0.615.

10) LONGEST COMMON SUBSEQUENCE
LCS finds the longest common subsequence from two labels instead of substrings. As we know substring should be in consecutive indexes in the label, but characters in common subsequence might exist at any index. Below mentioned first equation is used to calculate LCS Distance.
LCS Distance is similar to Edit Distance if Edit Distance allows only insertion and deletion operation or considers the cost of substitute operation as double. The equation given below is used to calculate the similarity between two labels based on their LCS Distance. According to the equation, the value of LCS distance is normalized by the sum of both labels' length.
The above equation returns a similarity score between 0 and 1, where, 1 means both labels are identical and 0 means both labels are completely dissimilar. For example, consider the two labels ''Send letter of acceptance'' and ''Send acceptance'', the LCS distance of the above mentioned labels is ten, therefore, the similarity value becomes 1 − 10 40 = 0.75.

11) JARO SIMILARITY
Jaro similarity is considered as the best syntactic similarity technique for shorter labels. As compared to Edit Distance, the Jaro similarity uses the concept of common characters between label pairs. Consider two labels L 1 and L 2 having length of n and m, respectively. Suppose i is used as index for L 1 and j is used as index for L 2 , at some index i in L 1 and j in L 2 the character is same, such as L 1 i = L 2 j , it is a common character if and only if the two characters are not far enough or at least not far from the half length of maximum length label. Formally, it can be written as follows: The above mentioned technique is used to find the common characters between label L 1 and label L 2 , and it is denoted by cc(L 1 , L 2 ). As a result, we get an ordered vector of common characters for L 1 by with comparing L 2 , we could name this vector as L 1 . Similarly, a ordered vector of common characters between L 2 and L 1 is represented as L 2 . Formally, Jaro similarity is defined as follows: In the above equation, t(L 1 , L 2 ) is the transposition number between L 1 and L 2 which is calculated by the half of hamming distance of L 1 and L 2 vectors. The above equation returns a similarity score between 0 and 1, where 1 means both labels are identical and 0 means both labels are completely dissimilar. For example, ''Send letter of acceptance'' and ''Send acceptance'' are two labels, the Jaro similarity value for the above mentioned labels becomes 0.76.

12) JARO SIMILARITY VARIANT
This technique is a variant of Jaro similarity with small changes. Formally, it is defined as follows: cc(L 2 , L 1 ) This technique also uses common characters in opposite direction because of the fact that cc(L 1 , L 2 ) is not equal to cc(L 2 , L 1 ). The above equation returns a similarity score between 0 and 1, where 1 means both labels are identical and 0 means both labels are completely dissimilar. For the two labels ''Send letter of acceptance'' and ''Send acceptance'', the similarity score of the labels become 0.73.

13) MEAN OF STATIC WEIGHTED WORD COMPARISON
This technique assigns 1.0 weight to common words in both labels and assigns 0.75 weight to those uncommon words pairs that are synonyms of each other. Subsequently, the score is normalized by the length of labels that has maximum words. As this technique assign a higher value to common words and lower to synonymous words, therefore, this technique favors the common words. Formally, it is defined as follows: where, where, |L 1 ∩ L 2 | is the count of common words in both labels.
Where synonym function checks whether the pair of words from uncommon words of both labels are synonym to each other. If this is the case then this function returns 1 otherwise return 0. For word synonyms, WordNet dictionary is used.

14) MEAN OF DOUBLED WEIGHTED WORD COMPARISON
The formal specification of the techniques is given below in the equation. In the equation, w i and w s are weights for common words and synonym words, respectively. The values for these weights could be dynamically assigned between 0 and 1, but 1 and 0.75 values are recommended for common words and synonym words, respectively. This technique assigns 1.0 weight to common words in both labels and assigns 0.75 weight to those uncommon words pairs that are synonym of each other. w i is doubled by multiplying it with 2 and w s is doubled by counting synonyms for both labels instead of the first label as compared to the previous technique. In the end, similarity score is finalized by normalizing the resultant value by words count in both labels. Formally, it is defined as follows: In the equations, |L1∩L2| is the count of common words in both labels. Where, synonym function checks whether the pair of words from uncommon words of both labels are synonyms of each other. If this is the case, then this function returns 1 otherwise it returns 0.

15) MONGE ELKAN
Monge Elkan technique takes a word from the first label and compare it with all words of the second label but only records maximum value from these comparisons. The procedure is repeated for all words of the first label. The resultant values are summed and normalized with the total number of words in the first label. Formally, it is defined as follows: WordSimilarity(a, b) |L 1 | In the equation, the WordSimilarity function return one if both words ''a and ''b are identical. In case both words are synonyms, a predefined value is returned. This value is less than one and greater than zero, however, typically, 0.5 or 0.75 values are used. In case both words are not identical nor synonyms, then this function returns zero.

16) MONGE ELKAN LIN
Monge Elkan technique takes a word from the first label and compares it with all words of the second label but only records maximum value from these comparisons. The procedure is repeated for all words of the first label. The resulting values are summed and normalized by the total number of words in the first label. Formally, it is defined as follows: |L 1 | As compared to the earlier technique, this technique computes the Lin similarity value between two words instead of synonym similarity.

17) DICE WITH SYNONYMS
Its is a syntactic technique which considers only common words in both labels. This technique also considers the synonyms. The equation for this technique is as follows.
where, L is used for Label representation and WC(L) means Word Count in the specific label. Additionally, SCW is the count of words from the first Label that have common or synonym in the second Label.

19) LIN WITH OPTIMAL PAIRS
This technique also computes Lin similarity between two candidate labels by employing a similar approach. However, this technique makes the words pair by the optimal way to gain the overall maximum Lin similarity value for both labels as compared to Greedy pairing where pair is selected based on the maximum Lin value. The Lin value of the label pair is calculated by averaging Lin value of words pairs that VOLUME 8, 2020 are selected by optimal nature. Furthermore, it is based on the Information Content of LCS (Least Common Subsumer) divided by the sum of Information Content of both synsets of words. Formally, it is defined as follows: 2 * IC(lcs) IC(s 1 ) + IC(s 2 )

20) LESK WITH GREEDY PAIRS
LESK value of a label pair is calculated by averaging of LESK value of word pairs that are selected by greedy pairing. Similar to the greedy pairing of Lin, each word of the first label with all words of the second label and chooses the maximum LESK value for each word of the first label. Then these values are multiplied by their respective IDF values. After that, all these values are added and then normalized with the word count of the first label. The same steps are repeated for the second label. In the end, both values are averaged and resultant value is computed.

21) LESK WITH OPTIMAL PAIRS
This technique also computes LESK similarity between two candidate labels. The difference is that this techniques employs optimizing pairing to gain the overall maximum LESK similarity value for both labels as compared to Greedy pairing, where the pair is selected based on maximum LESK value. The LESK value of label pair is calculated by averaging LESK value of words pairs that are selected by optimal pairing.

22) PATH SIMILARITY
This technique does not consider the common words in both labels, because the inclusion of common word increases the similarity of both labels and could influence the overall similarity decision. This technique calculates path similarity using the WordNet dictionary and normalizes this score by the count of words of the first label that are uncommon with words of the second label, and path similarity is based on shortest path between both words in the WordNet taxonomy. Formally, this technique is defined as follows: w2) is path similarity value of words pair w 1 and w 2 from the WordNet dictionary.

23) INTERSECTION OF SENSES
This technique only considers the uncommon words of both labels. Common words of both labels ignored by this technique. Then for each label, a synonym set is maintained that contains all the WordNet Synonym Sets of uncommon words of that Label. After that, the length of intersection of these two sets is divided by the length of the maximum set. Equation for this technique is as follows.
where ∂(L 1 ) is Synonym Set of all uncommon words of the first Label.

24) LIN OF SEMANTIC COMPONENTS
This technique extracts three semantic components (action, business object, additional information) from both labels and subsequently Lin value is calculated for each component of both labels. Finally, these values are summed up and the maximum number of semantic components in any label normalizes resultant value. Formally, it is defined as follows: In the equation, Lin a (L 1 , L 2 ) calculates the Lin value between action components of both labels. If any label does not have a component, then 0 is returned for that specific component instead of Lin value. Here, a, bo, add are used as abbreviations for action, business object, additional information components. Whereas, |L 1 | means the number of semantic components that are present in label L 1 . This technique assumes the action semantic component must be present in both labels.

25) WORD EMBEDDINGS
In addition to the above, we have also used three state-of-theart representations of words, called word embeddings. In particular, we have used three established techniuqes for generating word-embeddings, Word2vec, fastText, and GloVe. Where, Word2vec embeddings were proposed by Google Inc. [19], fastText embeddings were proposed by Facebook Inc. [20], and the GloVe embeddings were proposed by the leading research group at Stanford University [21]. Among these techniques, Word2vec and fastText are based on skipgram with the negative-sampling training method, whereas, GloVe is based on global word-to-word co-occurrence statistics.
In the subsequent sections, a comprehensive evaluation of these measures is presented.

IV. COMPREHENSIVE EVALUATION OF TECHNIQUES
This section presents our third contribution, which is a comprehensive evaluation of all the similarity techniques identified in the preceding section. Below, we introduce the evaluation measures and experimental setup used for evaluating the similarity techniques. Subsequently, a comprehensive analysis of the results is presented.

A. EVALUATION MEASURES
The three widely established measures to evaluate the effectiveness of matching techniques are Precision, Recall, and F1 scores. Where, Precision is the ratio between true positive pairs and true positive and false positive pairs [22], Recall is the ratio between true positive pairs and true positive and false negative pairs, and F1 score is the harmonic mean of precision and recall.
In this study, we have performed 432 matching experiments. That is, experiments are performed using 27 matching techniques for all the four datasets (UA, BR, AM, MG) to generate overall performance scores, as well as for the three types of pairs (VB, MC, and HR). For the experimentation, all the matching techniques were implemented in such a way that each implementation takes input an activity pair and returns a normalized similarity score between 0 and 1. Subsequently, each similarity score is converted into a binary decision of equivalent and non-equivalent represented by 1 and 0, respectively. These binary decisions are used to compute Precision, Recall, and F1 score. Note, for the overall results, the values of P, R, and F1 were computed. On the contrary, for VB, MC, and HR pairs only Recall scores were computed, due to the absence of un-equivalent pairs. Several matching techniques involve computing semantic similarity between words. In this study, we have used Word-Net to determine the semantic similarity between words from two labels. WordNet is a well-established and widely used lexical database for the English language that is composed of over 150,000 concepts, including verbs, adverbs, nouns, and adjectives [23]. The concepts are related with other concepts in the database, called synsets [24], [25]. Furthermore, the concepts can have additional relations, such as is-a and part-of relationships, to form a network of concepts.

C. RESULTS AND ANALYSIS
The results of all the experiments performed using the 27 techniques and four datasets are presented in Table 4 Tables 4 -7, we present the following three key observations: comparison of matching techniques, comparison of pair types, and comparison of datasets. Subsequently, we present the results of the error analysis.

1) COMPARISON OF MATCHING TECHNIQUES
It can be observed from the results of the UA dataset presented in Table 4, that fastText outperformed all other techniques in terms of overall scores, as well as for each type of pair. Similarly, it can be observed from the results of BR and MG datasets presented in Table 5 and 7, the scores achieved by fastText are higher than all the other 26 techniques. A possible reason for the higher effectiveness of fastText is that these embeddings have generated vector representation of each word in a high-dimensional space using a very large collection of tokens, which is useful to calculate the similarity between label pairs. A notable observation from the results of AM dataset presented in Table 6 is that the F1 scores achieved by some matching techniques, such as Jaro and Path similarity are higher than that of fastText. However, the recall scores achieved by these techniques are lower than the other techniques indicating the effectiveness of fastText to identify a higher number of corresponding pairs.

2) COMPARISON OF TYPES OF PAIRS
To gain deep insights into the performance of matching techniques, we computed the average Recall score for each type of pair: VB, MC, and HR. The generated average scores for the three types of pairs using UA, BR, AM, and MG datasets are presented in Figure 6, 7, 8 and 9, respectively. It can be observed from the results of the UA dataset presented in Figure 6 that all the types of techniques achieved a very high. Recall score for VB pairs, a lower score for MC pairs, and even lesser scores for HR pairs. These VOLUME 8, 2020  results indicate: a) a deeper insight about the effectiveness of matching techniques is desired, b) the overall F1 score achieved by each matching technique is inflated due to the VB pairs, and c) the HR corresponding pairs are harderto-detect than the other two-types of pairs. It can also be observed from the figure that the difference in the performance of techniques for VB pairs is minuscule. A manual review of these pairs revealed that the differences between VB pairs are so small that even the syntactic measures can  identify these corresponding pairs. Similar observations can be made about BR, AM and MG datasets, from the results presented in Figure 7, 8 and 9.   from the figure that the overall Recall score achieved by techniques for AM dataset is highest whereas it is lowest for the MG dataset. A manual analysis of the AM dataset revealed that a majority of the corresponding pairs, almost 67% pairs, lie in VB category, and hence they are easily detectable. It can also be observed that there is a slight difference between the effectiveness of techniques for VB pairs, whereas the difference is significant for MC and HR pairs. The reason for the VB pairs is that the labels involved in these pairs are slightly different from each other, hence, these corresponding pairs are identified by all the techniques.

D. ERROR ANALYSIS
We have performed an error analysis of the results of matching techniques. In particular, Table 8 presents the quantitative analysis of the results of word-embeddings, which are the most effective category of matching techniques. The first row in the table shows that the number of corresponding pairs that were identified by at least one type of word embeddings but not identified by any syntactic or semantic technique. These numbers represent the effectiveness of word embedding techniques as compared to other techniques.
The second row in Table 8 shows the number of corresponding pairs in each dataset which were not identified by any matching technique, including word embeddings. It can be observed from the table that 3893 pairs in the MG dataset were not identified by any matching technique, which is 47% of total corresponding pairs. This higher percentage of the corresponding pairs represents the challenging nature of the MG dataset.
We also performed a qualitative analysis of the corresponding pairs that were not identified by any matching technique. In particular, we manually analyzed each corresponding pair and grouped them together to identify the causes of errors. A brief overview of these causes are as follows: • Firstly, the activity pairs in which the length of one label is significantly greater than the other label in the pair were not identified by any matching technique. The VOLUME 8, 2020   Table 9.
• Secondly, the activity pairs that were declared as corresponding pairs in a business context were not identified by any matching technique. That is, these pairs include words having no semantic relationship between each other, according to the WordNet. However, the two labels have the same business meanings in the context of the business process. The example of such an activity pair is ''Initiate discussion if on due date -Start debate through message''.
• Thirdly, the corresponding activity pairs in which one label subsumes the other label were not correctly identified by any matching technique. For instance, in the corresponding pair ''Reject -Send letter of rejection'', the later label subsumes the former label. On the other hand ''Update the applicant status -Mark applicant as suitable'' is an example of the corresponding pairs in which one label semantically subsumes the other. That is, both the options of marking suitable and unsuitable are represented in the label ''update the application status'', whereas, the other label semantically subsumes in the first label as it only represents one option.
• Finally, the corresponding activity pairs in which the number of semantically similar words are less than the dissimilar words were not identified by matching techniques. Our manual investigation of the errors identified that it is due to the cutoff threshold of 0.75.

V. CONCLUSIONS AND DISCUSSION
The effectiveness of PMM techniques is evaluated using three widely-used benchmark datasets, in terms of Precision, Recall, and F1 scores. Our comprehensive analysis of these datasets, as well as the results of existing techniques have revealed two interrelated issues: a) the datasets include a substantial amount of identical or similarly labels which inflates the performance scores, and b) the overall Precision, Recall, and F1 scores are useful for a surface-level evaluation of a technique, however, it does not provide the necessary insights about the capabilities of a matching technique. In this study, we have made three key contributions. Firstly, we have refined four existing datasets containing 92565 activity pairs, by omitting their discrepancies. Subsequently, we have enhanced these datasets by classifying the corresponding pairs into three types, based on the level of differences in the participating labels. The notable features of this contribution are that these types are well-rooted in literature, and we have employed a systematic procedure for the extension involving clearly defined classification criteria and multiple researchers.
Secondly, we have identified an extensive set of 27 matching techniques from a literature survey and developed a taxonomy of these techniques. This study includes the procedure for developing the taxonomy along with an overview of each matching technique. As a third contribution, we have performed 432 matching experiments using the enhanced datasets for a comprehensive evaluation of 27 matching techniques.
The results of the experiments revealed that the state-ofthe-art word embeddings exhibit are somewhat promising for identifying two types of activity pairs, Modified Copy and Heavy Revision Pairs. Whereas, both syntactic and semantic techniques almost completely failed in the identification of Heavy Revision Pairs. Based on the results we recommend the need for developing the next generation of matching techniques that can show the promising results for all the three types of pairs. MUHAMMAD ALI received the M.Sc. degree from the Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, and the master's degree from the University of Management and Technology. He is currently pursuing the Ph.D. degree with the PUCIT, University of the Punjab. He is a Systems Analyst with the Directorate of IT Services, University of Gujrat, Gujrat. He has over a decade of software development experience.
KHURRAM SHAHZAD received the master's and Ph.D. degrees from the KTH Royal Institute of Technology, Stockholm. He has been associated with the Information Systems Group, Technical University Eindhoven, Eindhoven, and the University of Fribourg, Fribourg. He is currently an Assistant Professor with the Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore. He has published more than 35 articles in international conferences and journals.
SYED IRTAZA MUZAFFAR received the M.Phil. degree in computer science from the Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, where he is currently pursuing the Ph.D. degree with the PUCIT. He has three years of teaching and development experience. He is currently a Lecturer of computer science with the University of Central Punjab, Lahore.

MUHAMMAD KAMRAN MALIK received
Kaman holds a the Ph.D. degree in computer science. He has more than 15 years of teaching and development experience. and He holds one has authored one U.S. patent. He has provided consultancy to many multinational firms on natural language processing, machine learning, and data science. He has authored 25 journal articles and conference papers. He is an Assistant Professor with the at Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan. His research interests include natural language processing, machine learning, and data science. VOLUME 8, 2020