Exploring Natural Language Processing in Model-To-Model Transformations

In this paper, we explore the possibility to apply natural language processing in visual model-to-model (M2M) transformations. Therefore, we present our research results on information extraction from text labels in process models modeled using Business Process Modeling Notation (BPMN) and use case models depicted in Unified Modeling Language (UML) using the most recent developments in natural language processing (NLP). Here, we focus on three relevant tasks, namely, the extraction of verb/noun phrases that would be used to form relations, parsing of conjunctive/disjunctive statements, and the detection of abbreviations and acronyms. Techniques combining state-of-the-art NLP language models with formal regular expressions grammar-based structure detection were implemented to solve relation extraction task. To achieve these goals, we benchmark the most recent state-of-the-art NLP tools (CoreNLP, Stanford Stanza, Flair, Spacy, AllenNLP, BERT, ELECTRA), as well as custom BERT-BiLSTM-CRF and ELMo-BiLSTM-CRF implementations, trained with certain data augmentations to improve performance on the most ambiguous cases; these tools are further used to extract noun and verb phrases from short text labels generally used in UML and BPMN models. Furthermore, we describe our attempts to improve these extractors by solving the abbreviation/acronym detection problem using machine learning-based detection, as well as process conjunctive and disjunctive statements, due to their relevance to performing advanced text normalization. The obtained results show that the best phrase extraction and conjunctive phrase processing performance was obtained using Stanza based implementation, yet, our trained BERT-BiLSTM-CRF outperformed it for the verb phrase detection task. While this work was inspired by our ongoing research on partial model-to-model transformations, we believe it to be applicable in other areas requiring similar text processing capabilities as well.


I. INTRODUCTION
As one of the most established topics in natural language processing (NLP), information extraction is focused on extracting various structures of interest from unstructured textual information. Recent advances in deep learning and NLP fields enable the development of high performing models by using large amounts of data and wide contexts to automatically extract relevant features, which can be transferred and reused in other related tasks. Such techniques enable complex context-driven detection of grammatical The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . and semantic inconsistencies [1], extraction of relations, aspects, or entities [2], [3], also, tagging entities of interest in the text [4], deduplication, identifying similarities or synonymous forms [5] and solve other similar problems. Moreover, successful implementation of such tasks requires fundamental knowledge about multiple techniques at the intersection of information retrieval, computational linguistics, ontology engineering, and machine learning. This work is inspired by our previous research on NLP-enhanced information extraction in model-to-model transformations [6], [7]. However, the need for similar solutions was also identified in other areas involving visual modeling, such as business process modeling [8], [9], [10].
In this paper, we address the issue of relation extraction from graphical models focused on the detection of semantic relationships within the given text. More specifically, we aim to extract subject-verb relations which can be easily extended to triplets (subject, verb, object) using associative or compositional relationships from the source model (for instance, Use Case element is usually associated with one or more Actors using Association relationship). Therefore, such relationships will be defined between two or more entities and represent certain connections between them. Many recent papers address relation detection between entities of predefined types (such as PERSON, ORGANIZATION, LOCATION) and their semantic relations using supervised learning [11], while we aim to perform more generalized extraction by extracting all available verb and noun pairs. This is not a trivial task, although it has been previously addressed in document processing using pattern-based analysis [12], distant supervision [13], [14] and rule-based extraction systems [15]. In addition to the extraction of verb/noun phrases from the text labels, in this paper, we also study the problem of identifying and properly interpreting abbreviations and acronyms, which is a very relevant topic in model-driven systems development, especially, in the field of automated model transformations. While it may be handled using external sources, like acronym databases, dictionaries, or thesauri, real-world cases may be more complex to interpret due to ambiguities, contextual dependency, or simply the lack of proper text formatting (for instance, acronyms may be written in lowercase if less formal communication or discourse context is considered, such as chatbots or tweets). Finally, we address the problem of processing conjunctive/disjunctive statements, by parsing them into multiple ''subject-verb'' relations. In the context of our research, they can later be combined with related elements to form valid associative relations (triplets). This is also a sophisticated problem due to the natural language ambiguities or inconsistency in the underlying NLP technology. All the above-mentioned issues are discussed in more detail in Section III.
The main objective of this paper is to evaluate the capabilities of the most recent developments in NLP for processing text labels in graphical models and to validate their suitability by performing the extraction of noun/verb phrases from the names of model elements under certain real-world conditions and constraints which are usually not addressed in more generalized NLP-related research.
To solve these problems, we first identify and enumerate multiple anti-patterns for naming model elements extracted from a real-world dataset which complicate this task and should be handled separately by using additional techniques. Further, we apply deep learning-based sequence tagging models, pretrained with augmented data to address some of these ambiguities and combine them with predefined formal grammar-based extraction. In this paper, we specifically consider the processing of text labels in graphical models created using two prominent visual modeling standards, namely, Business Process Model Notation (BPMN) [16] and Unified Modeling Language (UML) [17]. To our knowledge, this research is one of the first attempts to apply novel deeplearning-driven techniques for the extraction of information from such models. Additionally, we provide evaluations of two related tasks, namely, conjunctive/disjunctive statement processing and acronym detection, which may significantly enhance the performance of our developed relation extractors in this context. We consider our findings to be also applicable to other NLP topics that involve the processing of similar texts, such as process mining, aspect-based sentiment analysis, or conversational intelligence.
Further in this paper, Section II gives a short introduction to model-to-model transformations with their reliance upon NLP functionality and provides a concise review of NLP techniques that we consider to be relevant to our research and model-to-model transformations in general. Section III summarizes the main challenges, which must be addressed when solving similar problems, and provides a structured list of element naming anti-patterns, which provide additional noise during automated text processing and illustrate the complexity of this problem. Further, solutions for three inter-related tasks are discussed: Section IV describes the verb/noun extraction task and the experimental results on this subject; Section V deals with the processing of conjunctive and disjunctive statements; similarly, Section VI presents abbreviation and acronym detection challenges together with the corresponding experimental results. Section VII provides a discussion of our experimental findings, the identified issues, and possible improvements. Finally, the paper is concluded with Section VIII providing certain insights on the future work and conclusions.

II. INTRODUCING NLP TO MODEL-TO-MODEL (M2M) TRANSFORMATIONS
Let us assume that a system analyst has a valid UML use case model, created either by himself or obtained from external parties, which he intends to use as a part of some system specification. Therefore, he wants to use it as a source of knowledge to develop a conceptual data model for that business domain in form of a UML class model. Model-to-model transformations enable direct reuse of the input model without the need to manually develop the target model; they also provide the benefits of transferring and reusing the whole logic of model transformations for other instances. Unfortunately, existing solutions provide only complete model transformations which are quite rigid due to their solid formal foundations and are very limited for integrations with complementary functionality, such as natural language processing [18], [19]. Therefore, in this section we will rely on our own development [7], [20] to demonstrate use cases for NLP-based transformations, as our solution provides the ability for the user to use intuitive drag and drop actions on certain model source elements, as well as provides relevant extension points to integrate required functionality. These actions trigger selective transformation actions to generate a set of one or more related target model elements and represent those elements in the opened target model diagram.
In our example, we use the UML use case model as the source model, and UML class model as the target model. Furthermore, we present the situation where it is necessary to apply more advanced processing to produce a semantically valid fragment of a target model. We assume that the user dragged Actor element Customer from the UML use case model onto the opened UML class diagram Order Management (Fig. 1, tag 1), which triggered a transformation action to execute the specific transformation specification. This specification is visually designed and is specified to be executed particularly after an action, dragging an Actor element from the use case model onto the UML class diagram, is triggered. The transformation specification instructs the transformation engine to select Customer element together with instances of Use Case elements associated with this Actor and transform them into UML Class elements and a set of UML Associations connecting those classes. Now, we assume that in the exemplary use case model, Customer is associated with two Use Case elements, particularly, Return back item and Fillin complaint form. This results in generation of a UML class diagram fragment as presented in Fig. 1, tag 2.
While from the very first sight this would seem like a straightforward and simple transformation, this particular example illustrates a situation where certain NLP processing is already required to acquire a correct result. The reason behind this is that the conditions defining the extraction of multi-word verb and/or noun phrases are non-trivial. In our case, the association between the two classes Customer and Item is named as the two-word phrase return back, which is extracted from the name of the source element, particularly Use Case element Return back item. Moreover, actual verb phrases are not limited to one or two words, like phrasal-prepositional verbs containing both particle and preposition (come up with) or even distributed in the whole phrase, e.g. when the particle is after the object (associate the object with), although the such cases are observed less frequently in the formal language used in modeling practice.
The above-mentioned examples are just sample cases where a straightforward text chunking is not sufficient and certain involvement of NLP technology is required to obtain correct transformation results. Further, we provide more examples which may require additional steps for linguistic preprocessing: • Hierarchical relations created after one element is dragged onto another if text labels of these elements match some form of the semantic relationships (such as generalization, synonymy, hyponymy, hypernymy or holonymy) • Entity deduplication when multiple entries have the same meaning but different expressions. In some cases they are not considered synonyms, for instance, acronym and abbreviation resolution does not result in synonymous entries but rather in duplicate representations • Processing of more complex phrasal structures like conjunctions/disjunctions, or combinations of the above (e.g. create invoice and send it to the manager). This may also include mining of ternary associations or relationships, as well as identifying possible coreferences • Text normalization, such as having two sets of elements that differ only in syntactic structures. For instance, consider two sets of associated elements in the source model, Actor Administrator and Use Case Monitors instance, and Actor Administrator and Use Case Monitor instance. The only difference here lies in the present tense form of verb monitors, where normalization to infinitive form monitor would result in deduplication of output elements, and hence, more clarified and concise output model. While this is a very straightforward and less likely scenario, more sophisticated cases may involve disambiguation of acronyms, or detection of missing words as well as grammatical errors. Furthermore, we list the main NLP fields which could be applicable in this context in Table 1, together with our insights on their further applications in improving the quality of model-to-model transformations. Most of them will not be considered in this research, yet, they are proposed as additional extension points for improving the final pipeline. Moreover, this table is also supplemented with core techniques used to solve these problems; it is clearly indicated that deep learning techniques are the most widely researched and applied to solve these problems. For more extensive reviews of the techniques, as well as more discussions on their weaknesses or future prospects, we refer to recent survey NLP papers such as [76], [77], [78], [79], and [80]. Additionally, their performance can be significantly boosted after applying transfer learning with pretrained language models, such as BERT [39], ELMO [41], RoBERTa [81], ELECTRA [82], XLNet (83), T5 [84] or Microsoft's DeBERTa [85]. Therefore, from the technological point of view, one would need to consider the integration of deep learning based techniques that require to satisfy certain technological constraints. This is the first work which tries to bridge these two fields by performing a thorough evaluation of the existing NLP implementations for processing short text labels, which is required in the context of model-to-model transformations.

III. RELATION EXTRACTION-RELATED TEXT LABELING ANTI-PATTERNS
In this section, we enumerate a set of modeling and element naming issues, which make the automated processing of labels in graphical models rather intricate. While certain modeling best practices are generally considered in modeling [86], [87], actual real-world cases tend to contain various issues (such as linguistic or modeling ambiguities) making it very difficult to be dealt with using automated tools. Hence, if the processing of text labels created following VOLUME 10, 2022 best modeling practices could be considered as a relatively uncomplicated task (assuming that the tagging bias of the underlying implementation is not considered), significant deviations might easily complicate it.
To identify the most common text labeling issues in graphical models, we used a large dataset provided by the BPM Academic Initiative (BPMAI) [88], which contained over 4100 real-world process models presented in BPMN notation. We excluded instances that did not meet certain requirements (e.g., all the elements in the models were named using single letters without any semantic meaning, or the text labels were not in English). Labels from the BPMN Task elements were extracted from the remaining models as one of the main objects of interest in our research. After analyzing the extracted labels, a set of naming anti-patterns for activity-like Task elements was formed (Table 2) together with examples and some heuristic rules for detecting these anti-patterns; in our opinion, the latter could be applied for the initial screening and filtering tasks in other types of graphical models as well.
The detection rules are not formal in any way but can be used as guidelines to identify the cases of anti-patterns. Also, the morphosyntactic analysis might have to be carried out to properly detect sophisticated cases of element naming anti-patterns in graphical models. Moreover, other elements representing subjects or entities (such as BPMN Lane, Pool elements) may contain invalid names as well, including multiple subjects, phrases, or some of the anti-patterns from Table 2.
It is worth noting that some of the observed naming cases indicated invalid modeling practices, for instance, naming activity elements as conditions or decision points (e.g., Available, Yes, Check if available). Naming activities as whole triplets <actor-relationship-activity> is yet another quite common bad modeling practice used in modeling processes. The latter should be transformed into a combination of a BPMN Lane or Pool element with an activity-like element in it. One may also identify cases that combine multiple anti-patterns, for instance, the name of an activity may contain both conjunctive/disjunctive clauses relating multiple verb phrases into one text rumbling (e.g., Mark the invoice as invalid and return to customer), which increases the complexity of NLP tasks to a whole new level. Even though resolving conjunctive/disjunctive clauses is a challenging task, it can still be processed by using dependency parsing-based extraction, which is further addressed in Section V.

IV. PHRASE EXTRACTION EXPERIMENT
In this section, we evaluate the capabilities of the existing NLP tools to properly extract noun/verb phrases from the given text labels. This task is closely related to the relation extraction task, given its goal to extract tuples (verb phrase, noun phrase) from the given chunk of text that can further be used to construct semantic associative relations after combining with semantics from the source models (e.g., associative relationships between UML Use Case and Actor elements). Moreover, this task is important for successful model-to-model transformations because the extracted tuples are used to generate sets of elements for various target models or augment the existing models with additional elements.
Further, we present basic aspects of our experimentation on extracting noun/verb phrases from the text labels extracted from the real-world dataset which contains BPMN process models and UML use case models; both types of these models contain activity-like elements which are subjects for specific processing. Section IV-A describes the preliminaries and setup of the experiment, while Section IV-B presents the evaluation methodology; in its turn, Section IV-C elaborates on the main findings in this experiment.

A. EXPERIMENT SETUP
Information extraction (and more specifically, relation extraction) is widely supported by multiple commercial and academic engineering efforts that provided multiple options for selecting the initial starting point for our research. While new techniques emerge frequently, they are based on the generally-available text corpora that do not provide the flexibility and specificity required to fulfill our goals. More specifically, our initial testing of such tools helped us to recognize the possibility of confusion in verb/noun recognition if the infinitive verb form is used -this is not handled correctly by generic POS tagger tools. On the other hand, the development of specialized datasets is usually challenging and time-demanding. Therefore, given the lack of specialized resources required for successful implementation, we chose to adopt and test existing tools by complementing them with additional extraction functionality and applying certain enhancements to the existing ones. Moreover, some of these toolkits provide implementations for wide array of related problems, such as such as tokenization, POS tagging, lemmatization, syntactic analysis, dependency parsing, co-reference resolution, or relation extraction, which may significantly enhance required pipelines. Additionally, some libraries provide other interesting tools, for instance, Stanford CoreNLP [89] also provides natural logic annotator that enables quantifier detection and annotation, as well as CRF-based true case recognition, which is also important for knowledge base acquisition and normalization and relates to the problems addressed in this work; while quantifier detection is not among such issues, it can be tested and integrated into the future pipelines as well.
Further, we list the set of implementations selected for our experimental implementation and evaluation 1 : • Stanford CoreNLP toolkit [89] which relies on conditional random field (CRF) implementations for performing both part-of-speech tagging and NER-related tasks.
• Stanford Stanza [90] which uses Bi-LSTM to implement components and pipelines for multiple NLP tasks such as tokenization, lemmatization, POS tagging, and dependency/constituency parsing. 1 The final datasets, experimental code and results are available at https: [24] toolkit by Zalando Research, which applies pooled contextualized embeddings together with deep recurrent neural networks, as well as provides its pretrained language models.
• AllenNLP [25] which relies on deeply contextualized ELMo embeddings based on combined character-level CNN and Bi-LSTM architecture.
• BERT [39] is one of the most dominant techniques in NLP at the moment of writing this paper, based on transformer architecture and masked language modeling.
• ELECTRA [82] which is an improvement over BERT that applies token replacements with plausible alternatives sampled from a generative network during model training, instead of using masked tokens. The main goal of the model is to predict whether the corrupted input was replaced with a generator sample. ELECTRA authors show that this task is more efficient than BERT and the final model is capable of substantially outperforming BERT model in terms of model size, amount of computing and scalability [82]. The fact that these tools use different machine learning or deep learning approaches to solve NLP tasks has also motivated us to test their performance in the context of our approach. In this work, we use the BERT 2 and ELECTRA 3 implementations from the Hugging Face repository, which are already fine-tuned for part-of-speech tagging tasks.
Additionally, we developed our own taggers that were biased towards the recognition of conflicting verb forms by performing augmentations of the original text inputs with their copies containing infinitive verb forms as replacements for the original ones; a similar approach was successfully applied in our previous work to improve performance for base CRF-based tagger [6]. OntoNotes corpus [91] was used as the base data source due to its resemblance to the communication cases observed in graphical process and system models. For the reference implementation, we selected Bi-LSTM-CRF architecture [37] which has been proven to be the best performing one at the time of writing. It consists of a single input embeddings layer, a bidirectional LSTM hidden layer to process both past and future features, and CRF layer at the output, which helps to improve tagging accuracy by learning and applying constraints over sentence level to simultaneously optimize the labeling output and ensure its validity. For our experimental purposes, we implemented two versions of our customized taggers: • BERT-BiLSTM-CRF that uses original pretrained BERT embeddings at the input layer, • ELMo-BiLSTM-CRF that relies on ELMo embeddings at the input layer. For training these models, CRF (also known as Viterbi) loss, based on the maximization of the conditional probability, was used; for more details on its derivation, we refer to [37] and [92]. Moreover, learning rate was set to 0.1, the hidden layer size was set to 128, and the early stopping parameter for termination, if no convergence is further observed, was set to 10. SimpleNLG library [93] was used to normalize tense for verb phrases, while NLTK [22] toolkit was used to implement text chunking with POS tags obtained as an output from the above-mentioned tools.
Listing 1 represents formal grammar, based on regular expressions (regex) over part-of-speech tags, which was used for noun/verb phrase extraction. It relies upon the Universal Dependencies scheme [94], which is used by Spacy, Flair, and Stanza tools. Here, NP defines a noun phrase, VP defines a verb phrase, and PNP defines a proper noun phrase. As Stanford Core NLP and ELECTRA pretrained implementations use Penn Treebank notation for its POS tagger output, the grammar is adjusted for their cases (Listing 2); here, additionally, ADP defines an adposition, and ANP -a partial noun phrase, which is further used as a block in NP extraction.
The datasets used during experimenting were obtained after pre-processing a relatively large number of BPMN process and UML use case models, obtained from various sources. The final experimentation set of such models consisted of: • 32 BPMN process models and 25 UML models that were collected freely from the Internet; • A large sample of preprocessed and cleansed BPMN process models, which were selected from a large set of Signavio BPMN models provided by BPM Academic Initiative [88]. The acquired final set of models was processed, and the names of Task elements (for BPMN process models) and Use Case elements (for UML use case models) were extracted for experimentation. It was expected that Task and Use Case elements would contain at least one verb or verb phrase, and one noun or noun phrase. The extracted elements were cleaned from semantic inconsistencies, grammatical errors, invalid names, and common modeling errors, as well as filtered to exclude invalid practices listed in Table 2. In this stage, we also excluded entries containing multiple verb phrases in their names (e.g., conjunctive/disjunctive clauses), as the recognition of such structures was not a part of this experiment (this is later addressed in Section V). However, having a single verb phrase with multiple noun phrases in conjunctive or disjunctive form could be considered processable and would result in multiple valid tuples of target transformation outputs.
After performing the aforementioned steps, we obtained a dataset of 4044 valid entries that were then used to manually extract verb phrase and noun phrase pairs. The whole extraction procedure was performed by the authors of this paper. These pairs were set as a ''golden standard'' to validate the outputs acquired from the automatic extraction using selected extractors. Hence, the final dataset included 328 instances having no verb phrases, and 3716 instances containing both verb and noun phrases.

B. EVALUATION METHODOLOGY
The developed extractors were evaluated in terms of accuracy, precision, recall, and F-measure, which measured the ability to match the acquired outputs to the ''golden standard'' outputs. In our experiment, two different aspects were taken into consideration: • Whether the extractor successfully determined that the phrase contained one or more noun/verb phrase that must have been extracted. In case there is no particular phrase found, the output would be empty.
• Whether the extractor successfully extracted the required verb phrases or noun phrases. Note, that it was required to evaluate if both verb phrases and noun phrases were successfully extracted. In cases, where multiple phrases were marked as an output, it was considered that strictly all of them had to be present in the output for it to be marked as correct. Extraction accuracy is defined as the ratio of correctly extracted verb/noun phrase instances (together with empty outputs when such instances were absent) to a total number of entries: accuracy = number of correctly extracted instances number of total instances Precision is defined as the ratio of correctly extracted concepts to the number of total extracted concepts, whereas recall is a ratio of correctly extracted concepts to the number of correct concepts:  F1-measure (also referred to as F1-score) is defined as a harmonic mean of these two measures:

C. EXPERIMENT RESULTS
The results of the experimental extraction of verb phrases and noun phrases from the names of activity-like elements are presented in Table 3. It depicts both results of detecting whether the given entry had particular types of phrases, as well as the performance of extracting these phrases from the respective entries.
The obtained results indicate that the extractor based on the RNN-based Stanza tagger outperformed CNN-based and CRF-based tools (Spacy and CoreNLP respectively) in solving our problem. Extraction using Stanza's Bi-LSTMbased tagger showed the best performance in 2 tasks, while Flair tagger use resulted in the second-best. Extractor based on our custom BERT-BiLSTM-CRF tagger outperformed other implementations while detecting verb phrase presence and verb phrase extraction. Moreover, both custom taggers also showed improvements over their generic versions, i.e., ELMo-BiLSTM-CRF resulted in a better performance than the original AllenNLP ELMo, and BERT-BiLSTM-CRF proved to be better performing compared to the BERT-based POS tagger. This is quite optimistic, considering the size and specificity of the dataset. However, some caution should be taken while interpreting these results, given that our custom-trained tagger was biased towards the identification of infinitive forms of conflicting verbs. This implies that in some other cases it could fail to correctly tag other words that were handled correctly by the tagger trained using conventional corpora, and was initially confirmed in our previous research applying similar principles to train custom POS taggers [6]. Therefore, more attention should be given to improving and tuning custom taggers applied in the research, as well as finding an optimal balance between an increase in performance for verb detection and a possible decrease in other tasks that are performed better using generic POS taggers.
Nevertheless, the results of the leading extractor (based on the Stanford Stanza toolkit) are quite encouragingthe achieved F1-Score was more than 0.8 in most of the performed evaluation tasks, especially given the limitations and the level of unavoidable ambiguity in the testing dataset. One of the main challenges in this particular case is the fact that corpora currently available for training, like OntoNotes [91] or English Web Treebank [95], are better accustomed to working with whole documents rather than the analysis of short text and, therefore, do not represent the specificity addressed in this paper. We tried to mitigate this issue with additional augmentations of the input text, which resulted in certain performance improvements; developing text corpora, which are better adjusted for this specific task, would certainly help to improve its performance even further.

V. PARSING CONJUNCTIVE/DISJUNCTIVE STATEMENTS
The techniques described in Section IV proved their efficiency during the extraction of verb phrases and noun phrases, the tools we experimented with in the phrase extraction task are not capable of processing more complex examples discussed in Section III when applied directly -here, the conjunctive/disjunctive statements are a good example of that. The complexity can be illustrated with the following examples which depict multiple cases of conjunctive statements (disjunctive statements may be formulated almost identically): • check dates and suggest modifications -the statement includes the conjunction ''and''.
• consult project, check progress -the statement does not include direct conjunction, but it is inferred.
• receive invoice, packing slip, and shipment from supplier -multiple nominal subjects are related to the single verb receive.
• calculate and send price offer -contains a single nominal subject that has dependencies on multiple verbs. Obviously, the presented examples are not the most sophisticated text labels one could find in real-world models. This is not surprising due to the well-known fact that natural language is one of the most complex objects there is for automated machine processing. It is worth noting that the topic of processing conjunctive/disjunctive statements is not widely researched, although it has received some attention from researchers working on sentence simplification [96] or detecting boundaries of the whole conjunction span [97]. Also, many works on sentence simplification rely upon parse trees [15], [98], [99], which is in line with our research.
In Section V-A, we provide an algorithm based on dependency parsing, which is used to extract pairs of noun/verb phrases from conjunctive/disjunctive statements. Section V-B describes an experimental setup using a real-world dataset consisting of conjunctive/disjunctive phrases that are then processed using the proposed solution. Finally, Section V-C provides the evaluation results, a discussion, and some ideas for our future research.

A. ALGORITHM FOR EXTRACTING NOUN/VERB PHRASES FROM CONJUNCTIVE/DISJUNCTIVE PHRASES
Further, we briefly describe a dependency parsing-based algorithm for extracting pairs of noun phrases and verb phrases from conjunctive/disjunctive phrases (see Algorithm 1). The input is the parsed and tagged document D; hence, it requires a part-of-speech tagger and a dependency parser as part of its processing pipeline. We define D TOK as the set of tokens that constitute document D, together with the parsing and tagging output. Further, this document is also processed to create noun phrase spans (further denoted as SNP) and verb phrase spans (denoted as SVP) by using predefined grammars (such as presented in Section IV-A). Later, we use correspondence indexes Ind NP and Ind VP to map each token in the document to a corresponding noun phrase or a verb phrase. These indexes enable traversing dependency relationships at a phrase level and at the same time reduce the ambiguity that is observed after using different dependency parsers. We denote the head of the dependency relationship from the token tok as Dep Head (tok), and the end as Dep End (tok). Finally, we denote GET operation as the operation, which enables retrieving an entry from the index, given its indexing value. The syntactic dependencies are expected to be labeled using Universal Dependencies format [94], particularly DOBJ as the dependent object, OBJ as the object, POBJ as the object of the preposition, CONJ as the conjunction.
The output of this algorithm is a collection of tuples of verb phrases and noun phrases. It is expected that the input contains both nouns and verbs, otherwise, tuples with empty values instead of the verb or noun phrases can be returned as a result.

B. EXPERIMENT SETUP
To evaluate our approach, we extract a dataset of 410 entries acquired from the same set of process models which was used in our phrase extraction experiment. The final dataset comprised only those text labels that included at least one conjunctive or disjunctive clause. Then we manually extracted all available verb/noun phrase parts to create a ''gold standard'' dataset to be used as a reference point for our evaluation.
The algorithm presented in Section V-A was implemented as a separate module without any text normalization capability. To perform comparative testing, we implemented the module in Python, using Spacy, and extended it to use Stanford Stanza, due to its flexible integration with the Spacy framework, to enable comparing the performance of dependency parsing capabilities of these toolkits.
Again, for the evaluation, we used metrics like the ones described in Section IV-B, that is, accuracy, precision, recall, and F1-Score. Here, accuracy is defined as the ratio of the entries processed correctly and the total number of entries. Note, that this is a very strict measure as it considers a valid extraction only if all noun/verb phrase pairs were extracted correctly. However, this technique is capable to generate a larger or smaller number of entries compared to the actual outputs. To address this issue and provide an evaluation of partially correct outputs, we defined two additional metrics to evaluate the performance in terms of the number of generated output instances: • The mean deviation between the number of extracted outputs and benchmark output results: • The mean Sørensen-Dice coefficient, which is used to evaluate the average similarity between the actual and extracted instance sets: Here, n is the total number of processed entries in the dataset; O i actual is the benchmark set of verb phrase/noun if DEP END (tok) ∈ (DOBJ , OBJ , POBJ ) then 4: results ← results ∪ (GET(Ind VP , Dep START (tok)), GET(Ind NP , Dep END (tok))) 5: else if DEP END (tok) = CONJ then 6: Ind POS ← index of POS tags and tokens for conjuncts in DEP END (tok) Assume pattern <VERB>, <VERB> and <VERB> <NOUN> 7: if |GET(Ind POS , NOUN )| = 1 and |GET(Ind POS , VERB)| > 1 then 8: noun ← Get(Ind POS , NOUN ) 9: for all verb ∈ GET(Ind POS , VERB) do 10: results ← results ∪ (GET(Ind VP , verb), GET(Ind NP , noun)) Assume pattern <NOUN> <VERB>, <VERB> and <VERB> 11: else if ||GET(Ind POS , VERB)|| = 1 then 12: verb ← GET(Ind POS , VERB) 13: for all noun ∈ GET(Ind POS , NOUN ) do 14: results ← results ∪ (GET(Ind VP , verb), GET(Ind NP , noun)) 15: else if ||GET(Ind POS , NOUN )|| > 1 then 16: for all noun ∈ GET(Ind POS , NOUN ) do 17: results ← results ∪ (GET(Ind VP , LEFTMOSTVERBnoun), GET(Ind NP , noun)) Output: the set of (verb phrase, noun phrase) tuples results phrase pairs extracted for the i-th dataset entry; O i extracted is the set of output elements extracted for the i-th dataset entry; # i actual and # i extracted represent the number of elements in O i actual and O i extracted , respectively.

C. EXPERIMENT RESULTS
The results of the experiment are presented in Table 4. They summarize the performance of both Spacy and Stanza models. The obtained results prove the influence of the underlying dependency parser. Here, the implementation based on the Stanza toolkit significantly outperformed the Spacy-based implementation. Unfortunately, the extraction accuracy score for both implementations was very low proving that those implementations failed to extract all the expected verb/noun phrase pairs from each given input text; this is also reflected in relatively high values of MeanDiff and MeanSDC. Moreover, precision, recall, and F1-Score scores, which are calculated on a macro-level, show that results at the macro level are not disappointing, yet, both implementations of the algorithm and the underlying technology could still be improved in the future.
Here, the performance of experimental implementation resulted in F-Score = 0.631, although we must also take into consideration the influence of a sample bias. The significance of the underlying parse model was also obvious, as the Stanza-based processor significantly outperformed the implementation based on Spacy. Again, all the mandatory pipeline steps -text tagging, text chunking into noun/verb phrases, and dependency parsing -have proven to be crucial to the overall quality of phrase processing. A failure in any of these steps inevitably translates into errors in the further steps of the developed pipeline. Therefore, we safely conclude that the dependency parser plays the most important role of all. This was extremely well visible in the experimental cases when insignificant changes to the input entry (e.g., adding an adjective to one of the nouns) resulted in completely different parse trees compared to the initial ones; it complicated the analysis significantly or even resulted in cases not covered by the used formal grammar. This indicates the need for more extensive research and improvements in both extraction and dependency parsing areas. We believe that it could be achieved by integrating and testing recent developments in dependency parsing, based on neural techniques as described in [51] and [52] among the others.

VI. ACRONYM/ABBREVIATION DETECTION
Acronym/abbreviation detection is an issue in text normalization which deals with multiple issues and ambiguities while detecting whether the given word in the text is an abbreviation or an acronym. While many cases can be handled by simply performing a search for a particular VOLUME 10, 2022 candidate's expansive form in the text or performing a search in dictionaries and word lists, this is not trivial when it comes to widely used acronyms. The first issue is that these acronyms/abbreviations might be present in dictionaries and at the same time overlap with some general words (e.g., acronym IT overlaps with pronoun it); another common issue is omitting the expanded form of an acronym/abbreviation due to its widespread use, which makes it almost impossible to automatically identify it as an acronym/abbreviation of some particular phrase with simple backtracking in the input (the aforementioned acronym IT can be seen as an example in this context as well). Acronym/abbreviation expansion is yet another similar task aiming to solve the problem when a given abbreviation or acronym should be replaced by its expansive form, which is the most appropriate in the given context. This task is not a trivial one either -for instance, EM could be referred to as entity matching; however, it could also be expectation maximization or entity model, with all these expansive forms coming from a single computer science domain. Unfortunately, current research tends to focus on long text passages, which highly reduces their applicability in the context of our research.
In model-to-model transformation, as well as in other relevant topics, the acronym/abbreviation (A/A) detection task helps one to properly match full concept names with their abbreviated forms, thus adding to greater consistency of the models being developed. The A/A detection task itself comprises two interrelated subtasks: • PA/A detection seeks to detect candidate A/A, which must be expanded (what must be replaced?); • A/A expansion is focused on finding the right expansion for the given A/A (what is the replacement?). Acronym/abbreviation (A/A) expansion is often considered as a simple expansion of entries that are identified as A/A due to their writing style or absence in relevant sources, like thesauri or dictionaries. While simple A/A mapping lists are generally applied for common text normalization tasks, they may not always provide the correct result, unless they are restricted to having single meanings in specific or even multiple contexts. Therefore, real-world use cases may easily complicate his seemingly uncomplicated task. The complexity of the task may rise depending on the diversity of corpus or data required to properly train one's implementation to resolve models. The expansion problem will not be further addressed in this paper due to certain limitations of the dataset.
While recent developments in acronym detection tend to apply state-of-the-art deep learning techniques (as stated in Table 1), they are not applicable in our context due to relatively short text input. Therefore, we will model this problem in a more traditional yet efficient way by applying context-based classification techniques within a space of contextual, morphological, and linguistic features. While a similar approach was successfully tested in [56] and [100], we propose using a different set of features that are preferred due to data limitations. The target variable of the classifier is simply an indicator of whether the particular word represents an acronym or abbreviation.
Further in this section, we provide an empirical evaluation of A/A detection in BPMN element names. To make it more consistent with other experiments presented in this paper, we will use the same initial set of the BPMN process and UML use case models as in the experiment presented in Section IV. Hence, Section VI-A describes the preliminaries and setup of the experiment, while Section VI-B presents and briefly discussed the results obtained during that experiment.

A. EXPERIMENT SETUP
The initial dataset of process models was used as the source for developing the feature dataset for our A/A detection experiment. The feature dataset was created from all the available words in the extracted text by applying simple heuristic rules: • Acronym or abbreviation must contain at most 5 characters. It can be observed that the longer the word is, the smaller is the probability of it being an acronym. Therefore, words with more than the predefined number of characters are not considered to be acronyms and are excluded from further analysis.
• The word representing an acronym or abbreviation is not available in the dictionary. Since WordNet does not contain all the English words and their forms, we used Enchant 4 library, which is generally used for grammatical error correction, to check for the word existence. The first rule helped to identify the candidate entries for the feature dataset, and the entries longer than the predefined length threshold were not considered as candidates for acronyms and abbreviations. The second rule helped to perform its primary labeling. After the automated generation of the dataset, some manual adjustments were performed fixing automated labeling errors and ambiguities, removing redundant and duplicate entries, as well as identifying situations that were not covered by the above-listed heuristics and could not be handled automatically -all this was done to make the feature dataset more consistent and suitable for the development of our detection classifier. The feature dataset examination also helped to identify that most of the acronyms were written in uppercase, which also helped to simplify the semi-automated labeling task. To avoid feature leakage, we removed the feature of the uppercase word as it would serve as a proxy for the label otherwise (in practical applications, it might serve as a very strong indicator for acronym presence). To perform POS tagging required for the POS-based feature generation, we used Stanford Stanza tagger that showed the best performance in our previous experiment presented in Section IV-C.
After performing the feature generation procedure, a feature dataset with a total of 16579 entries was created. Each entry in the dataset was a vector of 16 features extracted from the text labels in the BPMN process and UML use case models, together with the label indicating whether a word represents an acronym or an abbreviation. The full set of features is presented in Table 5. The features has.special and long.char.seq were excluded from further analysis as the final dataset did not contain any such entries. Nonetheless, these features could be useful while performing further research with more extensive datasets and/or contexts, and thus they are included in Table 5 along with other features as a reference for future consideration. This left us with 14 features that were further used as the inputs for the classifier.
For the development of acronym detection classifier, the following techniques were considered: • CatBoost [101] is a high-performing gradient boosting classifier. One of its most exceptional features is the ability to efficiently work directly with the categorical feature variables, which helps to improve performance when numerous categorical features are used.
• XGBoost [102] is one of the best performing gradient boosting-based ensemble classifiers, widely used to solve various classification tasks.
• Random Forest [103], [104] is a widely used decision tree ensemble technique based on bagging and random feature selection. To handle the high level of class distribution imbalance of the input dataset, weighted classification was applied to improve detection performance. Also, grid search was used to optimize the performance of CatBoost and XGBoost by selecting their optimal hyperparameters. Random Forest classifier was run with default parameters, but using 200 estimators. All the classifiers were implemented in Python using  scikit-learn, catboost and xgboost libraries. Similar to the experiments presented in Section IV-B, for performance measuring, the measures of accuracy, precision, recall, and F1-score were used. Figure 1 presents the results obtained using the classifiers described in Section VI-A. They show that CatBoost significantly outperformed Random Forest and slightly -XGBoost classifiers in terms of precision and F1-Score. This is not surprising, due to the design of the CatBoost tool and its ability to work directly with categorical variables. Its superiority over the XGBoost classifier was also confirmed by the McNemar's test that resulted in p < 0.05 (p = 0.029). Table 6 also provides an insight into the feature importance obtained using CatBoost classifier. The results indicate that morphological features of tokens next to the target word were identified as the most important, whereas the presence of a particular word in an English dictionary or similar referential source played a less influential role as expected. One of the reasons for this is the fact that usually abbreviations are created by the people, who create models and write documentation (e.g., business/system analysts). And so, those people create various acronyms and abbreviations by themselves, or they use already established A/A to make the text more compact (compact text labels are particularly relevant in visual modeling). Contextual part-of-speech features seem to play an important role as well because they capture acronym VOLUME 10, 2022 usage patterns in spoken or written language; this is also proved by the high importance of the features of preceding tokens, as well as more distant contextual features. This prompts for testing wider context features (like prev.pos3, next.pos4, etc.); however, such features are not considered in this paper due to the limited size of the processed text phrases.

B. EXPERIMENT RESULTS
Alternatively, one might consider sequence-tagging models (such as Markov models or recurrent neural networks) that directly apply such context, yet their training would require larger datasets and the inclusion of an even greater number of additional features (lexical and morphological). Emerging deep learning approaches, such as [58] or similar, seem to be a viable solution as well, although their training might require a significant amount of labeled data, and their applicability for the given problem must be verified.

VII. DISCUSSION
With the experiments described in this paper, we explored the capabilities of the advanced NLP tools to process short text fragments (text labels) which are required to enable advanced capabilities in processing our model-tomodel transformations. While this is inspired by our previous research [6], [7], we believe that the presented research results could be applicable in other relevant fields as well. Similar text normalization is required for practical process mining where the names of the composing elements need to be unified from multiple data sources while reducing the number of duplicates to a minimum. It is also applicable in conversational intelligence when intent processing is required to identify the responsive action for the inquiry. The experiments prove that the recent developments in the field of NLP and deep learning could provide the needed tools to solve such and other similar problems.
Overall, the experiments presented in this paper revealed several issues, which should be addressed and might be required to handle separately: • Bad modeling (in particular, element naming) practices were not considered in the extraction activities. During the initial dataset screening, we observed many such cases that were summarized in Table 2. Detecting the most common bad modeling practices and introducing an automated resolution of such cases into the developed solution could provide even greater automated processing results.
• A more thorough analysis of the outputs showed that some tagging tools, like Spacy, were quite sensitive to the letter casing, which is also significant for the practical application of NLP technology in model-tomodel transformations as well as in other relevant fields. While this is less relevant when processing long text passages or whole documents, the importance increases when more specific text processing is considered. This is stipulated by different modeling styles used by practitioner modelers who prefer starting each word with a capital letter while naming model elements such as activities, tasks, use cases, etc. (this is verified by the analysis of the BPMAI dataset used in our research, as well as our personal experience), and some tools may fail to tag such labels correctly. For example, return invoice could be tagged as <VERB><NOUN>; however, Return Invoice might as well become <NOUN><NOUN>, which would be an incorrect tagging result. Again, in our related experiments, we reverted all text labels to lowercase to mitigate this problem. Unfortunately, such normalization might remove relevant features that could be used to detect abbreviations.
• The previous issue is also relevant for other related problems. While such cases could be normalized to lowercase, doing so increases the risk of failure in the other tasks like named entity recognition where capital letters play a crucial role. Moreover, NLP tools may face difficulties detecting named entities within fully lowercase entries (e.g., United States was identified as LOCATION, while united states was not).
• Detection performance can be negatively affected by the presence of non-alphanumeric symbols (e.g., dashes, commas, apostrophes) within words. It is advisable to remove such symbols from the model element names wherever possible. This issue might be mitigated using more advanced tokenizers capable of handling most of these cases, but the risk of failing to properly handle them still exists.
• Generally, using conjunctive/disjunctive clauses in activity-like element names indicates a bad modeling practice as such instances should be refactored to two or more atomic elements. As stated previously, processing such statements appeared to be a very challenging task requiring the support of several advanced NLP techniques, such as dependency or constituency parsing. In its turn, this would bring in other kinds of errors from the underlying parser model.
• In our experimentation, we observed general ambiguity in detecting abbreviations. The A/A detection experiment confirmed the applicability of a machine learning-based approach to handling this problem. Yet, A/A expansion is a more complicated task as full forms of concepts designated by A/A might not be present in models under the scope, especially if those A/A are well-known and heavily used (e.g., IT, USA). External sources, such as domain vocabularies and linked data can be applied by matching them contextually to each model instance containing cases of acronyms and abbreviations. Again, this requires additional sources of input data, together with a more extensive dataset, and could be considered as one of the directions for our future research.

VIII. CONCLUSION
NLP discipline has seen impressive advancements and improvements during the last several years, with the number of NLP applications increasing dramatically. Also, the progress in deep learning has resulted in a significant increase in the performance of solving different linguistic tasks. In this paper, research on applying the recent developments for processing small text phrases is discussed. While the need for this research originated from our recent research on modelto-model transformations [6], [7], we may identify several other areas that could benefit from similar text processing capabilities, such as process mining, aspect-based sentiment analysis or conversational interfaces with command-like short text processing capability. At the same time, all these areas share the same NLP-related issues that have to be dealt with to ensure satisfactory performance of the underlying NLP technology (e.g., identical representation of verbs and nouns, lack of context required for the automated processing).
In this paper, we addressed the problem of extracting relation tuples from the process and system requirements' models containing elements expressing activity-like statements. As it is stated in Section III, it is not an easily solved problem, due to multiple ambiguities, applied modeling practices, and many other issues that are not addressed in common NLP processing toolkits. Among such issues, one may emphasize the processing of disjunctive or conjunctive statements (which is considered to be a bad modeling practice), the presence of shortened forms, like acronyms or abbreviations.
To solve the issues addressed in this paper, we evaluated several current state-of-the-art implementations from the perspective of our research, while combining them under our custom formal grammar-based extraction to derive prototype implementations. Additionally, we implemented and tested our custom tagging tools, based on input corpora augmentations and bidirectional LSTM-CRF architecture with BERT and ELMO embeddings at the input layer. In the first experiment, the Stanza-based implementation showed the best performance results in noun/verb extraction tasks. Yet, we showed that implementation based on our custom BERT-BiLSTM-CRF tagger helped to improve the detection of verb phrase presence and verb phrase extraction as compared to the generic tagger implementations, including generic BERT-based tagger. This was expected as bias towards proper tagging of verbs could reduce the ability to correctly tag nouns in short text statements. Hence, balancing between biased and unbiased tagging still requires further research.
Our second experiment with processing disjunctive and conjunctive statements showed this task to be more challenging than expected, due to the dependence of our implementation on the performance of underlying dependency parser toolkits. Unfortunately, while such statements are also considered to be bad modeling practice, they are widely used in real-world cases (this is also verified from the initial analysis of BPMAI dataset) and need to be addressed carefully. This is an important topic relevant for multiple information extraction and other NLP-related areas, such as relation extraction or aspect-based sentiment analysis. It has been proven to be a complicated task due to the generally unstructured nature of natural language texts. Handling of these issues is also discussed in this paper providing additional insights for further improvements in this area. Results obtained after applying our technique described in Section V-C indicate that there is still a lot of potential for further improvements. While at this stage, we did not consider training custom parsers, we hope to achieve more progress in the future after carrying out more extensive studies and taking advantage of the improvements in dependency parsing, constituency parsing, and general relation extraction algorithms.
Finally, in the third experiment, we tested a machine learning-based approach for the acronym/abbreviation detection issue. While this issue is widely discussed in multiple papers (see Table 1 for more details on that), these works tend to focus on processing longer text statements or even whole documents, which is not suitable for our particular case. Due to limitations discussed in previous sections, we approached this issue by applying context-based classification using token-level and text label-level features. We found out that our trained classifier was able to obtain a precision of 0.78 and F1-Score of 0.73, which we consider to be a rather positive result due to multiple constraints and limitations. In the future, we might as well test the developed solution in other settings by expanding our developed dataset to include more specific cases. The results are expected to be improved after applying the classifier to a more extensive and comprehensive dataset, which would lead to exploiting additional tokenlevel, phrase-level, or even whole model-level features, and is still subject to our further research. In this paper, we did not consider acronym/abbreviation expansion, due to certain limitations and requirements discussed in Section VI. Yet, it is an interesting challenge that will be addressed in our future developments.
While our research presents a certain amount of contribution in text processing for the system modeling domain, there is still a lot of space for future research. In this paper, we experimented with text labels of activity-like elements acquired from the BPMN process models and UML use case models. However, other models, like UML activity models, state machines (or other kinds of statechart models) could also be successfully tested. Moreover, applying these techniques to larger and more elaborate datasets might reveal other cases that could be addressed by tuning the formal grammars or processing algorithms discussed in this paper. Additionally, one could also resort to creating specialized datasets or text corpora which would enable the development of even-more specialized extraction tools.
Complementary, several technological constraints should be addressed, particularly optimization of the final models for deployment due to the requirement of a significant amount of resources needed to run larger deep learning models. This may require investigation of model reduction techniques such as distillation or quantization.
Finally, it is safe to state that in model-to-model transformation (as well as in other areas involving the processing of graphical models), one could also benefit from other existing NLP capabilities, such as the extraction of semantic relationships (synonymy, hyponymy, hypernymy, etc.), analysis and correction of grammatical errors. Indeed, fully automated processing requires significant input and capabilities from multiple fields of linguistic processing to ensure the high performance of the developed NLP applications, as discussed in Table 1. This paves the road for our next near-future developments and experimentation.