Analyzing the Trade-Off Between Complexity Measures, Ambiguity in Insertion System and Its Applications

Insertion is one of the basic operations in DNA computing. Based on this operation, an evolutionary computation model, the insertion system, was defined. For the above defined evolutionary computation model, varying levels of ambiguity and basic descriptional complexity measures were defined. In this paper, we define twelve new (descriptional) complexity measures based on the integral parts of the derivation, such as axioms, strings to be inserted, and contexts used in the insertion rules. Later, we analyze the trade-off among the (newly defined) complexity measures and the existing ambiguity levels. Finally, we examine the application of the analyzed trade-off in natural languages and modelling of bio-molecular structures.

of natural computing or bio-inspired computing models 23 which bridged the gap between nature and computer sci- 24 ence. As a result, lot of bio-inspired computing models have 25 been defined namely membrane computing, sticker systems, 26 splicing systems, Watson-Crick automata, insertion-deletion 27 systems, DNA Computing, H-systems [6], [35], [36]. In formal 28 The associate editor coordinating the review of this manuscript and approving it for publication was Yang Li . language theory, the language generation depends on the 29 rewriting operations, which paved a new dimension for inser-30 tion systems. If a string β is lodged between two substrings 31 α 1 , α 2 of a string α 1 α 2 to get a new string α 1 βα 2 , then the 32 performed operation on the strings is called insertion. Inser-33 tion operation was first theoretically studied in [16]. In DNA 34 computing, the insertion operations have (some) biological 35 relevance, which in turn has (some) biological relevant prop-36 erties in human genetics. In [34], the application of the inser-37 tion operation in the domain of genetics has been investigated. 38 In 1969, Solomon Marcus introduced Contextual gram-39 mars which are mainly based on the descriptive linguis-40 tics [30]. In contextual grammars based on the selector, the 41 context is inserted to the left and right of selectors. Using 42 the adjoining operation, iteratively, the strings are generated 43 in the language, where as in insertion system based on the 44 left and right context, the string is inserted. In [33], differ-45 ent ambiguity levels were defined and studied for external, 46 internal contextual grammars depending on the parts that are 47 used in the derivation. For more details, on the ambiguity 48 issues related to contextual grammars, we cite [18], [21], 49 [22], [32], [34]. As insertion system can be viewed as the 50 counterpart of contextual grammars, in the similar line of 51  [23], [24], [25], [35]. This moti-112 vated to define new decsriptional complexity measures for 113 insertion systems, perform the trade-off and to investigate the 114 application of the analyzed trade-off. 115 The organization of the paper is given as, the preliminaries 116 are dealt in Section II. The newly introduced descriptional 117 complexity measures of insertion systems were discussed in 118 Section III. The trade-off results between the newly defined 119 complexity measures and various ambiguity levels of inser-120 tion systems were investigated in Section IV. The application 121 of the trade-off between ambiguity and measures in natu-122 ral languages and modelling of bio-molecular structures has 123 been probed in Section V. The comparative study is dealt in 124 Section VI. The conclusion and the future work is dealt in 125 Section VII. 127 We start with discussing about the fundamental notations 128 used in formal language theory. V ( ) is called an alphabet 129 set. T is called a terminal set. The free monoid generated by 130 V ( ) is represented as V * ( * ). The null string is denoted by 131 . Strings or words are the elements from V * ( * ). By elim-132 inating from V * ( * ), we can obtain V + ( * ). A language 133 L is given as L ⊆ V * ( * ). For more details, we refer to [40]. 134 An insertion system is defined as: γ = (V , A, R), where V 135 represents an alphabet, A is a finite language over the alphabet 136 (axiom), R is defined as a set of finite insertion rules in the 137 given format (u, β, v). In the above insertion rule (IR) the pair 138 (u, v) is called context and (u, v) ∈ V * ×V * . The β represents 139 the string to be inserted (IS) and β ∈ V + . Given an insertion 140 rule, depending on the left context (LC) and right context 141 (RC), (u, v), the string β is inserted. If (u, v) ∈ λ, then the 142 insertion of β can be done anywhere in the word.

143
Given an insertion rule (u, β, v), the y can be derived 144 from x as follows ((x, y) ∈ V * and x ⇒ y). Con-145 sider the following derivation step: x = x 1 u ↓ vx 2 , y = 146 x 1 uβvx 2 , for some x 1 , x 2 ∈ V * and (u, β, v) ∈ R, ↓ marks 147 the location of the string to be inserted, the inserted string is 148 represented by a underline. The language generated by γ is 149 defined as: L(γ ) = {w ∈ V * | x ⇒ * w, for some x ∈ 150 A}, where ⇒ * is the reflexive and transitive closure of the 151 defined relation ⇒.

259
However, the language L 2 is unambiguous as there exists 260 an unambiguous insertion system γ 2 which generates L 2 .  Proof: Let the language L 3 = {cba 2 ca n | n ≥ 282 1} ∪ {dba 2 da n | n ≥ 1}. The following 5-ambiguous 283 insertion system γ 3 can be used to generate L 3 . cba 2 ca k , k ≥ 1 can be generated. Likewise, by the using the 311 insertion rule (d, a, a), dba 2 da k , k ≥ 1 can be generated.

312
While deriving cba 2 ca r or dba 2 da s , r, s ≥ 1 ∈ L 3 , the posi-313 tion of the string to be inserted a is unique in the derivation.  Proof: Let the language L 4 = {c 2 a n | n ≥ 0} ∪ {d 2 a n | 329 n ≥ 0} ∪ {c 2 a n d 2 a m | n, m ≥ 0}. The following 4-ambiguous

336
To generate c 2 a l , l ≥ 0 of L 4 , definitely, the insertion system 337 must have an insertion rule of the following form (c 2 , a i , λ), insertion rule (c 2 , a, λ). As the string to be inserted a i is same 347 for any arbitrary system, the insertion system γ 4 is 1 and 348 3-unambiguous.

349
However, L 4 is unambiguous as there exists an unambigu- The fol-364 lowing 4-ambiguous insertion system γ 5 can be used to gener-365 ate L 5 .
First, we will discuss on the measure Ax. The 367 axioms c 2 ab 3 , d 2 ab 3 , c 2 d 2 can be used to derive the first, 368 second and third part of L 5 . It is easy to see that mini-369 mum three axioms should be there to generate L 5 . Therefore, 370 Ax(L 5 ) ≤3, which implies Ax(L 5 ) = 3. Next, we will discuss 371 on the measure mLen − LCon. The insertion rule (c 2 , ab 3 , λ) 372 is used to generate the first part of the language L 5 . The 373 d 2 (ab 3 ) n , n ≥ 1 part of the language L 5 can be derived using 374 the following insertion rule (d 2 , ab 3 , λ). By using the inser-375 tion rules alternatively,the third part of the language L 5 can 376 be generated. Any system which generates L 5 , should have 377 insertion rules whose mLen − LCon(L 5 ) ≤ 2, which implies 378 mLen − LCon(L 5 ) = 2. Next, we will discuss on the measure 379 MLen−LCon+InsStr. Any system which generates L 5 should 380 have the left contexts c 2 , d 2 in the insertion rules, which 381 implies MLen − LCon(L 5 ) = 2. Likewise, the inserted string 382 should be ab 3 , which implies MLen − InsStr(L 5 ) = 4. As 383 the measure MLen − LCon + InsStr is the combination of 384 the above two measures, we can conclude MLen − LCon + 385 InsStr(L 5 ) = 6.
Consider any system γ 7 which generates L 7 , where 469 TLen − LCon(L 7 ) = 1 and TLen − LCon + InsStr(L 7 ) = 5. 470 Since the strings of the form a(bc 3 ) i , i ≥ 1 ∈ L 7 , definitely 471 in the insertion rule there should be a context of the form 472 (a, (bc 3 ) t ), t ≥ 0. Likewise, since the strings of the form 473 (bc 3 ) j d, j ≥ 1 ∈ L 7 , definitely in the insertion rule there 474 should be a context of the form ((bc 3 ) s , d), s ≥ 0. In both 475 the cases, the inserted string will be (bc 3 ) k , k ≥ 1. In order to 476 prove the insertion system is γ 7 is 2-ambiguous, lets us take a 477 string a(bc 3 ) e+f d ∈ L 7 . From two (different) unordered CCS, 478 the above word can be obtained from the (same) axiom. In one 479 sequence using the context (a, (bc 3 ) t ) completely, the string 480 a(bc 3 ) e+f d can be obtained. In another sequence, using the 481 context ((bc 3 ) s , d) completely, the string a(bc 3 ) e+f d can be 482 derived. Thus, the same word a(bc 3 ) e+f d ∈ L 7 , is derived 483 from two different unordered CCS. Therefore, the language 484 L 7 is 2-ambiguous. 485 However, the language L 7 is 2-unambiguous since there 486 exists an 2-unambiguous insertion system which is mini-487 mal in {TLen − RCon, TLen − RCon + InsStr}. γ 7 = 488 ({a, b, c, d}, {abc 3 , bc 3 d, ad, abc 3 d}, {(c 3 , bc 3 , λ)}). As the 489 system γ 7 uses only one insertion rule, obviously, there will 490 be only one context in the insertion rule (c 3 , λ). Therefore,the 491 system γ 7 is 2-unambiguous. The system γ 7 is minimal in 492 the measures {TLen − RCon, TLen − RCon + InsStr}. Note 493 that, the insertion system γ 7 is not minimal in the measure 494 {TLen − LCon, TLen − LCon + InsStr}. To generate the strings of the form a 2 b k c 2 , k ≥ 0 ∈ L 8 , the 508 insertion rule should have the string b. However, if such an 509 insertion string is present in any of the insertion rules, then 510 the system γ 8 may generate some strings a 2 b 3p , p ≥ 1 and 511 b 2q c 2 , q ≥ 1 which doesn't / ∈ L 8 . From the above claim, all 512 the parts of L 8 cannot be produced by the insertion string b, 513 which implies mLen − InsStr(γ 8 ) = 2. Next, we will discuss 514 on the following measure TINS − Str. Since the strings of 515 the structure a 2 b 3p , p ≥ 1 ∈ L 8 , insertion rule will certainly 516 have the string b 3 . Likewise, since the strings of the structure 517 b 2q c 2 , q ≥ 1 ∈ L 8 , insertion rule will certainly have the 518 string b 2 . Therefore, we conclude MLen−InsStr(γ 8 ) ≥ 3 and 519 TINS − Str(γ 8 ) ≥ 5.

520
Next, we will discuss on the measure Axiom. Any system 521 which generates L 8 will have three axioms a 2 b 3 , b 2 c 2 , a 2 c 2 . 522 Next, we will discuss why the system should have an axiom 523 a 2 bc 2 . If the system is not having the axiom a 2 bc 2 , probably 524 it can be generated by using the axiom a 2 c 2 by inserting 525 the string b. But previously, we have proved that the system 526 cannot have b as the string to be inserted. Therefore, it implies 527 a 2 bc 2 should be present in the axiom. Therefore, the system 528 γ 8 is minimal in the measure Ax.
The system will produce a unique derivation step for any 553 word ∈ L 8 , starting from an axiom by inserting the string b 6 , 554 which shows γ 8 is 0-unambiguous. As the system uses the  The system γ 9 is minimal with respect to {Ax, mLen−InsStr}. 571 First, we will prove for the measure mLen − InsStr. From Next, we will discuss on the measure Axiom. Any system 583 which generates L 9 will have three axioms ba 2 , a 4 c, bc. Next, 584 we will discuss why the system γ 9 should have an axiom bac. 585 If the system is not having the axiom bac, probably it can be 586 generated by using the axiom bc by inserting the string a. But 587 previously we have proved that the system cannot have a as 588 the string to be inserted. Therefore, it implies bac should be 589 present in the axiom. Therefore, the system γ 9 is minimal in 590 the measure Ax. 591 Consider any system γ 9 which generates L 9 . The system 592 γ 9 is minimal in the measure Ax, mLen − InsStr. In order 593 to claim γ 9 is 0-ambiguous, let us take the strings ba 2r and 594 a 4s c ∈ L 9 , for a large values of r and s. To produce the 595 words of the form ba 2p , p ≥ 1 and a 4q c, q ≥ 1, the insertion 596 system γ 9 should have strings of the form a 2t , t ≥ 1 and 597 a 4u , u ≥ 1 respectively. Consider a word ba 4tm+2un c ∈ L 9 , 598 for m ≥ 1, n ≥ 0. The above word can be achieved from 599 two unique axioms bc and bac. Starting from the axiom bc, 600 the word ba 4tm+2un c can be obtained by inserting the strings 601 a 2t , m-times and a 4u , n-times. On the other hand, the word 602 ba 4tm+2un c can be derived from the axiom bac. In one deriva-603 tion, the string a 2t can be inserted for m − i 1 -times, i 1 ≥ 1. 604 In another derivation, the string a 4u can be inserted for n + i 2 -605 times, i 2 ≥ 1. Thus, the word ba 4tm+2un c is obtained from 606 two different axioms bc, bac. Therefore, the system γ 9 is 607 0-ambiguous. 608 Next, we have to prove the L 9 is 0-unambiguous, 609 by showing there exists an 0-unambiguous system γ 9 = 610 ({a, b, c}, {ba 2 , ba 4 , a 4 c, bc, bac, ba 2 c, ba 3 c, ba 4 c}, {(a, a 4 , 611 λ)}) which generates L 9 . The system will produce a unique 612 derivation step for any word ∈ L 9 , starting from an axiom by 613 inserting the string a 4 , which shows γ 9 is 0-unambiguous. 614 As the system uses the following insertion rule (a, a 4 , λ), the 615 system γ 9 is minimal in the measures {MLen−RCon, mLen− 616 RCon, TLen − RCon + InsStr, MLen − RCon + InsStr}.

634
In this section, we analyze the application and significance of 635 the trade-off in natural languages, modelling of bio-molecular 636 structures. Before moving on to the application, first, we will 637 discuss about the controlling parameters and limitations of 638 VOLUME 10, 2022 In addition to that, many natural languages has the existence 683 of sentences beyond context free [7], [28]. In this regard, as the sentence is having semantic ambiguity, as it can be 695 elucidated in a different manner. The different interpreta-696 tion of the above mentioned sentence can be: Whether any 697 group is hunting for dogs? or Whether the category of dogs 698 belongs to the hunting type or Whether the phrase hunting 699 dogs refers to a music band or a basket ball team or a secret 700 code. In fact, the right phrases of the sentence are They are, 701 They are hunting, They are dogs, They are hunting dogs. 702 Assume that, we want to construct an insertion system which 703 generates the above sentence. As there is no concept of 704 non-terminals(variables) in insertion system, it can be called 705 as pure grammars. Since the insertion system is a pure 706 grammar, every derivation step should represent a correct 707 phrase, the correct phrases are They are, They are hunting, 708 They are dogs, They are hunting dogs. Consider, 'They are' 709 is an axiom and the insertion rules are of the form: 710 (They are, dogs, λ) and (They are, hunting, λ).

711
By using the above axiom and the insertion rules, 712 the derivations can be of the forms (the underlined 713 words indicates the inserted string): (1) They are ⇒ 714 They are dogs ⇒ They are hunting dogs, which 715 gives all the three correct phrases.
(2) They are ⇒ 716 They are hunting ⇒ They are dogs hunting, which is 717 not a correct phrase. So, with the above insertion rules 718 all the correct phrases cannot be generated. However, 719 if we consider three insertion rules (They are, dogs, λ), 720 (They are, hunting, λ), (They are, hunting, dogs) all the three 721 correct phrases can be derived from the axiom or else using 722 different insertion rules we may get all correct phrases of the 723 sentence, but the number of insertion rules will be more. So, 724 to derive the above sentence, we need three insertion rules.

725
Such sentences can be stored compactly if there exists an 726 unambiguous system which generates it, but may happen to 727 be not minimal with respect to measure(s). As insertion sys-728 tems is found to be one of the prominent (rewriting) gram-729 mar mechanisms, the system can be recognized to be one of 730 the fit (rewriting) mechanisms to generate natural languages 731 [30]. The above example clearly shows that the sentence can 732 be generated by an unambiguous system but not minimal in 733 terms of components used to iterate the sentence. The above 734 case study explicit the importance of studying the trade-off in 735 natural languages.

738
In computational biology there are lot of research prob-739 lems needs to be addressed based on the gene sequence 740 such as gene structure prediction, gene sequence alignment, 741 bio-molecular modelling, construction of phylogenetic trees. 742 Such gene structure prediction, bio-molecular modelling 743 problems are effectuated by progressing with relevant pattern 744 matching algorithms. The above discussed computational 745 biological problems are somewhat akin to investigating the 746 structural frameworks in computational linguistics. The gene 747 structure prediction, bio-molecular modelling problems can 748 be handled in an effective and succinct manner, if there exists 749 a unique grammar model/system which generates/models it. 750 To model or predict the structures, first, it should be expressed 751 as a gene sequence. Such sequences can be visualized as 752 strings formed over the four basic chemical symbols a, t, g 753 and c ( DNA ). The complementary of the above four chem-754 ical symbols is given asā = t,ḡ = c,t = a,c = g. 755 As the bio-molecular structures can be expressed in terms of 756 (gene) sequences it has kindled the researchers to study the 757 connection and application of formal language theory and 758 computational biology [42]. In addition to that, the genetic  In both the derivations, the same sequence cgatatgccg is 808 derived from two different axioms cg and at. Therefore, the 809 system γ od evinces 0-ambiguous.

810
Case 2: Consider an orthodox string cgtagccgat ∈ L od , 811 which can be obtained by two different ordered CS: In CS1, the order of gene sequence used by the insertion 817 rules are ta, gc, cg, at, cg, whereas in CS2, the order of gene 818 sequence used by the insertion rule are ta, cg, at, gc, cg. 819 Thus, the gene sequence cgtagccgat can be derived by two 820 different ordered CS. Therefore, the system γ od evinces 1-821 ambiguous also.

822
Case 3: Consider the string atcgcgta ∈ L od , which can be 823 derived in two different descriptions by γ od which are given 824 below: atcgcgta.

829
In both the descriptions the axioms are same cg and the con-830 texts used in the insertion rules (λ, λ) are also same, but the 831 position where the inserted gene sequence yy are different. 832 Therefore, the system γ od is 5-ambiguous also.

833
The above example shows a clear evidence on the existence 834 of different levels of ambiguity for the same language L od 835 on different gene sequences. In addition to that, the above 836 (ambiguity) example reveals that analysis of the ambiguity 837 issues in gene sequences has to be carried out with utmost 838 care because ambiguity issues plays a pivot role in some of 839 the computational biology problems such as protein sequence 840 analysis, parallel gene recognition, prediction of gene loca-841 tions. For more practical applications on the importance of 842 ambiguity in gene sequences, we refer to [1] The axiom, intermediates sequence and the final sequence 850 to be generated can be represented as a tree. If the interme-851 diate gene sequences are different then we will have more 852 than one phylogenetic trees for the same gene sequence. Such 853 a study on the different intermediate sequences will help us 854 to study more on the inheritance properties. The following 855 example of a phylogenetic tree will give a better understand-856 ing on the ambiguity. One such phylogenetic tree is shown 857 VOLUME 10, 2022  in Figure.1. In Figure.1 by two different paths from the root node. The above scenario, 868 clearly shows that a different perspective can be given in 869 the visualization of ambiguity in phylogenetic trees. On the 870 other hand, while predicting the gene structure, we need an 871 optimal system and at the same time the system which gener-872 ates/models the bio-molecular structure should be unambigu-873 ous. Consider the system γ od = ({y, b }, {λ}, {(λ, yy , λ)}) 874 which generates the L od . One insertion rule is enough to 875 generate all the strings in L od . The system γ od is minimal 876 {Ax, MLen − LCon, MLen − RCon, mLen − LCon, mLen − 877 RCon, TLen−LCon, TLen−TCon}. The language L od can be 878 generated by an unambiguous system but definitely the unam-879 biguous system will not be minimal in the above mentioned 880 measures. This example shows the importance and applica-881 tion of the trade-off study between complexity measures and 882 ambiguity levels in modelling of the bio-molecular structures. 883

884
In this section, we discuss about the comparative study of 885 trade-off results obtained for the insertion systems and its 886 applications in natural languages, modelling of bio-molecular 887 structures. Table.6 shows the comparative study of the pro-888 posed results and applications with other relevant grammar 889 models. From the comparative study, it has a clear evi-890 dence, that the insertion systems, insertion-deletion systems, 891 variants of insertion deletion systems are mainly motivated 892 towards reducing the weights in simulating the recursively 893 enumerable languages by means of suitable normal forms 894 where as, in this paper, we have defined some new descrip-895 tional complexity measures, analyzed the trade-off between 896 ambiguity levels and descriptional complexity measures. 897 In addition to that, we have discussed about the application 898 of the analyzed trade-off which was missing in the various 899 research work carried out on insertion systems. Turkey, and the African University of Science and 1096 Technology, Abuja, Nigeria. He leaded several funded research projects and 1097 supervised several graduate (M.Sc. and Ph.D.) and undergraduate students. 1098 He edited three books and has more than 100 papers in major international 1099 journals and conferences published by major publishers, such as IEEE, ACM, 1100 Elsevier, and Springer. His research interests include artificial intelligence 1101 and learning technologies. He is also interested in smart devices (such as 1102 smartphones and tablets) applications development and innovation. He is a 1103 member of the editorial board of several international journals and a program 1104 committee member of several international conferences. He is a member 1105 of the IEEE Technical Committee on Multimedia, and the IEEE Technical 1106 Committee on Learning Technologies. He is a Senior Member of IEEE and 1107 ACM Computer Societies.

1108
MANUEL MAZZARA received the Ph.D. degree 1109 in computing science from the University of 1110 Bologna, Italy. He is currently a Professor of com-1111 puter science at Innopolis University, Russia, with 1112 a research background in software engineering, 1113 service-oriented architecture, concurrency theory, 1114 formal methods, and software verification. He is 1115 also the Director of the Institute of Software 1116 Development and Engineering and the Head of 1117 the International Cooperation Office at Innopolis 1118 University. He published many relevant and highly-cited papers, in partic-1119 ular in the field of service engineering and software architectures. He has 1120 collaborated with European and U.S. industries, plus governmental and 1121 inter-governmental organizations, such as the United Nations, always at the 1122 edge between science and software production. The work conducted by Dr. 1123 Manuel Mazzara and his team in recent years focuses on the development 1124 of theories, methods, tools, and programs covering the two major aspects of 1125 software engineering: the process side, related to how we develop software, 1126 and the product side, concerning the results of this process. 1127 1128 VOLUME 10, 2022