Genome-wide Analysis to Identify Palindromes, Mirror and Inverted Repeats in SARS-CoV-2, MERS-CoV and SARS-CoV-1

Research pertaining to SARS-CoV-2 is in full swing to understand the origin and evolution of this deadly virus that can lead to its rapid detection. To achieve this, atypical genomic sequences which may be unique to SARS-CoV-2 or Coronaviridae family in general may be investigated. Such sequences in virus genomes may be responsible for target prediction, replication, defence mechanisms and viral packaging. This fact has motivated us to explore the different types of repeats such as palindromes, mirror repeats and inverted repeats in SARS-CoV-2, MERS-CoV and SARS-CoV-1. For this purpose, the respective reference sequence of SARS-CoV-2, MERS-CoV and SARS-CoV-1 is divided into descriptors of sequences of length k using k-mer technique. Thereafter, these descriptors are represented as a collection of tokens which are subsequently used for the identification of palindrome, mirror repeat and inverted repeat in the respective reference sequence. The highest number of palindromes, mirror repeats and inverted repeats are identified for descriptor length 10. As a result, for palindromes such values are 38, 42 and 33 and for mirror repeats they are 52, 38 and 33 for SARS-CoV-2, MERS-CoV and SARS-CoV-1 respectively. For inverted repeats, with a descriptor length 10 and intervening length 5, the values are 59, 56 and 70 respectively. Moreover, the identified repeats are then searched for in 108246, 291 and 340 SARS-CoV-2, MERS-CoV and SARS-CoV-1 virus sequences respectively to find the population coverage of such repeats. It surpasses 99% in most cases and even 100% for some. Furthermore, GC contents which mostly lie between 20%-50% are evaluated for these repeats as well in order to understand their binding efficacy.


I. INTRODUCTION
The outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), the virus causing COVID-19, has caused more than five million deaths in different parts of the world leading to researches around the globe to control this virus. The most common symptoms of COVID-19 include fever, dry cough, dyspnea, headache and pneumonia and the less common symptoms are gastrointestinal symptoms, myalgias, rhinorrhea and chest pain [1]; the fatality rate is high in patients suffering from cancer, diabetes or cardiovascular disorders [2]. SARS-CoV-2 belongs to the family of Coronaviridae which also houses MERS-CoV and SARS-CoV-1 viruses [1], [3]. Research pertaining to SARS-CoV-2 is in full swing to understand the origin and evolution of this deadly virus. This may lead to the immune-related studies of viruses [4]. To achieve this, atypical genomic sequences which may be unique to SARS-CoV-2 or Coronaviridae family in general may be investigated [5]- [8]. Studies [9]- [11] have suggested that such sequences in virus genomes may be responsible for target prediction, replication, defence mechanisms and viral packaging. In this regard, Goswami et al. [6] have identified inverted repeats in hotspot mutations using Palindrome analyzer webserver. Siddique et al. [12] have also identified simple sequence repeats using web tools like Sequence Repeat Identification Tool and IMEx-web: Imperfect Microsatellite Extraction Webserver. From the biological perspective as well, repeats are correlated with several phenomena. They are known to form functional association between genes in genetic networks and co-expression of genes as well as an entire gene or operon [13]. Moreover, inverted repeats have many important biological functionalities that play significant role in genome instability and may lead to mutation and disease.
The aforementioned studies have motivated us to explore the different types of repeats such as palindromes, mirror repeats and inverted repeats in SARS-CoV-2, MERS-CoV and SARS-CoV-1. A palindromic sequence is a symmetrical sequence so that when read from the reverse direction, it is the exact complement of itself. For example, TGCA is a palindrome of length 4. It is to be noted that a palindrome is always even in length. On the other hand, a mirror repeat has inverted sequence occurring within individual strands of DNA. For example, TTAGGATT is a mirror repeat. Like palindromes, mirror repeats are also even in length. An inverted repeat sequence is followed downstream by its reverse complement and has intervening length of sequence that can vary from zero to any number. If the intervening length is zero, it is a palindromic sequence. For example, CGTAAxxxxxxTTACG is an inverted repeat, where xxxxxx are the intervening sequence of length 6. If the intervening length is zero, CGTAATTACG is a palindrome. The significance of each of the repeats is many folds. As palindromic sequence reads same in both directions, it is important for the replication, repair mechanisms and transcription. These repeated sequence patterns are thus widely used for their fundamental importance in understanding the genome function and organization [13].
In this work, reference sequences of SARS-CoV-2, MERS-CoV and SARS-CoV-1 are divided into descriptors of sequences of length k using the popular k-mer technique. Thereafter, these descriptors are represented as a collection of tokens which are subsequently used for the identification of palindrome, mirror and inverted repeats in the respective reference sequence. In this regard, such repeats of length 10 to 20 are identified in the reference sequence of SARS-CoV-2, MERS-CoV and SARS-CoV-1. Subsequently, the identified repeats are then searched for in 108246, 291 and 340 SARS-CoV-2, MERS-CoV and SARS-CoV-1 virus sequences respectively to find their population coverage. Moreover, the GC contents are also evaluated for these repeats in order to understand their binding efficacy. Furthermore, repeats are identified in the mutated variants of the virus like Alpha, Beta, Delta and Gamma as well. It is worth mentioning that we have used our own method in this work to identify the different repeats like palindrome, mirror repeat and inverted repeat as contrary to the ones mentioned in the literature like [6] and [12] who have used online tools for the identification of the repeats. Also, please note that, although in [14] the author has identified palindromes in SARS-CoV-2 genomes, it is not biologically appropriate and this has also motivated us to correctly identify different types of repeats in SARS-CoV-2, MERS-CoV and SARS-CoV-1. Also, palindromic sequences can be identified by using tools like http://www.biophp.org/minitools/find_palindromes/demo.php, https://www.bioinformatics.nl/cgi-bin/emboss/palindrome, https://www.novoprolabs.com/tools/dna-palindrome and that reported in [13] and [15]. However, from the perspective of SARS-CoV-2, our work is first of a kind which provides a comprehensive idea of palindromes, mirror and inverted repeats. Moreover, the code used to identify such repeats is free to use as well for the readers which is not the case for many of the works in the literature.

II. MATERIALS AND METHODS
In this section, the various terms used in this work, the collection of the datasets for SARS-CoV-2, MERS-CoV and SARS-CoV-1 genomes and the proposed pipeline are discussed.

A. BACKGROUND
Before delving into the methodology, the several terms used in this work are briefly discussed. The motivation of this work is to find palindromic sequences, mirror and inverted repeats for SARS-CoV-2, MERS-CoV and SARS-CoV-1. For the benefit of the readers, the description of palindromes, mirror and inverted repeats are already introduced in the Introduction section. Furthermore, descriptors are small segments or subsequences of length k in which the reference sequence of each of the virus is divided to identify the different repeats as mentioned in this work.

B. DATA ACQUISITION
The pipeline of the work is given in Figure 1

C. PIPELINE OF THE WORK
This work is executed according to the pipeline as given in Figure 1(a) and Figure 1(b) provides a demonstration of the workflow.
Algorithm 1 presents the workflow as provided in Figure 1(b) for the identification of a palindromic sequence, mirror repeat and inverse repeat in details. To identify each such repeat, the reference sequence χ for each of the virus is initially divided into subsequences or descriptors of length k using the popular k-mer technique and stored in C. Thereafter, these descriptors are represented as a collection of words or tokens using the function tokenizedDocument and stored in P. Subsequently, these tokens are checked if they qualify to be either palindromic sequence, mirror repeat or inverted repeat. To perform such operation for palindrome, the reverse complement of the tokens stored in P are calculated. For those tokens, if the original token and their reverse complement are same, then they are palindromes. To find the mirror repeat, the tokens are flipped in the horizontal direction. For those tokens whose original and flipped version are same, they are considered to be the mirror repeats. Finally, for the inverse repeat, initially, the starting positions (start_pos_F) of the tokens in the reference sequence are calculated. From these starting positions, the starting positions (start_pos_rev) of the probable reverse complements are calculated by considering k and η which is the intervening length for inverted repeat. Based on start_pos_rev and end_pos_rev, the intermediate sequences between these two positions are extracted from the reference sequence χ. If these tokens and the reverse complements of the original tokens match, then the original tokens are considered to be inverted repeats. Once the repeats are identified, their population coverage is calculated for all the virus sequences. Furthermore, the GC content of the identified repeats are also reported.
The time complexity of the proposed method for finding the palindrome, mirror repeats and inverted repeats as given in Algorithm 1 can be calculated as: Let the length of the reference sequence be N . Thus, the time complexity of finding descriptors of sequences of repeats using kmercount is O(N ) (Line number 3). The time complexity of navigating through the sequence of descriptors as returned by kmercount is O(α), where α (=N − k + 1) is the number of descriptors. So, the time complexity for representing the descriptors as tokens is O(αk) (Line number 4) (k is the length of a particular descriptor out of α descriptors). The time complexity of navigating through the tokens is O(α) as well. Line 7 has a time complexity of O(k) and so do lines Algorithm 1: Pseudo-code for identification of palindromic sequence, mirror repeat and inverse repeat Input : χ (reference sequence of virus genomes), k (length of the descriptors of probable repeats), η (intervening length for inverted repeat) Output:

III. RESULTS
The experiments in this work are conducted using MATLAB R2021a and the results following Figure 1(a) are given next. Figure 1(b) provides a glimpse of each of the repeats as identified in this work. As can be seen from the figure, ACCAGT-TAACTGGT is a palindrome because when seen from the reverse direction, it gives TGGTCAATTGACCA which is an exact complement of itself. AATGACTTCAGTAA is a mirror repeat as the last 7 nucleotides TCAGTAA is a mirror image of the first 7 that is, AATGACT. Finally, AA-CACTTCxxxxxGAAGTGTT is an example of an inverted repeat as GAAGTGTT is an exact complement of AACACTTC with an intervening length of 5 (marked by x). The results for the total number of palindromes and mirror repeats in the reference genomic sequence of each of SARS-CoV-2, MERS-CoV and SARS-CoV-1 are shown in Table 1 Table 2. The intervening length for the inverted repeats has been varied from 5 to 8. For SARS-CoV-2, inverted repeats are present only for lengths 18 and 20, with intervening lengths of 5 and 7, while for MERS-CoV and SARS-CoV-1, such sequences are present for length 18 and intervening length 5. The results for palindromes along with the corresponding coding gene, population coverage and GC content are given in Table 3 with descriptor length varying from 14 to 20. For length = 20, the palindrome for SARS-CoV-2 is ACACTGGTAATTACCAGTGT and the corresponding genomic location in the reference sequence is 5745, the coding gene being NSP3. The corresponding population coverage and GC content are 99.82% and 40% respectively. For MERS-CoV, with a descriptor length of 18, the palindromic sequence is CTATAGAGATCTCTATAG at the genomic location 16171 and it belongs to the coding gene RNA-dependent RNA polymerase with a population coverage and GC content of 28.52% and 33.33% respectively. On the other hand, for SARS-CoV-1, for a length of 20, the palindrome is CTTTAACAAGCTTGTTAAAG at the genomic location 25963 and belongs to the coding gene ORF3a. The corresponding population coverage and GC content are 91.17% and 30% respectively.    The findings in this work are solely based on the proposed technique. In this regard, apart from the identification of the repeats, their population coverage and GC content are also elaborated. Table 3 Tables 4 and 5 respectively. For VOLUME 4, 2016 length = 20, the mirror repeat for SARS-CoV-2 is CTCAAT-GACTTCAGTAACTC and the corresponding genomic location in the reference sequence is 9977. The coding gene for the sequence is NSP4 with a population coverage of 99.56% and GC content of 40%. For MERS-CoV, for a mirror repeat length of 14, one of the the sequences is CAGATGTTGTA-GAC at the genomic location 12574 and it belongs to the coding gene NSP8 while for SARS-CoV-1, for a length of 16, the mirror repeat is CTACTGACCAGTCATC at the genomic location 7643 and belongs to the coding gene NSP3. For inverted repeats with a length of 20 and intervening length of 7, as reported in Table 5, the SARS-CoV-2 sequence is ATAAAGAACTxxxxxxxAGTTCTTTAT where 'xxxxxxx' denotes the intervening sequence of length 7. In this case the intervening sequence is TTAAGTC. The sequence ATAAAGAACTTTAAGTCAGTTCTTTAT (by replacing TTAAGTC in place of x's in ATAAAGAAC-TxxxxxxxAGTTCTTTAT) has the starting coordinate 15775 and belongs to the coding gene RNA-dependent RNA polymerase with a population coverage of 99.88% and GC content of 22.22%. According to [16], [17], targeting GC-rich genes are difficult and thus it can be said that repeats with moderate GC-content are good candidates for being target sites of a virus. Thus, the identified repeats in this work can be considered to be such good targets. Examples for each of the repeats are provided in Figure 1(b) as well where ACCAGT-TAACTGGT is a palindrome, AATGACTTCAGTAA is a mirror repeat and ACACTTCxxxxxGAAGTGT is an inverted repeat where the intervening sequence of length 5 is CCACA (marked in orange in the figure). Figure 1(c) provides the pictorial representations of Tables 1 and 2 where '5', '6', '7' and '8' are the intervening lengths (η) of the inverted repeats. Figure 1(d) shows the common repeats among SARS-CoV-2, MERS-CoV and SARS-CoV-1 for descriptor length (k) 10. It can be seen from the figure that although some of the repeats are similar, mostly they are different. The similarity may be attributed to the fact that they all belong to the same family of Coronaviridae but due to evolutionary pressure their structures have changed. All the results for the identified palindromes, mirror repeats and inverted repeats of SARS-CoV-2, MERS-CoV and SARS-CoV-1 along with the coding gene, population coverage and GC content are given in the supplementary as excel files.
Furthermore, major mutations of the original reference sequence of the virus have led to important variants like Alpha, Beta, Delta and Gamma. We have also identified the palindromes, mirror repeats and inverted repeats in such variants. As can be seen from Figure 2, most of such repeats are common among all the variants. In Figure 2

V. LIMITATIONS OF THE STUDY
Extensive experiments have been performed in this study to find palindromes, mirror and inverted repeats in SARS-CoV-2, MERS-CoV and SARS-CoV-1. However, the study has certain limitations. First, there are about 7 species of coronavirus like 229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV-1, SARS-CoV-2 that transmit from animals to humans. But in this study, we have considered SARS-CoV-2, MERS-CoV and SARS-CoV-1. The study can be extended to other species as well which we intend to do in our future work. Second, in vivo-based verification can further add significance to the study. We intend to collaborate with biologists in the future to perform the same.  to understand the binding efficacy with the corresponding complements while targeting such repeats for detecting the virus. Hence, using the identified repeats further research can be conducted in this direction as well. Moreover, such repeats are identified in major mutations of SARS-CoV-2 like Alpha, Beta, Delta and Gamma and it is seen that most of the repeats being conserved in nature are common across the original sequence and the different variants. As a future work, parameterized matching [18] can be considered to identify repeats.

AVAILABILITY OF DATA AND MATERIALS
All the SARS-CoV-2, MERS-CoV and SARS-CoV-1 virus genomes with their corresponding reference sequences and the final results of this work are available at "http://www.nitttrkol.ac.in/indrajit/projects/COVID-Repeats/".