DNA Sequences Compression by GP² R and Selective Encryption Using Modified RSA Technique

Humans, by nature, have always been fascinated by the possibility of being able to acquire more information in minimum possible time and space. The effective lossless compression method, effective data structure, and DNA (Deoxyribonucleic Acid) data searching are quite essential as they provide a stimulus to easy accessibility and communication. The proposed algorithm is a new Lossless Compression algorithm, which compresses data, based on two tiers. Firstly, it searches for the exact Genetic Palindrome(GP), Palindrome(P) and Reverse(R)[GP2R] and the substring is reported, which is replaced by the corresponding ASCII character creating a Library file. By using the ASCII code, the Library file acts as a signature as well as provides the security of data. Secondly, modified RSA technique is proposed for the selection encryption purpose. This selection encryption of the modified RSA technique is an approach to lessen computational resources for greatly sized DNA facts. The experimental work shows 44% to 45% original sequence is encrypted where above 95% of the original file is damaged by using this method. This technique can find out the 3.851273 bits per base of the compression rate. The O(n) is the complexity of this algorithm. The running time is a few seconds of this algorithm. This is a hybrid approach to the compression & encryption process. For reducing the compression rate, the first pass output is again compressed by the second pass but it is lossy, This experiment is performed on benchmark DNA order.


I. INTRODUCTION
The amount of DNA being taken from organisms and order is increasing exponentially [1]. This gives in two questions-a place for storing and safe transmission. The hard question of place for storing while useful to the workplace is depending on the size of each base. The DNA order size vary from Megabyte (MB) to Terabyte(TB) annually [2]- [8]. The DNA contains some logical organization [9], hence data structure for storing, accessing and efficient processing tasks is The associate editor coordinating the review of this manuscript and approving it for publication was Lo'ai A. Tawalbeh . challenging [10], [11]. The DNA database requires an efficient compression algorithm for storing. The available compression methods [12], [13] cannot be applied to biological data aptly because the DNA sequences have some specialty [14]. The DNA data of a living organism are nonrandom, so the two bits encoding techniques cannot be applied directly and has some limitations [14]. Huffman's lossless compression technique both of the static and the adaptive model are not well applicable due to the presence of very less number of different characters in a DNA sequence [10], [14]. The phenomenal characteristics of genomic data have conation, so there occurs many repetitions within the DNA sequence [10]. Some specific structures [10], [15], [16] are present in DNA order, to which researchers have kept in mind and proposed several DNA compression algorithms. It is not an easy task to find out the exact GP 2 R match position in a long DNA sequence. The offered algorithm has three parts: i) firstly, finding all the exact GP 2 R and ii) secondly, encoding GP 2 R and the non-match regions iii) finally encrypt the compressed file, library file or in both by using modified RSA technique. The basic principle of this algorithm is quick and sensitive homology searching [17], as our exact GP 2 R search engine. The substring technique creates an online library file and the ASCII characters are placed on the source file. Also developed another algorithm of one to one character matching, sequence orientation change and measurement size of files, etc.
In accordance with its function, DNA shows different properties from other kinds of facts. The compression algorithms for text files utilize short repeated patterns and contextual likeness to get compression. These techniques cannot be successfully applied to DNA. The compression diminishes the file size and the process of encryption makes certain the safety of one text record which is to be sent over some unreliable network like the internet. Many algorithms have undergone growth. Each of which has its own forces and feeblenesses. The data compression tries to minimize storing space and encrypting a nucleotide genome order from unauthorized uses. If the genome order is sent from source to destination and unauthorized user can access the order and make changes in it by putting forward some west information, the structure of the genome will change and this will lead to loss of unused power. The order is compressed with the help of encryption, keeping in mind that it does not lead to data loss. Now a day's sending over the wireless of DNA/RNA/ PROTEIN order is very essential. The computational price and a small place for storing things is the demand for mass. Information safety is a hard question to keep safe the facts from hackers. This offered technique keeps safe the facts from unauthorized users. Also, make a comparison of the selection encryption technique with the RSA algorithm. This algorithm also applies to artificial facts. The process is mentioned in Fig. 1.
The offered algorithm consists of three forms: i) discovering all exact GP 2 R substring ii) encode exact match of GP 2 R region & unmatch region iii) Encrypt the compressed file, library file or in both using modified RSA technique.
This proposed compression method provided two-tier safety i) the data are compressed and encrypt the order, produces two separate text records one at a time and each text record has in it ASCII code of 256 different characters ii) applying selective encryption of modified RSA technique. A small part is selected from the complete message is the principal of the selective encryption process and the remaining part of the message is clear where the safety is not put at risk. By using proposed selection encryption in RSA increase the speed of encryption. The R part selection is important from the viewpoint of security and R depends on the medium type. In the selection encryption technique, a part of the bitstream (R) is used and not the whole bitstream (M). Here M is the plain text message and the selective part is R. So, M-R is the unencrypted part. It is observed that the security level increases if R increases, also time of encryption increases. The R-part is selectively encrypted using the RSA technique based on single or multiple or alphanumeric character selection. If decrypt without applying keyvalue or entered a wrong key the text will be of different manner.
This is a hybrid model of compression and security techniques This model was proposed in a strong, hybrid cryptosystem. If selection encryption is applied before compression, we observed very low-level safety because DNA order has only four symbols, any unauthorized user can decrypt the sequence by trial and error method and also select options are less. The compressed output text has more symbols than the input text. After compression the selection encryption techniques are applied easily, getting very high-level security and also increase the selection option. This is a two-pass method, in first pass use GP 2 R technique and in second pass use modified RSA's techniques, where first pass output uses the input of the second pass and finally getting the ultimate result at the end.

II. EXISTING COMPRESSION ALGORITHMS
All genome compression method used redundancy within the order but differs greatly in the way they do so. In general compression algorithms can be put in order into Naive Bit manipulation, Dictionary-based, statistical and referential algorithms. Most of the compression method used today, including the DNA compression falls into two groups. First is VOLUME 8, 2020 the statistical method, which compresses facts by giving another in place of a more having general approval special sign to a small sign. Second is designed on a dictionary that compress facts by replacing long orders with small information to a similar order in a dictionary.
The statistical compression technique like the statistical process, CTW and arithmetic coding compresses the DNA sequences well. But the Huffman technique is inapplicable on the DNA sequence. Both the algorithms Lempel-Ziv78 and Lempel-Ziv77 works on this principle. In GS compress, LZ77 design with reverse complement is introduced as a dictionary-based design. Rivals et al. [18] developed Cfact, is another compression algorithm, which using a suffix tree data structure and searches the longest matching repeat. Based on approximate string matching property, Sadeh proposed a lossy compression technique. The limitation of CTW, arithmetic coding is low decompression speed but the compression rate is good.
Biocompress [19] specially design for DNA compression was offered by Grumbach and Tahi. Based on order-2 finite context arithmetic encoder the Biocompress is modified as Biocompress-2 [11]. Sliding window-based algorithm offered by Lempel and Ziv [20], is familiar as LZ77. The compression algorithm like Cfact [21], off line [22], DNASC [23] and B2DNR [24], etc are based on the common characteristic of sequence repetitions. Shibata et al. Proposed the Boyer-Moore algorithm based on compressed pattern matching. In the 1 st phase of DNA compress [25], [26] use patternHunter tool which finds out the highest score of complimentary palindrome and approximate repeats and encodes in the 2 nd phase.
The popular DNA compression algorithm is GenCompress-1 [25] based only on replacement operation. The GenCompress-2, the modified version of GenCompress-I is based on deletion and insertion operation in the subsequence and the compression performance is the same in both the cases.
GENBIT compress tool [27] is designed by Rajeswari and Apparao specially for DNA sequence compression based on binary bit coding. They also introduced another DNA compression algorithm called HUFFBIT [28]. It produced better results than GENBIT. DNABIT Compress tool(DBC) [29] was also designed by them. It used binary bit 'in the bit-preprocessing stage' of DNA smaller part repeats and reverse.

III. EXISTING SELECTION ENCRYPTION ALGORITHMS
The RSA, DES, 3DES, and AES are popular encryption algorithms discussed in this paper. The use of the net and network is growing quickly. So there are more requirements to secure the facts sent over different networks using different services. To provide the network safety and facts different encryption methods are used [30]. In this paper, a survey of the currently existing works on the encryption techniques has been done. Every method has its own importance and works in its own special characteristic way. For data storage and retrieval we have used the compression-encryption algorithm parallelly and enhance the security level.
The idea of selective encryption with a purpose of probabilistically selective encryption algorithm was proposed by Ren et al. [31].
Kala [32] proposed Quadrature Mirror Filters and lossless compression technique for wireless ad hoc network based on selection encryption.
Singh and Samaddar [33] have used the selection [34] encryption technique in RSA based on a singular cubic curve for the text-based Documents.

IV. MOTIVATION & CONTRIBUTION
The main purpose of this work is to analyze the good effect of executing the method of compression and encryption. First the sequence is grouped into three/four bases, replaced by special sign and as a result, we get compression encryption simultaneously. This process reduces the complexity of the standard procedure. The secure private key is generated by a group of nucleotide bases which is replaced by a single ASCII code and a private key is known only who is transmitting the sequence. The decompression process is known to everyone but the private key is not known to everyone, except the transmitting user, the process is known as cryptanalysis, and it is very reliable. This selection encryption efficiently first searches the sensitive region of the DNA sequence. By using Lavenshtein distance (LD) we have calculated the effectiveness of the process.

V. PROPOSED TECHNIQUE OF GENETIC PALINDROME, PALINDROME AND REVERSE A. METHODOLOGY OF GP 2 R TECHNIQUE
Consider a string S consisting of four symbols g,t,c & a and the size is b byte. The process is graphically shown in Fig. 2 The substring tta is the Genetic palindrome (GP) of aat shown twice in the string, the substring ata is the palindrome(P) of ata shown twice in the string and gta is the reverse(R) of atg shown twice in the string. After compression the string is s= &c#@&g#@&g#c@ The string has 31 characters and requires 31 bytes for storing, after compression required 15 bytes which is less than input file size and corresponding Library file are aat-&, ata-# and atg-@. We can take special character in some case this case is also considered. i.e. N Searching for exact Reverse, genetic palindrome, palindrome, encoding analysis and decoding procedure, details discussed in this paper [35]. Step 1: enter the input file name Step 2: Step 3: Step 4: Step 5: Repeat for i=0 to i<3. then fclose(lib else then if feof(inp) putc(a,com),putc(b,com),putc(c,com). break; else then fclose(com).
Step 9: While !feof(inp) then fclose(inp) and fclose(com) ntime needed for execution is=%2.3f'',difftime(end,beg) and remove(finp) End for loop End while loop Step 10: End 2 nd pass compression algorithm based on GP 2 R INITIALIZATION OF INPUTS: 1.Compressed file used as a source file 2. 2 nd pass compressed file is the target file ESTIMATED OUTPUT: The 2 nd passed compressed file is * COM and library file * LIB Step 1: enter the input file name Step 2: Step 3: Step 4: Step 5: Repeat for i=0 to i<3. thenfclose(lib).

VI. RESULTS & DISCUSSION OF GENETIC PALINDROME, PALINDROME & REVERSE TECHNIQUE
This algorithm of genetic palindrome, palindrome & reverse tested on standard benchmark data used in [11]. For testing purposes use two sets (data set-1 & 2) of data, they come under different sources. The data set-3 of some real DNA sequences taken from Gen Bank. Encryption ratio (ER): This criterion measures the ratio between the size of an encrypted part and the whole data size.
For Cellular sequences, the results are presented graphically in Fig. 4 for the data set-1 and 7 for data set-2. The Fig. 4 & 7 considers only compressed file, shows that the compression rate is dependent on file size. The minimum average compression rate is 3.58077 bits/base for data set-1 and 3.58072 bits/base for the data set-2 where word size is 3 and sequence orientation is complement. The compression rate is increased when the word size increases. Also, compression time increases when the word size increases from 3 to 4. So, word size 3 base compressions are better than word size 4 or 5. The nature of the graph is heterogeneous in nature because sequences come under from different species as shown in Fig. 4 & 7 for both the data sets. The result shows that the increase in file size decreases the compression rate. The disk utilization and encryption rate are shown in Fig. 5 for data set-I and 8 for data set-II where it is shown that both changes are parallel with file size.
For Artificial data, the results are presented graphically in Fig. 4 for the data set-1 and 7 for data set-2. The Fig. 4 & 7 considers only the compressed file, which shows that the compression rate is dependent on file size as well as word size. The minimum average compression rate is 3.60244 bits/base for data set-1 and 3.60539 bits/ base for the data set-2, where word size is 3 and sequence orientation is a compliment. The nature of the graph is homogeneous in nature because sequences are randomly generated as shown in Fig. 4 & 7. The cellular sequences versus artificial data, getting distinct fig. with naked eye shows two different graph characteristic. Where observed that cellular sequences have structure and nonrandom data, whereas random data is unstructured. Also observed that library file is constant in size  in case of an artificial sequence whereas library file is variable in size in case of cellular sequence.   It is shown that cellular sequence compression and decompression timeless than artificial sequences.
If sequences encrypt by three/four-character secret keys (Genetic palindrome, palindrome & reverse technique, sequences compressed by the sub-sequence/word of different   size), calculate the percentage of encryption and percentage of modification of the actual text. From this, it is observed that an average 44%-45% for both the data set-I & II of encryption on the actual text will be modified 95% (for both the data set) on the actual file.
The entropy is increased two to three times before and after compression. As a result, both the compressed file and library file increased the randomness, so, the attacker can not attack the sequence easily. The Fig. 10, 11 & 12 for data set-1 and   13,14 & 15 shows the encryption is in increasing order. The Fig. 15 for data set-I & 16 for the data set-2 shows that before and after compression, the entropy of compressed file and library file and entropy is increased in both the cases.
We test the programme on real DNA sequences, result shown in Figure 18. The real data is collected from National Center for Biotechnology Information site    (www.ncbi.nlm.nih.gov/genbank). Also this technique overcome the lossy compression. This result is presented   in Fig. 18. It is shown that the compression rate and ratio are the same in all the data sets. Now, after first pass compression, these two sets of DNA orders are converted into simple text files of another size and find out the result on it's as a percentage of encryption and percentage of effect on actual file by changing the level.
In selection encryption, the file is encrypted based on the basic principle of selection i,e selection of a single character, double character, and alphanumeric character. Find out the   result on data set in the original file, compressed file and library file. The result shows that RSA exhibits the highest avalanche effect. The Avalanche effect gives us the extent of the diffusion of the message. One bit of change in the plaintext brings about the significant change in bits of the ciphertext. The data is presented in Fig. 19 to 22 for both the   data set for the original file, Fig. 23 to 28 for compressed file and Fig. 29 to 31 for the library file. It was shown that the decryption time is always less than the encryption time and independent of file size for both the data set. The encryption rate & avalanche effect increased with file size in both the data set. If consider the highest level of Hamming   distance, the effect on the original file is highest on the basis of top-level interchanging. On the other at a lower level Hamming Distance the effect is proportional. The encryption is increased with respect to output text effectiveness. The percentage of effect on actual file in increased when input file size is increased and vice-versa. Now using an appropriate selection of character decode the encrypted text and get back the original text as getting in our previous experiment. But if decrypt without applying   an appropriate selection of character value or entered an incorrect key the message will be different. Table 1 shown the encryption & decryption throughput. The result showing the decryption throughput is less than encryption throughput. Also observed that data set-2 throughput is better than data set-1. Also observed that selection encryption applied on compressed file is better than original file encryption.
The encryption & decryption time is also graphically presented for both the data set is presented in Fig. 19 & 21 for source file, 23 & 26 for compressed file and 29 for library file. It is observed that the encryption time is always greater than decryption time. The encryption and decryption time is independent of file size. This is the basic principle for all encryption techniques (AES / DES / RSA / DNA). So these techniques are very much effective in security purposes.
Also calculated the compression score is 2.32453 for data set-1 and 3.88363 for data set-2. This algorithm is com-pression friendly because of no impact on data compression efficiency.
The second pass result is shown in Table 2. It is shown that data is lost in the second pass.

VII. CONCLUSION
The results show that the compression rate & ratio varies from each other due to the data set comes from different sources. This algorithm is very helpful for storing the DNA database. Our algorithm stores the DNA sequence as a record in the database without maintaining them as files. By utilizing the decrypt decompression algorithm instantly we can get the original sequence at the client end without any error. This algorithm is user-friendly.
The experimental results show that the reverse, palindrome and genetic palindrome matching patterns are similar in all kinds of sources. A major part is played by the lookup table in finding the regularities and similarities of DNA order.
The output text file has an ASCII symbol and non matched c,t,g & a, providing information safety. It is very useful for data transmission and provides data protection. This process protects the particular source of the DNA sequence. Here we can get better security than a static lookup table (LUT).
Internal of genetic palindrome, palindrome & reverse is the key idea of our algorithm. This algorithm act as a DNA sequence compression model that brings out the real features of the DNA order. The output of our experiments also shows that our process is better than other standard processes such as the RAY compression ratio [36]. By using our method the regularities in DNA sequence like crossover and mutation are detected. This algorithm fails to attain a high compression rate and ratio than other standard methods but proves that information security is very high. The encryption performance depends on speed, our algorithm shows that the modified RSA algorithm speed is better than the RSA algorithm. The limitation of this research work, if nonmatch base pair and ASCII code again compressed, we can not provide corresponding ASCII code, because this compression is one pass. But using other orientations shows no meaningful changes. Important observations are: i) Genetic palindrome, Palindrome & reverse subsequence size vary from 2 to 5 and no match found in case the subsequence size becoming more than six. ii) The substring length three is highly repeated than 4 or 5 bases long substring. So, 3 bases substring is more compressible than other substring length iii) The cellular DNA sequence is more compressible than other orientations. iv) It is observed that the cellular DNA sequence compression rate & ratio are distinguishably different because this data set is collected from different species. The artificial data generated by the random string generation process, compression rate & ratio are similar both the data set v) We have developed a second pass compression algorithm for reducing compression rate and ratio but the second pass algorithm is lossy, we cannot consider this algorithm in our work. vi) Our algorithm works more efficiently on a short pattern than a long pattern.
vii) The compression-encryption output file efficiency increases and is user-friendly. viii) Here using GP 2 R technique we can get more than four characters in the output file, in that situation we can apply Huffman's technique easily. ix) Modified RSA technique also tests other compression methods such as repeat, reverse, repeat-reverse..etc x) This algorithm also compares with some real DNA sequence taken from Gen Bank but compression rate and ratio are the same as benchmark data.
Our method provides higher information security than other standard methods. The first pass process produces two separate files, each having more than four symbols. If two files are transmitted one by one, then the decrypt of the file by unauthorized persons is very hard. Also, the first-pass output file contains 256 symbols, so the selection option increases and getting very strong safety. From these tables & graphs, it is observed that cellular DNA sequences have a logical organization, structure, systematic and nonrandom whereas artificial data are random and unstructured. Also observed that the cellular DNA sequence compression ratio follows the equation as {1-Output/2Xinput} whereas the artificial data, the compression ratio is followed by the equation as {1-Output/input}; where the output size is a number of bits. Also, the same table shows the average compression gain of the sequences, observed that lower the compression rate, compression gain is high.

SANKAR PRASAD MONDAL was an Assistant
Professor with the Department of Mathematics, Midnapore College (Autonomous) and with the National Institute of Technology, Agartala, for five years. He is currently an Assistant Professor with the Department of Natural Science, Maulana Abul Kalam Azad University of Technology, Haringhata, India. He is having five years of research experience in the field of operations research, differential equation, fuzzy sets, mathematical biology, and fuzzy differential equation.
ALI AHMADIAN received the Ph.D. degree in applied mathematics from the Universiti Putra Malaysia, Selangor, Malaysia, in 2013. He is currently a Fellow Researcher with Universiti Putra Malaysia. He has published more than 70 peerreviewed scientific publications. He is a Reviewer for 70 international journals. He was involved in several national and international projects related to the applications of fuzzy systems in the realworld systems. His current research interests include artificial intelligence, fuzzy fractional calculus, interval-valued functions, numerical simulations, and mathematical modeling.
FERIAL GHAEMI received the Ph.D. degree in nanotechnology from the Institute for Advanced Technology, Universiti Putra Malaysia, in 2015. She joined the Institute for Tropical Forestry and Forest Products, in 2016, where she is currently a Postdoctoral Research Fellow. She is a University Lecturer with the National University of Malaysia. She has published more than 30 research works in prestigious CIJ journals. Her main research interests include the synthesis of different types of nanomaterials with applications arising in drug delivery, polymer composites, and microextraction techniques.
NORAZAK SENU is currently an Associate Professor with the Institute for Mathematical Research, Universiti Putra Malaysia. As his main interests are working on different types of differential equations and modeling real-world systems using such equations, he published more than 100 articles in the peer-reviewed international journals. He obtained several prizes for his research works from the Ministry of Education, Malaysia, and achieved a number of governmental grants to support his scientific works. VOLUME 8, 2020