DNA Design Based on Improved Ant Colony Optimization Algorithm With Bloch Sphere

DNA computing and coding have good application prospects in data storage, data computing, data encryption and other fields. Meanwhile, it is very important to design a set of DNA coding set that meets a variety of constraints in today’s research. The purpose of DNA coding is to find as many qualified code sets as possible under given conditions or to keep each DNA code in the DNA code set as far as possible from other codes. The former is used in this paper. The algorithm uses the Bloch system to initialize the DNA coding population, and uses the Ant Colony Algorithm to find the optimal DNA coding. At the same time, crossover and mutation operations are added to make the generated population more random and diverse. Experimental results show that the number of code sets obtained by this algorithm under certain specific conditions is better than the number of code sets obtained by other algorithms.


I. INTRODUCTION
DNA computing was proposed by Head T [1] in 1987. Adleman [2] proved head's conjecture with an innovative method in 1994. At the same time, the DNA coding set that meets the constraints plays an important role in various DNA fields. With the rapid development of DNA related technology, DNA coding technology is not only used in DNA computing, but also used in other technologies, such as data storage [3], DNA nanostructure [4], DNA microarray [5], image processing and encryption [6], [7]. Deng et al. [8] proposed an improved hybrid coding method of variable-length run-length limited (VL-RLL) coding and low-density parity-check (LDPC) coding based on DNA based data storage technology. The experimental results show that the hybrid coding method proposed in this paper has better performance than the current traditional DNA data storage technology. Immink et al. [9] proposed a sequence replacement method for K constraints and Q metadata. Experimental data show that this method has significant improvement over the existing replacement techniques. Calais et al. [10] proved that under typical concentration conditions, the maximum melting temperature and 14 base saturation provide a useful guidance for all technical The associate editor coordinating the review of this manuscript and approving it for publication was Sabah Mohammed . solutions, and the effectiveness of the optimization method is verified by experiments. Yin et al. [11] proposed a new nonlinear control strategy and an improved NOL-HHO algorithm. Experiments on several functions show that the algorithm can obtain a better lower bound of DNA storage and has stronger global search ability. Weber et al. [12] designed a DNA coding set with specified minimum distance and certain ability of error detection. Experiments show that the DNA coding set has good ability of error detection. Common constraints include Hamming distance constraint (HD), Reverse Complement Hamming distance constraint (RC) and GC constraint. The above three constraints are used in this paper, and they will be mentioned in the second chapter. According to the constraints mentioned above, many researchers and scholars are studying DNA coding ensemble now. Cao et al. [13] proposed a K-means multiple optimization algorithm (KMVO) based on Hamming distance constraint, GC constraint and other constraints. The algorithm can find better coding boundary than the previous multiple optimization algorithms (MVO). The experimental results also show that the algorithm can store more information more effectively in a given length, so as to improve the utilization of space. Kim et al. [14] proposed a DNA coding structure based on binary Hadamard matrix. The experimental data show that compared with the minimum complement Hamming VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ distance, the DNA coding structure with length of 8 or 16 in this paper is larger than that of all previous known results. Tulpan et al. [15] designed an algorithm for random, heuristic and linear structure of DNA strand satisfying Hamming distance constraint and complement constraints, and used GC constraints to preprocess DNA coding. The results show that the algorithm is better than other methods when using linear construction. Deaton et al. [16] proposed an algorithm that is stricter than Hamming distance and reverse Hamming distance constraints. Tulpan et al. [17], [18] proposed a random local search algorithm. Experiments show that the algorithm can find the DNA coding set with better quality. The premise of DNA coding set design is to meet the corresponding constraints. The research on DNA coding can be divided into two aspects: (1) find the DNA coding with better quality under the condition of meeting the constraints [19]; (2) under the premise of meeting a certain quality, the more DNA coding sets found, and the better. The focus of this paper is the second case.
In this paper, we design an ant colony algorithm based on Bloch Sphere to find the coding set under various constraints. In this method, Bloch coordinates are regarded as three gene sequences, and each chromosome is composed of three genes. Then the ant colony algorithm is used to find the best initialization coding set. Compared with the random search algorithm, the search strategy based on genetic algorithm can search in a wider range, and the effect is better than the former. Through comparison, we can know that the results obtained by using the proposed search algorithm are better than the previous search algorithm.

II. CONSTRAINTS ON DNA CODING
A DNA [20] code of length n is composed of a group of (x 1 , . . . , x n ) of x i ∈ {A, G, C, T } (T, C, G and A represent four nucleotides). The problem of DNA coding design is to find n coding sets that meet the constraints. Suppose that in the set s, the chromosome of X is 5 − x 1 x 2 . . . x n − 3 , and the chromosome of Y is 5 − y 1 y 2 . . . y n − 3. Given a distance D, the following constraint conditions will appear:

A. HAMMING DISTANCE CONSTRAINT (HD)
Hamming distance is the sum of the number of different bases in the same position of two codes with the same length. For different X and y, hamming distance is expressed as H (x, y) ≥ d where H (x, y) represents Hamming distance of coded X and y, and D represents distance. The calculation formula of Hamming distance is as follows: In formula 2, the Hamming distance is calculated for the i-th of code X and the i-th of code y (that is, the difference in the same position between the two codes is compared), The same code is 0, and the different code is 1. For example, if the code x is 100000 and the code y is 011111, the Hamming distance of the two codes is 6. In Formula 1, n is the number of codes.

B. REVERSE COMPLEMENT HAMMING DISTANCE CONSTRAINT (RC)
In DNA coding, the code x and the code y may be equal. The Reverse Complement Hamming distance is expressed as where H X , Y RC represents the Reverse Complement Hamming distance between the code x and the code y, and Y RC is the Watson-Crick complement of Y.
In DNA computing experiments, single-stranded DNA molecules can diffuse freely in the solution, so they can be hybridized with the reverse code to describe the degree of difference between code x and code y. Experiments have shown that the greater the number of different bases between two DNA codes, the less complementary base data between them, and the less prone to non-specific hybridization.

C. GC CONSTRAINT (GC)
Every DNA code consists of four bases, T, C, G, and A, and G and C play a very important role in it. In DNA coding, the total number of bases G and C satisfies n/2 , that is GC(x) = n/2 . The calculation formula of GC is as follows: where |G| and |C| represent the number of base G and base C of X in the DNA sequence respectively, and n represents the length of sequence X. For example, X=12343434, GC(x) = 25%.

III. RELATED ALGORITHMS A. BLOCH SPHERE CODING 1) BLOCH SPHERE CODING
In 2002, Ham and Kim [21] proposed to make use of the characteristics of chaos (such as randomness, convenience and regularity) to make the ethnic groups more diverse. Use Logistic mapping to generate r variables, the formula is as follows: Among them, m is the number of variables, and the value of µ is generally between [0, 4]. If the value is greater than 4, the result will diverge, and n chromosomes will be generated through the above formula.
Bloch sphere is a unit of two-dimensional sphere, which is orthogonal to the corresponding points. The north and south poles of Bloch sphere correspond to the up and down states of electrons respectively, which can be expressed as 0 or 1, and vice versa. In a second-order quantum system, the possible state |ψ can be represented by two bases which are mutually steamed, and these two bases can be represented by |0 and |1 . In physics, |0 and |1 represent the only two results obtained by quantum measurement, which can be expressed as |ψ = α|0 β|1 , α, β ∈ C, |α| 2 + |β| 2 = 1. On the three-dimensional Bloch sphere, a qubit can be expressed as |ϕ = cos θ 2 |0 + e iϕ sin θ 2 |1 , where e iδ is called the global phase, which has the same effect on |0 and |1 . The e iδ in relative phase is different. Its influence formula is:|ψ = cos θ|0 + sin θ e iφ |1 , so the range of θ and φ can be determined: The Bloch sphere can be obtained by drawing all the R 3 distributed in three-dimensional space in 2θ and φ. The spherical formula is as follows: The 3D Bloch sphere is shown below:

2) TRANSFORMATION OF SOLUTION SPACE
In DNA coding, it is necessary to express the 4 bases in real life with codes that can be read by a computer. At the same time, it is also necessary to establish a corresponding mathematical model. Its definition formula is as follows: From the above formula, we can see that the four bases T, C, G, and A of DNA in real life are represented by the four numbers 0, 1, 2, and 3 respectively, which facilitates the subsequent DNA coding work.
The value range of n chromosomes generated by formula (4) is between [-1,1], which does not meet the requirements of our DNA coding set, so we need to perform space transformation to convert n chromosomes into our own In the required space, the formula for solution space conversion is as follows: Because the number of bases selected in this paper is 4, the value of b is set to 4, and the value of a is set to 1. After the transformation of the above solution space formula, a set of DNA codes in the range [0, 4] can be obtained.

B. IMPROVED ANT COLONY ALGORITHM
Ant colony optimization (ACO) [22], also known as ant algorithm, is a probabilistic algorithm to find the optimal path, and also a simulated evolutionary algorithm. Its inspiration comes from the behavior of finding the optimal path in the process of searching for food in the animal kingdom.
Ant colony algorithm was first applied to TSP problem (Traveling Salesman Problem). After several years of development, ant colony algorithm can now be seen in other fields [23,24]. Although the ant colony algorithm has achieved good performance in many fields, the most successful is its application in combinatorial optimization problems.

1) FUNDAMENTAL
Biologists have discovered that ants are polymorphic social creatures without vision. They rely on the pheromone on the way to find the best path. For example, if the ants are at an intersection that they have never passed before, they will randomly choose a path to travel, and at the same time release pheromone on the changed path. When the ants pass through the intersection, they will choose the one with a greater probability. The path of pheromone advances. At the same time, the pheromone on the path will evaporate over time, which will result in more and more ants passing through the optimal path, and the pheromone will become more and more concentrated. The probability of choosing this intersection will become greater and greater. Correspondingly, the pheromone on the path that is not chosen by the ant will become weaker and weaker, and the probability of the ant choosing this road will become less and less. In the end, the ants will choose the best path.

2) ANT COLONY TRANSFER STRATEGY
The DNA coding set S k is structured as follows: First initialize a certain number of DNA code sets V(If there is a GC constraint, you only need to ensure that the GC content is GC(x) = n/2 when initializing the DNA code set).
Ant k randomly selects a code from the code set V and adds it to S k , and then constructs the candidate code set candidates in the set S k . The definition formula is as follows: Because the constraints used in the above formula are Hamming distance constraints and Reverse Complement Hamming distance constraint, the corresponding constraints can also be selected according to the actual situation. Each code in the candidate code and all codes in the set S k meet the corresponding constraints. Then calculate the probability of each code in the candidate code set, and then use roulette to select one of the codes to add to the set S k , so as to repeat until the candidate set is empty. At this time, the first ant has completed its mission, and then the second ant also performs the corresponding operation until the last ant completes the operation. The formula of ant transfer strategy [25] with candidate codes added to set S k is as follows: where d i is the sum of the distances (Hamming distance and inverse Hamming distance) between the ith code and other codes, and η i is the heuristic function. α is the pheromone factor and β is the heuristic function factor, The larger the value of β, the larger the proportion in formula (11). τ i is the pheromone concentration of the ith code. From the above two formulas, we can see that the two major factors that affect our selection probability are pheromone concentration τ i and heuristic function η i , that is to say, the higher the pheromone concentration, the greater the probability of the path being selected. However, with the increase of coding length, the calculation time of p (v i ) will increase, so a new formula is needed to reduce the calculation time. On the basis of the original ant transfer strategy, the simplified formula is simplified and a simplified version of ant transfer strategy is proposed. The simplified version of ant transfer strategy [26] is shown in formula (12).
Compared with formula (11), formula (12) does not need to calculate the distance between codes, which greatly improves the calculation efficiency and shortens the calculation time. The basic ant colony algorithm will calculate the distance of all DNA codes when calculating the ant transfer strategy, and select the code with the highest probability to add to the candidate code, but this operation will greatly improve the computing time of the computer. The improved ant transfer strategy only needs to calculate the pheromone concentration, and the ant only needs to rely on the pheromone concentration At the same time, it also proves the universality of the improved ant transfer strategy, which can be combined with other algorithms to improve the robustness of the hybrid algorithm and reduce the running time of the algorithm.

3) UPDATE LOCAL PHEROMONE
The communication between ant colonies is completed by pheromone, which plays a crucial role in the cooperation of ant colony. When any ant K completes a coding set S k , it will update the local pheromone. The update formula [27] is as follows: In the process of updating pheromones, the content of pheromones that do not belong to the coding set S k remains unchanged, while those that belong to the coding set S k remain unchanged The pheromone content in code a is reduced. The reason may be that when the nth ant chooses the path with higher pheromone concentration, it is more likely to choose the path with higher pheromone concentration. Therefore, when the local update is completed, the pheromone concentration not belonging to code set S k is higher than that belonging to code set S k , which leads to the ant preferentially choosing those not previously selected Thus, the diversity of solution space is increased. The value of pheromone is generally fixed in a range, and when the local pheromone is updated, the pheromone content of the code will also change. At this time, it is necessary to determine whether the pheromone content of the code exceeds the specified range. If the content of a pheromone in the coding is less than the minimum value set before, the pheromone content is set to the minimum value. On the contrary, if pheromone content in the encoding is higher than the maximum, the pheromone content is set to the maximum. From the above formula, we can see that updating the local pheromone can be divided into two parts: the first part updates the pheromone, and the second part processes the pheromone beyond the specified range.

4) UPDATE GLOBAL PHEROMONE
After completing an iteration, all ants build DNA coding set, and then we need to update the global pheromone. Because the pheromone in natural environment evaporates with time, the algorithm also increases evaporation to simulate the phenomenon in nature. ρ is pheromone volatilization factor. τ (v i ) is the new pheromone content, Q is the pheromone constant, and L k is the sum of the codes that meet the constraint conditions. According to different rules, ant colony algorithm can be divided into three models: Ant Cycle model, Ant Quantity model and Ant Density model. The model used in this paper is Ant Cycle model. Ant Cycle model means that ants release pheromones after completing a path cycle. The formula for updating global pheromone is as follows: From the above three formulas, we can see that global pheromone updating includes two processes: increasing and evaporating. In nature, the pheromone content released by ants in the two places is not wireless superposition, because pheromone will slowly dissipate with the passage of time, and the evaporation of pheromone on the path that few ants pass will be higher than the increase, so the pheromone content on the path will be less and less until exhausted, in contrast, the increase of pheromone on the path that many ants pass So the pheromone on this path will be more and more. After completing iteration, the evaporation of pheromone on all codes is ρ · τ (v i ), which leads to ants slowly forgetting the previous path and choosing the path they haven't found to search, thus increasing the diversity of understanding space search.

C. CROSSOVER
In evolutionary algorithm, if there is only update operation, the population will evolve in the same direction, which leads  to premature algorithm and reduces the ability of searching solution space. Crossover operation can not only make the population more diverse, but also retain excellent individuals to the next generation. These two aspects play an important role in evolutionary algorithm. The common crossover strategies of evolutionary algorithm [28] are single point crossover, two-point crossover, multi-point crossover and uniform crossover. In order to make the population more diverse, the crossover method used in this paper is full crossover, that is, each gene of the population participates in the crossover, but this operation greatly increases the calculation time. Suppose the population size is 4, and each chromosome has 5 genes. The whole crossover process is shown in Figure 1, and the crossover result is shown in Figure 2.

D. MUTATION
Mutation operation is also an indispensable part of evolutionary algorithm. Mutation operation can prevent the algorithm from converging to the local optimum too prematurely to a certain extent. However, in order to ensure the stability of the algorithm, the mutation probability of the mutation operation is generally very small. The mutation method used in this article is to randomly select a gene in a chromosome to become another gene in the same chromosome when the  mutation requirements are met. Assuming that the population size is 4 and the number of genes per chromosome is 5, Figure 3 is the result before mutation, Figure 4 is the result after mutation, and the red part in Figure 4 is the value after mutation.

E. CALCULATE FITNESS
The constraints in this paper are combination constraints (Hamming distance constraint, Reverse Complement Hamming distance constraint, etc.), so the adaptive function of the algorithm is set to satisfy the Reverse Complement Hamming distance constraint. If the fitness function value is zero, all the code in the collection satisfies the combination constraint. The formula is as follows: F. ALGORITHM DESCRIPTION 1) Use Bloch Sphere Coding to initialize the population.
2) Ant K starts to work and randomly selects a code from code s to add it to C k . 3) Each ant constructs a candidate set based on its own set. 4) According to the ant transfer strategy (roulette), select a set from the candidate set to join the C k . 5) Determine whether the candidate set is empty. If not, return to step 4. Otherwise, go to step 6. 6) Update the local pheromone. 7) Determine whether all ants have completed the update, if not, return to step 4, otherwise go to step 8. 8) Update the local optimal solution and determine whether the maximum number of iterations is reached, if not reached, update the global optimal solution and the global pheromone and return to step 2, otherwise go to step 9. 9) Obtain the global optimal solution S bs and use it as the input of genetic algorithm. 10) Use Bloch Sphere Coding to initialize the set S. 11) Perform crossover, mutation and fitness calculation operations on the set S and update the code set. 12) Determine whether the set s is empty, if it is not empty, go to step 11, otherwise go to step 13. 13) Determine whether the maximum number of iterations has been reached, if not, go to step 10, otherwise go to step 14. 14) The final set S bs is output, and the operation ends. The algorithm flow chart is shown in Figure 5.

IV. EXPERIMENT A. EXPERIMENTAL ENVIRONMENT
The CPU of this experiment is i7-10070, 16GB memory. MATLAB is used to experiment and compare the results.

B. EXPERIMENTAL DATA
In this manuscript, Bloch spherical coding method is used, which has more randomness and diversity than the traditional random coding method. Therefore, the coding method used in this manuscript is Bloch spherical coding (the population generated by Bloch spherical coding is the experimental data).

C. PARAMETER SETTINGS
The combination of ant colony algorithm and genetic algorithm is used to solve the problem of DNA coding set design. The CPU of this experiment is i7-10070, 16GB memory. MATLAB is used to experiment and compare the results. Table 1 shows the parameters of ant colony algorithm, and table 2 shows the parameters of genetic algorithm.

D. EXPERIMENTAL PROCESS
In this experiment, we first use Bloch spherical coding to initialize the population (the population obtained by Bloch   spherical coding can have randomness and diversity), then remove the same coding and put the coding into the improved ant colony algorithm to get the optimal one or more DNA coding sets, and then use genetic algorithm (crossover and mutation) to expand the DNA coding set, After many cycles, the final DNA coding set is obtained.

E. MEASURE STANDARD
In this experiment, Bloch spherical coding, improved ant colony algorithm and genetic algorithm are used to obtain DNA coding set, so the evaluation index of this experiment is to obtain as many DNA coding sets as possible under certain conditions (given constraints and DNA coding length). Table 3 shows the meaning of the upper corner of each experimental data, that is, the experimental results of each algorithm. Table 4 and table 5 are the maximum number of sets obtained by each algorithm of A RC 4 (n, d) and A GC,R C 4 (n, d, w) respectively. The value range of coding length is [4,13], and the range of distance D is [3, n]. A in each column represents the previous experimental results (the letter in the upper right corner of each result represents the method used, see Table 3 for details), B represents the experimental results of this paper. The symbol ''-'' indicates that the value of distance D is greater than the length of the code. The number in bold indicates that the experiment in this paper is better than the previous experiment, and the symbol ''.'' indicates that the experiment has not been run. In this paper, the value range of w is n/2 .

F. EXPERIMENTAL RESULT
In Table 4 (12,6,6), are far from the existing experimental results, and some results are equal to the previous results, such as A RC 4 (9, 7) = 8. There are also some cases where the number of codes in the new algorithm is larger than that of the existing codes in the range of coding length [4,13] (11,4,5) = 2569, etc. This shows that the algorithm can find more codes that meet the constraints in some cases. The reason why the algorithm based on Bloch sphere coding, improved ant colony algorithm and genetic algorithm can get better results is that: (1) the method of randomly generated coding is replaced by Bloch sphere coding, which makes the generated population more random, so as to achieve a larger search range, and also has a certain impact on the subsequent search solution space (2) compared with random VOLUME 9, 2021 search, the coding set obtained by crossover and mutation of genetic algorithm can approach the optimal space, while random search has no such evolutionary mechanism, which is blind search. Although the algorithm can obtain better coding set in some cases, the coding set obtained in many cases is not better than the previous one, which indicates that the algorithm still has a lot of room for improvement. At the same time, the latest intelligent optimization can be added to the algorithm to obtain better performance.

V. CONCLUSION
In this paper, we combine Bloch sphere coding, ant colony algorithm and genetic algorithm to design a DNA coding method that meets the constraints. Bloch spherical coding method is more diverse than the traditional random generation method, which makes the search space of the algorithm larger and more able to find the DNA coding set that meets the constraints; ant colony algorithm has better stability and global search ability; genetic algorithm makes the population better adapt to the new environment and expand the existing solutions through mutation crossover and other operations Space.
Through experiments, we find that the size of DNA coding set found by the proposed algorithm in some cases is better than the existing results. This also proves the feasibility of the algorithm to a certain extent, and provides a strong support for the future research of DNA coding set.