Improved Particle Swarm Based on Elastic Collision for DNA Coding Optimization Design

In DNA computing, the design of DNA coding sequences is an important factor affecting the reliability of DNA computing. In different DNA sequence designs, suitable constraints should be selected and the sequence design should be rationalized according to these constraints. In this paper, an improved particle swarm optimization algorithm based on elastic collision strategy (EC-PSO) is used to optimize the design of DNA sequences by using an adaptation function that satisfies multiple constraints. EC-PSO uses the idea of elastic collision to improve the optimal and worst positions within the population, introduces the flight means of the sparrow search algorithm (SSA) to enhance the search capability of the algorithm and increase the diversity of the population; then introduces the harmony search algorithm to the population is then fine-tuned to improve the quality of the solution. The effectiveness of the algorithm was verified by comparing it with the other six algorithms in eight test functions. Finally, the sequence designed was more reasonable in the DNA optimal design experiment.


I. INTRODUCTION
With the rapid development of science and technology, the data generated by human beings every day is increasing exponentially, so research and development of highperformance computing is urgent. DNA computing is a highperformance computing technology based on biological DNA molecules. Compared with traditional electronic computers, it has advantages of less resource consumption, high computing efficiency and large storage capacity, and has better application scenarios in the era of big data. In 1994, Professor Adleman [1] proved the feasibility of biological computing with DNA molecule as the carrier, and pioneered DNA computing. Subsequently, DNA computing has also been applied in various fields, such as DNA nanotechnology [2], DNA image encryption [3], etc.
DNA coding is the core step of DNA computing. A good DNA coding sequence determines the reliability of DNA computing. So how to design high quality DNA coding sequence is an important problem in the field of DNA computing. The design of DNA coding sequence is generally subject to multiple constraints to meet the specified quality requirements, such as continuity, similarity, and H-measure. Therefore, DNA coding design can also be regarded as an optimization problem with multiple objective functions [4].
Metaheuristic algorithm is a mathematical model of solving optimization problems by inference and induction inspired by natural laws. Genetic algorithm [5] is the first known meta-heuristic algorithm, it shows excellent optimization performance of metaheuristic algorithm in path planning [6], image processing [7], shop scheduling [8] and other problems. Subsequently, particle swarm algorithm [9], tabu search algorithm [10], ant colony algorithm [11], artificial bee colony algorithm [12] and other meta-heuristic algorithms have been studied. They all play an important role in optimization.
In order to obtain high quality DNA sequences, metaheuristic algorithms have been widely used in DNA coding sequence optimization in recent years. Xiao et al. [13] combined particle swarm optimization algorithm with chaotic search algorithm and proposed a new quantum chaotic group evolution algorithm to solve the DNA sequence optimization problem. Ibrahim et al. [14] proposed a binary particle swarm optimization algorithm to design DNA coding sequences. Xiao et al. [15] used a multi-swarm particle swarm VOLUME XX, 2022 optimization algorithm based on time-varying acceleration coefficient to optimize DNA coding. Yang et al. [16] introduced a niche crowding strategy into the invasive weed optimization algorithm and algorithm to solve the problem of DNA sequence design. Liu et al. [17] combined bat algorithm and particle swarm optimization algorithm with a hybrid strategy, and used fast non-dominated sequencing to calculate and sort the fitness of DNA coding, achieving good optimization results. Bano et al. [18] proposed a multiobjective meme generalized differential evolution algorithm to design reliable DNA sequences based on reverse learning and local search strategies. Yao et al. [19] proposed an activity-based bacterial foraging algorithm by considering the influence of bacterial activity on foraging. The quality of DNA coding sequence was improved by introducing bacterial activity regulation mechanism and competitive exclusion mechanism. Zhou et al. [20] proposed an improved Bloch ball ant colony algorithm, which uses The Bloch system to initialize the population and then uses the ant colony algorithm to search for the optimal DNA code. Li et al. [21] proposed the closed constraint and paired sequence constraint on the original constraint, and obtained stable and reliable DNA sequence in combination with the improved chaotic whale algorithm. A great deal of work by researchers shows that meta-heuristic algorithms have obvious advantages in solving DNA coding problems. However, when the scale of DNA coding problem is enlarged, the optimization time of the algorithm is greatly increased and the efficiency is not high. The quality of DNA coding is also difficult to be guaranteed due to the strong randomness of metaheuristic algorithm, which is easy to fall into local optimization. Therefore, it is necessary to further improve the relevant algorithms to better solve the DNA coding problem.
Particle swarm optimization (PSO) is a classical metaheuristic algorithm, it has good optimization ability and has few parameters and is easy to realize, but it also has some defects such as large randomness and easy local convergence. To solve these problems, scholars have made many improvements. Recently, Zhang et al. [22] proposed a multiobjective particle swarm optimization algorithm with control parameters. An improved position updating formula was used to replace the inertia weight and acceleration coefficient in traditional PSO, and an adaptive disturbance updating strategy was added to improve the universality and optimization ability of the algorithm. Zhang et al. [23] uses fuzzy clustering to guide population initialization, and adds local operators based on feature importance to strengthen local search ability, and finally obtains an improved particle swarm optimization algorithm with good optimization ability. Although the above improvements have effectively enhanced the local search capability of the particle swarm algorithm, it is still lacking in the global search.
In order to balance the local search and global search capability of the algorithm, this paper proposes an improved particle swarm algorithm based on elastic collision. And the following work has been carried out in this paper: Compared with Sparrow Search Algorithm (SSA) [24], PSO, Grey Wolf Optimizer (GWO) [25], Teaching learning based optimization (TLBO) [26], Manta Ray Foraging Optimization (MRFO) [27] and whale Optimization Algorithm (WOA) [28] on 8 standard test functions, the results show that EC-PSO has better optimization ability. Through the designed DNA sequence, it can be seen that EC-PSO has better rationality and higher quality than the algorithm proposed before. The main contributions of this paper are as follows: 1) An elastic collision strategy is proposed to improve the updating mode of individual position and enhance the local search capability of the algorithm. 2) In the iterative process, the finder-phase update strategy of sparrow search algorithm is introduced to enhance the global search capability of the algorithm.

3)
harmony search algorithm was used to screen the individuals in the population, and the individuals with lower fitness value were eliminated, and the solution with higher quality was obtained.
In the rest of this paper, the second section introduces the constraints of DNA coding sequences. The third section introduces the improved strategy of improved particle swarm optimization algorithm based on elastic collision. The fourth section introduces the comparison and analysis of EC-PSO and other optimization algorithms in function optimization. The fifth section introduces the comparison and analysis of EC-PSO and other optimization algorithms in DNA coding sequence design. The sixth section summarizes the conclusion of this paper and points out the next step.

II. Constraints on DNA coding sequences
In 2004, Garzon et al. [29] discussed the problem of coding design for storing data in DNA sequences. DNA coding design should theoretically follow two major constraints, namely combinatorial constraints and thermodynamic constraints. Common combinatorial constraints include H-measure and similarity to prevent irregular hybridization of DNA molecules. And continuity and hairpin structures to prevent DNA molecules from forming the wrong structure during hybridization. Common thermodynamic constraints include GC content, chain temperature and free energy; These constraints allow DNA molecules to maintain consistent biochemical properties when hybridized. Next, the constraints used in this article are introduced.
In the following formula to introduce constraints, and respectively refer to the two DNA coding sequences in the DNA sequence set , is the number of DNA sequences, and is the number of bases in the given DNA sequence. Other functions involved are as follows.
In formula (1), ( , ) is a threshold function that returns if parameter is greater than the specified threshold , and 0 otherwise. In formula (2), ( , ) is the function that determines whether the bases and are complementary. In formula (3), ( , ) is the function that determines whether bases and are equal.

A. Continuity
Continuity refers to the number of consecutive occurrences of the same base in a sequence of DNA code. If the frequency of occurrence is greater than the set threshold value, the DNA molecular structure will become very unstable due to the hydrogen bonding between the bases, and it is easy to distort or fold during the reaction, resulting in secondary structure. DNA molecules with secondary structure cannot be used in DNA calculation, which will affect the reliability of calculation. So the continuity of the DNA coding sequence should be as small as possible. Continuity constraints can be expressed by a mathematical model as follows: is the set continuity threshold. In different DNA sequence lengths, it is generally required that should not be greater than 2 to reduce the probability of secondary structure of DNA sequence in the reaction process ( , ) returns the number of consecutive bases, being one of A, T, C, or G.

B. Hairpin
Hairpin structure refers to the reverse folding of DNA molecule itself, with some bases close to each other and some folded bases complementing each other to form a secondary structure. It is called hairpin structure because its shape resembles hairpin. In DNA coding sequence, lowering hairpin structure value can reduce the self-reaction of DNA molecule, thus avoiding the emergence of secondary structure. Hairpin structure includes hairpin stem and hairpin ring, set the minimum stem length as , set the minimum ring length as , hairpin structure ring length as , stem length as , the calculation formula of hairpin structure is as follows: In formula (8), =min( + , − − − )

C. Similarity
Similarity refers to the degree to which the base structures of two DNA molecules are similar. Hamming distance refers to the number of different bases in the corresponding positions of two DNA sequences, while similarity is a constraint condition based on Hamming distance. On the basis of hamming distance, the similarity constraint also considers the shift. Transposition refers to the fact that the hamming distance between DNA sequence and is very large, but the hamming distance becomes very small after the complementary strand of is moved one bit to the right, so that the complementary strand of and is also prone to non-specific hybridization. Therefore, the smaller the similarity between DNA sequences, the lower the probability of non-specific hybridization between them. The calculation formula of similarity is as follows: In formula (9)-(12), , are two coding sequences in DNA sequence set and the similarity value is divided into two parts for calculation. One is discontinuous similarity, and the other describes the largest continuous common subset.
(−) represents the sequence of two pieces of and the gap between the two pieces of is . represents the number of digits moves to the right.
( , , ) returns the number of bases and are continuously equal starting at position , ] is a positive integer.

D. H-measure
H-measure also calculates the possibility of unexpected base pairing between two DNA sequences based on hamming distance, and avoids mismatched hybridization between and by calculating the number of complementary bases between DNA sequences and . At the same time, the sequence shift is also considered in the calculation process. The smaller the value of H-measure is, the lower the possibility of crosslinking with complementary strands of the same group of VOLUME XX, 2022 DNA molecules. The calculation formula of H-Measure is as follows: In formula (13)- (16), , are two reverse parallel DNA sequences. h-measure calculation can be divided into continuous calculation and discontinuous calculation. And ( , , ) represents the number of consecutive complementary pairs of bases and starting from position .

E. GC content
GC content refers to the percentage of the number of guanine (G) and cytosine (C) in the DNA sequence to the total number of DNA bases. GC content is very important to keep the chemical properties of DNA sequence stable. In DNA computing, GC content of DNA coding sequence should be as consistent as possible, generally around 50%. Because ≡ contains three hydrogen bonds, and = contains two hydrogen bonds, GC content also indirectly affects melting temperature.

F. Melting temperature
Melting temperature (Tm) is the temperature at which 50% of the double stranded structure of DNA double stranded molecule is opened into a single strand under heating conditions. Tm is an important factor affecting the reaction efficiency of DNA molecules. In DNA computing, it is required that the DNA coding sequence should have the same melting temperature as far as possible, so as to better control the reaction between DNA molecules and effectively reduce the probability of non-specific hybridization. According to the nearest neighbor thermodynamics model [30], the calculation formula of Tm is as follows: In formula (19), ∆° is the total enthalpy of adjacent bases, ∆° is the total entropy of adjacent bases. is the gas constant (1.987 / ), is the concentration of DNA.

III. Introduction to EC-PSO algorithm
In standard particle swarm, individuals need to update their positions according to the current population information, so strengthening the communication of position information between particles in the population can facilitate optimization. It is worth noting that in the search for optimization in complex environment, the range of change of the best and worst positions in the population is small, or even unchanged after multiple searches, which will limit the search ability of the algorithm and make the subsequent search means meaningless. Scientific researchers have found such problems and also carried out partial update for the optimal location. However, the traditional update strategy will evaluate the individual after searching for several times, which increases the computational complexity of the algorithm, so the effect is not comprehensive enough.

A. Elastic Collision strategy
Elastic collision comes from the knowledge of mechanics in physics. When ball A hits another static ball B at speed V, the speed and direction will change. Assuming no energy is lost during the collision, and a initially moves in A positive direction, there are three scenarios after the collision.
(1) If the mass of A is greater than that of B, the two balls move in the positive direction after the collision.
(2) If the mass of A is less than that of B, A moves in the opposite direction and B moves in the positive direction after the collision.
(3) If the mass of A is equal to the mass of B, A is at rest after the collision, and B moves to the right with A velocity equal to the original velocity V.
In the design of this paper, the worst individual in the population is endowed with speed to impact the best individual, and the fitness value of the individual is equivalent to the mass of the object. The fitness value of the position after the collision is compared to select the best and worst individual respectively. The specific idea is shown in Figure 1.
As shown in Figure 1, the fitness value of the worst individual is 1 , and the fitness value of the best individual is 2 . The worst individual of the worst individual 1 is assigned to speed, and the velocity can be calculated as follows: In formula (20)- (21), represents the worst individual position, represents the position of randomly selected individuals. The difference between the two is used to perceive the location information of the population and master certain group information. is a non-linear parameter factor used to improve speed adaptability. where is the maximum number of iterations and is the current number of iterations is considered as the independent variable., and the velocity 1 is seen as the dependent variable. The velocity change diagram is shown in Figure 2.

VOLUME XX, 2022
It can be seen that with the increase of the number of iterations, the speed gradually increases and the transformation is faster, which is conducive to the search ability in the middle and late period and reduces the probability of falling into local optimal.

B. Improved global search
In the process of PSO optimization, the global search ability is weak, and the optimal solution may not be found. In the previous new algorithms, the function optimization ability is slightly inadequate. Sparrow search algorithm (SSA) is a new population intelligence algorithm, which has extensive search ability and is better than other algorithms in function optimization ability. The formula of the discoverer stage is flexible.
In Formula (27), represents the maximum number of iterations of the algorithm. is a uniform number conforming to (0,1); Q is a random number that follows the standard normal distribution; L is a 1 * matrix where each entry is 1; Alarm value 2 ∈ [0,1] and safety value ∈ [0.5，1]. If a sparrow in the population finds danger, it will send out an alarm signal. When the alarm value is greater than the safe value, the finder will take the population to another safer area to forage.
In this paper, the finder-stage optimization strategy of SSA is used to update the particle position, which makes the subsequent optimization more reliable. Firstly, some individuals are randomly selected from the population for updating according to Equation (27), providing two individuals with large differences for the collision process, so that the next optimization range is more extensive. Some individuals are selected in a proportional manner. The proportion of individuals selected is set as 0.1. In this way, the selected individuals will not greatly increase the computational complexity of the algorithm. In addition, the diversity of the population is increased to some extent, and the optimization ability of the algorithm is improved.

C. Harmony search algorithm
Harmony search algorithm (HS) [31] the main idea from the harmony memory of randomly generated, based on considering the harmony memory, random selection and the operation of the pitch adjustment strategies candidate solution vectors, then a candidate solution vector and harmony the worst memory solution vector fitness function value comparison, decide whether to update the harmony memory Banks. In this paper, HS is used to locally adjust the individuals of the current population and eliminate the worst solution in the current candidate solution, so as to prevent the updated worst solution from continuously affecting the search results, so as to extract a group of high-quality solutions.

D. EC-PSO algorithm
strongly encouraged.) English units may be used as secondary In order to improve the optimization ability of particle swarm optimization and increase the diversity of its optimization methods, an improved particle swarm optimization algorithm based on elastic collision is proposed in this paper. The algorithm uses elastic collision to improve the best and worst individuals in the population without increasing the computational complexity of the algorithm. The finders update formula of the SSA is introduced to improve the global search ability of the algorithm. Finally, HS is used to update the candidate solutions and retain a set of high-quality solutions. The specific process is as follows: Step 1. Initialize the population and parameters.
Step 2. Use the formula (25), (26) to update the best and worst individuals.
Step 3. Use the formula (27) to update the corresponding position.
Step 4. Update the corresponding position and speed.
Step 5. Fine-tuning the population with HS.
Step 6. Generate the corresponding optimal position and optimal solution.
Step 7. If iteration ends, output the optimal position and solution; otherwise, return to step 2.

E. Time complexity analysis of algorithm
Time complexity is an important factor to measure the quality of the algorithm and determines the rationality and timeliness of the algorithm. Let the population number of EC-PSO be , the maximum iteration times be , the dimension of the problem be , and the randomly selected ratio be 1 . Therefore, the time complexity of EC-PSO is analyzed as follows: From the macro point of view, the time of swarm intelligence optimization algorithm is ( × × ) , and EC-PSO is the same. Although EC-PSO added some strategies in the optimization process, it did not change the structure of the algorithm and increase the number of cycles, so its time complexity was still ( × × ).
From the microscopic point of view, it is assumed that the calculation time for introducing elastic collision is 1 , that for adding global search is 2 , and that for introducing harmony search is 3 . Other parts of the calculation scale is small and can be ignored. As can be seen from the algorithm flow chart, ( × 2) is added in the phase of updating the best and worst locations, ( × 1 × × ) is added in the global search phase, and ( × × ) is added in the introduction of HS. Therefore, compared with PSO, the time complexity of EC-PSO is increased by ( (2 + × × ( 1 + 1))). But the order of magnitude is not improved, and the optimization efficiency and accuracy of the algorithm can be effectively improved, so the increased time complexity is significant and worthwhile.

F. DNA sequence design process based on EC-PSO
In this paper, an improved particle swarm optimization algorithm for elastic collisions is proposed for DNA sequence optimization, and the fitness function is a combination of continuity, hairpin and Hamming constraints. When the DNA sequence meets this particular condition, the fine-tuned individual will be retained for subsequent operations, otherwise it will be eliminated. The aim is to prevent secondary structures from developing among individuals in the population and to reduce similarity to some extent. The magnitude of each fine tuning of harmonic search is determined by the fine tuning probability. When the number of iterations reaches the common multiple of 100, the population is re-initialized to prevent the algorithm from entering the design mistakes, so as to get rid of the influence VOLUME XX, 2022 of the initialization stage. The specific flow chart is shown in Figure 3.

V. Algorithm performance test and analysis
In the performance test experiment, eight standard test functions were used to verify the validity of EC-PSO. In order to make the test more reasonable and comprehensive, different function types were used in the experiment to test the optimization performance of the algorithm. The specific function information is shown in Table 1. F1-F4 is a simple single-peak function, F5-F7 is a complex multi-peak function, and F8 is a fixed-dimensional function. In addition, PSO, GWO, SSA, MRFO, TLBO and WOA algorithms are compared with EC-PSO to verify the advanced nature of EC-PSO. In order to better show the optimization of each algorithm, each algorithm runs 30 times, the best value, worst value, median, mean, standard deviation, run time of each algorithm are calculated. In order to be fair, the population (pop) size and iteration (iter) times of each algorithm remain the same. The specific parameters of each algorithm are shown in Table 2. All experiments were run on a PC with CPU of Core I5-10200h, RAM of 16G, operating system of Windows 10, and compilation environment of MATLAB R2019a. The optimization results obtained in the experiment are shown in Table 3.
As can be seen from Table 3, EC-PSO can find the optimal value in F1-F4 function every time, while SSA and MRFO can find the optimal value in most cases, but there will be deviations occasionally, with poor stability in searching, and other algorithms perform worse. It can be seen that EC-PSO has excellent performance in single peak optimization. In terms of multi-peak function, EC-PSO is slightly different from SSA and MRFO, but it still outperforms other algorithms, proving that EC-PSO has good performance in multi-peak problem. Finally, the performance of EC-PSO is almost the same as that of MRFO in the fixed dimension function, which is better than other algorithms. In conclusion, multi-strategy supported EC-PSO algorithm has good optimization performance. In order to clearly see the convergence of each algorithm in the function, the average convergence effect diagram of each algorithm in each function is given, as shown in Figure 4. To show the difference of the algorithms, the horizontal coordinates in some functions are reduced so that the difference of the algorithms can be seen before the search.     Figure 4 shows that EC-PSO has a fast optimization speed and high accuracy. Especially, F1-F4 and F7 have obvious advantages in optimization effect and convergence speed. On the other hand, although EC-PSO does not achieve the best convergence effect on F5-F6 functions, it is only second to SSA and MRFO, and very close to MRFO on F8 functions. In summary, although EC-PSO has some differences with other algorithms in multi-peak problem, it is far superior to other algorithms in single-peak problem. Compared with the standard PSO algorithm, the improved EC-PSO optimization ability is greatly improved, thus proving the effectiveness and advance of the improved strategy proposed in this paper.

VI. DNA sequence optimization experiment based on EC-PSO
DNA sequence design needs to meet multiple constraints, so DNA sequence optimization can be regarded as an optimization problem with multiple objective functions. In order to verify the practicality of EC-PSO, EC-PSO was applied to DNA sequence optimization. In EC-PSO, a DNA sequence is a candidate solution, and the algorithm will find the best DNA sequence that meets the constraints. The experimental environment is the same as the algorithm performance test environment. Related parameters in DNA sequence optimization are: the minimum stem length and minimum ring length threshold of hairpin structure are both 6. The minimum continuity critical value is 2. In the continuous case, the critical value of similarity and H-measure is 6. In the case of discontinuity, the critical value of similarity and Hmeasure is 0.17. For codes with a length of 20 DNA Sequence Design Optimized. In addition, the concentration of DNA molecule is 10nM and the concentration of salt is 1mol/L. Table 3 shows the specific parameters of EC-PSO.  In order to highlight the performance of EC-PSO in DNA optimization, NCIWO [32], IWO [33], CPSO [34] DMEA [35] and DNA sequences designed by NUPACK were selected for comparison. NUPACK is a software designed by Caltech academics that generates DNA sequence structures using code. In addition, the parameters of all algorithms are set according to the literature, and the constraint formula is consistent with this paper, which fully ensures the rationality and fairness of the experiment. In order to test the effectiveness of the algorithm in DNA sequence optimization, 7 DNA sequences containing 20 bases were generated in the experiment. The base arrangement of these sequences and the values of each constraint condition after the algorithm optimization are shown in Table 4.
In DNA sequence design, the smaller the value of continuity and hairpin, the less likely the DNA is to produce secondary structures. the smaller the value of hairpin and similarity, the lower the probability that DNA molecule will hybridise incorrectly. The smaller the change of Tm and GC content, the better. According to Table 4, only the continuity and hairpin of EC-PSO are 0. The hairpin value of the last sequence in DMEA was 3. The continuity value of the first sequence in NCIWO is 9. The hairpin value of the first and third sequences in NUPACK are 3. The continuity and hairpin values of IWO and CPSO are not zero, so the optimization effect is poor. Therefore, EC-PSO can effectively avoid the secondary structure of DNA molecules. Figure 5 shows the average value of continuity and hairpin of each algorithm.
In EC-PSO, fine tuning operation of HS is adopted to remove individuals with low fitness value, which reduces similarity and H-measure value to a certain extent, which plays a very important role in DNA sequence optimization. Low similarity and H-measure values can effectively prevent DNA sequences from mismatching during reaction. It can also be seen from Figure 6 that the mean value of similarity and Hmeasure of EC-PSO are the minimum, indicating that DNA sequences optimized by EC-PSO can be hybridized more regularly during the reaction. In DNA sequence, the optimal proportion of guanine (G) and cytosine (C) is 50%. In this experiment, under the condition of meeting other constraints, the GC content is optimized to 50%. In Table 4, only EC-PSO, NCIWO and CPSO achieve this goal. However, it can be seen from Figure  7 that the variance of melting temperature of NCIWO and CPSO is greater than EC-PSO, the change of melting temperature is greater. It is proved that the DNA sequences optimized by EC-PSO have better stability in biochemical reactions.

VII. CONCLUSION
To better solve the problems of the DNA sequence optimization design, this article put forward the elastic collision strategy improved particle swarm optimization (EC-PSO) algorithm, the algorithm adopts the elastic collision theory to update the best and worst position, the introduction of SSA discoverer of update formula to improve the search of algorithm, and finally use HS to extract of populations, selecting better individuals. The effectiveness of the algorithm is verified by eight test functions. In DNA sequence design optimization, the sequences obtained by optimization are of higher quality, which can effectively meet the requirements of DNA computing. Due to the mutual constraints among constraints, we consider to further improve the algorithm and use non-dominated sequencing to screen more high-quality DNA sequences in future work.