Optimization Method for Distributed Database Query Based on an Adaptive Double Entropy Genetic Algorithm

In a distributed database environment, multi-join query optimization is one of the key factors affecting database performance. Genetic algorithms have a good application in dealing with this type of problem. However, the traditional genetic algorithm has the problems of low efficiency and easily falls into the precocity when dealing with query optimization, which is mainly caused by the lack of population diversity. Therefore, this paper sets up a mathematical model for distributed database query optimization and proposes an adaptive genetic algorithm based on double entropy. We introduced a genetic algorithm with two types of entropy: genotype and phenotype. Genotype entropy was used to optimize the distribution of the initial population, ensuring that the initial population has good population diversity. Phenotype entropy is used to optimize the genetic strategy, which can be divided into individual entropy and population entropy. Individual entropy is used to optimize the selection strategy, and population entropy is used to optimize the crossover and mutation operators to maintain the population diversity in the iteration process and accelerate the speed of iteration. The experimental results show that the algorithm proposed in this paper is effective for query optimization of a distributed database.


I. INTRODUCTION
In the era of big data, in the face of increasing mass data, the disadvantages of traditional centralized database are increasingly appearing. To adapt to complex and changeable network requirements and massive data, the distributed database system was born at a historic moment. A distributed database system (DDBS) is a collection of data that is logically related to each other but distributed on different sites of the computer network [1]. These data can not only be run separately, but also communicate with each other through the computer network, and respond to a complex task together to form a uniform whole. The performance of DDBS depends on its ability to handle query requirements in an efficient manner, and the query processing in DDBS needs to transfer data between different sites on the network. In DDBS, the query cost mainly includes CPU, I/O, and communication costs, and communication cost is the most important factor affecting the The associate editor coordinating the review of this manuscript and approving it for publication was Jagdish Chand Bansal.
performance of a query. The communication cost is the cost of transferring data among the different sites that participate in the query. Data transmission and local data processing constitute the distributed query strategy, which is known as the query execution plan (QEP). Multi-join query is one of the most common operations in a distributed database. When multiple relationships are connected, there are many different orders for the same query, and each order corresponds to a QEP. As the number of relational tables increases, the number of different QEPs increases exponentially, which leads to high computational complexity. Therefore, the traditional database query method is inefficient in dealing with the query of massive data, and it is difficult to adapt to distributed queries. Therefore, seeking an intelligent method to quickly find the best QEP with the lowest communication cost among all QEPs in the search space, to reduce the query cost as much as possible and improve the efficiency of query response has become the focus of current research.
Scholars at home and abroad have proposed various strategies for the optimization of distributed database queries. Examples include SDD-1 [2], [3], dynamic programming [4], [5], simulated annealing [6], [7], genetic algorithm [8]- [10] and so on. Paper [11] proposed a query optimization method based on the Tabu-GEP algorithm, which combines the Tabu search strategy with the GEP algorithm. It improves the performance of the classic GEP algorithm. The query time and generation time of the optimal query strategy were both significantly reduced compared to the original. However, the time complexity of the algorithm was still high. Paper [12] proposed an adaptive genetic algorithm, which reintroduced individuals scattered outside the convergence part into genetic operations and adaptively adjusted the evolutionary strategy according to the different fitness of individuals to maintain the diversity of individuals. However, it may also introduce undesirable genes, which slows down the optimization process. Paper [13] combined multiple ant colonies with genetic algorithm, overcoming the blindness of the early search of ant colony algorithm, and used the smooth mechanism and the mechanism of learning from each other among ant colonies to avoid falling into local optimum and precocity. It performs better in preventing the algorithm from falling into a local optimum and can obtain a better query strategy. However, the quality of the initial pheromone too depends on the results of the genetic algorithm. The work in [14] provided an HMSST+ algorithm to optimize the storage and query strategy of a distributed memory database. It uses an SST connection selection strategy to quickly calculate the optimal connection scheme. This algorithm can improve the query efficiency and has strong scalability. However, the improvement effect is not significant for more complex query statements.
Because the classical genetic algorithm is prone to prematurity, it is difficult to obtain an ideal optimal solution. In this paper, we introduced the concept of information entropy into the genetic algorithm and proposed an adaptive double-entropy genetic algorithm (ADEGA), which is based on two types of entropy. We used the genotype entropy of the population to optimize the initial population distribution. During the process of evolution, we selected an appropriate evolutionary strategy according to population phenotype entropy to adaptively adjust the genetic operator and maintain individual diversity in the evolution process. This can improve the global search ability of the entire algorithm and quickly obtain the optimal solution. Experiments show that the ADEGA algorithm can obtain good optimization results and effectively improve the efficiency of distributed database queries.

II. QUERY EXECUTION COST MODEL A. THE REPRESENTATION OF QUERY EXECUTION PLAN
As the uncertainty of the join order of relational tables constitutes the diversity of QEPs, the QEPs of a distributed database can be represented by a query binary tree [15], as shown in Fig. 1. In the figure, the leaf nodes of the binary tree represent the relational tables in the database, and the intermediate nodes represent the intermediate result sets for the joins of the left and right relational tables. In general, for a join of n relational tables, there are n! different kinds of QEP. With the increase in the number of relational tables, the number of QEPs increases exponentially, which is similar to the classical TSP [16] problem, both of which are NP-hard problems. Therefore, it is almost impossible to search for the optimal QEP using an exhaustive method.

B. COST ESTIMATION
In this paper, we considered the multi-join query. Therefore, we use ''∞'' to represent the join operation of the two relational tables. For the join of two relational tables in the multijoin, S = R 1 ∞R 2 , the record number of the intermediate result set after the join is where |R 1 | and |R 2 | represent the cardinality of In the expression, i represents the ith attribute of table R, n represents the total number of attributes for table R, W i is the width of ith attribute. The join attributes involved in the join operation can be divided into two types. If the join attribute is not a pure join attribute, the repeated join attributes should be removed from the result set, and only one join attribute can be retained. If the join attribute is a pure join attribute, all of them are removed from the result set. Therefore, the width W S of the intermediate result set S formed by each join operation VOLUME 10, 2022 can be expressed as join(R 1 , R 2 ) represents the join attributes of table R 1 and table R 2 . Therefore, the size of the intermediate dataset, Size (S) , formed by a join operation is: Because the data of the distributed database are stored separately on data tables at different sites, table R 1 and table R 2 may be on the same site or on different sites. When they are not in the same site, the data of the smaller relational table must be transferred to the site on which the larger relational table is located. Therefore, it is necessary to calculate the transmission cost of the data among the sites. For the joining of two relational tables, J = R 1 ∞R 2 , the cost model cost(j) is given by × W j R 1 and R 2 are in the same site Size(R 2 )) R 1 and R 2 are in the different sites (5) In the expression, min(Size(R 1 ), Size(R 2 )) is the smaller value between table R 1 and table R 2 , W j is the width of the intermediate dataset after the join. The upper part of equation (5) indicates that when two tables are joined at the same site, the cost only includes one part, that is, the cost of the join of two tables, because there is only the data join within the same site but no data transmission between different sites. The lower part of equation (5) indicates that when two tables are in different sites, the data need to be transferred between different sites, it produced additional transmission cost, so the cost includes two parts, the first part is the cost of join of tables, the second part is the cost of data transmission between different sites and the value is the smaller one between two tables. For a join query with n relational tables, the intermediate nodes are (j 1 , j 2 , . . . , j n ), and the total cost estimation model, COST , is defined as The global data dictionary [18] is a record of the global structure and information of a distributed database. It mainly includes some static information of data tables, such as the relevant information of fields, table names, number of table records, number of sites where the tables are located, and the database to which the table belongs. When a global data dictionary is being built, multiple records should be created if a field appears in more than one table, or if the table containing a field is distributed in more than one site.
In the process of distributed query, by accessing the global data dictionary, the relevant database information and site information can be obtained, so that the next query decomposition operation can be carried out.

2) NETWORK PERFORMANCE MATRIX
The network performance matrix is an intuitive representation of network performance between sites. The number of rows and columns is equal to the number of sites; thus, theoretically, it is a symmetric matrix. When cost evaluation involves data transmission between different sites, it is necessary to access the network performance matrix to obtain network performance parameters between sites and take these parameters into account in the cost evaluation. Assume that the network performance matrix M of the four sites is expressed as follows: From this performance matrix, we can find that the diagonal element represents the site communicates with itself, because there is no network communication cost when the site communicates itself, so the site's self-network performance is −1. The other elements indicate that the network performance varies among different sites. The best performance was between site 1 and site 3, with a value of 3.2. The performance between site 1 and site 2 was the worst, with a value of 1.2. The value of 0 between site 2 and site 4 indicates that these two sites cannot communicate with each other owing to some faults.

III. IDEA OF OPTIMIZATION ALGORITHM
The genetic algorithm (GA) is a random search method derived from the evolution law of biology [19]. It exhibits strong robustness and fast convergence. The steps generally include selection, crossover, mutation, and others. However, classic genetic algorithms generally exist the precocity phenomenon, and it easily falls into the local optimum, which greatly affects the optimization ability of the algorithm. The main reason is that the search process is limited to a piece of area due to the lack of population diversity, and the results obtained are only the optimal solutions within this area, rather than the global optimal solutions. Therefore, maintaining population diversity is very important for genetic algorithms, and many researchers have proposed various measures [20]- [22].
Entropy [23] is a quantitative index used to measure the diversity and richness of a system state. By monitoring and controlling the change in entropy, the system can change in a certain direction. Therefore, we introduced information entropy into the genetic algorithm [23], [24] as a control index of state change in the population, so that the algorithm can always maintain a good population diversity in the evolution process. In this paper, we propose an adaptive double-entropy genetic algorithm (ADEGA). Two types of entropy are used to control the algorithm process. Genotype entropy is used to optimize the generation of the initial population, so that the initial population has a better distribution and improves the initial genetic advantage of the population. Phenotype entropy is used to control the change in genetic operators during the evolution process. Phenotype entropy can be divided into population entropy and individual entropy. Population entropy mainly affects the selection process, whereas individual entropy acts on the crossover, mutation, and recombination processes. With these two types of entropy, the algorithm can adaptively adjust the evolutionary strategy, maintain population diversity, and obtain excellent optimization results.
The flow of the optimization algorithm designed in this paper is shown in Fig. 2.

IV. QUERY OPTIMIZATION BASED ON DOUBLE ENTROPY GENETIC ALGORITHM
Aiming at the query optimization problem of distributed database and the shortcoming of genetic algorithm, this paper proposed an improved algorithm that uses two types of entropy to optimize the genetic algorithm and applied it to the query of distributed database. The main content of the optimization algorithm in this paper includes the encoding of problem, fitness function, initial population optimization, genetic operators and so on.

A. ENCODING SCHEME AND FITNESS FUNCTION
In this paper, we chose real encoding to encode the tree structure of the query execution plan. It is assumed that each data table is encoded into a one-digit integer, and the order of the code string represents the access order of the data tables. For example, for code string 12345, the corresponding access order is 1→2→3→4→5. For the encoding of n tables, the encoding length is simply the number of tables.
The specific encoding method was as follows: First, we numbered the n data tables participating in the query from 1 to n, then encoded the leaf nodes of the binary tree into an ordered sequence from bottom to top according to the principle of post-order traversal, and the length of the sequence is n. Then, the access order of each site during the query process can be obtained according to the site to which each relational table belongs. Finally, the query cost can be obtained by substituting the relative data of tables and sites into the cost calculation model. For example, the query binary tree shown in Fig. 1 that contains relational tables R 1 , R 2 , R 3 , R 4 and R 5 can be numbered as 1, 2, 3, 4, and 5, and the corresponding encoded string is 12345. Assume that the five tables are located at four different sites, and the corresponding sites from table 1 to table 5 are 1, 2, 2, 3, and 4. The site access order corresponding to the encoded string 12345 was 12234. When two adjacent access sites are the same, the join operation only needs to calculate the size of the intermediate data set after joining, and the transmission cost between sites is ignored. However, for different adjacent access sites, in addition to calculating the size of the intermediate data set, the transmission cost between sites is the main part of the query cost.
The goal of query optimization is to obtain the QEP with the lowest join cost, and the fitness value should show the advantages and disadvantages of the coding individuals corresponding to each join order. The smaller the join cost, the greater the corresponding fitness value. Therefore, the fitness value should be inversely proportional to the join cost. In this paper, we calculated the cost of each QEP according to equations (5) and (6), and the fitness function of each QEP can be obtained by taking the reciprocal of its cost: VOLUME 10, 2022

B. INITIAL POPULATION OPTIMIZATION
The initial population should be good at representing the entire solution space [25], and its distribution will directly affect the quality of the subsequent new population. In general genetic algorithms, the initial population is randomly generated. Due to randomness, the initial chromosome may be concentrated on a local area of the solution space, so it cannot represent the whole solution space, the population diversity is missed, and the genetic advantage is greatly reduced.
To improve the diversity of the initial population, we used genotype entropy to assist in generating the initial population. Genotype entropy reflects the diversity of the individual loci. Assume that there is an initial population composed of M individuals with an encoding length of L (as shown in Fig. 3), x i j represents the jth gene of individual i in the population. Genotype entropy H j of the jth gene in the population can be defined as: In the formula, k represents the possible value of the jth gene, V j is the set of k, which theoretically equals the encoding length L, Then, P jk can be understood as the frequency at which the jth gene in the population is equal to the k value, that is, the ratio of the number N jk of the k value to the population size M . The genotype entropy H of the whole population is defined as the average value of all genotype entropies H j in the population: The specific generation process of the initial population is as follows: 1) In the individual definition domain, generate N 0 individuals randomly (N 0 < N ) and calculate their entropy H 0 .
2) In the individual definition domain, generate an individual randomly, and then calculate the entropy H with the new individual and the existing individuals. If H ≥ H 0 , we will receive the new individual and update the value of H 0 to H ; otherwise, we will reject the new individual, regenerate another new individual randomly, and continue step 2) until H ≥ H 0 .

3) Repeat step 2) until the number of individuals in the initial population reaches the target number N .
For example, for the join query of five relational tables in Fig. 1, the query tree can be encoded into gene sequence 12345, which is one of the combinations in the solution space. Through the above steps, every turn can produce new gene sequence combinations that are different from all previous individuals in the solution space as much as possible. The retained individuals can ensure that the genotype entropy of the expanded population is not inferior to that of the previous population. The initial population generated in this way can be well distributed in the entire solution space, has good diversity, and can accelerate the speed of evolution.

C. GENETIC OPERATION
In the iterative process of the genetic algorithm, the fitness of the chromosome is the core of the judgment. In this paper, the fitness value was inversely proportional to the cost of the query execution plan. The higher the fitness value, the smaller the corresponding query cost, and the better the corresponding QEP. The fitness value is calculated by substituting the encoding sequence into the cost model and taking the reciprocal, which can be regarded as the phenotype of the chromosome. Therefore, this paper introduces the concept of phenotype entropy, referring to [26].
Definition 1 (Population Entropy): Population entropy represents the phenotype entropy of the entire population. Assume that S is the search space, the population of generation t is P t = {x 1 t , x 2 t , . . . , x N t ∈ ÊS N , N is the population size, and the subpopulation produced by the population of t We define the active window W t of generation t as: 1) the active window of the initial population is w 0 = [l 0 , u 0 ], l 0 is the lower limit of the fitness value of the initial population, and u 0 is the upper limit; 2) the active window of generation t is w t = [l t , u t ], and the active window of generation t + 1 is w t+1 = [l t+1 , u t+1 ], where l t+1 = min(l t , l ot ), u t+1 = max(u t , u ot ), l t and u t are the lower and upper limits of the population fitness of generation t, l ot and u ot represent the lower and upper limits of the fitness value of the sub-population produced by generation t. Then, the active window is divided into K pieces equally. The range of the jth part in the population of tth generation can be expressed as: If the number of individuals falling into the jth interval of the active window w t in the population P t is n j , then the individual density in this interval is n j N , the population entropy E of tth generation can be defined as:

Definition 2 (Individual Entropy):
In the population P t of generation t, if the fitness value of chromosome i falls into the jth interval of the active window, the individual entropy ε i of chromosome i can be defined as p ij is the individual density of the jth active interval in which chromosome i falls, and N is the population size. Population entropy and individual entropy are interrelated. The additivity of entropy can be verified by equations (11) and (12), that is, the population entropy is equal to the sum of all individual entropies. Population entropy is a measure of population distribution in the macro, while individual entropy is the distribution of individuals in the micro.

1) SELECTION OPERATOR
In this paper, we combined the screening effect of fitness and individual entropy on the population, and the individual selection probability is formulated as follows: f i is the fitness value of the ith individual in the population, ε i is the individual entropy defined above. We used roulette [27] as the selection method to filter the population. When the population diversity is high, individuals with higher fitness values are more likely to be retained. While the population diversity is lost, the chance of individuals with small fitness values to be retained in the next generation increases. It can avoid the loss of effective genes to maintain population diversity.

2) CROSSOVER OPERATOR
In the genetic algorithm, the crossover operator is the main method for generating new individuals. In this paper, we consider the influence of evolutionary algebra and entropy on the crossover operator. In the early stages of evolution, a large crossover probability should be adopted to accelerate the generation of new individuals. However, in the later stage of evolution, the crossover probability should be reduced to prevent the destruction of the structure of excellent individuals. When the population entropy is large, the population's individual diversity is high. At this time, it is necessary to concentrate on mining the structure of better solutions, so the crossover probability should be increased. When the population entropy is small, the crossover probability should be reduced to avoid destroying the optimal solution structure. Therefore, the crossover probability P C is set as: P c0 is the initial crossover probability, we chose 0.9 in this paper, g is the current generation, G is the maximum evolution times, E is the population entropy of the current population, H is the maximum population entropy in theory.
For the two individuals x t 1 and x t 2 of the t generation, the value range of chromosome genes is [a, b], and then the crossover operator generates sub-individuals x t+1 1 and x t+1 2 using the following method: where u is a random number uniformly distributed between [0,1].

3) MUTATION OPERATOR
When the algorithm iterates into a local area, it may fall into a local optimum. To make the algorithm jump out of the local area and continue to search globally, we adopted the following mutation strategy. When the population entropy is large and the diversity of the population is high, the structure of the solution space is sufficient, and the mutation operator only needs to search for the optimal value in the current range. At this time, the probability and step length of the mutation should be reduced. Otherwise, when the entropy of the population is small, the diversity of the population is lacking, and it is necessary to increase the probability and step length of the mutation, so that the algorithm can jump out of the local area and search globally. As the number of iterations increases, the step length of the mutation should be reduced to prevent breaking the optimal solution structure that has been found, and the mutation probability should increase because the crossover probability is small in the later iteration, and the mutation operator will be used to assist the generation of new individuals. According to the above analysis, the mutation probability P M is set as P M 0 is the initial mutation probability, we chose 0.06 in this paper. We use the following method to generate mutation individuals: where x is the mutation individual of x, r and r 1 are random numbers within (0,1), u and l are the upper and lower limits of the values of the chromosome gene. VOLUME 10, 2022

4) REORGANIZATION OF FATHER AND SON POPULATIONS
To maintain a good diversity of the population during the evolution process, when the son and parent individuals are reorganized to form the next generation population, the change in population entropy will be used as the selection criteria to ensure that the population entropy of the next generation is not inferior to the previous generation in every turn. The specific steps are as follows. 1) Add all the parent individuals and all the son individuals generated after the genetic operation to form an intermediate population, calculate the fitness value of every individual, and sort by the fitness value from large to small.
2) According to the elite retention strategy, select the first N 0 individuals with the best fitness value from the intermediate population, record them as the temporary population P 0 , calculate its population entropy E 0 , and remove these N 0 individuals from the intermediate population.
3) Select the first individual from the remaining intermediate population and add it to P 0 , and calculate the population entropy E of the new P 0 . If E ≥ E 0 , retain the individual in P 0 and remove the individual from the intermediate population.
Otherwise, the individual will not be retained, and select the next individual from the intermediate population and repeat step 3) until E ≥ E 0 . 4) Repeat step 3) until the number of individuals in the temporary population P 0 reaches the required N. Then, P 0 is the population of the next generation.
This method can keep the population diversity rich as much as possible while selecting excellent individuals for the next generation. This can increase the search speed of the algorithm.

V. SIMULATION RESULTS
In this section, we test the performance difference between the ADEGA algorithm and other comparison algorithms for distributed queries based on a distributed database composed of five servers. The dataset used in the experiment was a set containing more than 2 million records obtained from the Internet, which was divided into several tables and randomly allocated to the databases of five servers. All experiments were carried out on an Intel (R) 2.2GHz machine with 8G physical memory, all five servers using the CentOS 7 system and MySQL database.
The comparison algorithms selected in this paper were the adaptive genetic algorithm (AGA) [28] and the parallel ant colony algorithm (PACA) [29]. The parameters of the algorithms were set as follows: population size was 100, maximum iteration number was 500, initial crossover probability was 0.9, and initial mutation probability was 0.06. In the experiment, the communication performance among sites was represented by the following network performance matrix: In this experiment, we tested the distributed database query with 4, 6, 8, 10, and 12 tables. First, to test the performance of the algorithm in this paper compared to the comparison algorithms, we used the three algorithms to process a join query of 10 tables. The convergence diagram for the three algorithms is shown in Fig. 4. When the curve becomes flat, it indicates that the algorithm has converged. It can be seen from the figure that when all three algorithms converged, the number of iterations when the ADEGA proposed in this paper starts to converge is the smallest, the number of iterations when PACA starts to converge is slightly larger than that of ADEGA, and the number of iterations when AGA starts to converge is the largest. It shows that compared with comparison algorithms, the algorithm proposed in this paper has a faster convergence speed and can search for the optimal solution more quickly. The ordinate in the figure represents the minimum cost in the population after each iteration. When all three algorithms converged, the minimum cost of the ADEGA proposed in this paper is the smallest, the minimum cost of PACA is slightly larger than that of ADEGA, and the minimum cost of AGA is the largest. It shows that the algorithm proposed in this paper is better than comparison algorithms, and its optimal solution that can be searched is closer to the global optimal solution. The above results show that the algorithm proposed in this paper has batter convergence performance and results than comparison algorithms.
Second, for the distributed database query with the number of join tables of 4, 6, 8, 10, and 12, we performed the search using the three algorithms. We selected the amount of transmission data for the query execution plan to represent the query cost. Assume that when the minimum query cost does not change for 100 consecutive generations, the algorithm is considered to have converged. Record the minimum cost, iterations, and search time when each algorithm converges. The results are presented in Table 1, Table 2, and Table 3.
According to the results shown in Table 1, Table 2 and Table 3, we can conclude the following. When the number of join tables is 4, the optimization results of the three algorithms are almost the same. This is because when the number of join   tables is 4, the number of different kinds of QEPs is only 24, which is smaller than the population size of 100. At this time, the initial population will contain almost all types of chromosome sequences, and all three algorithms can search for the global optimal QEP at the beginning. Thus, there is little difference among the three algorithms under this condition. However, as the number of join tables increases to 6, 8, and more, the search space of algorithms becomes increasingly large, which has already exceeded the population size and needs to search for the optimal solution gradually. Thus, the performance difference among the three algorithms became increasingly significant. Under these conditions, compared to AGA and PACA, ADEGA performed the best. The optimal solution searched by ADEGA has the lowest query cost, its number of iterations is the smallest, and its search time is also the smallest. PACA's results were the next, and AGA's were the worst. Moreover, the larger the number of join tables, the greater the gap between the ADEGA and the comparison algorithms. This indicates that the ADEGA proposed in this paper has the better optimization effect and query efficiency in distributed database multi-join queries.
Finally, we test the effect of the algorithm in this paper on distributed database queries. We applied the optimal query scheme found by the three algorithms in our distributed database environment, recorded the query time of each scheme, and used the following evaluation indicators: search cost ratio = search time of current optimal query scheme search time of global optimal query scheme (18) query cos t ratio = execution time of current optimal query scheme execution time of global optimal query scheme (19) The results were shown in Fig. 5 and Fig. 6.  From Fig. 5 and Fig. 6, we can see that when the number of tables is 4, because the initial population may contain all different QEPs, the search cost ratios and query cost ratios of all three algorithms were very close to 1, and there was little difference in search performance among these three algorithms. However, as the number of join tables increases, compared to AGA and PACA, the ADEGA in this paper has a smaller search cost ratio and query cost ratio, which are closer to 1. This indicates that, as the number of join tables increases, the ADEGA can find the solution that is closest to the global optimal solution more quickly and efficiently, and the query efficiency is greatly improved. This is because the algorithm in this paper always maintains a good diversity of population in the iteration process, so that the algorithm can jump out of the local optimum and avoid the algorithm from falling into premature, to better search for the global optimal solution.

VI. CONCLUSION
Aiming at the premature problem that exists in multi-join queries in distributed databases using traditional genetic algorithms. In this paper, we propose an adaptive double-entropy genetic algorithm (ADEGA) based on genotype entropy and phenotype entropy. This algorithm optimizes the initial pop-VOLUME 10, 2022 ulation distribution based on genotype entropy and adaptively selects genetic strategies based on phenotype entropy to maintain population diversity in the iteration process. The results of the experiment show that by maintaining the population diversity in the evolution process, this algorithm can be effectively prevented from falling into the local optimum, the global search ability is improved, and a better query execution plan can be obtained.