Boosting perturbation-based iterative algorithms to compute the median string

The most competitive heuristics for calculating the median string are those that use perturbation-based iterative algorithms. Given the complexity of this problem, which under many formulations is NP-hard, the computational cost involved in the exact solution is not affordable. In this work, the heuristic algorithms that solve this problem are addressed, emphasizing its initialization and the policy to order possible editing operations. Both factors have a significant weight in the solution of this problem. Initial string selection influences the algorithm’s speed of convergence, as does the criterion chosen to select the modification to be made in each iteration of the algorithm. To obtain the initial string, we use the median of a subset of the original dataset; to obtain this subset, we employ the Half Space Proximal (HSP) test to the median of the dataset. This test provides sufficient diversity within the members of the subset while at the same time fulfilling the centrality criterion. Similarly, we provide an analysis of the stop condition of the algorithm, improving its performance without substantially damaging the quality of the solution. To analyze the results of our experiments, we computed the execution time of each proposed modification of the algorithms, the number of computed editing distances, and the quality of the solution obtained. With these experiments, we empirically validated our proposal.


I. INTRODUCTION
The median string problem has attracted the attention of the scientific community in different domains, from early work by Kohonen [10], through word recognition and prototyping. An example of the above is encoding as strings of representative shapes [32], handwritten character recognition [2] or prototyping as a way to condense a dataset [51]. In classification tasks, an approximation to the median string is better as a prototype compared with taking as a prototype a string from the set accumulating the least distance from the rest [15]. Although they show rapid convergence, some heuristics hardly manage to improve the quality of the approximation to the median string concerning the starting point. In contrast, heuristics that converge to better solutions to problems such as character string classification [51] can take tens of hours to find a solution [29], [43]. So work must be done to improve the speed of these algorithms.
Even though there are algorithms that manage to find the exact solution for this problem [11], the computational cost is exponential. For a set of strings of size |S| and length l, a time O(l |S| ) is needed. Furthermore, various authors have shown that the mean chain problem has W [1] -Hard complexity in |S| even for binary alphabets for the Levenshtein distance case. Examples of these approaches are greedy algorithms as in [4], [27].
Many heuristics can be classified as disturbance-based iterative refinement algorithms. An initial seed or string is modified through editing operations (also named perturba-tions in the literature) to get closer to the median string. There are different strategies for selecting the editing operation to be tested. For example, testing possible insertions, deletions, and substitutions in a pre-established order without estimating the probability that the operation will be successful [15]. In [1] authors proposes a different approach, ranking operations by an index allowing testing the most promising first, a better variant of the said index was proposed in [29]. We cover both approaches in Section II.
This work proposes two variations to the perturbationbased iterative refinement algorithms to compute the median string that improves their performance. The first variation affects the algorithm's initialization since an alternative to the median string belonging to the dataset is sought, which was the best initialization proposed in previous works. The new initialization consists in computing the median of a subset of the original dataset. We propose this subset to by the Half Space Proximal neighbors of the median of the string set. The second proposed variation modifies the stop condition; we propose trimming the list of possible edits in each iteration. The above trimming befalls when the expected quality of the operations is less than a threshold; in our case, we took 0 as the threshold.
This paper continues with Section II, where we review relevant concepts and the related works. In Section III, we describe our proposal for a new initialization of the algorithms and the modification in the stop condition. In Section IV, we present the set of experiments carried out and discuss the results, comparing them with other relevant algorithms. Ending with Section V, where the conclusions we arrived at are presented.

II. PRELIMINARIES AND RELATED WORK.
As described in [1], [29], the Median String problem can be defined as follows. Let Σ be an alphabet, let Σ * be the set of all strings over Σ, and let the empty symbol over this alphabet. For two strings S i , S j ∈ Σ * , an edit operation is a pair (a, b) = ( , ), written a → b, which transforms a string S i into S j , if S i = σaτ and S j = σbτ , where σ and τ represent substrings.
We denote as E Sj Si = {e 1 , e 2 , ..., e n } the sequence of edit operations transforming There are multiple approaches to tackle the problem of the median string, in this paper, we focus on those algorithms based on perturbations. Kohonen originally conceived this idea in [10], who proposed, starting from the median of the set, systematically mutate each of its symbols, verifying if the sum of the distances decreased. In that work, the authors established no criteria to evaluate the feasibility of each of the possible modifications.
Perturbations are called each of the possible editing operations defined for the Levenshtein editing distance [13], that is, the substitution of a a symbol for a b symbol, the elimination of a symbol a or inserting a b symbol. We will denote each of these operations, as well as their cost, by w(a, b), w(a, ), and w( , b). One of the interpretations of the edit distance is to find the sequence of transformations of the string s i to s j in such a way that the sum of the cost of each operation involved is the minimum.
Some authors such as [4], take the empty string as the initial stringŝ, perform a greedy search adding at each iteration at the end of the string the symbol that is estimated to lead to the lowest sum of distances. First, si∈S d(ŝ + c j , s i ) for each c j ∈ Σ is calculated using dynamic programming [50], storing the row relative to c j from the dynamic programming array. The sum of the minimum value in each row yields the estimate of how promising c j is; the lower, the better. The process is repeated until neither c j leads to improvement or the length ofŝ matches the largest string in S. In [12] this procedure is enhanced with a different quality estimate, as well as a procedure for solving cases where more than one symbol has the same quality.
In [5] authors introduced the idea of performing multiple operations simultaneously to speed up calculation. The algorithm starts from the median established as the initial approximationŝ. In each iteration, they compute the distances fromŝ to all the strings in S, registering the list of edit operations in each case. After that, for each position, i ofŝ the most frequent operation is applied. This process is repeated until there are no further improvements. Experiments show that the algorithm is faster than [10]; however, they provide no details on the quality of the solution. Later the above strategy was revisited in [45], where they show that there may be an inverse relationship between convergence speed and solution quality since as the number of simultaneous operations increases, the quality of the solution deteriorates. The authors also stated that this could slightly increase the probability of finding the global optimum since algorithms that change only one symbol at a time can get stuck at a local optimum.
In [14] authors propose one operation at a time approach, which appraises a specific order to perform operations. For each position i ofŝ, each possible substitution of the i-th symbol is applied to identifyŝ sub , the string with the smallest sum of distances. The stringŝ ins is calculated in a similar way andŝ del results from the elimination of the symbol i. The new candidate is the best among the three options andŝ; then the process continues for i + 1. The authors also evaluated two optimizations to reduce the algorithm's O(|σ|×|S|×L 3 ) time. The division technique explored dividing each string in S into d substrings. In this way, starting from a string s ∈ S results in strings s 1 , s 2 , .., s d . The procedure independently calculates the median for each resulting setŝ 1 ,ŝ 2 , ...,ŝ d that they are concatenated to obtain the median of S. Another improvement is related to inserts and substitutions, which account for most of the computation time. For substitutions, only the two symbols closest to the one at position i are evaluated, and for inserts, only the symbol at position i − 1 and its two closest symbols. The above allows avoiding the |Σ| factor. They compare the three choices in a classification task where the median is a prototype for the classification k -NN, but without considering the sum of distances. Regardless, the results suggest that splitting optimization leads to faster convergence but lower quality prototypes.
Another work that advocates simultaneous operations is [3]. Even though the proposal is not an algorithm for the median string, it can be easily adapted. The main loop is similar to [5] calculating the distance fromŝ to the strings in S to get the frequency of edit operations. However, only operations with a frequency more significant than a threshold of η are considered. The authors focused on prototyping for classification, they report no experiments to assess the quality of the median chain.
More recently, [1] improved the results in [14] by applying perturbations one at a time. When calculating the distance fromŝ to strings in S, the authors also take into account the editing operations. The hypothesis is that the operations on both sequences have a better chance of improving the solution given the results in [2], which states that if we apply toŝ one of the operations of the edit sequence d(ŝ, s i ) to get s then d(ŝ , s i ) <= d(ŝ, s i ). The above is the underlying idea in [3], [5] but the authors also took into account the cost of the operation to estimate how much the sum of distances would reduce by applying the operations. The estimated probability, from highest to lowest, determines the order in which operations are evaluated. In cases where an edit operation is not present in all edit sequences, the heuristic is optimistic as it does not consider how the application of the operation will affect the distance fromŝ to the strings that do not request this operation. In [29] authors tackled the problem offering an estimate for some of the cases with a significant improvement in convergence rate without affecting the quality of the solution. A comparison between algorithms performing multiple operations simultaneously [3], [5] and one operation at a time [1], [14], [29] suggests that the first ones have fast convergence with a lower approximation quality. The above can be explained by the side effect of editions in other positions, as suggested by [5].
Recently, [47] describes the applications of median string in DNA motif classification, where the median string is computed using Markov chains. Also, a prototype generation in the string space via approximate median for data reduction in nearest neighbor classification is presented in [49].

III. OUR PROPOSAL
Most algorithm start from the empty string [1], [5] or from the string belonging to the dataset that accumulates the least distance to the rest, understood as the median of the dataset [1], [4], [10], [14], [29], [43]- [45]. Our goal is to propose an initialization alternative since, from the works above, initialization affects the speed of convergence of the algorithms.

A. SELECTING A BETTER INITIALIZATION USING THE HALF SPACE PROXIMAL(HSP) TEST.
The Half Space Proximal (HSP) graph was originally defined by Chávez et al. [19], and it is a sparse subgraph of the complete graph in a metric space. For the construction of this graph, it needs to separately apply the HSP test to each object in S i ∈ S. The output of the HSP test on S i is a partition of S, P = {P 1 , P 2 , ...}, where each P j has a representative object p jr , that is connected to S i in the HSP graph. One advantage of this test is reducing the number of edges of the complete graph. The connection of the vertices in the HSP guarantees two desirable properties for our work: proximity and diversity.
The HSP test proceeds iteratively. Let S be the set of strings in iteration i (when i = 0, S = S). In each iteration i, the algorithm finds the nearest neighbor of m in S , p ir = kN N (m, S ). All the strings that are closer to p ir than to m form a new subset P i of S , where p ir is the representative of P i . All the strings in P i are removed from S . This process repeats until S is empty. After this, every string is reassigned to the subset corresponding to the closest representative.
In Fig. 1, we show a pseudo-code of the implementation used by us of the Half Space Proximal test. In our case, we previously removed from the dataset the element m. The output of this algorithm is a set of disjoint subsets P i derived from the original dataset. All elements of P i ∈ P has a representative p ir that is the closest element to m within P i . We initialize P as an empty set (line 1), and iterates until every string in S is added to some subset P i ∈ P (lines 2-15). In each iteration, we search for the nearest neighbor of m, nn ∈ S (lines 4-7), then, all closer to nn than to m are removed from S and added to Q (lines 8-13), including nn thats become the representative of Q. The set Q is added to P (line 14) and a new iteration begins if S is not empty. From line 16 to line 31, each element p ij ∈ P i , such that p ij = p ir , is reassigned to the nearest representative. This part of the algorithm takes O((n − p) × p) where n is the size of S and p is the number of subsets in P .
In Fig. 2 we illustrate an example of the Half Space Proximal test. To simplify, we show a possible spatial relation of the different strings in the set S in terms of distance between them. In Fig. 2(a), the star represents the set median m ∈ S. Then, in Fig. 2(b), string s 1 , represented by a white square, is selected as the closest to m, and space is split into two parts, any string closer to s 1 than to m is assigned to P 1 and p 1r = s 1 . In Fig. 2(c), the string represented by a white triangle, is selected as the closest to m and the previous process is repeated to build P 2 . This process continues, selecting the nearest string s 3 , represented by a white diamond, and building P 3 and so on, until all strings become part of the set P = {P 1 , P 2 , ..., P n }. In our example, we can see in Fig. 2(d) that all strings are already assigned. The last stage of the test is to reassign all strings s i = p ir to the P i with the p ir closest to s i . The final reassignment is VOLUME 4, 2016 shown in Fig. 2(e). We propose Algorithm 3 for calculating the median string when strings in the dataset have different weights. This algorithm takes as input an instance set S, an initialization string R, and a weight set W that is a vector that represents the corresponding weight of each element in S. The algorithm iterates through the same steps until no editions applied toŜ improve the result. First, the distances between R and each S i and the respective involved editions E R Si are computed (lines 4 − 7). Each edition in E R Si have an associated weight W i used to update the statistics (line 6). In lines, 8 − 14, the repercussion of each edition affecting the same position is computed, generating a goodness index of editions. All editions are inserted in a priority queue Q, sorted by goodness index. Then, we discard from Q all editions q i with q i .goodnessIndex ≤ 0 (line 16). Next, we dequeue editions from Q to obtain a new candidate R , applying e k tô S (lines 18 − 19). These two steps are repeated while the new candidate R is worse thanŜ and Q is not empty. Finally, the algorithm returnsŜ. It is worth noting that the vector W for strings in S can be computed using any weighting procedure. In our particular setup, we use the output of Algorithm 1 as follows. The set S = {p ir ∈ P i } and W = {|P i |}.
This algorithm iterates by considering one perturbation at a time until it does not improve during the iteration. Each iteration may consider several different editions. In the worst case, this is upper-bounded by O(l×Σ 2 ), where l is the length of the longest string. The experimental evaluation shows that this bound is rather pessimistic and that our heuristic usually needs just a few editions per iteration. The above is a crucial difference with the algorithm in [1], which uses more operations per iteration.
For each edition explored during an iteration, the algorithm computes the distance of the new candidate R to all the elements in S (lines 17 − 20), which takes time O(N × dc), where dc is the time to compute the edit distance (Levenshtein in our experiments). By providing a better ranking, we save on the number of operations explored per iteration, and thus, on the number of times this distance is computed, which is expensive. In the case of Levenshtein, for example, it is O(l 2 ). However, to do that, we expend some computations to bound the repercussion (lines 8 − 14). This takes O(l × Σ 2 ) time and it is usually worth it as √ l ≥ Σ in most applications.

B. TRIMMING THE LIST OF POSSIBLE EDIT OPERATIONS.
The algorithm presented by [29], generates a goodness index for each edition, taking into account how this edition impacts other editing alternatives affecting the same position. This goodness index is more precise than considering only editions frequency or the frequency multiplied by the cost as proposed in [1]. In Fig. 3, it is possible to see on line 15 that their goodness index sorts the operations but, unlike in [1], [29], we propose to trim Q discarding those editing operations that have a negative value of goodness index, line 16. This modification can reduce the size of Q, speeding up the while loop in lines 17-20 because it runs until an improvement is achieved or Q is empty.
The basis of the algorithm in Fig. 4, was presented in [29]. The original algorithm considered all strings S i ∈ S with the same relevance. We have made the necessary modifications so that strings S i ∈ S can have different weights. It is important to notice that, unlike in [1], [14], [15], [29], for the calculation of the median of S we use an approach in which each S i is weighted according to the size of the subset P i that it represents. The idea of weighting strings has been studied in [28], but only when computing the median of two strings.
Combining the modifications described in Section III-A and in Section III-B, we have four different algorithms, labeled as Median-all, HSP-all, Median-trimmed and HSPtrimmed. The first part of the algorithms name refers to the initial string, having Median for those that have as initial string the set median and HSP for the ones that apply the  modifications described in Section III-A. The second part of the names refers to how we deal with the operation list. We use all for those that test the whole operation list and trimmed for the ones that apply the modifications described in Section III-B.

IV. EXPERIMENTAL RESULTS
Our experimental evaluation uses different alphabets, set sizes, and string lengths. In Eq. 1, we show the ratio used to evaluate the quality of the obtained median stringŜ, where S M is the set median. Besides, we compare the number of edit distances required by the algorithms to converge. Also, we took into consideration the execution time for each experiment. As expected, in all the experiments, time was proportional to the number of edit distances calculated.
The second dataset considers 23 symbols representing different amino acids. We selected 175 samples of orthologous of insulin protein, representing 70 species, obtained from eggNog online application 1 with length ranging in [100, 300] and average 150. With them, we prepared 26 different sets in total, 5 different for each of the sets of size {20, 40, 80, 120, 160} strings, respectively, and 1 set with size 175 with all the data available. We use the well-known BLOSUM62 [18] cost function.
We also generated a third dataset containing synthetic Freeman chain codes as in [1], [16], [29]. With these data, we aimed to study how algorithms scale for sets with sizes of {45, 90, 180, 270, 360}, with the average length of the strings of {20, 40, 80, 160, 320} symbols, respectively. The length variation among strings in the same set was 10%. We generated 5 different sets for each possible combination of set size and string length, making a total of 125 independent sets.
We designed experiments to compare our proposal with the best algorithm described in [29], labeled as Median-all, and in [1], labeled Frequency and Frequency*Cost, in terms of edit distances calculated, average distance to median string and time.
In Fig. 5, we can differentiate three groups of algorithms, at the top, we see the algorithms labeled as Frequency and Frequency*Cost, exposed in [1]. These two algorithms are those with the highest number of edit distances calculated, which increases very rapidly as the size of the dataset grows. In the central region, we see the algorithm labeled as Medianall, presented in [29], and a variant that takes as a starting point the one described in Section III, labeled HSP-all. These two algorithms perform better in comparison with those mentioned above. As the dataset size grows, differences between them are more evident. Finally, in the bottom part of Fig. 5, the algorithms Median-trimmed and HSP-trimmed are shown. These two algorithms are the ones that perform the best.
In Fig. 6 the quality of the solution achieved by the same algorithms is studied. As can be seen, the quality of the two algorithms computing fewer edit distances is slightly worse. In Fig. 6, we can differentiate two groups of algorithms, at the top, we see the algorithms labeled as Median-trimmed and HSP-trimmed. Except for these two algorithms, the others behave similarly regarding the quality of the obtained median string. Finally, as expected, Fig. 7 shows a similar behavior to Fig. 5, i.e. the running time of the method is proportional to the edit distances that they compute.
Next, we expose in detail the effect of each modification when applied independently. From Fig. 8 to Fig. 16, we see the comparison between the algorithm that takes as a starting point the one proposed in Section III, labeled as HSP-all, comparing it with the same algorithm starting from the set median, labeled as Median-all. It is essential to clarify that 1 http://eggnogdb.embl.de/#/app/home  we include the edit distances calculated to obtain the starting point for those algorithms using the HSP test. For all the datasets, results show that in most of the cases our proposal requires fewer operations, while, as can be seen in Fig. 11 and in Fig. 13, the quality of the median string obtained, in the Freeman Chain Codes datasets, is equivalent in both cases.        explained in detail in the previous section. We label as Median-trimmed the algorithm that takes as a starting point the median of the set and trims the list of operations when the expected quality of the operation is zero. We label as HSPtrimmed the algorithm that takes as a starting point the one    proposed previously in Sec. III and trims the list of operations when the expected quality of the operation is zero. In Fig. 8, Fig. 9, Fig. 10, Fig. 14, Fig. 15, and Fig. 16 we notice that trimming can reduce the number of calculated edit distances, and thus, lead to a decrease in the execution time. However, the quality of the median is slightly worse when trimming, as it can be seen in Fig. 11, Fig. 12, and Fig. 13.
Finally, we can see the significant difference that exists, concerning the edit distances calculated and execution time, between the current state of the art, Median-all, and our new proposal, HSP-trimmed. We can also compare the quality of the median string achieved for each of the different algorithms. The results show that the loss of quality of the median string in HSP-trimmed is small, and can be assumed for cases in which a high speed of convergence is required.

V. CONCLUSIONS
A new starting point can be used with satisfactory results in perturbation-based iterative refinement algorithms to compute the median string. We obtain this new starting point from computing the median of a subset of the original dataset. The string subset consists of the Half Space Proximal neighbors of the median string. The above modification implied weighting the elements of the subset depending on the number of instances they represented. We also show that trimming the list of operations improved the stop condition of these algorithms. The above trimming occurs when the expected quality of the operations is less than a threshold, 0 in our case.
The combination of the two heuristics above in our approach produce a more competitive solution than SOTA algorithms. Comparing Median-all and HSP-trimmed we reduced edit distance computations by 86% on average. Similarly, we decreased execution time 82% on average. Reductions in execution time and the number of computed edit distances induced a slight increase of 2% in the average distance to the median string.
PEDRO MIRABAL received a Ph.D. degree in Computer Science from the University of Concepción, Chile in 2019. He is currently Professor in the Department of Informatics Engineering, Faculty of Engineering, at Universidad Católica de Temuco, Chile. His research interests include NLP, data structures, and algorithms.
JOSE ABREU is a Researcher at the University of Alicante, Institute for Computing Research. He has been a member of the Cuban chapter of the International Association of Pattern Recognition, and a Full-Time Professor at the University of Matanzas, and Catholic University of the Most Holy Conception. His research interest covers (i) data-driven solutions in Natural Language Processing (ii) instance selection and prototype construction algorithms.
DIEGO SECO is an associate Professor at the Department of Computer Science, University of Concepción (Chile). PhD. in Computer Science obtained at the University of A Coruña, Spain (2009). His research interests include geographic information retrieval, geographic information systems, and compressed data structures and algorithms for textual and geographic data.
O scar Pedreira has M.Sc. and Ph.D. degrees in Computer Science from University of A Coruña, Spain. He is an Associate Professor since 2008 at the same institution. He is a researcher of the Database Laboratory. His research interests include topics in databases (algorithms for similarity search, data structures and algorithms for graph databases, geographic information systems), and in software engineering (process improvement, testing, MDE, and SPL). He has co-authored many articles published in journals and conferences relevant for the research areas mentioned. He has continuously participated in research projects and technology and knowledge transfer projects with different companies.
EDGAR CHAVEZ received a Ph.D. degree in computer science in 1999 from Centro de Investigacion en Matematicas (CIMAT). He is a full professor at Centro de Investigacion Cientifica y de Educacion Superior de Ensenada (CICESE), Ensenada, Mexico. He is interested in multimedia information retrieval, similarity search, indexing, and clustering algorithms.