LVMapper: A Large-variance Clone Detector Using Sequencing Alignment Approach

To detect large-variance code clones (i.e. clones with relatively more differences) in large-scale code repositories is difficult because most current tools can only detect almost identical or very similar clones. It will make promotion and changes to some software applications such as bug detection, code completion, software analysis, etc. Recently, CCAligner made an attempt to detect clones with relatively concentrated modifications called large-gap clones. Our contribution is to develop a novel and effective detection approach of large-variance clones to more general cases for not only the concentrated code modifications but also the scattered code modifications. A detector named LVMapper is proposed, borrowing and changing the approach of sequencing alignment in bioinformatics which can find two similar sequences with more differences. The ability of LVMapper was tested on both self-synthetic datasets and real cases, and the results show substantial improvement in detecting large-variance clones compared with other state-of-the-art tools including CCAligner. Furthermore, our new tool also presents good recall and precision for general Type-1, Type-2 and Type-3 clones on the widely used benchmarking dataset, BigCloneBench.


I. INTRODUCTION
Clone code is generated by copying, pasting and modifying code fragments for reuse, which are common operations in software development [1], [2]. In the past, code clones with relatively more modifications (called large-variance code clones) were difficult to be found by existing tools [3] because most of them were suitable for finding the identical or very similar clones. However, in our experimental observation, large-variance code cloning is ubiquitous. Compared with the clones provided by traditional tools, the large-variance code clones have a broader range. Accordingly, they have an important impact on and make changes to some software applications such as bug detection, code completion, software analysis and so on. Recently, a meaningful attempt has been made with CCAligner [3]. It is a large-gap clone detection tool which can detect the code clones with relatively concentrated modifications. Our study focuses on a more general case: to detect large-variance code clones that include not only the clones with concentrated code modifications but also those with scattered code modifications.
Software development usually involves two modes: homologous modification and heterologous development. In homologous modification there is always an original version of code and modifications on the original version generate new clone code. Hence, the similarity in lexical is the inherent characteristic of homologous modification. By contrast, in heterologous development, different programmers develop similar functionalities and these introduce semantic clones, where the lexical similarity is not inherent. Furthermore, large-variance clones of homologous modification are generated by two cases. One of the common cases is that the old version of software are modified and expanded iteratively. Another case is the reuse and modification of open source software. Both cases are very common in software development, therefore it is important to detect the clones with homologous modification. Fig. 1 shows an example of clones between two different versions of project Ant 1.6.5 and 1.10.5. In code B, the code segments of lines 1-2, 6, 8-10, 18-20, 27-28 are the same as those of the lines 1-2, 7, 9-11, 13-15, 18-19 in code A, respectively. Lines 4-5, 7, 14 and 21 in code B are modified from lines 3-5, 8, 12 and 16 in code A. Other lines in code B are new extension part of code A. These modifications in clone codes are scattered and occupy a certain portion of the code.
Existing tools have made attempts but still have more or less limitations in finding Type-3, especially large-variance code clones. For one of the best tools with good performance on Type-3 clone, SourcererCC [4] has to decrease the threshold of similarity to find large-variance clones at the cost of accuracy loss. Another popular detector CCFinderX is good at identifying Type-1 clones and Type-2 clones, but cannot directly support Type-3 clone detection. iClones identifies the Type-3 clones by merging the nearby small Type1-2 clone fragments, but the recall of Type-3 is low due to the simple strategy. NiCad uses a Longest Common Sub-sequence (LCS) algorithm and can tolerate discontinuous subsequences. However, it does not scale and its precision suffers with decreasing thresholds. CCAligner is a good recent attempt in detecting clones with relatively concentrated modifications. It can detect large-gap but misses scenarios where modifications are scattered. Besides, some semantic methods have certain ability to detect variance clones because there is an overlap between  semantic clones and syntactical clones. Deckard [5] builds the characteristic vectors from abstract syntax tree (AST) to detect clones, but suffers from low precision and recall rate. Deep learning methods such as Oreo [6] encode software metrics into semantic vectors and achieve good results, but they mainly focus on semantic clones. For these considerations, we present a tool aimed at detecting large-variance code clones called LVMapper. Our proposed code clone detector that can find clones with more general variance is based on locate-filter-verify method. Its key idea mainly comes from third-generation sequencing alignment method [7]- [9]. In bioinformatics, the third-generation sequencing alignment based on seed-and-extend strategy performs well with sequence difference up to 30%. LVMapper uses small windows of continuous lines (called seeds) with lower costs to locate and filter the candidate pairs of clone codes. In order to verify whether these candidate pairs of codes are cloned, another feature that code clones always have certain proportion of order-preserving code lines is considered. Based on this property, a heuristic algorithm which is more efficient than the Longest Common Subsequence (LCS) algorithm is proposed. Besides, a dynamic threshold that changed with the code size is used for the verification of code clones. It makes LVMapper identify clones with more modifications while guaranteeing certain precision.
To evaluate the large-variance clone detection performance of LVMapper, we carried out experiments on self-synthetic and real datasets. In order to test the capability of finding large-variance clone, we compared our tool's performance with NiCad, SourcererCC and CCAligner on 4 Java and 4 C projects. For the self-synthetic dataset experiment, we generated clones by inserting scattered different lines to source code, and then evaluated the performance of LVMapper and other tools in detecting clones with various modifications. We also used the BigCloneBench [10], [11] to compare and measure the different type clones recall of LVMapper with CCFinderX [12], iClones [13], Deckard [5], NiCad [14], SourcererCC [4] and CCAligner [3]. The experiments show that LVMapper performed the best in detecting large-variance clones and had good recall and precision for general Type-1 to Type-3 clones.
The main contributions of our work are as follows: (1) Goal contribution: CCAligner has advantages in detecting large-gap clones while our work extends the detection approach of large-variance clones to more general cases. It identifies not only the clones with concentrated code modifications but also the clones with the scattered code modifications. We also give a concrete definition of the large-variance clones.
(2) Method contribution: Inspired by the idea of the seedand-extend method in bioinformatics, we develop a novel tool with locate-filter-verify procedure and it is suited to detect clone with large variance. We propose a dynamic threshold to promote the accuracy and recall, and a rapid method to verify, avoiding the time-consuming dynamic programming.
(3) Result contribution: We compared LVMapper with other state-of-the-art detectors on real cases of software projects, self-synthetic programs and the state-of-the-art benchmarks. The results show that LVMapper is much better than the stateof-the-art tools in large-variance clone detection. In addition, our new tool has good recall and precision for general Type-1, Type-2 and Type-3 clones.
The rest sections of the paper are organized as follows. Some terminologies and definitions of code clone are introduced in Section II. Section III provides the details of our detection tool. Section IV presents the results of the experiments to evaluate the detection ability of our approach. In Section V the related work of clone detection is discussed and Section VI describes the limitation. Finally, Section VII concludes the paper and briefly introduces future work.

II. TERMINOLOGIES & DEFINITIONS
Code block is a statement sequence within braces and usually represents a single function. Clone pair is a pair of similar code portions. The minimum clone size is the minimum number of lines or tokens that either of a clone pair should have. The standard minimum clone size is 6 lines or 50 tokens which we also follow in this paper. Four primary clone types are agreed by researchers and the former work [1], [15]: Type-1 (textual similarity) and Type-2 (lexical similarity) clones are syntactically identical code fragments except for variances in white space, layout, comments and variances in identifier names, literal values, white space, layout and comments, respectively. Type-3 (syntactic similarity) clones are code fragments which are similar but have statements added, modified and/or removed with respect to each other. Type-4 (semantic similarity) clones are code fragments that implement the same functionality but are different in syntax.
Type-3 and Type-4 clones are difficult to partition because there is no clear boundary between syntactically similar Type-3 clone and dissimilar Type-4 clone. Hence, BigCloneBench Before giving the definition of large-variance clones, we first define a new similarity of code pairs and then obtain the difference degree of code pairs. In the past, Jaccard similarity [16] is usually used to measure the similarity of code pairs, which is not suitable for code pairs with large difference in size. For example, given code block A and B, assume the size of B is double the size of A and all lines of A appears in B. According to the Jaccard similarity, the similarity of code blocks A and B are only 0.5, which is not practical. In fact, all of A are cloned by B.
Definition 1 Similarity and Difference of code pairs: Given two code blocks A and B consisting of l(A) and l(B) prettyprinted lines, respectively. Let l(A ∩ B) be the number of shared lines in A and B. The harmonic similarity sim(A, B) of A and B, the single-side similarity sim(A|B) that A relative to B, and the single-side similarity sim(B|A) that B relative to A are: The difference diff(A,B) of A and B,and the difference diff (A|B) that A relative to B, and the difference diff (B | A) that B relative to A are defined according to the similarity, respectively: Taken the previous example of code A and code B which is double size of A, the single-side similarity sim(A|B) = 1, which means that A is all in B and it matches the actual situation. The difference degree of large-variance clones is set higher than 0.15 mainly based on two considerations. First, it's difficult for most code clone detection tools or methods to find such clones, even the better tools that are able to detect Type-3 clones cannot find them well. Another consideration is the compatibility with the large-gap clones described in [3], in which the volume difference between blocks A and B is used to specify variance clones. Assuming that A is the block with smaller size and B is the block with larger size, their method is to detect the large-gap clones for the set of blocks λ-difference volume (λ = l(A)/l(B) ≤ 0.7). According to our definitions, the sim(A,B) is at most 0.85, then the diff(A,B) is at least 0.15. Hence, the code clones described in [3] are included in our large-variance clones but ours have a broader scope.

III. METHOD
Inspired by the seed-and-extend approach which is typically used in sequencing alignment from bioinformatics [7], [8], [17], we proposed a locate-filter-verify procedure for clone detection. In bioinformatics, the seeding step uses the subsequences of the query to quickly locate exact match in reference and the extending step extends and refines the candidate positions by a dynamic programming alignment. Our method includes three phases: locate-filter-verify. The first two phases are designed to seek out the candidate clone pairs with low cost and high recall. In the last phase we design a heuristic algorithm to further eliminate the false clone pairs and improve the accuracy. The uses of dynamic threshold, seeds index and avoiding time-consuming dynamic programming are the keys and innovations of our method. Fig. 2 shows the general procedure. The rest of this section will provide detail descriptions of each phase.

A. Lexical Analysis
Lexical analysis for code includes extracting code blocks from source code and tokenizing the code. TXL [18], which is commonly used in previous tools [3], [4], is adapted to extract code blocks from source code files. After obtaining the code blocks, the tokenizing step that mostly based on Flex [19] begins. Identifiers including variables and function names are replaced by the same token 'id' to tolerate Type-2 changes. The extracted code blocks are pretty-printed. The tokens of each line are concatenated into a single token sequence except the white spaces.

B. Seeds Indexing
It is necessary to establish a seed index to speed up the computation for the locating and filtering phases, where the seeds are all of the k-line sliding windows (i.e., code fragments of continuous k lines) for a code block. The seeds are basic units for matching instead of tokens. For example, given a code block with 10 lines and sliding windows size k = 3, the number of all seeds is obviously 8. In LVMapper, these seeds are converted to hash value and the seed is also regarded as the hash value.
Here are the detailed steps of indexing. LVMapper scans all code blocks, collects all seeds and indexes them in a hash table. The key of the element in hash table is seed's hash and the value is a set of corresponding positions (e.g., block number plus line number). Like CCAligner, we use MurmurHash hash function [20] in order to guarantee the efficiency with the low collision rate. For the value in the hash table, we packed the block number and line number into a 64-bit integer.

C. Locating via the Shared Seed
The locating phase is a preliminary selection for possible code clone pairs. In this phase, the goal is to collect as many candidates as possible without losing real clone pairs. Because the standard minimum clone size is 6 lines [10], [15], CCAligner chooses 6-line or longer windows to match the possible clone pairs. The experiments in CCAligner [3] shows that 6 lines with 1 mismatch windows balanced recall with precision. However, the cost of this approach is still considerably high. We use a lower cost and more efficient way to achieve this. Here, the 3-line sliding windows are chosen as seeds to collect the possible clone pairs that share seed(s).
For any code segments of 6-line with 1 mismatch that can be found by CCAligner, in LVMapper, they can also be identified by two non-overlapping 3-line exact windows. The reason is that the 6 lines window in CCAligner can be covered by two non-overlapping 3-line windows. According to pigeonhole principle, one line modification affects at most one 3-line window and the other one remains unchanged. As a result, the identification ability of our method is better than that of CCAligner.
Algorithm 1 lists the steps of clone detection process, in which lines 4-11 belong to the locating phase. To retrieve these seeds, we create a hash table for efficiency. Once the index has been built, the candidates of each code clock can be obtained by utilizing this index. Let the current inquiring code block be A. Every sliding overlapping 3-line window in A is the seed, whose hash value is used to find positions in hash table with the same key (line 9). If the block id of the position, denoted as B, is greater than A, the position will be added to CollisionList (A) the collision list of block A (line 11). This practice eliminates the duplication of detecting clone for the same two blocks with reverse order. When B is the current inquiring block, block A is not considered anymore because the pair of A and B has been considered before. After the last seed of current block is queried, the positions in the CollisionList (A) are sorted according to block id and then according to line id (line 12). Every block B that has the collided seed with A will be further filtered and verified.

D. Filtering via the Common Seeds Number
Through the locating phase, two code blocks that share common seed(s) may be clone pair. Then we take into account the clone possibility of these Collision code blocks. This phase is called the filtering phase and it picks out candidate clone pairs. It gives us two benefits. First, the number of candidate code pairs can be reduced significantly. Therefore the processing time and the false positive rate can also be reduced. Besides, the probability of the candidate pairs being the true clones increase significantly. Our idea of the filtering phase is mainly based on considering the number of shared seeds for two code blocks to measure the possibility of being clone.
The selection of candidate clone pairs depends on the similarity between the two code blocks, and the similarty is calculated by the number of seeds they share. In our method, we adopt the Equation (2) to compute the similarity of a code pair, because the definition of single-side similarity has better ability to discover large variance code clones. As we get the number of shared seeds between two code blocks, the former of Equation (2) can be equivalently changed to: where s is the number of shared seeds, t is total seeds number of B and L is the length in line of B. For any pair of the collision code blocks, the higher the SR A (B) value is, the more likely they are to be a clone pair.  The method of threshold setting for SR A (B) is the key point of our method, and it is also significantly different from other methods. In order to enhance the ability of detecting large variance code clones, here we use a dynamic threshold to filter the code pairs. The function of the threshold is designed as a multi-segment function, corresponding to the length of code blocks A and B. As the length of code blocks A and B increases, this threshold can be reduced. Actually, the reason can be explained by an understandable analogy. For example, given real-life conversations, how to judge whether two different conversations belong to the same topic? We have such a consensus that long conversations with lower rate of common sentences could discuss the same subject while short conversations should have higher rate to judge as discussing the same subject. In the experiments of next section, the filtering threshold is set to about 0.1 for both code blocks with lengths more than 10. Note that SourcererCC uses the fixed ratio of shared tokens to verify clone pairs and it sets the ratio threshold as 0.7 to guarantee the precision.
Lines 13-16 in Algorithm 1 belongs to filtering phase. In practice, once we get the candidate blocks list CollisionList (A) of current inquiring block A, for every candidate block B in the list, LVMapper counts the number of B in CollisionList (A) (line 14). As we mentioned above, we treat each overlapping k-line windows as seed to vote for potential clone blocks, then every position added in the collision list is the block that have the same seed with A. In this case, the number B in CollisionList (A) is the number of votes that B gets from A. If B gets more votes, then it is more likely to be clone code of A. The idea is similar to the idea of seed-and-extend in the sequencing alignment [8]. The threshold for SR A (B) is θ (line 17).

E. Verifying via the Ordered Common Lines
Unlike the first two phases, the last phase (called the verifying phase) further measures the clone possibility of the candidate clone pairs output by the filtering phase from another perspective: if two blocks are large-variance clone, an important feature of them is that the common lines of code in them have order preserving property. Actually, previous tool NiCad [14] used similar idea. It is based on a Longest Common Sub-sequence (LCS) algorithm. Not as complex as NiCad, we design a heuristic simple algorithm for this order preserving property. The idea of the heuristic algorithm is to count the order preserving number of two adjacent code lines in one code block.
Similarly, the similarity of the candidate pair is measured by another characteristic quantity: the rate of ordered common lines. This characteristic quantity OS (A, B) is defined as: where common lines is the ordered common lines of A and B, and min lines(A, B) is the minimum size in line of A and B.
Like the threshold mentioned in the filtering phase, we also use a dynamic threshold for verifying candidate clone pairs. The detailed setting will also be discussed in Section IV.
We implement the heuristic algorithm as follows and lines 17-38 in Algorithm 1 show the steps. For every candidate pair of block A and B survived from the locating and filtering phase, assume block A is smaller than B. We scan from the first line in A to find contiguous lines which also appear in B

IV. EVALUATION
The performance of LVMapper for detecting large-variance clones and general Type-1, Type-2, Type-3 were thoroughly evaluated in real and self-synthetic dataset. We first introduce the parameter setting of seed length and dynamic threshold of different phases. Then the detailed information of different experiments is provided.
A. Parameter Setting 1) Choice of Seed Length: As the seeds play a major role in locating and filtering phases, the choice of seeds length is important and has an impact on the performance of LVMapper. If the seeds are too long, the recall of our method will be affected. In contrast, if the seeds are too short, the effectiveness of the locating and filtering will be eroded. In Section III we analyzed the choice of seed length theoretically. Here we also used experiments to evaluate the performance of detection for different seed lengths k quantificationally. We used the BigCloneBench [10], [11] to evaluate the detection ability of LVMapper for different seed lengths, because it is not only a benchmark for general clones but also contains large-variance clones. Besides, we considered the memory use and execution time of different seed lengths for Linux kernel dataset.
BigCloneBench is a benchmark which contains different types of manually validated clones in the repository IJaDataset-2.0 [21] and it defines clone types by syntactic similarity as described in Section II. The framework BigCloneEval [22] summarizes recall performance for different clone types of clone detectors automatically and it is widely used in previous work [4], [6]. We configured the BigCloneEval with minimum clone size 6 lines and 50 tokens which are consistent with the standard minimum clone size. The seed length of LVMapper was set as 2-line, 3-line and 4-line, with other parameters fixed. The recall was reported by BigCloneEval. And for each parameter, we measured the precision by randomly validating 400 reported clone pairs. Table I shows the detailed results. Because the recall rate of Weakly Type-3&4 is under 1%, we provided the number of detected clones instead, denoted as # of Weakly Type-3&4. As seen from Table I, the recall of Type-1, Type-2 and Very Strongly Type-3 were nearly 100% for all seed length. When the seed length became longer, the recall of type-3 with lower similarity decreased. For seed length of 4-line, the recall of Strongly Type-3 and Moderately Type-3 was under 80% and 20%, respectively. And the number of Weakly Type-3&4 fell to 22998. However, while the recall for seed length of 2-line was the highest, especially in Weakly Type-3&4, the precision declined apparently. The recall and precision for seed length 3-line strike a balance. The number of Weakly Type-3&4 was over 30000 and the precision kept at 88%.
Besides, we took into consideration the memory use and execution time of different seed lengths. The Linux kernel 4.18 was used as the target source code and it has 25782 files with 12964738 lines of code (LOC) measured by cloc [23]. As shown in Table II, the execution time of 2-line method was as much as 19 times compared to the execution time of 3-line method and the memory use increased about 100MB. The configuration of seed length 3-line had the least memory use and medium execution time. The execution time of 4-line method was the shortest, but it had more memory requirement. Note that the configuration of seed length 3-line has the least memory use. The reason is that for the method using 4-line, larger window space allows greater variation of seed, resulting in greater hash index space. For the method using 2-line, more position values (i.e., the block id and line id corresponding to the seeds) occupy the storage space. Taken together, the seed length of 3-line balanced not only the recall and precision, but also the space and time. Therefore, we selected 3-line as our default configuration. 2) Threshold for Filter and Verification: In filtering and verifying phases, we use dynamic threshold for judgment of results. The threshold is defined according to the block size in order to be better adapted to code clone judgment.
For θ, the threshold of SR A (B) in the filtering phase, it is defined as: where L is the size in length of the collided block B. We set θ = 0.5 when B has 6 lines and we set θ = 0.1 when the size of B is greater than 10 lines. When the code size is small, there are plenty of statements that have similar forms. So LVMapper filters the smaller candidate block with stricter standards. The use of the dynamic threshold utilizes the characteristic of source code. Moreover, for the ratio of ordered common sequences in the verifying phase, assume the smaller block of the candidate pair is block A. We adapt a more-refined piecewise function to define the threshold δ, which is used in the verifying phase, according to A.size l: In Equation (8), g(l) = -α·l+β relies on size of A. In our implementation, we set α = 0.025, β = 0.8 empirically. For the smaller blocks whose length is smaller than 10 lines, the ratio of minimum matching continuous lines is 0.55, because in our observation the precision will be slashed when δ < 0.55. For medium size blocks with length from 10 to 30 lines, the ratio of ordered common sequences linearly decreases with the smaller blocks length. As big blocks are more likely to be modified in code clone, the lower limit of δ is 0.3 which allows large-variance for big code blocks and ensures certain accuracy.

B. Large-variance Clone Detection
To test the large-variance clone detection ability, we first compared LVMapper with others in eight open source projects dataset. Then we tried to construct synthetic dataset by inserting different number of lines to further evaluate the detecting ability for different variance proportions.
1) Empirical Test: Here we evaluated the large-variance clone detection ability and studied the existence and pervasiveness of large-variance clones. Considering that CCAligner is a good large-gap clone detection method in a very recent study [3], for all the methods in the experiments, we calculate the number of reported clone pairs that satisfied the setting of volume difference λ ≤ 0.7. So we can directly compare with their experimental results for fairness. To validate the FP (false positive number), for each projects, we randomly selected 100 samples from our results to validate if they are true clone pairs or not and calculated the false positive rate. Then we used the false positive rate to estimate the FP.
In order to compare the detection ability of LVMapper with the state-of-the-art tools, we selected the best two clone detection tools for Type-3 and large-gap clone detection SourcererCC and CCAligner. The results data of SourcererCC and CCAligner were taken from that study straightforwardly. We did not provide the result of NiCad here, because NiCad detected almost none of clones with largely different sizes or variances. For all the experiments using these 8 projects, we considered the clones with minimum length of 10 lines, which is consistent with that of the experiments in paper of CCAligner.
The detecting number of large-variance clones (shorted as LV) in 8 projects are shown in Table III. The number of large variance clones detected by LVMapper was markedly more than that detected by SourcererCC and CCAligner. In project JDK1.2.2, the large-variance clones reported by LVMapper are 970 while CCAligner only reported 15 and SourcererCC only reported 4. CCAligner performed better than SourcererCC at detecting the clones with largely different sizes, which was in fact the target of CCAligner. For all projects we tested on, the precisions (which are 1 − LV F P ) of LVMapper were all over 85%. Among the reported pairs of LVMapper, we found that many clone pairs has scattered modifications and insertions which were missed by CCAligner. We summarized the number of different types clones and the proportion of LV clones detected by LVMapper in Table IV. We classified the clone pairs reported by LVMapper to Type-1&Type-2, Type-3 and LV clones. As we can see from Table IV, the majority of reported pairs belong to the clones of Type-3. The proportion of the large-variance clones LVMapper reported ranges from 18% to 62% in these projects. There is a high proportion of large-variance clones in open source projects and they should not be overlooked. 2) Large-variance Clone Injection Evaluation: Considering that the real datasets in previous experiments may be unbalanced in data distribution, we further designed the experiment using self-synthetic data to simulate clones with various proportions of insertions and to measure the impacts on the clone detection tools. Given the source code fragment, we tried to insert scattered lines in different quantities and tested the largevariance detection ability for the clone of the file after insertion with the original one.
To evaluate the detection ability of detectors for largevariance clones, we used 200 original code fragments from open source project jdk 1.8.0 as target code fragments. About one third of them have 15 to 20 lines, one third have 20 to 25 lines, and the remaining part of them have 25 to 30 lines. The number of inserted lines ranges from 1 to 20 lines. For each number of inserting lines, we generated 200 synthetic clones and tested if the tools can detect the clone pairs. We evaluated the state-of-the-art tools CCAligner, SourcererCC, NiCad, iClones with their default configuration. Fig. 3 shows the recall of tools detecting clones for different numbers of inserting lines.
From Fig. 3, we see that when the number of inserting line is 1, all of the detection tools except iClones had the recall of over 95%. The recall fell down with the number of inserting lines increasing. When the inserting line was more than 3, SourcererCC, CCAligner and NiCad descended faster while the performance of LVMapper descended slowly and it maintained the highest recall. For number of inserting lines were smaller than 11 NiCad performed better than CCAligner, while CCAligner showed its advantage when the number of inserting lines continued to increases. At the same time, iClones could hardly detect clones when number of inserting lines was more than 7. After insertion of 16 or more lines, the recall of all the other detection tools was lower than 30% while the recall of LVMapper still remained at above 80%. In this experiment, NiCad showed good performance when the clone pairs have small variance. CCAligner had the advantage in detecting part of the clones that has relatively concentrated gaps. LVMapper performed the best in both small insertions cases and large-variance cases.

C. General Clone Detection
Apart from the evaluation of detection ability for largevariance clones above, we also compared the detection performance of LVMapper for general clones (i.e. from Type1 to Type3) with other clone detectors. We used BigCloneEval [22] to test the recall of tools on BigCloneBench [10]. The configuration of NiCad was minimum length 6 lines, similarity threshold 70%, blind renaming and literal abstraction. We used the default configuration of LVMapper, which has seed length of 3-line, and the threshold is described previously. The configuration of SourcererCC was minimum one token and similarity threshold 70%. We set CCAligner with minimum clone size of 6 lines, window size q = 6, edit distance e = 1, and similarity threshold of 60%. The result of Deckard [5], iClones [13] and CCFinderX [12] were from [4]. The number of Weakly Type-3&4 of Deckard was estimated by the recall rate in [4]. As iClones and CCFinderX did not perform well in detecting Moderately Type-3 clones, we did not run them for the performance of Weakly Type-3&4 (denoted as "-" in Table V). We evaluated the clone pairs in BigCloneEval with the setting of considering minimum clone size prettyprinted 6 source lines and minimum clone size 50 tokens. For the measurement of precision, as a common practice in the researches [3], [4], [6], we randomly picked 400 pairs from the reported clones of each tool and manually validated the true clone pairs.
The results for general clone detection performance in BigCloneBench listed in Table V has two parts: the last line is the precision and the rest are the recall. The recall of LVMapper for Type-1, Type-2 and Very Strongly Type-3 were 100% or nearly 100%. For Strongly Type-3, NiCad performed the best with recall of 95% and LVMapper was the second with recall of 81%. With the decreasing in the similarity of the clone pairs, more variance exists within clone pairs. LVMapper had the best performance in Moderately Type-3 with 10% higher than the second best. The number of Weakly Type-3&4 clones LVMapper detected was more than double by that of CCAligner. Although Deckard detected the most pairs in Weakly Type-3&4, it had poor recall for other types of clones and the precision is only 34.8%. LVMapper kept its precision at about 88% while showing excellent detecting capability. Compared with the state-of-the-art clone detectors, LVMapper has good recall and precision for all types of clones.

D. Comparison with semantic method
In the experiments above, the detection ability of LVMapper and other non-semantic tools were tested thoroughly. In order to compare the clone detection ability of LVMapper with semantic method and analyze the difference between them, here we compared LVMapper with the latest machine learning based tool Oreo [6]. Because Oreo only supported clone detection with Java code, we used the same Java projects mentioned in Section IV-B. Oreo was executed with the default configuration. Table VI shows the LV (number of large-variance clones) detected by LVMapper and Oreo for the Java projects and O/L is the ratio of LV that Oreo could detect among the LV detected by LVMapper. The results shows that the largevariance clones detected by Oreo are only a small part of that detected by LVMapper. The main reason is that the semantic  Type-1  100  100  100  100  60  100  100  Type-2  99  99  100  98  58  82  93  Very Strongly Type-3  98  97  100  93  62  82  62  Strongly Type-3  81  70  95  61  31  24  15  Moderately Type-3  20  10  1  5  clone detector focuses on detecting clones that are almost the same or very similar in semantic. As the semantics of code with more modifications may be changed, the large-variance clones are difficult to be detected by semantic methods.

E. Scalability
To test the scalability of LVMapper, we selected 1M LOC, 10M LOC, 20M LOC and 30M LOC from the inter-project Java repository IJaDataset-1.0 [24] as the target files to detect clones. We used a standard desktop with a 3.5GHz quadcore i7-4770k CPU and 24GB of memory. As CCAligner and SourcererCC had relative good scalability in a recent study [3], we compared the execution time of LVMapper with CCAligner and SourcererCC. The minimum lines was set as 6 for LVMapper and CCAligner, and the minimum tokens was set as 50 for SourcererCC.
The execution time across different scales of datasets are shown in Table VII. SourcererCC had less execution time for 1M LOC to 30M LOC and it ran faster than LVMapper on 30M LOC. CCAligner scaled to 10M LOC and it failed for the 20M LOC and 30M LOC inputs with an out of memory error (denoted as "-" in Table VII). LVMapper was the fastest for 1M LOC, 10M LOC and 20M LOC. For 30M LOC, LVMapper was slightly slower than SourcererCC but they were of the same order of magnitude. Although SourcererCC has good scalability but the large-variance detection ability of SourcererCC is limited.

V. RELATED WORK
There are many code clone detection tools proposed in the literature. More descriptions of these tools and methods can be found in [1], [2], [15], [25]- [30]. At present, the code clone detection of Type-3 is still a difficult task, especially for large-variance code clones. According to the types of clone similarity, the clone detection methods can be divided into two categories. One is the non-semantic method, and the other is the semantic method. Our method belongs to the former.

A. Non-semantic Methods
These methods or tools determine whether the code pairs are clones or not base on the similarity of code words and code sentences. These clone detection methods mainly include the text based, the token based, the tree and graph based and the metrics based methods. Among these methods, some researchers classified the latter two as the semantic method.
For the text based tools [14], [31], [32], two code blocks are compared in the form of text or strings. Johnson [31] proposed a fingerprinting technique to identify similar source code and to speed up processing speed. Ducasse [32] developed a line based comparison detection tool. NiCad [14] is based on a two phases process, viz., identification of potential clones and code comparison using longest common subsequences. Compared with LVMapper, it adopts similar locating and verifying strategy. NiCad can detect Type-3 clones, but did not perform well in the test of detection ability for large-variance clones as shown in Fig. 3.
For the token based tools [4], [13], [33], tokens are firstly extracted from the source code by lexical analysis, and it is better than simple keyword matching since it tolerates different identifiers. CCFinder [33] is a popular tool based on token, but it does not support Type-3 clone detection. In their work, they used suffix tree to find identical subsequences and increase the threshold to filter small clones. Essentially, these operations are equivalent to the indexing and filtering technology in LVMapper. iClones [13] and SourcererCC [4] are also influential representatives of such tools. Göde and Koschke [13] developed the incremental tool iClones by merging neighboring Type-1/Type-2 clones to big clones or Type-3 clones. However, iClones can only detect Type-3 clones with small variance. Sajnani [4] developed a fast clone detection tool SourcererCC which uses tokens composition to verify clones, but it is constrained to the identification ability of token granularity. CCAligner [3] has good performance in detecting clones with relatively concentrated modifications but it misses scenarios where modifications are scattered.
For the tree and graph based tools [5], [34]- [37], abstract syntax tree (AST) and program dependency graph (PDG) are frequently used as the representation of source code. Yang [34] and Deckard [5] proposed AST approaches for finding the syntactic differences between two programs. Duplix [35] and PDG-DUP [36] are PDG-based tools which use program slicing to find isomorphic subgraphs. These tree and graph based tools suffer from large execution times and poor scalabilities. To this end, CCSharp [37] improves the time performance and accuracy of the PDG-based method, but it still cannot achieve good performance in large scale dataset. Besides, these tools will fail to detect large-variance clones since structure of tree and graph may be changed during the extension and modification of the code.
For the metrics based tools [38]- [40], some metrics and characteristic features for tree and graph of source code can be used for code clone detection. Both Mayrand [38] and Balazinska [39] extracted metrics from an AST representation of source code and used the metrics for clone identification. Patenaude [40] used the metrics of source code that can be divided into five categories, viz., classes, coupling, methods, hierarchical structure and clones. These methods extract some features from tree or graph or source code to verify the semantic similarity of two code blocks. They have similar limitations to that of the tree and graph based tools for largevariance clones.

B. Semantic Methods
Apart from the tree and graph based and the metrics based tools mentioned above, these kind of clone detection tools include the semantic space mapping based tools [41], the software behavior based tools [42]- [44] and so on. Substantially, the tools based on semantics adopt semantic abstraction or modeling for source code rather than abstraction of lexical and syntactic similarity. Due to overlap of semantic clones and large-variance clones, however, the methods based on semantics can also find a small part of large-variance clones shown as Table VI. Machine learning is always an effective way to deal with complex problems, including code clone detection especially for semantic clone. White [45] presented an unsupervised deep learning approach to detect clones, which can automatically learn discriminating features of source code. Wei [46] proposed a method to detect clones by learning representations and Hamming distance of code fragments. Zhao [47] encoded code control flow and data flow into a semantic matrix for detecting semantic clones. Recently, Saini [6] put forwarded a machine learning based method called Oreo, which can find the code clones in the overlap between syntactic and semantic zone. Oreo used the clones that are almost identical or very similar to train the deep learning model and the large-variance clone detection ability of Oreo is limited. Machine learning methods always face the issues of efficiency and dependency on the initial training data. The experiment in Section IV-D shows that the machine learning method could not detect the large-variance clones well, which means the modifications between clones may change the semantics of code.

VI. LIMITATION
Our tool is aimed at detecting the large-variance clones with homologous modification. As shown in Fig. 3 for the large-variance clone injection evaluation, the large-variance clone detection ability of LVMapper is significantly better than others. Note that this is the experiment of large variance clone detection just for syntactic similarity and our method did not detect semantic clones. In the BigCloneBench experiment shown as Table V, our tools have made better progress in Moderately Type-3 and Weakly Type-3&4, but it has not yet reached a satisfactory level. The reason is probably that Moderately Type-3 and Weakly Type-3&4 clones are more of semantic clones that are generated by heterologous development.
The scalability of LVMapper on larger scale dataset needs to be tested. LVMapper on 30M LOC shows satisfactory performance (3h) in the evaluation. It is meaningful to test it on over 100M LOC for validating its scalability further.

VII. CONCLUSION & FUTURE WORK
The large-variance code clones are generated by homologous modification and can be used in software development and other applications. Our experiments found that these clones were widespread in Type-3 clones, even in some datasets up to half or more. And the large-variance code clone changes the past clone detection methods that focus on finding almost identical or very similar code pairs. Therefore, The research on large-variance code clone is important and meaningful. In this paper, we proposed a novel concrete definition and a detector LVMapper borrowing from the idea of sequencing alignment in bioinformatics for large-variance code clones. Effective and innovative technologies such as dynamic threshold, avoiding time-consuming dynamic programming and seeds index are designed in our method. A series of testing on real cases of software projects, self-synthetic programs and the state-of-the-art benchmarks showed the large-variance clone detection performances of LVMapper are much better than the other state-of-the-art tools, and it has good recall and precision for general Type-1 to Type-3 clones. Furthermore, we will make LVMapper more scalable for the clone detection on larger scale datasets. And it will be important work to do research on software engineering applications such as code recommendation and completion, refactoring and bug propagation for large-variance clones in the future.