A Survey of Software Clone Detection From Security Perspective

For software engineering, if two code fragments are closely similar with minor modifications or even identical due to a copy-paste behavior, that is called software/code clone. Code clones can cause trouble in software maintenance and debugging process because identifying all copied compromised code fragments in other locations is time-consuming. Researchers have been working on code clone detection issues for a long time, and the discussion mainly focuses on software engineering management and system maintenance. Another considerable issue is that code cloning provides an easy way to attackers for malicious code injection. A thorough survey work of code clone identification/detection from the security perspective is indispensable for providing a comprehensive review of existing related works and proposing future potential research directions. This paper can satisfy above requirements. We review and introduce existing security-related works following three different classifications and various comparison criteria. We then discuss three further research directions, (i) deep learning-based code clone vulnerability detection, (ii) vulnerable code clone detection for 5G-Internet of Things devices, and (iii) real-time detection methods for more efficiently detecting clone attacks. These methods are more advanced and adaptive to technological development than current technologies, and still have enough research space for future studies.


I. INTRODUCTION
In the field of software development, programmers prefer to copy and paste a piece of source code directly from another source code fragment, even if there are minor modifications, so that they look similar or even identical. This is called software/code cloning [1], [2], some researchers also call it code duplication [3]- [5]. Many reasons exist for code cloning; the main reason is that code clones can help programmers to finish their tasks more quickly. Programming and maintenance issues occur because of this type of behavior. For instance, if a bug is found in a cloned code fragment of a software system, the programmer has to detect this bug everywhere and fix it, which increases software maintenance difficulties [1].
Furthermore, in terms of software system security, code clones could lead to vulnerability propagation if a vulnerable code fragment is cloned [6]. Even though software programmers are trying to write secure source code and minimize vulnerabilities in the source code when developing their systems [7], code clone behavior inevitably occurs during the software programming process and propagates system vulnerabilities [8], [9].
The associate editor coordinating the review of this manuscript and approving it for publication was Mansoor Ahmed .
Code clone vulnerability detection has been studied intensively. Jiang et al. [10] presented a scalable and accurate approach for detecting code clones on the basis of identifying similar subtrees. Yamaguchi et al. [11] proposed a method called Chucky to statically taint source code, which can identify anomalous or missing conditions linked to security-critical objects. Farhadi et al. [12] presented a malicious code clone detection technique for binary code based on token normalization levels. Some researchers have also applied more advanced and efficient technologies to improve the efficiency of vulnerability detection systems. Li et al. [13] proposed a deep learning-based system for source code vulnerability detection called VulDeePecker. For applications based on code clone detection, Gao et al. [14] applied an approach based on binary vulnerability search for cross-platform Internet of Things (IoT) devices. Hum et al. [15] proposed a system based on code evolution analysis and a clone detection technique to indicate cryptocurrencies that might be vulnerable.
In this paper, we aim to provide a comprehensive review of code clone detection methods based on vulnerable code analysis. We compare previous studies following several different analysis methods and detection techniques, and then discuss open questions and future research trends in this field. Our contributions are summarized as follows: • We illustrate some key terminology, including related definitions, code clone types, and techniques with code samples.
• We comprehensively review and introduce previous security-related studies following three classifications and various comparison criteria. • We discuss some open questions about deep learningbased code clone vulnerability detection approaches, which are more advanced and adaptive to technological development than current technologies, and code clone detection for 5G-IoT devices and real-time detection methods that can detect clone attacks more efficiently. To enable a better understanding of what we discuss in this paper, it is organized to answer the following questions: 1) What is a code clone and why does it make software vulnerable to cyberattacks? 2) What types of vulnerabilities can be detected by code clone detection mechanisms? 3) What types of technologies can detect vulnerable code clones and how do those technologies work? 4) What is the future research direction for vulnerable code clone detection? The motivation of this paper is to find corresponding answers to above research question.
This paper focuses on the security perspective of code clone detection. Most studies are selected by searching keywords code clone vulnerability detection from e-resources, such as Elsevier and Springer, which are authentic and accepted with high impact factors. We also select research works which are published on international conferences and journals organized by ACM an IEEE. We selected 50 works from IEEE, 17 works from ACM, 7 works from Springer, 5 works from Elsevier, and 4 works form arXiv preprint. This paper does not discuss much of studies which are not regarding security issues of code clone detection, such as the general programming and maintenance issues aroused from code clones.
The remainder of the paper is organized as follows: In Section II, we provide background information on why code cloning occurs, and introduce several representative studies and definitions of related terminology. In Section III, we thoroughly review existing security-related code clone analysis methods and detection techniques, and compare previous studies. In Section IV, we discuss deep learning-based approaches, 5G-IoT-based code clone detection, and real-time detection research directions. In Section V, we conclude this paper and outline avenues for future studies

II. BACKGROUND
In this section, we introduce the following: • general background about why code cloning occurs; that is, why programmers prefer to use copy-paste methods during their programming work; • issues arising from code cloning behavior; and • definitions of several code clone-related terms, types of code clones, and code clone detection phases and detection techniques.

A. WHY CODE CLONING OCCURS
A good software engineering project should be developed with thorough and mature programming; however, sometimes, programmers prefer to reuse a code fragment [16] to finish their tasks, even if this is not encouraged. We analyzed several reasons for code cloning.

1) COST AND TIME CONSTRAINTS
The main reason for code cloning is that it can help programmers to finish software development more efficiently by reducing cost and time, particularly when meeting task deadlines [2], [16], [17].

2) LIMITATIONS OF PROGRAMMERS' SKILLS
Some junior, and even senior, programmers may receive specific programming tasks beyond their capabilities. For example, they may lack programming language proficiency or have difficulty in understanding those tasks, which facilitates code reuse [18].

3) USE OF TEMPLATES
Increasingly, code templates are providing more thorough and mature code, algorithms, and frameworks for programmers to help them to finish software development more efficiently. However, programs that use the same template could include identical or closely similar code fragments, which leads to code clones [2], [16].

4) FEAR TO BRING IN NEW IDEAS
Sometimes, new ideas or fresh code may result in a lengthy software development life cycle, or even introduce new errors to existing software [19], [20]. Hence, programmers fear bringing new ideas or fresh code into their existing project [2].

5) ACCIDENTAL CLONING
Sometimes, a programmer writes a piece of code that accidentally matches existing code, which leads to a type of accidental code cloning [16].

B. ISSUES
As a result of the reasons for code cloning mentioned above, whether intentional or unintentional, code clones have led to some issues in software development and maintenance.
• Maintenance cost: Cloning a piece of code in software can increase the post-implementation maintenance effort. For instance, if one cloned code fragment is modified, all other cloned code fragments have to be located to maintain the consistency [21].
• Bugs propagation: Cloning a piece of code that includes a bug can propagate the bug to different locations in the software system [2], [16], which also increases the maintenance effort to identify this bug from all cloned code fragments.
48158 VOLUME 9, 2021 • Vulnerability propagation: If a piece of code that is vulnerable to specific attacks is copied, it will lead to vulnerability propagation across the entire software system. In this paper, we mainly discuss vulnerability detection approaches in software/code clones.

C. CLONE DETECTION
Researchers have addressed clone issues by providing code clone detection tools or approaches. Baker [22] developed a program called dup, which can locate duplicate or near-duplicate code sections in large software systems. Kamiya et al. [23] proposed a token-based clone detection tool called CCFinder, which extracts code clones in C, C++, Java, COBOL, and other source files. Li et al. [19] proposed a tool called CP-Miner, which can identify cloned code fragments in large software systems using data mining techniques. Roy and Cordy [24] proposed a lightweight application called NiCad for a source code transformation system that can find near-miss clones by applying an efficient text line comparison technique. Several studies have also been conducted on clone detection approaches based on the application level. For Android applications, Crussell et al. [25] presented a detection tool called DNADroid, which can identify cloned applications by computing the similarity between two applications. Chen et al. [26] proposed an approach to measure the similarity between code sections in two applications on the basis of the method level. Similarly, Akram et al. [27] designed DroidCC for detecting cloned Android applications based on the source code level.
Meanwhile, many researchers have conducted excellent surveys on the topic of code clone detection. Rattan et al. [2] provided an extensive systematic literature review of software clones in general and software clone detection in particular, based on reviewing 213 articles from 2,039 articles published in 48 publication resources. Sheneamer and Kalita [1] discussed details of code clones, such as types of clones, detection phases of clones, detection techniques and tools, and challenges faced by clone detection techniques, by analyzing previous related studies. Saini et al. [16] discussed code clone detection and management to help researchers to start quickly on the basic concept of code clones and detection techniques. Ain et al. [3] provided a comprehensive review of the latest code clone detection tools and techniques, and a systematic literature review of 54 studies.
In this paper, we focus on providing a comprehensive literature review of vulnerability detection approaches in code clone areas to help to provide researchers with a clearer direction for future studies.

D. TERMINOLOGIES
We summarize several terms related to code cloning to help readers to obtain a basic understanding of code clones [1], [16], [28].

1) CODE FRAGMENT
A code fragment is a piece of source code, with or without comments, in a software project. It can contain any number Code Fragment 1 Original Code Fragment Data: A string Result: Count the number of one certain letter in the string 1 def countElem (string, elem): 2 num = string.count(elem, 0, len(string)) 3 # comment 1 4 print(num) 5 stri = ''Hello world!'' 6 sub = ''l'' 7 countElem(stri, sub) 8 # comment 2 of lines, statements, begin-end blocks, methods, or functions needed to run a program. For instance, CODE FRAGMENT 1-5 are five different code fragments.

2) CODE CLONE PAIR
If a code fragment is identical or similar with minor modifications to another code fragment, which means that they are code clones, these two code fragments are called a code clone pair. For example, the pair that consists of CODE FRAGMENT 1 and CODE FRAGMENT 2 is a code clone pair.

3) CLONE CLASS
A clone class refers to a set of code clone pairs (more than two code fragments) related to each other, with the same equivalence relation. CODE FRAGMENT 1-5 could be a clone class.

4) CLONE GRANULARITY
Clone granularity can be regarded as a research or detection level. This means that the detection method can be executed at the level of, for example, functions, classes, blocks, statements, and files. Granularity can be predefined for directional detection or not predefined, as for free granularity clones.

5) PRECISION AND RECALL
Precision and recall are two critical factors for evaluating the system accuracy of detecting software clones. Precision refers to the percentage of true negatives detected, and recall refers to the percentage of total clones detected in the software system, including false positives.

E. TYPES OF CLONES
To better understand what type of study belongs to code cloning and analyze the target source code more efficiently, the code clone issue could be classified into two main groups: textual-level clone and semantic-level clone [1], [29], [30]. CODE FRAGMENT 1-5 provide five code fragments (Python) as examples of these two groups of clones. CODE FRAGMENT 1 is an example of the original fragment.

1) TEXTUAL-LEVEL CLONE
This type of clone refers to two code fragments that perform almost the same text task [31]. For a textual-level clone, this can be further classified into three types of clone. VOLUME 9, 2021 Code Fragment 2 Type-1 Exact Clone Data: A string Result: Count the number of one certain letter in the string 1 def countElem (string, elem): 2 num = string.count(elem, 0, len(string)) # comment 1 3 print(num) 4 stri = ''Hello world!'' 5 sub = ''l'' 6 countElem(stri, sub) # comment 2 Type-1: Exact clone A code fragment that is almost an exact copy of the original code fragment except for whitespace, blanks, and comments is regarded as an exact clone. Compared with the original code fragment, the Type-1 code fragment simply modifies the layout of comments and deletes one blank line, so it clearly belongs to the exact clone type.
Type-2: Renamed clone A code fragment that is similar to the original code fragment except for the names of variables, functions, types, and literals is regarded as a renamed clone. As shown in CODE FRAGMENT 3, compared with the original code fragment, the Type-2 code fragment modifies the function's name from 'countElem' to 'num_of_string,' and some variables, such as 'string' to 'a' and 'elem' to 'b.' Type-3:Near miss clone A code fragment that is almost the same as the original code fragment except for some modifications, such as added or removed statements, and a different use of literals, variables, layout, and comments [1]is regarded as a near miss clone. The Type-3 code fragment belongs to the near miss clone type because of the modification that only replaces variables 'string' and 'elem' with 'a' and 'b,' respectively.

2) SEMANTIC-LEVEL CLONE
The second group of code clones is based on the semantic level, and is called the Type-4 clone.
Type-4: Semantic clone A code fragment that is similar to the original code fragment based on their functions and not syntax [32]is referred to as a semantic clone. The Type-4 code fragment modifies the code using 'for loop' to implement the same result achieved using the function 'count,' which refers to a semantic clone.

Code Fragment 4 Type-3 Near Miss Clone
Data: A string Result: Count the number of one certain letter in the string 1 def countElem (string, elem): Code Fragment 5 Type-4 Semantic Clone Data: A string Result: Count the number of one certain letter in the string 1 def countElem (string, elem): Fig. 1 shows the entire life cycle of code clone detection; some researchers prefer to call the process from preprocessing to report clones clones the code clone detection phases. The Code clone detector is the main component of a clone detection system, and is in charge of acquiring copy-pasted or duplicated source code and then processing the major clone detection phases. For instance, Davey et al. [33] provided a comprehensive illustration of the fundamental process of developing SOM-based and DCL-based clone detection tools.

1) PRE-PROCESSING
Pre-processing is the first step of code clone detection that [28]: • removes all uninteresting or irrelevant parts of the source code, such as whitespace and comments, to reduce unrelated comparisons and calculation; • identifies the remaining source code as source units, which are used for checking for the existence of direct clones' relations to each other after removing irrelevant fragments [1]]; and • divides sources units into smaller comparison units depending on the comparison algorithm.

2) CONVERSION
Conversion, also called transformation, is used to convert the source code acquired from the pre-processing step into a corresponding intermediate representation for further comparison [16]. Types of intermediate representation are the Tokens, Abstract Syntax Tree and Program Dependency Graph, which we introduce in detail in Section III.

3) DETECTION MATCHING
This step compares the source code units with target files using a particular comparison algorithm to identify similar source code fragments. The output of this step is a list of clone pairs or clone classes.

4) FORMATTING
This step formats the list of clone pairs obtained from the previous step based on the comparison algorithm into a new clone pair list related to the original source code.

5) POST-PROCESSING
Post-processing, also called filtering/manual analysis [34]], is not required by all code clone detection systems, and is used to filter out false positives or missed clones on the basis of reanalysis by human experts or automated heuristics.

6) REPORT CLONES
Clone results analyzed and confirmed by previous detection phases can be reported to the system for further action, such as correcting or removing the source code.

III. SECURITY-RELATED WORKS
In addition to the maintenance and debug cost arising from code clone behavior, software vulnerability propagation is another serious issue. Programmers may use source code files downloaded from websites that are intensively modified by attackers that can help those attackers to infiltrate their target systems easily. Islam et al. [35] proved that the security vulnerabilities found in code clones have a higher severity of security risk than those in non-cloned source code by detecting code clones and vulnerabilities in 8.7 million lines of code over 34 software systems based on quantitative analysis with statistical significance. Karademir et al. [36] conducted an experiment that used the NiCad [37] clone detector to identify JavaScript vulnerabilities in PDF files. Nappa et al. [38] presented a systematic study of the effect of shared/cloned code on vulnerability patching for client-side applications.
In this section, we provide a comprehensive review of recent security-related studies, analyze and discuss their primary purpose, present systems or architectures, and evaluate results from different analysis methods and detection techniques.

A. STATIC ANALYSIS VERSUS DYNAMIC ANALYSIS
The code clone detector shown in Fig. 1 plays an essential role during the entire life cycle of clone detection. Furthermore, analysis methods can be regarded as core functions of clone detectors. Analysis methods of code clone detection can be classified as static analysis and dynamic analysis, in addition to hybrid analysis, which refers to the advanced combination of both.

1) STATIC ANALYSIS
Static analysis refers to analyzing a piece of source code to detect possible defects in the early stage without any program's dynamic execution. Two types of static code analysis methods exist: one uses a machine that can read and check the source code automatically to detect possible clones, and the other is performed by a human reviewing the source code, also called code review [39]. The reviewer could be an expert or peer developer who fully understands the source code and manually reviews it to identify any missed clones or false positives.

a: VULNERABILITY DETECTION VERSUS CODE CLONE DETECTION
Static code analysis is typically used for software/source code vulnerability detection. Source code vulnerability detection methods normally refer to the code or function similarity comparison between detected and target files based on normalizing or abstracting the source code into a representation. Code clone vulnerability detection can be regarded as a type of special software vulnerability detection, where the original code fragment is the target code fragment. Detection methods, particularly for vulnerability identification, can be adopted as references in the detecting vulnerable code cloning scenario. In this section, we discuss both of vulnerability detection and code clone detection. Table 1 provides a comparison of several vulnerable code clone detection studies based on static code analysis methods. It compares these studies by illustrating their primary research purposes, detection techniques, and evaluation factors.
i) Source Code Vulnerability Detection: Zhang et al. [40] proposed an approach that uses trace-based security testing methods to detect software vulnerabilities in C programs. They generated a program constraint (PC) and obtained a security constraint (SC) by applying symbolic execution based on each hotspot mentioned above. The judgment condition for a vulnerable hotspot was PC∧SC, which means it satisfies PC but violates SC.
To enhance the accuracy of vulnerable source code similarity analysis, Zhu et al. [44] proposed a solution that combines the Simhash algorithm and MD5 matching algorithm. The authors considered the problem in which the traditional hash algorithm cannot record the difference between similar files by generating identical fingerprints with local sensitive hashing. They used the Simhash algorithm to complement the file-level homology analysis algorithm based on MD5 matching.
As mentioned previously, automatic static analysis has some limitations, such as missing checks in the source code. Yamaguchi et al. [11] introduced a method called Chuckythat can identify missing checks (for vulnerability discovery) in the source code automatically based on static analysis to help to accelerate the manual code review. Their method includes five major steps: (a) extract sources, sinks, conditions, assignments, and API symbols from a function's source code using a robust parser; (b) identify functions in which a similar context code operates; (c) determine only those checks associated with a given source or sink; (d) embed a selected function and its neighbors in a vector space using the tainted conditions; and (e) perform anomaly detection for missing checks based on identifying large distances from the normality model over the functions. They also provided suggestions for correcting potential fixes.
ii) Vulnerable Code Clone Detection: Jang et al. [41] proposed a detection system called ReDeBugthat focuses on detecting unpatched source code flaws from code cloning. Unpatched code clones refer to buggy codes that are cloned by programmers but missed or unpatched when patches to source files are debugged and installed. Compared with previous detection techniques, ReDeBug does not focus on the number of detected clones but the scalability across the entire operating system. ReDeBug performs as a language-agnostic system to identify sequences of known vulnerable patched code fragments that are extracted and normalized from the diff files in the source code file to obtain the unpatched code clone list.
Li et al. [42] proposed a software vulnerability detection system by applying a backward trace analysis approach and symbolic execution method. Their system considers only vulnerability-related paths to mitigate the path exploration problem. They implemented this system using backward tracing of sensitive data used in a detected hotspot. They then used a data flow tree to recover the program's execution paths, which helped them to focus only on sensitive related data. Like Zhang et al.'s study, they also applied PC and SC mechanisms to verify existing vulnerabilities. They also proposed a software vulnerability discovery mechanism using code clone verification (CLORIFI) [6]], which can discover vulnerabilities in real-world programs in a scalable manner.
Kim et al. [43] proposed a scalable approach called VUDDY for code clone vulnerability detection. Its extreme scalability is achieved by leveraging function-level granularity and a length-filtering technique to reduce the number of signature comparisons. Their approach was divided into two main sections: pre-processing and clone detection [47].
• The pre-processing section includes retrieving functions from a given program using a robust parser, abstracting the source code by replacing it with symbols, normalizing the code body by removing unnecessary parts, and generating fingerprint dictionaries for the next detection process. • The detection section works by comparing the fingerprint dictionary of vulnerabilities with the fingerprint dictionary of target programs by applying key lookup and hash lookup algorithms.
Bowman and Huang [45] used software source code properties to implement a more robust vulnerable code clone detection system called VGRAPH. Their system aims to identify vulnerable code modification and all types of VOLUME 9, 2021 clone attacks by comprising the code property relationships between three graph-based (code property graph) components extracted from the contextual code, vulnerable code, and patched code. They called it a Triplet Match. To evaluate their detection technique, Bowman and Huang also compared VGRAPH with four state-of-the-art vulnerability detection techniques, that is, FlawFinder, RATS, VUDDY [43], and ReDeBug [41]], in accordance with the true positive, false positive, false negative, precision, recall and F1 values.
Another scenario may lead to missing clones, that is, the dynamic argumentation of source code functions. Normally, static code analysis focuses on the static arguments of source code functions, and then the dynamic arguments passed to the source code functions are ignored while the system is running. Mishra and Polychronakis [46] recently presented a compiler-level defense approach called Saffire against code clone/reuse attacks. Saffire performs static code analysis by eliminating the static arguments and restricting the acceptable dynamic values of arguments (user input, file address, and system status) during system runtime. This approach applies a narrow-scope form of data flow integrity to specialize functions with a restricted interface.
Although static code analysis is efficient in the early stage of the code clone detection life cycle, there are some inevitable limitations, such as time consumption, personnel training, and vulnerabilities introduced during program runtime. Goseva-Popstojanova and Perhinschi [48] evaluated three widely used commercial static code analysis tools to detect security vulnerabilities based on C/C++ and Java programs. Their experiment showed that a certain number of vulnerabilities were missed by all three tools. Furthermore, they did not provide any assurance of software product security and required further manual effort to classify reported warnings. Hence, dynamic analysis is needed for the late stages, particularly unit testing.

2) DYNAMIC ANALYSIS
Opposite to static analysis, dynamic analysis is performed by executing the program with real-time data to detect target system cloning issues [39]. Dynamic analysis can proceed on virtual machines, or even real processors, by monitoring the system's behavior while the system is running. This type of analysis method helps to detect vulnerabilities introduced during the entire system life cycle, particularly after static code analysis.
A critical role of dynamic analysis is to detect the real-time vulnerability introduced to avoid missing clones during the entire system life cycle. It is not easy to provide an explicit definition of dynamic analysis. Some researchers analyze application similarity on a code/method/function level, but we classify this kind of analysis as dynamic analysis after the implementation phase. In Table 2, we summarize some dynamic analysis studies for clone attack detection based on various working environments with corresponding attack methods, technologies, and evaluations.

a: SENSOR NETWORKS
Sensor networks provide a vulnerable environment for adversaries to easily compromise and duplicate sensors, and use them as weapons to obtain access to the entire network using legitimate credentials [50]. Parno et al. [51] presented a detection system to prevent the node replication attack in a distributed sensor network environment. However, their study did not mention further attacks that result from cloning compromised sensors that spread to the entire network. Choi et al. [49] provided a clone detection scheme called SETin sensor networks. They modeled a sensor network as a set of non-overlapping sub-regions, and assigned a unique identifier to each sensor node. The subset of each node in each sub-region is exclusive to other nodes. If adversaries capture, compromise, and duplicate sensor nodes in the network, the clone attack can be detected because of the intersecting subsets of the cloned nodes. Xing et al. [50] proposed an approach for the real-time detection of cloned-sensor attacks in wireless sensor networks by computing the fingerprint of each sensor to extract the neighborhood characteristics and check the validity of the originator's fingerprint for each message. Their approach achieved high detection accuracy based on a low computation and storage cost for node/sensor cloning scenarios during fingerprint generation and the detection phase. Furthermore, with no limitation on the number of cloned sensors, their approach improved on the results of related studies [49], [51].

b: INTERNET OF THINGS
The rapid development of the IoT has triggered many security issues, including various malicious code injections into IoT devices. Program developers prefer to use the software clone method to finish tasks quickly because of the large scale and range of IoT devices. The consequent clone attacks need more efficient corresponding detection approaches. To detect code clones in IoT applications, Tekchandani et al. [53], and Luo et al. [54] provided good results based on semantic-level source code analysis; however, their studies were not primarily on cloned vulnerability detection. Sachidananda et al. [55] proposed a framework to detect various vulnerabilities located in IoT devices using the static analysis method. Their approach was efficient in terms of identifying many types of attacks, such as memory leaks, code injection, buffer overflow, and other code-related vulnerabilities. Liu et al. [56] also proposed a similar vulnerability detection method for IoT binary code, but not for code clone attack detection in particular.
Gao et al. [14] presented an approach called IoTSeeker for cross-platform IoT device vulnerability detection based on analyzing binary code at the semantic level. They constructed a labeled semantic flow graph to capture both data flow and control flow information from binary code. They then extracted semantic features as numerical vectors and built a detection neural network model for feature integration and vulnerability search. Finally, IoTSeeker calculated the cosine distance between two embedding vectors to identify whether vulnerable clones exist. The supply chain provides another platform for introducing software clone attacks, such as cloned and compromised RFID tags, which may help attackers to acquire confidential credentials and authorization information to compromise the supply chain system. Researchers [57]- [60] have proposed several clone detection approaches for an RFID-embedded VOLUME 9, 2021 supply chain system, and these approaches can be applied to vulnerable RFID tag clone detection with the appropriate improvement.
c: ANDROID APPLICATIONS The Android operating system has become more popular and widely used, and more security concerns have attracted researchers' attention. Some researchers have detected source code similarities for Android applications, including Type-1, Type-2, and Type-3 clones, and also injected vulnerabilities into applications during software runtime. Crussell et al. [25] presented a cloning attack detection tool called DNADroid, Chen et al. [26] presented a similarity/clone detection approach called the centroid, both which are based on comparing program dependency graphs between methods in candidate applications. Crussell et al.'s study focused only on identifying similar clones, thereby leading to a low false positive rate and missing clones. Chen et al.'s approach is more accurate, has the explicit purpose of improving the detection system's accuracy and scalability, and has a greater focus on cross-platform application clone detection. Akram et al. [27] proposed a scalable clone detection approach called DroidCC based on excluding third-party libraries, normalization, and feature extraction, and evaluated their approach on a real-time dataset.

d: ETHEREUM SMART CONTRACT
With the rapid development of blockchain's distribution architecture, the Ethereum smart contract provides an environment for malicious code clones by injecting a piece of contract code and propagating it to other blocks. He et al. [52] focused on the ecosystem of the Ethereum smart contract to characterize vulnerable code clones using the fuzzy hashing technique to calculate the edit distance between two fingerprints. Their approach compares the similarity between generated fingerprints of user-created contract code and contract-created contract code during Ethereum virtual machine runtime.

e: CRYPTOCURRENCY
Cryptocurrency is another research topic of great interest because of its novel security protection structure and wide use in both academic research and industrial applications. Hum et al. [15] proposed an approach called CoinWatch for detecting code/system vulnerabilities in cryptocurrencies on the basis of code clone detection technology. They provided this type of approach because of the rapidly increasing use of cryptocurrencies (e.g., Bitcoin) and their publicly readable code structures [61], [62]. If one code fragment is vulnerable to cyberattacks, the vulnerability is propagated into other cloned code fragments or even cryptocurrencies.
CoinWatch has four main phases for vulnerabilities detection: • CVE parsing & linking it with commits: The first phase involves CVE parsing and linking the result with possible commits. A target CVE is provided at input together with data publicly obtainable from its structured details [63]. After selecting a target CVE, CoinWatch performs code evolution analysis of the parent project to obtain bug fixing and bug introducing commits.
• Identification of vulnerable code: The bug introducing and fixing commits are then manually annotated to minimize the code responsible for the vulnerability and improve the program.
• Initial filtering: This phase can be regarded as pre-processing before moving to the detection process.
To narrow down the search space, which means to work more efficiently, CoinWatch filtered the list of monitored projects on the basis of the fork's date before running the clone detector.
• Detection process: The last phase is the core part of CoinWatch: the clone detector. This part reports the cloned projects that are likely to be affected by the vulnerability given the filtered source code of the monitored cryptocurrencies. Authors evaluated their approach by answering three research questions about clone prevalence in cryptocurrencies, the accuracy of CoinWatch, and the comparison of true positives with false positives in the vulnerability detection report.

3) DISCUSSION
Malicious people typically target web applications as an easy and flexible environment for code and script injection. However, few researchers have discussed this related clone problem. Vineetha and Krishna [65] researched this topic for code clone vulnerability analysis and detection in web applications by analyzing the web page structure and comparing the similarities. They did not propose a powerful detection system, and further evaluation for their approach is needed. Agrawal et al. [66] presented a detection framework to identify web application clones based on the source code level. They presented their framework following a detailed process introduction involving executing and monitoring, classifying and controlling, and refining and managing code. Many security practitioners have adapted this framework; however, the framework is limited to the source code level, which is not flexible for dynamic detection.
Following recent technological improvements, static or dynamic analysis per se cannot satisfy the requirement to prevent various increasing cyberattacks. For example, it is difficult and will take a longer time to trace back a piece of vulnerable code to its exact location through dynamic analysis only. Static analysis cannot obtain access to some types of source code files if the source code is not available or the executable file has been packed by packer or protector tools. Hybrid and advanced analysis methods, such as binary code-level detection methods, are necessary for more efficient code clone detection and source code fixing.

B. REPRESENTATIONS
Software clone detection techniques can be classified on the basis of different representations as five types: text-based, token-based, AST-based, program data graph-based, and metric-based. In this section, we provide a review of security analysis studies on these clone detection techniques. In Table 3, we summarize several studies by comparing their representations, purpose, possible detected clone types, applied techniques, and evaluations.

1) TEXT-BASED
Text-based code clone detection technology simplifies the source code to a sequence of characters by removing unnecessary parts, such as comments, whitespace, and new lines, from the source code [72]- [74]. It compares the similarity between these character sequences individually, and then returns the matching results [28]. Text-based code clone detection can be used to detect Type-1 (exact clones), Type-2 (renamed clones), and Type-3 (near-miss clones) code clones, which are based on the textual level.
Karademir et al. [36] and Alalfi et al. [68] both presented approaches for detecting vulnerable near-miss clones (Type-3) based on a text-based technique. Karademir et al.'s approach identifies malware from JavaScript in Adobe Acrobat (PDF) files. It compares the similarity between collected PDF files that contain JavaScript malware and clear JavaScript. Their approach uses the NiCad clone  [64]. This code snippet (a) contains an SQL Injection vulnerability occurs on line 6 as the variable $title will be posted in an SQL query without first security processing. Line 12 and 13 can result in the XSS attack by inserting database rows into the document directly. Part (b) describes an example of an abstract syntax tree generated from the SQL Injection vulnerability of part (a), in which leaf nodes correspond to identifiers (variables), API symbols or literals. Part (c) represents a derived template from the abstract syntax tree by replacing all variables and literals using wildcard symbols and introducing edges between nodes to represent the same variable.
detector, which is particularly for near-miss clone detection. Alalfi et al.'s approach identifies near-miss interaction clones in reverse-engineered UML sequence diagrams. Their approach works at the XMI level. They also used the NiCad clone detector to help to process the detection in the reverse-engineered behavioral model.

2) TOKEN-BASED
Token-based code clone detection technology converts the source code into an intermediate representation, that is, a token sequence, using a certain token conversion tool before the detection phase [1], [28]. One converted token sequence can be compared with another converted token sequence under a matching rule to obtain the matching results for further processing. Representative token-based techniques are CCFinder [23] and CP-Miner [75]. Compared with text-based techniques, a token-based technique is more robust against code changes, such as formatting and spacing [10].
Farhadi et al. [67] proposed a scalable code clone detection approach called ScalClone for malware analysis on the basis of their previous approach, which was for an assemble code clone detection method [12], [76]. Their approach discovers both exact and inexact clones at different token normalization levels using a large-scale assemble code search. Akram et al. [70] proposed a lightweight and scalable system called VCIPR for vulnerability detection in unpatched source code based on token normalization representation at function-level granularity. They built a fingerprint index of the top critical CVE's source code to detect unpatched code fragments in common open-source software.

3) TREE-BASED
Tree-based code clone detection technology also refers to AST-based technology [1]. In the code parsing process, the syntax tree-based method converts the source code into an AST, and the representation is the tree node before the matching and detecting phases [28]. The matching result is returned by comparing two converted syntax trees.
In the code clone area, a source program can be parsed into a parse tree or AST that represents the source code [10]. Subtrees can be compared through exact or close subtree matches to detect whether any code clones exist [77]- [79].
Unruh et al. [64] proposed an approach to semiautomatically detect vulnerable code snippets starting from certain web tutorials and QA websites, which aim at assisting programmers' coding tasks. They applied AST-based graph traversals to verify similarities in analyzed code snippets that correspond to the original vulnerability. Unruh et al. provided an example of an identified vulnerable code snippet taken from a popular PHP tutorial and its corresponding AST structure, and derived the template shown in Fig. 3. Shi et al. [71] proposed a two-phase framework (training phase and detection phase) to identify vulnerable source code clones in operating systems. The approach learns correlations on the basis of AST normalization at function-level granularity.

4) PDG-BASED
PDG-based detection technology refers to converting source code into a control flow and data flow graph, and then returning the matching result by comparing the similarities between the sub-graphs [28].
For Type-4 (semantic clones) code clones, the PDG-based code clone detection method is efficient in terms of detecting source code vulnerabilities because it preserves the semantic features of the program [28]. Several research studies have been conducted on the basis of this type of graph cooperation method for vulnerability detection.
The final subsection (dynamic analysis) introduces some studies that used program data graph abstraction for feature extraction. For instance, Crussell et al. [25] proposed DNADroid, Chen et al. [26] proposed the centroid, both for detecting Android application cloning vulnerabilities. Their approaches are capable of identifying Type-4 (semantic clones) code clones in a dynamic software operating environment.

5) METRIC-BASED
The metric-based code clone detection method parses the program by dividing the source code into several small code segments, and then calculates the difference value among these code segments and determines whether the calculated values are the same (a clone) [28]. Mayrand et al. [80] discussed using metric extraction techniques to automatically detect function cloning in a software system. Their study focused on analyzing and comparing control graph metrics and data flow graph metrics on the basis of a previous AST representation. Few researchers have primarily studied, or specially mentioned, applying metric-based detection techniques for vulnerable code clone detection. Therefore, a more in-depth survey is needed regarding this aspect.

6) DISCUSSION
As illustrated above, text, token, and AST-based detection techniques can identify textual-based clone attacks, and the PDG-based detection technique can detect semantic-based clone attacks. However, (i) few researchers have aimed to present a hybrid detection approach that is efficient in terms of detecting both textual and semantic-based code/software clone attacks; and (ii) it is not easy and obvious to select a normalized source code representation while designing a detection approach because there are no selection criteria.

C. BINARY-LEVEL DETECTION
When reviewing previous studies, we found that many researchers focused on analyzing binary code-based similarity comparison. The reason for binary-based analysis is that software source code cannot be acquired at any time. Because of some privacy protection reasons, researchers have to find another way to obtain software or application code, or function information. Another reason might be the huge task load of pre-processing, filtering, and feature extraction for source code information. Khoo et al. [81] provided a search system that identifies binary code by comprising instruction mnemonics, control flow sub-graphs [89]], and data constants extracted from binary code fragments. Lee et al. [83] introduced a method for identifying software vulnerabilities from assembly code using a deep learning mechanism. Hu et al. [82] presented a semantics-based approach to identify binary code clones. Table 4 summarizes several binary-level-based code clone detection techniques following a set of specific criteria. As introduced in Table 4, some researchers have proposed efficient binary-code reuse analysis and detection methods. For instance, Frahadi et al. [12] introduced a method to identify malicious cloned code binaries based on the token normalization technique; Xue et al. [85] proposed a framework to detect vulnerable code clones by slicing binary codes and identifying domain-specific code fragments; Ishiura [87] proposed detecting the loss of guards by comparing binary-code pairs with or without problematic optimization; Ding [88] proposed learning lexical semantic relationships and the vector representation directly from plain assemble code instead of manually specifying it from prior knowledge; Liu et al. [56] proposed a long short-term memory (LSTM)-based approach to detect binary-level software vulnerabilities automatically.
However, binary-level analysis still faces several challenges. For instance, the limitation of accurately determining all valid control flow paths from the source code at system runtime and performing accurate static data flow analysis to identify argument values [46]. Mishra and Polychronakis [90] proposed Shredder for statically analyzing Windows applications at the binary level using backward dataflow analysis to derive expected argument values and generate application-wide policies for critical system functions. To address limitations in binary-level analysis, after Shredder, they proposed Saffire (section III). Hence, binary code-based clone attack detection is an important future research direction.

IV. FUTURE RESEARCH DIRECTION
For security analysis, several important topics on software clone detection remain, which we discuss here. Following the discussion in Section III, we summarize a potential research direction, which is an integration of intelligent detection techniques, code clone detection for IoT devices and dynamic detection mechanisms.
Some researchers have provided efficient results, for example, Gao et al. [14], and Liu et al. [56] proposed an in-depth learning-based approach for binary vulnerability detection at the semantic level for IoT devices. They trained a neural network model with numerical vectors transformed by the semantic features of the captured data flow and control flow information. We discuss three aspects of this type of research topic; however, we believe that there is a wider research space for this topic.

A. INTELLIGENT DETECTION
From Table 1, we found that some detection approaches based on the static analysis method partially relied on manually analyzing source code or generating representations, which typically takes time and effort, and is not efficient for solving the big-code problem [91]. Many researchers are moving toward applying more intelligent technology, such as deep learning and neural network models, to the research area of vulnerable source code detection [92], [93].
Kim et al. [94] used obfuscation techniques for obfuscated macro code detection based on training five machine learning classifiers and extracting 15 static discriminant features. Wang et al. [95] researched the patch level for "0-day" vulnerability detection by automatically identifying secret security patches in open-source software. They trained the identification model with extracted features from more than 4,700 security patches from a database to detect similar patches or vulnerabilities.
Li et al. [13], [96], [97] proposed deep learning-based approaches (VulPecker, VulDeePecker, SySeVR) for software vulnerability detection. Their approaches were aimed at VOLUME 9, 2021 automatically detecting vulnerable source code fragments by training a BLSTM neural network. They compared similarities between source code fragments and target vulnerabilities by generating code gadgets and transforming these code gadgets into vector representations, which were used as the neural network input. Their approach performed well in terms of finding vulnerabilities compared with similar systems, and was able to find many types of vulnerabilities simultaneously.
Although the series of VulDeePecker and Kim's study only focused on source code vulnerability detection based on similarity comparison, it was an efficient and applicable method for vulnerable code clone detection. The clone detection system can be made more intelligent and automated by training it using the original vulnerable code fragments on the basis of deep neural network models and appropriate feature selection.

B. 5G-IoT DETECTION
The IoT network provides an environment for attackers to inject malicious code easily. IoT devices, particularly small devices, such as baby monitors, can be attacked easily by malicious code cloning without complicated or extensive code. The cloned vulnerability can spread in a moment through a network of a vast number of devices.
The global data volume is increasing, which makes 5G technology indispensable. For existing technologies, it is more challenging to meet the requirements of the rapidly developing IoT world. Next-generation technology, that is, 5G, will provide IoT devices with unlimited connectivity in the future internet world. Hence, code cloning is a primary challenge for 5G-IoT technology. Ullah et al. [98] proposed an approach to identify code clones in specific 5G-IoT applications using a control flow graph and deep learning model.

C. DYNAMIC DETECTION
From Table 2, we conclude that real-time clone attack detection is another possible research topic that needs further attention. Particularly for mobile applications and web environments, attackers can intrude on a running application at any time by executing malicious software clone behavior and controlling compromised applications. Applying real-time detection techniques to platforms at the application level is very necessary. As previously discussed, researchers have proposed code clone vulnerability detection approaches for IoT devices [14], [56], [99], and RFID-enabled supply chain systems [57]- [60]. Thus, a real-time cloning detection approach is needed to protect systems more efficiently.

V. CONCLUSION
In this paper, we provided a comprehensive review of previous studies on software/code clone detection from the security perspective. We compared and summarized several detection approaches based on static code analysis and dynamic analysis, respectively. Additionally, we outlined different representation-based studies and provided some meaningful information to researchers, such as possible detected clone types, the research purpose, and applied techniques or tools. We also discussed vulnerable code clone detection issues at the binary code level. Then we proposed a future research direction, including three potential topics, intelligent detection, 5G-IoT-based clone detection and real-time detection, which were generated from the literature review.
This survey provides a summary of previous vulnerable code clone detection-related results to help researchers to acquire basic knowledge of this topic, and select the correct techniques or tools while identifying potential research issues and future directions.