Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

Binary code similarity analysis (BCSA) is widely used for diverse security applications, including plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark, sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a certain feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.


INTRODUCTION
P ROGRAMMERS reuse existing code to build new soft- ware.It is common practice for them to find the source code from another project and repurpose that code for their own needs [1].Inexperienced developers even copy and paste code samples from the Internet to ease the development process.
This trend has deep implications for software security and privacy.When a programmer takes a copy of a buggy function from an existing project, the bug will remain intact even after the original developer has fixed it.Furthermore, if a developer in a commercial software company inadvertently uses library code from an open-source project, the company can be accused of violating an open-source license such as the GNU General Public License (GPL) [2].
Unfortunately, detecting such problems from binary code using a similarity analysis is not straightforward, particularly when the source code is not available.This is because binary code lacks high-level abstractions, such as data types and functions.For example, it is not obvious from binary code to determine whether a memory cell represents an integer, a string, or another data type.Moreover, identifying precise function boundaries is radically challenging in the first place [3], [4].
However, despite the surging research interest in binary code similarity analysis (BCSA), we found that it is still significantly challenging to conduct new research on this field for several reasons.
First, most of the methods focus only on the end results without considering the precise reasoning behind their approaches.For instance, during our literature study in the field, we observed that there is a prominent research trend in applying BCSA techniques to cross-architecture and cross-compiler binaries of the same program [11], [12], [13], [15], [16], [19], [22].Those approaches aim to measure the similarity between two or more seemingly distinct binaries generated from different compilers targeting different instruction sets.To achieve this, multiple approaches have devised complex analyses based on machine learning to extract the semantics of the binaries, assuming that their semantics should not change across compilers nor target architectures.However, none of the existing approaches clearly justifies the necessity of such complex semanticsbased analyses.One may imagine that a compiler may generate structurally similar binaries for different architectures, even though they are syntactically different.Do compilers and architectures really matter for BCSA in this regard?Unfortunately, it is difficult to answer this question because most of the existing approaches leverage uninterpretable machine learning techniques [12], [13], [19], [20], [21], [23], [24], [25], [26], [27], [28], [29].Further, it is not even clear why a BCSA algorithm works only on some benchmarks and not on others.
Second, every existing paper on BCSA that we studied utilizes its own benchmark to evaluate the proposed technique, which makes it difficult to compare the approaches with one another.Moreover, reproducing the previous results is often infeasible because most researchers reveal neither their source code nor their dataset.Only 10 of the 43 papers that we studied fully released their source code, and only two of them opened their entire dataset.
Finally, researchers in this field do not use unified terminologies and often miss out on critical citations that have appeared in top-tier venues of other fields.Some of them even mistakenly use the same technique without citing the previous literature properly.These observations motivate one of our research goals, which is to summarize and review widely adopted techniques in this field, particularly in terms of generating features.
To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA.As the first step, we precisely define the terminologies and categorize the features used in the previous literature to unify terminologies and build knowledge bases for BCSA.We then construct a comprehensive and reproducible benchmark for BCSA to help researchers extend and evaluate their approaches easily.Lastly, we design an interpretable feature engineering model and conduct a series of experiments to investigate the influence of compilers, their options, and their target architectures on the syntactic and structural features of the resulting binaries.
Our benchmark, which we refer to as BINKIT, encompasses various existing benchmarks.It is generated by using major compiler options and targets, which include 8 architectures, 9 different compilers, 5 optimization levels, as well as various other compiler flags.BINKIT contains 243,128 distinct binaries and 36,256,322 functions built for 1,352 different combinations of compiler options, on 51 real-world software packages.We also provide an automated script that helps extend BINKIT to handle different architectures or compiler versions.We believe this is critical because it is not easy to modify or extend previous benchmarks, despite us having their source codes.Cross-compiling software packages using various compiler options is challenging because of numerous environmental issues.To the best of our knowledge, BINKIT is the first reproducible and extensible benchmark for BCSA.
With our benchmark, we perform a series of rigorous studies on how the way of compilation can affect the resulting binaries in terms of their syntactic and structural shapes.To this end, we design a simple interpretable BCSA model, which essentially computes relative differences between BCSA feature values.We then build a BCSA tool that we call TIKNIB, which employs our interpretable model.With TIKNIB, we found several misconceptions in the field of BCSA as well as novel insights for future research as follows.
First, the current research trend in BCSA is founded on a rather exaggerated assumption: binaries are radically different across architectures, compiler types, or compiler versions.However, our study shows that this is not necessarily the case.For example, we demonstrate that simple numeric features, such as the number of incoming/outgoing calls in a function, are largely similar across binaries compiled for different architectures.We also present other elementary features that are robust across compiler types, compiler versions, and even intra-procedural obfuscation.With these findings, we show that TIKNIB with those simple features can achieve comparable accuracy to that of the state-of-theart BCSA tools, such as VulSeeker, which relies on a complex deep learning-based model.
Second, most researchers focus on vectorizing features from binaries, but not on recovering lost information during the compilation, such as variable types.However, our experimental results suggest that focusing on the latter can be highly effective for BCSA.Specifically, we show that TIKNIB with recovered type information achieves an accuracy of over 99% on all our benchmarks, which was indeed the best result compared to all the existing tools we studied.This result highlights that recovering type information from binaries can be as critical as developing a novel machine learning algorithm for BCSA.
Finally, the interpretability of the model helps advance the field by deeply understanding BCSA results.For example, we present several practical issues in the underlying binary analysis tool, i.e., IDA Pro, which is used by TIKNIB, and discuss how such errors can affect the performance of BCSA.Since our benchmark has the ground truth and our tool employs an interpretable model, we were able to easily pinpoint those fundamental issues, which will eventually benefit binary analysis tools and the entire field of binary analysis.Contribution.In summary, our contributions are as follows: • We study the features and benchmarks used in the past literature regarding BCSA and clarify less-explored research questions in this field.• We propose BINKIT 1 , the first reproducible and expandable BCSA benchmark.It contains 243,128 binaries and 36,256,322 functions compiled for 1,352 distinct combinations of compilers, compiler options, and target architectures.
• We develop a BCSA tool, TIKNIB 2 , which employs a simple interpretable model.We demonstrate that TIKNIB can achieve an accuracy comparable to that of a state-ofthe-art deep learning-based tool.We believe this will serve as a baseline to evaluate future research in this field.• We investigate the efficacy of basic BCSA features with TIKNIB on our benchmark and unveil several misconceptions and novel insights.• We make our source code, benchmark, and experimental data publicly available to support open science.

BINARY CODE SIMILARITY ANALYSIS
Binary Code Similarity Analysis (BCSA) is the process of identifying whether two given code snippets have similar semantics.Typically, it takes in two code snippets as input and returns a similarity score ranging from 0 to 1, where 0 indicates the two snippets are completely different, and 1 means that they are equivalent.The input code snippet can be a function [11], [16], [19], [21], [24], [30], [31], [32], or even an entire binary image [7], [8].Additionally, the actual comparison can be based on functions, even if the inputs are entire binary images [12], [13], [15], [23], [33], [34], [35].At a high level, BCSA performs four major steps as described below: (S1) Syntactic Analysis.Given a binary code snippet, one parses the code to obtain a disassembly or an Abstract Syntax Tree (AST) of the code, which is often referred to as an Intermediate Representation (IR) [36].This step corresponds to the syntax analysis in traditional compiler theory, where source code is parsed down to an AST.If the input code is an entire binary file, we first parse it based on its file format and split it into sections.(S2) Structural Analysis.This step analyzes and recovers the control structures inherent in the given binary code, which are not readily available from the syntactic analysis phase (S1).In particular, this step involves recovering the control-flow graphs (CFGs) and call graphs (CGs) in the binary code [37], [38].Once the control-structural information is obtained, one can use any attribute of these control structures as a feature.We distinguish this step from semantic analysis (S3) because binary analysis frameworks typically provide CFGs and CGs for free; the analysts do not have to write a complex semantic analyzer.(S3) Semantic Analysis.Using the control-structural information obtained from S2, one can perform traditional program analyses, such as data-flow analysis and symbolic analysis, on the binary to figure out the underlying semantics.In this step, one can generate features that represent sophisticated program semantics, such as how register values flow into various program points.One can also enhance the features gathered from S1-S2 along with the semantic information.(S4) Vectorization and Comparison.The final step is to vectorize all the information gathered from S1-S3 to compute the similarity between the binaries.This step essentially results in a similarity score between 0 and 1.
Figure 1 depicts the four-step process.The first three steps determine the inputs to the comparison step (S4), which are often referred to as features.Some of the first three steps can be skipped depending on the underlying features being used.The actual comparison methodology in S4 can also vary depending on the BCSA technique.For example, one may compute the Jaccard distance [39] between feature sets, calculate the graph edit distance [40] between CFGs, or even leverage deep learning algorithms [41], [42].However, as the success of any comparison algorithm significantly depends on the chosen features, this paper focuses on features used in previous studies rather than the comparison methodologies.
In this section, we first describe the features used in the previous papers and their underlying assumptions ( §2.1).We then discuss the benchmarks used in those papers and point out their problems ( §2.2).Lastly, we present several research questions identified during our study ( §2.3).

Scope.
Our study focuses on 43 recent BCSA papers (from 2014 to 2020) that appeared in 27 top-tier venues of different computer science areas, such as computer security, software engineering, programming languages, and machine learning.There are, of course, plentiful research papers in this field, all of which are invaluable.Nevertheless, our focus here is not to conduct a complete survey on them but to introduce a prominent trend and the underlying research questions in this field, as well as to answer those questions.We particularly focus on features and datasets used in those studies, which lead us to four underexplored research questions that we will discuss in §2.3; our goal is to investigating these research questions by conducting a series of rigorous experiments.Because of the space limit, we excluded papers [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53] that were published before 2014 and those not regarding top-tier venues, or binary diffing tools [54], [55], [56] used in the industry.Additionally, we excluded papers that aimed to address a specific research problem such as malware detection, library function identification, or patch identification.Although our study focuses only on recent papers, we found that the features we studied in this paper are indeed general enough; they cover most of the features used in the older papers.

Features Used in Prior Works
We categorize features into two groups based on when they are generated during BCSA.Particularly, we refer to features obtained before and after the semantic analysis step (S3) as presemantic features and semantic features, respectively.Presemantic features can be derived from either S1 or S2, and semantic features can be derived from S3.We summarize both features used in the recent literature in Table 1.

Presemantic Features
Presemantic features denote direct or indirect outcomes of the syntactic (S1) and structural (S2) analyses.Therefore, we refer to any attribute of binary code, which can be derived without a semantic analysis, as a presemantic feature.We can further categorize presemantic features used in previous literature based on whether the feature represents a number or not.We refer to features representing a number as numeric presemantic features, and others as non-numeric presemantic features.The first half of Table 1 summarizes them.Numeric presemantic features.Counting the occurrences of a particular property of a program is common in BCSA as such numbers can be directly used as a numeric vector in the similarity comparison step (S4).We categorize numeric

BB-level Numbers
This mark denotes a feature that is not directly used for similarity comparison but is required for extracting other features used in post-processing.
Similarly, numeric features can be extracted from a CFG as well.CFG-level numeric features can also reflect structural information that underlies a CFG.For example, a function can be encoded into a numeric vector, which consists of the number of nodes (i.e., basic blocks) and edges (i.e., control flow), as well as grouped instructions in its CFG [11], [28], [61].One may extend such numeric vectors by adding extra features such as the number of successive nodes or the betweenness centrality of a CFG [12], [23], [28].The concept of 3D-CFG [72], which places each node in a CFG onto a 3D space, can be utilized as well.Here, the distances among the centroids of two 3D-CFGs can represent their similarity score [18].Other numeric features can be the graph energy, skewness, or cyclomatic complexity of a CFG [17], [28], [71].Even loops in a CFG can be converted into numeric features by counting the number of loop headers and tails, as well as the number of forward and backward edges [65].
Finally, previous approaches utilize numeric features obtained from CGs.We refer to them as CG-level numeric features.Most of these approaches measure the number of callers and callees in a CG [11], [17], [19], [23], [28], [65], [71], [73].When extracting these features, one can selectively apply an inter-procedural analysis using the ratio of the in-/out-degrees of the internal callees in the same binary and the external callees of imported libraries [15], [18], [20], [28].This is similar to the coupling concept [74], which analyzes the inter-dependence between software modules.The extracted features can also be post-processed using machine learning [19].Non-numeric presemantic features.Program properties can also be directly used as a feature.The most straightforward approach involves directly comparing the raw bytes of binaries [6], [53], [75].However, people tend to not consider this approach because byte-level matching is not as robust compared to simple code modifications.For example, antimalware applications typically make use of manually written signatures using regular expressions to capture similar, but syntactically different malware instances [76].Recent approaches have attempted to extract semantic meanings from raw binary code by utilizing a deep neural network (DNN) to build a feature vector representation [19], [25].
Another straightforward approach involves considering the opcodes and operands of assembly instructions or their intermediate representations [18], [77].Researchers often normalize operands [32], [34], [57] because their actual values can significantly vary across different compiler options.Recent approaches [62], [70] have also applied reoptimization techniques [78] for the same reason.To compute a similarity score, one can measure the number of matched elements or the Jaccard distance [15] between matched groups, within a comparison unit such as a sliding window [58], basic block [34], or tracelet [57].Here, a tracelet denotes a series of basic blocks.Although these approaches take different comparison units, one may adjust their results to compare two procedures, or to find the longest common subsequence [32], [34] within procedures.If one converts assembly instructions to a static single assignment (SSA) form, s/he can compute the tree edit distance between the SSA expression trees as a similarity score [10].Recent approaches have proposed applying popular techniques in natural language processing (NLP) to represent an assembly instruction or a basic block as an embedded vector, reflecting their underlying semantics [20], [21], [24], [26], [27], [29].
Finally, some features can be directly extracted from functions.These features may include the names of imported functions, and the intersection of two inputs can show their similarity [19], [61].Note that these features can collaborate with other features as well.
First, one straightforward method to represent the semantics of a given code snippet is to use symbolic constraints.The symbolic constraints could express the output variables or states of a basic block [7], a program slice [16], [59], [63], or a path [8], [79], [80].Therefore, after extracting the symbolic constraints from a target comparison unit, one can compare them using an SMT solver.
Second, one may represent code semantics using I/O samples [8], [15], [18], [22].The key intuition here is that two identical code snippets produce consistent I/O samples, and directly comparing them would be time-efficient.One can generate I/O samples by providing random inputs [8], [22] to a code snippet, or by applying an SMT solver to the symbolic constraints of the code snippet [15], [18].One can also use inter-procedural analysis to precisely model I/O samples if the target code includes a function call [15], [18].
Third, the runtime behavior of a code snippet can directly express its semantics, as presented by traditional malware analysis [81].By executing two target functions with the same execution environment, one can directly compare the executed instruction sequences [64] or visited CFG edges of the target functions [66].For comparison, one may focus on specific behaviors observed during the execution [18], [28], [30], [31], [35], [67], [82]: the read/write values of stack and heap memory, return values from function calls, and invoked system/library function calls during the executions.To extract such features, one may adopt fuzzing [31], [83], or an emulation-based approach [67].Moreover, one can further check the call names, parameters, or call sequences for system calls [18], [33], [35], [63], [67].
The next category is to manually annotate the high-level semantics of a program or function.One may categorize library functions by their high-level functionality, such as whether the function manipulates strings or whether it handles heap memory [15], [18], [61].Annotating cryptographic functions in a target code snippet [84] is also helpful because its complex operations hinder analyzing the symbolic constraints or behavior of the code [63].
The fifth category is extracting features from a program slice [85], because they can represent its data-flow semantics in an abstract form.Specifically, one can slice a program into a set of strands [14], [62].Here, a strand is a series of instructions within the same data flow, which can be obtained from backward slicing.Next, these strands can be canonicalized, normalized, or re-optimized for precise comparison [14], [62].Additionally, one may hash strands for quick comparison [68] or extract symbolic constraints from the strands [59].One may also extract features from a program dependence graph (PDG) [86], which is essentially a combination of a data-flow graph and CFG, to represent the convoluted semantics of the target code, including its structural information [13].

Key Assumptions from Past Research
During our literature study, we found that most of the approaches highly rely on semantic features extracted in (S3), assuming that they should not change across compilers nor target architectures.However, none of them clearly justifies the necessity of such complex semantics-based analyses.They focus only on the end results without considering the precise reasoning behind their approaches.This is indeed the key motivation for our research.Although most existing approaches focus on complex analyses, there may exist elementary features that we have overlooked.For example, there may exist effective presemantic features, which can beat semantic features regardless of target architectures and compilers.It can be the case that those known features have not been thoroughly evaluated on the right benchmark as there has been no comprehensive study on them.
Furthermore, existing research assumes the correctness of the underlying binary analysis framework, such as IDA Pro [95], which is indeed the most popular tool used, as shown in the rightmost column of Table 2.However, CFGs derived from those tools may be inherently wrong.They may miss some important basic blocks, for instance, which can directly affect the precision of BCSA features.
Indeed, both (S1) and (S2) are challenging research problems by themselves: there are abundant research efforts to improve the precision of both analyses.For example, disassembling binary code itself is an undecidable problem [96], and writing an efficient and accurate binary lifter is significantly challenging in practice [36], [97].Identifying functions from binaries [3], [4], [96], [98], [99], [100], [101] and recovering control-flow edges [102] for indirect branches are still active research fields.All these observations lead us to research questions in §2.3.

Benchmarks Used in Prior Works
It is imperative to use the right benchmark to evaluate a BCSA technique.Therefore, we studied the benchmarks used in the past literature, as shown in Table 2.However, during the study, we found that it is radically difficult to properly evaluate a new BCSA technique using the previous benchmarks.
We only mark items that are stated explicitly in the paper.Due to the lack of details about firmware images, we were not able to mark optimization options or compilers used to create them.For papers that do not explicitly state the number of binaries in their dataset, we estimated the number and marked it with parentheses.† This table focuses on two major compilers: GCC and Clang, as other compilers only support a limited number of architectures.
We infer the target architectures of the dataset as they are not stated explicitly in the paper.This indicates that only a portion of the code and dataset is available.For example, discovRE [11] makes available only their firmware images, and αDiff [19] opens transformed function images but not the actual dataset.
First, we were not able to find a single pair of papers that use the same benchmark.Some of them share packages such as GNU coreutils [15], [30], [31], but the exact binaries, versions, and compiler options are not the same.Although there is no known standard for evaluating BCSA, it is surprising to observe that none of the papers use the same dataset.We believe this is partly because of the difficulty in preparing the same benchmark.For example, even if we can download the same version of the source code used in a paper, it is extraordinarily difficult to crosscompile the program for various target architectures with varying compiler options; it requires significant effort to set up the environment.However, only two out of 43 papers we studied fully open their dataset.Even in that case, it is hard to rebuild or extend the benchmark because of the absence of a public compilation script for the benchmark.
Second, the number of binaries used in each paper is limited and may not be enough for analytics.The #Binaries column of Table 2 summarizes the number of program binaries obtained from two different sources: application packages and firmware images.Since a single package can contain multiple binaries, we manually extracted the packages used in each paper and counted the number of binaries in each package.We counted only the binaries after a successful compilation, such that the object files that were generated during the compilation process were not counted.If a paper does not explicitly mention package versions, we used the most recent package versions at the time of writing and marked them with parentheses.Note that only 6 out of 43 papers have more than 10,000 binaries, and none reaches 100,000 binaries.Firmware may include numerous binaries, but it cannot be directly used for BCSA because one cannot generate the ground truth without having the source code.
Finally, previous benchmarks only cover a few compilers, compiler options, and target architectures.Some papers do not even describe their tested compiler options or package versions.The Compiler column of the table presents the number of minor versions used for each major version of the compilers.Notably, all the benchmarks except one consider less than five different major compiler versions.The Extra column of the table shows the use of extra compiler options for each benchmark.Only a few consider function inlining and Link-Time Optimization (LTO).None of them deal with the Position Independent Executable (PIE) option, although, currently, it is widely used [103].
All these observations lead us to the research questions outlined in the next subsection ( §2.3) and eventually motivate us to create our own benchmark that we call BINKIT, which is shown in the last row of Table 2.

Research Problems and Questions
We now summarize several key problems observed from the previous literature and introduce research questions derived from these problems.First, none of the papers uses the same benchmark for their evaluation, and the way they evaluate their techniques significantly differs.Second, only a few of the studies release their source code and data, which makes it radically difficult to reproduce or improve upon existing works.Furthermore, most papers use manually chosen ground truth data for their evaluation, which are easily error-prone.Finally, current state-of-the-art approaches in BCSA focus on extracting semantic features with complex analysis techniques (from §2.1.1 and §2.1.2).These observations naturally lead us to the below research questions.Note that some of the questions are indeed openended, and we only address them in part.
RQ1.How should we establish a large-scale benchmark and ground truth data?
One may build benchmarks by manually compiling application source code.However, there are so many different compiler versions, optimization levels, and options to consider when building binaries.Therefore, it is desirable to automate this process to build a large-scale benchmark for BCSA.It should be noted that many of the existing studies have also attempted to build ground truth from source code.However, the number of binaries and compiler options used in those studies is limited and is not enough for data-driven research.Furthermore, those studies release neither their source code nor dataset ( §2.2).On the contrary, we present a script that can automatically build large-scale ground truth data from a given set of source packages with clear descriptions ( §3).

RQ2. Is the effectiveness of presemantic features limited to the target architectures and compiler options used?
We note that most previous studies assume that presemantic features are significantly less effective than semantic features, as they can largely vary depending on the underlying architectures and compiler optimizations used.For example, compilers may perform target-specific optimization techniques for a specific architecture.Indeed, 36 out of the 43 papers (≈ 84%) we studied focus on new semantic features in their analysis, as shown in Table 1.To determine whether this assumption is valid, we investigate it through a series of rigorous experimental studies.Although byte-level information significantly varies depending on the target and the optimization techniques, we found that some presemantic features, such as structural information obtained from CFGs, are broadly similar across different binaries of the same program.Additionally, we demonstrated that utilizing such presemantic features without a complex semantic analysis can achieve an accuracy that is comparable to that of a recent deep learning-based approach with a semantic analysis ( §5).
RQ3. Can debugging information help BCSA achieve a high accuracy rate?
We are not aware of any quantitative study on how much debugging information affects the accuracy of BCSA.Most prior works simply assume that debugging information is not available, but how much does it help?How would decompilation techniques affect the accuracy of BCSA?To answer this question, we extracted a list of function types from our benchmark and used them to perform BCSA on our dataset.Surprisingly, we were able to achieve a higher accuracy rate than any other existing works on BCSA without using any sophisticated method ( §6).
RQ4. Can we benefit from analyzing failure cases of BCSA?
Most existing works do not analyze their failure cases as they rely on uninterpretable machine learning techniques.However, our goal is to use a simple and interpretable model to learn from failure and gain insights for future research.Therefore, we manually examined failure cases using our interpretable method and observed three common causes for failure, which have been mostly overlooked by the previous literature.First, COTS binary analysis tools indeed return false results.Second, different compiler backends for the same architecture can be substantially different from each other.Third, there are architecture-specific code snippets for the same function.We believe that all these observations help in setting directions for future studies ( §7).

ESTABLISHING LARGE-SCALE BENCHMARK AND GROUND TRUTH FOR BCSA (RQ1)
Building a large-scale benchmark for BCSA and establishing its ground truth is challenging.One potential approach for generating the ground truth data is to manually identify similar functions from existing binaries or firmware images [10], [57], [59].However, this requires domain expertise and is often error-prone and time-consuming.
Another approach for obtaining the ground truth is to compile binaries from existing source code with varying compiler options and target architectures [13], [15], [16], [23].If we compile multiple binaries (with different compiler options) from the same source code, one can determine which function corresponds to which source lines.Unfortunately, most existing approaches do not open their benchmarks nor the compilation scripts used to produce them (Table 2).
Therefore, we present BINKIT, which is a comprehensive benchmark for BCSA, along with automated compilation scripts that help reproduce and extend it for various research purposes.The rest of this section details BINKIT and discusses how we establish the ground truth (RQ1).

BINKIT: Large-Scale BCSA Benchmark
BINKIT is a comprehensive BCSA benchmark that comprises 243,128 binaries compiled from 51 packages of source code with 1,352 distinct combinations of compilers, compilation options, and target architectures.Therefore, BINKIT covers most of the benchmarks used in existing approaches, as shown in Table 2. BINKIT includes binaries compiled for 8 different architectures.For example, we use both littleand big-endian binaries for MIPS to investigate the effect of endianness.It uses 9 different versions of compilers: GCC v{4.9.4,5.5.0, 6.4.0, 7.3.0,8.2.0} and Clang v{4.0, 5.0, 6.0, 7.0}.We also consider 5 optimization levels from O0 to O3 as well as Os, which is the code size optimization.Finally, we take PIE, LTO, and obfuscation options into account, which are less explored in BCSA.
We select GNU software packages [104] as our compilation target because of their popularity and accessibility: they are real applications that are widely used on Linux systems, and their source code is publicly available.We successfully compiled 51 GNU packages for all our target architectures and compiler options.
To better support targeted comparisons, we divide BINKIT into six datasets: NORMAL, SIZEOPT, NOINLINE, PIE, LTO, and OBFUSCATION.The summary of each dataset is shown in Table 3.Each dataset contains binaries obtained by compiling the GNU packages with different combinations of compiler options and targets.There is no intersection among the datasets.
NORMAL includes binaries compiled for 8 different architectures with different compilers and optimization levels.We did not use other extra options such as PIE, LTO, and no-inline for this dataset.
SIZEOPT is the same as NORMAL except that it uses only the Os optimization option instead of O0-O3.
Similarly, PIE, NOINLINE, LTO, and OBFUSCATION are no different from NORMAL except that they are generated by using an additional flag to enable PIE, to disable inline optimization, to enable LTO, and to enable compile-time obfuscation, respectively.PIE makes memory references in binary relative to support ASLR.On some architectures, e.g., x86, compilers inject additional code snippets to achieve relative addressing.As a result, the compiled output can differ severely.Although PIE became the default on most Linux systems [103], it has not been well studied for BCSA.Note we were not able to compile all 51 packages with the PIE option enabled.Therefore, we have fewer binaries in PIE than NORMAL.
Function inlining embeds callee functions into the body of the caller.This can make presemantic features largely vary.Therefore, we investigate the effect of function inlining on BCSA by explicitly turning off the inline optimization with the fno-inline option.
LTO is an optimization technique that operates at link time.It removes unnecessary code blocks, thereby reducing the number of presemantic features.However, it has also been less studied in BCSA.We were only able to successfully compile 29 packages when the LTO option was enabled.
Finally, the OBFUSCATION dataset uses Obfuscator-LLVM [105] to obfuscate the target binaries.We chose Obfuscator-LLVM among various other tools previously used [105], [106], [107], [108], [109], [110] because it is the most commonly used [20], [31], [61], [67], [70], and we can directly compare the effect of obfuscation using the vanilla LLVM compiler.We use Obfuscator-LLVM's latest version with four obfuscation options: instruction substitution (SUB), bogus control flow (BCF), control flow flattening (FLA), and a combination of all the options.We regard each option as a distinct compiler, as shown in the Comp column of Table 3.One can obfuscate a single binary multiple times.However, we only applied it once.This is because obfuscating a binary multiple times could emit a significantly large binary, which becomes time-consuming for IDA Pro to preprocess.For example, when we obfuscate a2ps twice with all three options, the compiled binary reaches over 30 MB, which is 30 times larger than the normal one.
The number of packages and that of compiler options used in compiling each dataset differ because some packages can be compiled only with a specific set of compile options and targets.Some packages fail to compile because they have architecture-specific code, such as inline assemblies, or because they use compiler-specific grammars.For example, Clang does not support both the LTO option and the Os option to be turned on.There are also cases where packages have conflicting dependencies.We also excluded the ones that did not compile within 30 min because some packages require a considerable amount of time to compile.For instance, smalltalk took more than 10 h to compile with the obfuscation option enabled.
To summarize, BINKIT contains 243,128 binaries and 36,256,322 functions in total, which is indeed many orders of magnitude larger than the other benchmarks that appear in the previous literature.The Source column of Table 2 shows the difference clearly.BINKIT does not include firmware images because our goal is to automatically build a benchmark with clear ground truth.One may extend our benchmark with firmware images.However, it would take significant manual effort to identify their ground truth.For additional details regarding each package, please refer to Table 12 in the Appendix.
Our benchmark and compilation scripts are available on GitHub.Our compilation environment is based on Crosstool-NG [111], GNU Autoconf [112], and Linux Parallels [113].Through this environment, we compiled the entire datasets of BINKIT in approximately 30 h on our server machine with 144 Intel Xeon E7-8867v4 cores.

Building Ground Truth
Next, we establish the ground truth for our dataset.We first define the criteria for determining the equivalence of two functions.In particular, we check whether two functions with the same name originated from the same source files and have the same line numbers.Additionally, we verify that both functions come from the same package and have the same name in their binaries to ensure their equivalence.
Based on these criteria, we constructed the ground truth by performing the following steps.First, we compiled all the binaries with debugging information using the -g option.We then leveraged IDA Pro [95] to identify functions in the compiled binaries.Next, we labeled each identified function with its name, package name, binary name, as well as the name of the corresponding source file and line numbers.To achieve this, we wrote a script that parses the debugging information from each binary.
Using this information, we then sanitize our dataset to avoid having incorrect or biased results.Among the identified functions, we selected only the ones in the code (.text) segments, as functions in other segments may not include valid binary code.For example, we disregarded functions in the Procedure Linkage Table (.plt)sections because these functions are wrappers to call external functions and do not include actual function bodies.In our dataset, we filtered out 40% of the identified functions in this step.
We also disregarded approximately 4% of the functions that are generated by the compiler, but not by the application developers.We can easily identify such compiler intrinsic functions by checking the corresponding source files and line numbers.For example, GCC utilizes intrinsic functions such as __udivdi3 in libgcc2.cor __aeabi_uldivmod in bpabi.S to produce highly optimized code.
Additionally, we removed duplicate functions within the same project/package.Two different binaries often share the same source code, especially when they are in the same project/package.For example, the GNU coreutils package contains 105 different executables that share 80% of the functions in common.We removed duplicate functions within each package by checking the source file names and their line numbers.Moreover, compilers can also generate multiple copies of the same function within a single binary due to optimization.These functions share the same source code but have a difference in their binary forms.For example, some parts of the binary code are removed or reordered for optimization purposes.As these functions share a large portion of the code, considering all of them would produce a biased result.To avoid this, we selected only one copy for each of the functions in our experiments.This step filtered out approximately 54% of the remaining functions.The last column of Table 3 reports the final counting results, which is the number of unique functions.
By performing all the above steps, we can automatically build large-scale ground truth data.The total time spent building the ground truth of all our datasets was 13,300 seconds.By leveraging this ground truth data, we further investigate the remaining research questions (i.e., RQ2-RQ4) in the following sections.To encourage further research, we have released all our datasets and source code.

BUILDING AN INTERPRETABLE MODEL
Previous BCSA techniques focused on achieving a higher accuracy by leveraging recent advances in deep learning techniques [12], [13], [19], [25].This often requires building a complicated model, which is not straightforward to understand and hinders researchers from reasoning about the BCSA results and further answering the fundamental questions regarding BCSA.Therefore, we design an interpretable model for BCSA to answer the research questions and implement TIKNIB, which is a BCSA tool that employs the model.This section illustrates how we obtain such a model and how we set up our experimental environment.

TIKNIB Overview
At a high level, TIKNIB leverages a set of presemantic features widely used in the previous literature to reassess the effectiveness of presemantic features (RQ2).It evaluates each feature in two input functions, based on our similarity scoring metric ( §4.3), which directly measures the difference between each feature value.In other words, it captures how much each feature differs across different compile options.Note TIKNIB is intentionally designed to be simple so that we can answer the research questions presented in §2.3.Despite the simplicity of our approach, TIKNIB still produces a high accuracy rate that is comparable to state-ofthe-art tools ( §5.2).We are not arguing here that TIKNIB is the best BCSA algorithm.

Features Used in TIKNIB
Recall from RQ2, one of our goals is to reconsider the capability of presemantic features.Therefore, we focus on choosing various presemantic features used in the previous BCSA literature instead of inventing novel ones.
However, creating a comprehensive feature set is not straightforward because of the following two reasons.First, there are numerous existing features that are similar to one another, as discussed in §2.Second, some features require domain-specific knowledge, which is not publicly available.For example, several existing papers [11], [12], [13], [17], [18], [23], [61], [65] categorize instructions into semantic groups.However, grouping instructions is largely a subjective task, and there is no known standard for it.Furthermore, most existing works do not make their grouping algorithms public.
We address these challenges by (1) manually extracting representative presemantic features and (2) open-sourcing our feature extraction implementation.Specifically, we focus on numeric presemantic features.Because these features are represented as numbers, the relationship among their values across different compile options can be easily observed.
Table 4 summarizes the selected features.Our feature set consists of CFG-and CG-level numeric features as they can effectively reveal structural changes in the target code.In particular, we utilize features related to basic blocks, CFG edges, natural loops, and strongly connected components (SCCs) from CFGs, by leveraging NetworkX [114].We also categorize instructions into several semantic groups based on our careful judgment by referring to the reference manuals [115], [116], [117] and leveraging Capstone [118]'s internal grouping.Next, we count the number of instructions in each semantic group per each function (i.e., CFG).Additionally, we take six features from CGs.The number of callers and callees represents a unique number of outgoing and incoming edges from CGs, respectively.
To extract these features, we conducted the following steps.First, we pre-processed the binaries in BINKIT with IDA Pro [95].We then generated the ground truth of these binaries as we described in §3.2.For those functions of which we have the ground truth, we extracted the aforementioned features.Table 5 shows the time spent for each of these steps.The IDA pre-processing took most of the time as IDA performs various internal analyses.Meanwhile, the feature extraction took much less time as it merely operates on the precomputed results from the pre-processing step.

Scoring Metric
Our scoring metric is based on the computation of the relative difference [119] between feature values.Given two functions A and B, let us denote a value of feature f for each function as A f and B f , respectively.Recall that any feature in TIKNIB can be represented as a number.We can compute the relative difference δ between the two feature values as follows: Let us suppose we have N distinct features (f 1 , f 2 , • • • , f N ) in our feature set.We can then define our similarity score s between two functions A and B by taking the average of relative differences for all the features as follows: Although each numeric feature can have a different range of values, TIKNIB can effectively handle them using relative differences by representing the difference of each feature with a value between 0 and 1.Therefore, the score s is always within the range of 0 to 1.
Furthermore, we can intuitively understand and interpret the BCSA results using our scoring metric.For example, suppose there are two functions A and B derived from the same source code with and without compiler option X, respectively.If the relative difference of the feature value f between the two functions is small, it implies that f is a robust feature against compiler option X.
In this paper, we focus only on simple relative differences, rather than exploring complex relationships among the features for interpretability.However, we believe that our approach could be a stepping-stone toward fabricating more improved interpretable models to understand such complex relationships.by the number of functions (the last column of Table 3).

Feature Selection
Based on our scoring metric, we perform lightweight preprocessing to select useful features for BCSA as some features may not help in making a distinction between functions.To measure the quality of a given feature set, we compute the area under the receiver operating characteristic (ROC) curve (i.e., the ROC AUC) of generated models.Suppose we are given a dataset in BINKIT, which is generated from source code containing N unique functions.In total, we have a maximum of N • M functions in our dataset, where M is the number of combinations of compiler options used to generate the dataset.The actual number of functions can be less than N • M due to function inlining.For each unique function λ, we randomly select two other functions with the following conditions.(1) A true positive (TP) function, λ TP , is generated from the same source code as in λ, with different compiler options, and (2) a true negative (TN) function, λ TN , is generated from source code that is different from the one used to generate λ, with the same compiler options as for λ TP .We generate such pairs for each unique function, thereby acquiring around 2 • N function pairs.We then compute the similarity scores for the functions in each pair and their AUC.
We note that the same methodology has been used in prior works [12], [13].We chose the method as it allows us to efficiently analyze the tendency over a large-scale dataset.One may also consider top-k [12], [13], [14], [20] or precision@k [12], [20] as an evaluation metric, but this approach has too much computational overhead: O((N • M ) 2 ) operations.
Unfortunately, there is no efficient algorithm for selecting an optimal feature subset to use; it is indeed a wellknown NP-hard problem [120].Therefore, we leverage a greedy feature selection algorithm [121].Starting from an empty set F, we determine whether we can add a feature to F to increase its AUC.For every possible feature, we make a union with F and compute the corresponding AUC.We then select one that maximizes the AUC and update F to include the selected feature.We repeat this process until the AUC does not increase further by adding a new feature.Although our approach does not guarantee finding an optimal solution, it still provides empirically meaningful results, as we describe in the following sections.

Experimental Setup
For all experiments in this study, we perform 10-fold crossvalidation on each test.When we split a test dataset, we ensure functions that share the same source code (i.e., source file name and line number) are either in a training or testing set, but not in both.For each fold, during the learning phase, i.e., the feature selection phase, we select up to 200K functions from a training set and conduct feature selection, as training millions of functions would take a significant amount of time.Limiting the number of functions for training may degrade the final results.However, when we tested the number of functions from 100K to 1000K, the results remained almost consistent.In the validation phase, we test all the functions in the testing set without any sampling.Thus, after 10-fold validation, all the functions in the target dataset are tested at least once.

PRESEMANTIC FEATURE ANALYSIS (RQ2)
We now present our experimental results using TIKNIB on the presemantic features ( §4.2) to answer RQ2 ( §2.3).With our comprehensive analysis of these features, we obtained several useful insights for future research.In this section, we discuss our findings and lessons learned.

Analysis Result
To analyze the impact of various compiler options and target architectures on BCSA, we conducted a total of 72 tests using TIKNIB.We conducted the tests on our benchmark, BINKIT, with the ground truth that we built in §3.Table 6 describes the experimental results where each column corresponds to a test we performed.Note that we present only 26 out of 72 tests because of the space limit.Unless otherwise specified, all the tests were performed on the NORMAL dataset.As described in §4.4,we prepared 10-fold sets for each test.We divided the tests into seven groups according to their purposes, as shown in the top row of the table.For example, the Arch group contains a set of tests to evaluate each feature against varying target architectures.
For each test, we select function pairs for training and testing as described in §4.4.That is, for a function λ, we select its corresponding functions (i.e., λ TP and λ TN ).Therefore, N functions produce 2•N functions pairs.The first row ( 1 ) of Table 6 shows the number of function pairs for each test.When selecting these pairs, we deliberately choose the target options based on the goal of each test.For instance, we test the influence of varying the target architecture from x86 to ARM (x86 vs. ARM column of Table 6).For each function λ in the x86 binaries of our dataset, we select both λ TP and λ TN from the ARM binaries compiled with the same compiler option as in λ.In other words, we fix all the other options, except for the target architecture for choosing λ TP and λ TN so we can focus on our testing goal.The same rule applies to other columns.For the Rand.columns, we alter all the compiler options in the group randomly to generate function pairs.
The second row ( 2 ) of Table 6 presents the time spent for training and testing in each test, which excludes the time for loading the function data on the memory.The average time spent for a single function was less than 1 ms.
Each cell in the third row ( 3 ) of Table 6 represents the average of δ(λ f , λ TN f ) − δ(λ f , λ TP f ) for feature f , which we call the TP-TN gap of f .This TP-TN gap measures the similarity between λ TP and λ, as well as the difference between λ TN and λ, in terms of the target feature.Thus, when the gap of a feature is larger, its discriminative capability for BCSA is higher.As we conduct 10-fold validation for each test, we highlight the cells with gray when the corresponding feature is chosen in all ten trials.Such features show relatively higher TP-TN gaps than the others do in each test.We also present the average TP-TN gaps in the fourth row ( 4 ) of the table.
The average number of the selected features in each test is shown in the fifth row ( 5 ) of Table 6.A few presemantic features could achieve high AUCs and average precisions (APs), as shown in the sixth row ( 6 ) and seventh row ( 7 ) of the same table, respectively.We now summarize our observations as follows.

Optimization is largely influential
Many researchers have focused on designing a model for cross-architecture BCSA [11], [15], [18], [22], [33].However, our experimental results show that architecture may not be the most critical factor for BCSA.Instead, optimization level was the most influential factor in terms of the relative difference between presemantic features.In particular, we measured the average TP-TN gap of all the presemantic features for each test (Avg. of TP-TN Gap row of the table) and found that the average gap of the O0 vs. O3 test (0.41) is less than that of the x86 vs. ARM test (0.46) and the x86 vs. MIPS test (0.42).Furthermore, the optimization level random test (Rand.column of the Opt Level group) shows the lowest AUC (0.96) compared to that of the architecture and compiler group (0.98).These results confirm that compilers can produce largely distinct binaries depending on the optimization techniques used; hence, the variation among the binaries due to the optimization is considerably greater than that due to the target architecture on our dataset.

Compiler version has a small impact
Approximately one-third of the previous benchmarks shown in Table 2 employ multiple versions of the same compiler.However, we found that even the major versions of the same compiler produce similar binaries.In other words, compiler versions do not heavily affect presemantic features.Although Table 6 does not include all the tests we performed because of the space constraints, it is apparent from the Compiler column that the two tests between two different versions of the same compiler, i.e., GCC v4 vs. GCC v8 and Clang v4 vs. Clang v7, have much higher TP-TN gaps (0.52) than other tests, and their AUCs are close to 1.0.

GCC and Clang have diverse characteristics
Conversely, the GCC vs. Clang test resulted in the lowest TP-TN gap (0.44) and AUC (0.97) among the tests in the Compiler group.This can be because each compiler employs a different back-end, thereby producing different binaries.Another potential problem is that the techniques inside each optimization level can vary depending on the compiler.We detail this in §7.2.0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7 Average Precision (AP) 0.95 0.93 0.99 0.97 1.00 1.00 0.97 0.99 0.99 0.97 0.98 0.99 1.00 0.98 0.96 0.99 0.98 0.98 0.98 1.00 0.98 0.98 1.00 0.95 0.93 0.90 Std. of AP 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 All values in the table are 10-fold cross validation averages.We color a cell gray if a feature was consistently selected (i.e., 10 times) during the 10-fold validation.Due to the space constraints, we only display features that have been selected at least once during the 10-fold validation.† We compare a function from the NORMAL to the corresponding function in each target dataset.‡ We match functions whose compiler options are largely distant to test for bad cases.Please refer to §5.1.8for additional information.↑ These in the first rows ( 1 ) are divided by 10 4 instead of 10 6 .
* The train and test times in the seconds rows ( 2 ) do not include the time for data loading.

ARM binaries are closer to x86 binaries than MIPS
The tests in the Arch group measure the influence of target architectures with the NORMAL dataset.Overall, the target architecture did not have much of an effect on the accuracy rate.The AUCs were over 0.98 in all the cases.Surprisingly, the x86 vs. ARM test had the highest TP-TN gap (0.46) and AUC (1.0), indicating that the presemantic features of the x86 and ARM binaries are similar to each other, despite being distinct architectures.The ARM vs. MIPS test showed a lower TP-TN gap (0.43) and AUC (0.98) although both of them are RISC architectures.Additionally, the effect of the word size (i.e., bits) and endianness was relatively small.Nevertheless, we cannot rule out the possibility that our feature extraction for MIPS binaries is erroneous.We further discuss this issue in §7.1.

Closer optimization levels show similar results
We also measured the effect of size optimization (Os) by matching function λ in the NORMAL dataset with a function (λ TP and λ TN ) in the SIZEOPT dataset.Subsequently, the binaries compiled with the Os option were similar to the ones compiled with the O1 and O2 options.This is not surprising because Os enables most of the O2 techniques in both GCC and Clang [125], [126].Furthermore, we observe that the O1 and O2 options produce similar binaries, although this is not shown in Table 6 due to the space limit.

Extra options have less impact
To assess the influence of the PIE, no-inline, and LTO options, we compared functions in the NORMAL dataset with those in the PIE, NOINLINE, and LTO datasets, respectively.For the no-inline test, we limit the optimization level from O1 to O3 as function inlining is applied from O1.It was observed that the influence of such extra options is not significant.Binaries with and without the PIE option were similar to each other because it only changes the instructions to use relative addresses; hence, it does not affect our presemantic features.Function inlining also does not affect several features, such as the number of incoming calls, which results in a high AUC (0.97).LTO does not exhibit any notable effects either.However, by analyzing each test case, we found that some options affect the AUC more than others.For example, in the no-inline test, the AUC largely decreases as the optimization level increases: O1 (0.995), O2 (0.981), and O3 (0.967).This is because as more optimization techniques are applied, more functions are inlined and transformed in the NORMAL, while their corresponding functions in the NOINLINE are not inlined.On the other hand, in the LTO test, the AUC increases as the version of Clang increases: v4 (0.956), v5 (0.968), v6 (0.986), and v7 (0.986).In contrast, GCC shows stable AUCs (0.987-0.988) across all versions, and all the AUCs are higher than those of Clang.This result indicates that varying multiple options would significantly affect the success rate, which we describe below.

Obfuscator-LLVM does not affect CG features
Many previous studies [20], [31], [61], [67], [70] chose Obfuscator-LLVM [105] for their obfuscation tests as it significantly varies the binary code [20].However, applying all of its three obfuscation options shows an AUC of 0.95 on our dataset, which is relatively higher than that of the optimization level tests.Obfuscation severely decreases the average TP-TN gaps except for CG features.This is because Obfuscator-LLVM applies intra-procedural obfuscation.The SUB obfuscation substitutes arithmetic instructions while preserving the semantics; the BCF obfuscation notably affects CFG features by adding bogus control flows; the FLA obfuscation changes the predicates of control structures [127].However, none of them conducts interprocedural obfuscation, which modifies the function call relationship.Thus, we encourage future studies to use other obfuscators, such as Themida [128] or VMProtect [107], for evaluating their techniques against inter-procedural obfuscation.

Comparison target option does matter
Based on the experimental results thus far, we perform extra tests to understand the influence of comparing multiple compiler options by intentionally selecting λ TP and λ TN from binaries that could provide the lowest TP-TN gap.In this study, we present two of them because of the space limit.Specifically, for the first test, we selected functions from 32-bit ARM binaries compiled using GCC v4 with the O0 option, and the corresponding λ TP and λ TN functions from 64-bit MIPS big-endian binaries compiled using Clang v7 with the O3 option.For the second test, we changed the Clang compiler to the Obfuscator-LLVM with all three obfuscation options turned on.The Bad column of the table summarizes the results.The AUC in both cases was approximately 0.93 and 0.91, respectively.Their average TP-TN gaps were also significantly lower (0.42 and 0.27) than those in the other tests.This signifies the importance of choosing the comparison targets for evaluating BCSA techniques.Existing BCSA research compares functions for all possible targets in a dataset, as shown in the Rand.tests in this study.However, our results suggest that researchers should carefully choose evaluation targets to avoid overlooking the influence of bad cases.

Comparison Against State-of-the-Art Techniques
From our experiments in §5.1, we show that using only presemantic features with a simple linear model (i.e., TIKNIB) is enough to obtain high AUC values.Next, we compare TIKNIB with state-of-the-art techniques.
To accomplish this, we chose one of the latest approaches, VulSeeker [13], as our target because it utilizes both presemantic and semantic features in a numeric form by leveraging neural network-based post-processing.Thus, As the index of the dataset grows, the number of packages, architectures, and compiler options increases ASE1 and ASE3 are the datasets used in VulSeeker [13].For all datasets, the optimization levels are O0-O3.we can directly evaluate our simple model using numeric presemantic features.Note that our goal is not to claim that our approach is better, but to demonstrate that the proper engineering of presemantic features can achieve results that are comparable to those of state-of-the-art techniques.
For this experiment, we prepared the datasets of VulSeeker, along with the additional ones as listed in Table 7.We refer to these datasets as ASE1 through ASE4.ASE1 and ASE3 are the ones used in VulSeeker, and ASE2 and ASE4 are extra ones with more packages, target architectures, and compilers.Note that the number of packages, architectures, and compiler options increases as the index of the dataset increases.The optimization levels for all datasets are O0-O3.We intentionally omitted firmware images used in the original paper, as they do not provide solid ground truth.For each dataset, we established the ground truth in the same way described in §3.2.The time spent for IDA preprocessing, ground truth building, and feature extracting was 2197 s, 889 s, and 239 s, respectively.We then conducted experiments with the methodology explained in §4; note that the same methodology was used in the original paper.
Figure 2 depicts the results.Figure 2a shows that the AUCs of TIKNIB on ASE1 and ASE3 are 0.9724 and 0.9783, respectively.However, those of VulSeeker were 0.99 and 0.8849 as reported by the authors [13].Figure 2b illustrates that the AUC of each fold in ASE3 ranged from 0.9777 to 0.9793, which is higher than that of VulSeeker (0.8849).Therefore, TIKNIB was more robust than VulSeeker in terms of the size and compile options in the dataset.TIKNIB also exhibits stable results, even for ASE2 and ASE4.
From these results, we conclude that presemantic features combined with proper feature engineering can achieve results that are comparable to those of state-of-the-art BCSA techniques.Although our current focus is on comparing feature values, it is possible to extend our work to analyze the complex relationships among the features by utilizing advanced machine learning techniques [12], [13], [19], [20], [21], [23], [24], [25], [26], [27], [28], [29].We tested three 64-bit architectures (aarch64, x86-64, and mips64el) that are widely used in software packages.All rank and precision@1 values are averaged; because each test involves multiple combinations of options (the first row), their results are averaged.* These are the results of the vulnerable tls1_process_heartbeat function in OpenSSL v1.0.1f.† This is the result of the vulnerable dtls1_process_heartbeat function in OpenSSL v1.0.1f, which is similar to but distinct from the tls1_process_heartbeat function.‡ These are the results of their patched versions in OpenSSL v1.0.1u.
We chose the tls1_process_heartbeat function in the OpenSSL package as our target function because it contains the infamous Heartbleed vulnerability (i.e., CVE-2014-0160), which has been widely used in prior studies for evaluation [11], [12], [20], [23].We utilized two versions of OpenSSL in the ASE4 dataset shown in Table 7: v1.0.1f contains the vulnerable function, while v1.0.1u contains the patched version.As the dataset was compiled with 288 distinct combinations of compiler options and architectures, each function has 576 samples: 288 (the number of possible combinations) × 2 (the number of available OpenSSL versions) ≈ 576.
Notably, testing all possible combinations of options entails a significant computational overhead; it requires 288 (the number of options for our target function) × 287 (the number of options for a function in OpenSSL) × 2 (the number of OpenSSL versions) × 5K (the number of functions in OpenSSL) ≈ 826M operations.Therefore, we focused on architectures and compiler options that are widely used in software packages.Specifically, we chose three 64-bit architectures (aarch64, x86-64, and mips64el) and two levels of optimization (O2-O3).This setup reflects real-world scenarios, as many software packages use O2-O3 by default: coreutils uses O2, while OpenSSL uses O3.Previous studies [20], [59] also used the same setup (O2-O3) except that they only tested x86 binaries.Additionally, we selected four compilers (Clang v4.0, Clang, v7.0, GCC v4.9.4, and GCC v8.2.0) to consider extreme cases.Consequently, there were 24 possible combinations of these architectures and compiler options.
We conducted a total of 552 tests on these 24 option combinations: 24 (the number of options for our target function) × 23 (the number of options for a function in OpenSSL).
For each test, we simply computed the similarity scores for all function pairs using TIKNIB and checked the rank of the vulnerable function.To reflect real-world scenarios, we assumed in all tests that we were not aware of the precise optimization level, compiler type, or compiler version of the testing binary.On the other hand, we assumed that we could recognize the architecture of the testing binary as it is straightforward.Therefore, when we train TIKNIB, we chose a feature set that achieved the best performance across all possible combinations of optimization levels, compiler types, and compiler versions, while setting the source and target architectures fixed.For training, we used the NOR-MAL dataset (Table 3) as it does not include OpenSSL; thus, the training and testing datasets are completely distinct.
Table 8 summarizes the results, with each column corresponding to the tests for the specified options.We organized the results by the option group specified in each column after running all 522 tests.The first row of the table (# of Option Pairs) indicates the total number of option pairs, which is the same as that of true positive pairs.The remaining rows of the table show the averaged values obtained by the option pair tests.For example, the All to All column represents the averaged results of all possible combinations (24 × 23).The ARM to MIPS column, on the other hand, represents the averaged results of all combinations with the source and target architectures set to ARM and MIPS, respectively.That is, we queried the vulnerable functions compiled with ARM and searched for their true positives compiled with MIPS while varying the other options.
In the majority of the tests, TIKNIB successfully identified the vulnerable function with a rank close to 1.0 and a precision@1 close to 1.0, demonstrating its effectiveness in vulnerability discovery.Meanwhile, it performed marginally worse in the tests for MIPS.This result corroborates our observation in §5.1 that feature extraction for MIPS binaries can be erroneous.We further discuss this issue in §7.1.Additionally, the last three rows of Table 8 display the ranks of additional functions worth noting.The dtls represents the DTLS implementation of our target function (i.e., dtls1_process_heartbeat), which also contains the same vulnerability.Due to its similarity to our target function, it was ranked highly in all tests.The last two rows of the table present the ranks of the patched versions of these two functions in OpenSSL v1.0.1u.Notably, the patch of the vulnerability affects the presemantic features of these functions, particularly the number of control transfer and arithmetic instructions.Consequently, the patched functions had a low rank.

Analyzing Real-World Vulnerabilities on Firmware Images of IoT Devices
We further evaluate the efficacy of presemantic features by identifying vulnerable functions in real-world firmware images of IoT devices using TIKNIB.For the firmware images, 82 ( 82%) † Among 43 BCSA papers that we studied in §2, 10 released their source code, and two of these 10 support both ARM and MIPS architectures (Gemini [12] and VulSeeker [13]).However, we were not able to compare the results directly because these tools released neither their firmware datasets nor complete source code.Here, we present the results stated in the latest one [13]; note that their firmware dataset is different from the one that we used.
we utilized the firmware dataset of FirmAE [129], which is one of the industry-leading large-scale firmware emulation frameworks.The dataset consists of 1,124 firmware images of wireless routers and IP cameras from the top eight vendors.
Particularly, we search for another infamous vulnerability (CVE-2015-1791) from OpenSSL, which has a race condition error in the ssl3_get_new_session_ticket() function.This vulnerability has also been extensively used in previous studies [12], [13], [23].Since there exist numerous functions (≈52M) in the firmware images, identifying the vulnerable function among them is sufficient to evaluate the impact of presemantic features.We evaluate our system, TIKNIB, against state-of-theart techniques that support both ARM and MIPS architectures [12], [13].Notably, these two architectures are prevalent in IoT devices [129].However, while analyzing the repositories of these tools, we found that they did not include their complete source code nor datasets.As a result, we were unable to directly compare our system to theirs.Instead, we compared the results to the ones stated in the paper [13].Specifically, we compiled the vulnerable version of OpenSSL (i.e., v1.0.1f) using a variety of compiler options and architectures, including six architectures (x86, ARM, and MIPS, each with 32 and 64 bits), two compilers (GCC v4.9.4 and v5.5.0), and two optimization levels (O2-O3).Here, we used two optimization levels (O2-O3) because many real-world software packages use them by default, as described in §5.3.Consequently, we obtained 24 samples of the vulnerable function.Notably, this dataset is essentially a subset of the ASE3 dataset, which is introduced in Table 7.Then, we queried each sample vulnerable function against all 52M functions in the 1,124 firmware images.This resulted in 24 similarity scores for each of the 52M functions.We then calculated the top-k result by averaging the similarity scores for each function.Finally, we manually counted the number of functions that were actually vulnerable in the top-100 results.
Table 9 summarizes the top-k results for the average similarity score for all 52M firmware functions.While our dataset is distinct from those used in the previous studies [12], [13], TIKNIB equipped with presemantic features achieved a level of performance comparable to that of the state-of-the-art tools.It should be noted that our objective is not to assert that our approach is superior to the state-of-theart tools, but rather to demonstrate the efficacy of appropriately utilizing presemantic features.Additionally, our experimental results indicate that the real-world IoT firmware images (at least those that we tested) are highly likely to be compiled with O2 or O3.

BENEFIT OF TYPE INFORMATION (RQ3)
To assess the implication of debugging information on BCSA, we use type information as a case study on the presumption that they do not vary unless the source code is changed.Specifically, we extract three types of features per function: the number of arguments, the types of arguments, and the return type of a function.Note that inferring the correct type information is challenging and is actively researched [130], [131].In this context, we only consider basic types: char, short, int, float, enum, struct, void, and void * .To extract type information, we create a type map to handle custom types defined in each package by recursively following definitions using Ctags [132].We then assign a unique prime number as an identifier to each type.To represent the argument types as a single number, we multiply their type identifiers.
To investigate the benefit of these type features, we conducted the same experiments described in §5, and Table 10 presents the results.Here, we explain the results by comparing them with Table 6, which we obtained without using the type features.The first row of Table 10 shows that the average number of selected features, including type features, is smaller than that of selected features ( 5 ) in Table 6.Note that all three type features were always selected in all tests.The second row in Table 10 shows that utilizing the type features could achieve a large TP-TN gap on average (over 0.50); the corresponding values in 4 of Table 6 are much smaller.Consequently, the AUC and AP with type features reached over 0.99 in all tests, as shown in the last two rows of Table 10.Additionally, it shows a similar result (i.e., an AUC close to 1.0) on the ASE datasets that we utilized for the state-of-the-art comparison ( §5.2).
This result confirms that type information indeed benefits BCSA in terms of the success rate, although recovering such information is a difficult task.Therefore, we encourage further research on BCSA to take account of recovering debugging information, such as type recovery or inference, from binary code [130], [131], [133], [134], [135], [136].

FAILURE CASE INQUIRY (RQ4)
We carefully analyzed the failure cases in our experiments and found their causes.It was possible because our benchmark (i.e., BINKIT) has the ground truth and our tool (i.e., TIKNIB) uses an interpretable model.We first checked the TP-TN gap of each feature for failure cases and further analyzed them using IDA Pro.We found that optimization largely affects the BCSA performance, as described in §5.1.In this section, we discuss other failure causes and summarize the lessons learned; however, many of these causes are closely related to optimization.We categorized the causes into three cases: (1) errors in binary analysis tools ( §7.1), (2) differences in compiler back-ends ( §7.2), and (3) architecture-specific code ( §7.3).0.99 0.99 1.00 0.99 1.00 1.00 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 Average Precision (AP) 0.99 0.99 1.00 0.99 1.00 1.00 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.98 All values in the table are 10-fold cross validation averages.† We compare a function from the NORMAL to the corresponding function in each target dataset.‡ We match functions whose compiler options are largely distant to test for bad cases.Please refer to §5.1.8for additional information.

Errors in Binary Analysis Tools
Most BCSA research heavily relies on COTS binary analysis tools such as IDA Pro [95].However, we found that IDA Pro can yield false results.First, IDA Pro fails to analyze indirect branches, especially when handling MIPS binaries compiled with Clang using the position-independent code (PIC) option.The PIC option sets the compiler to generate machine code that can be placed in any address, and it is mainly used for compiling shared libraries or PIE binaries.Particularly, compilers use register-indirect branch instructions, such as jalr, to invoke functions in a position-independent manner.For example, when calling a function, GCC stores the base address of the Global Offset Table (GOT) in the gp register, and uses it to calculate the function addresses at runtime.In contrast, Clang uses the s0 or v0 register to store such base addresses.This subtle difference confuses IDA Pro and makes it fail to obtain the base address of the GOT, so that it cannot compute the target addresses of indirect branches.
Moreover, IDA Pro sometimes generates incomplete CFGs.When there is a switch statement, compilers often make a table that stores a list of jump target addresses.However, IDA Pro often failed to correctly identify the number of elements in the table, especially on the ARM architecture, where switch tables can be placed in a code segment.Sometimes, switch tables are located between basic blocks, and it is more difficult to distinguish them.
The problem worsens when handling MIPS binaries compiled for Clang with PIC, because switch tables are typically stored in a read-only data section, which can be referenced through a GOT.Therefore, if IDA Pro cannot fully analyze the base address of the GOT, it also fails to identify the jump targets of switch statements.
As we manually analyzed the errors, we may have missed some.Systematically finding such errors is a difficult task because the internals of many disassembly tools are not fully disclosed, and they differ significantly.One may extend the previous study [96] to further analyze the errors of disassembly tools and extracted features, and we leave this for future studies.
During the analysis, we found that IDA Pro also failed to fetch some function names if they had a prefix pre-defined in IDA Pro, such as off_ or sub_.For example, it failed to fetch the name of the off_to_chars function in the tar package.We used IDA Pro v6.95 in our experiments, but we found that its latest version (v7.5) does not have this issue.

Diversity of Compiler Back-ends
From §5.1, the characteristics of binaries largely vary depending on the underlying compiler back-end.Our study reveals that GCC and Clang emit significantly different binaries from the same source code.First, the number of basic blocks for the two compilers significantly differs.To observe how the number changes depending on different compiler options and target architectures, we counted the number for the NORMAL dataset.Figure 3 illustrates the number of functions and basic blocks in the dataset for selected compiler options and architectures (see Appendix for details).As shown in the figure, the number of basic blocks in binaries compiled with Clang is significantly larger than that in binaries compiled with GCC for O0.We figured out that Clang inserts dummy basic blocks for O0 on ARM and MIPS; these dummy blocks have only one branch instruction to the next block.These dummy blocks are removed when the optimization level increases (O1) as optimization techniques in Clang merge such basic blocks into their predecessors.
In addition, the two compilers apply different internal techniques for the same optimization level, while they express the optimization level with the same terms (i.e., O0-O3 and Os).In particular, by analyzing the number of caller and callee functions, we discovered that GCC applies function inlining from O1, whereas Clang applies it from O2. Consequently, the number of functions for each compiler significantly differs (see the number of functions in O1 for Clang and that for GCC in Figure 3).
Moreover, we discovered that two compilers internally leverage different function-level code for specific operations.For example, GCC has functions, such as __umoddl3 in libgcc2.cor __aeabi_dadd in ieee754-df.S, to optimize certain arithmetic operations.Furthermore, on x86, GCC generates a special function, such as __x86.get_pc_thunk.bx, to load the current instruction pointer to a register, whereas Clang inlines this procedure inside the target function.These functions can largely affect the call-related features, such as the number of control transfer instructions or that of outgoing calls.Although we removed these compiler-specific functions so as not to include them in our experiments ( §3.2), they may have been inlined in their caller functions in higher optimization levels (O2-O3).Considering such functions took approximately 4% of the identified functions by IDA Pro, they may have affected the resulting features.
Similarly, the two compilers also utilize different instruction-level codes.For example, in the case of move instructions for ARM, GCC uses conditional instructions, such as MOVLE, MOVGT, or MOVNE, unless the optimization level is zero (O0).In contrast, Clang utilizes regular move instructions along with branch instructions.This significantly affects the number of instructions as well as that of basic blocks in the resulting binaries.Consequently, in such special cases, the functions compiled using GCC have a relatively smaller number of basic blocks compared with those using Clang.
Finally, compilers sometimes generate multiple copies of the same function for optimization purposes.For example, they conduct inter-procedural scalar replacement of aggregates, removal of unused parameters, or optimization of cache/memory usage.Consequently, a compiled binary can have multiple functions that share the same source code but have different binary code.We found that GCC and Clang operate differently on this.Specifically, we discovered three techniques in GCC that produce function copies with special suffixes, such as .part,.cold,or .isra.For instance, for the get_data function of readelf in binutils (in O3), GCC yields three copies with the .israsuffix, while Clang does not produce any such functions.Similarly, for the tree_eval and expr_eval functions in bool (in O3), GCC produces two copies with the .coldsuffix, but Clang does not.Although we selected only one such copy in our experiments to avoid biased results ( §3.2), the other copies can still survive in their caller functions by inlining.
In summary, the diversities of compiler back-ends can largely affect the performance of BCSA, by making the resulting binaries divergent.Here, we have introduced the major issues we discovered.We encourage further studies to investigate the implications of detailed options at each optimization level across different compilers.

Architecture-Specific Code
When manually inspecting failures, we found that some packages have architecture-specific code snippets guarded with conditional macros such as #if and #ifdef directives.For example, various functions in OpenSSL, such as mul_add and BN_UMULT_HIGH, are written in architecturespecific inline assembly code to generate highly optimized binaries.This means that a function may correspond to two or more distinct source lines depending on the target architecture.
Therefore, instruction-level presemantic features can be significantly different across different architectures when the target programs have architecture-specific code snippets, and one should consider such code when designing crossarchitecture BCSA techniques.

DISCUSSION
Our study identifies several future research directions in BCSA.First, many BCSA papers have focused on building a general model that can result in stable outcomes with any compiler option.However, one could train a model targeting a specific set of compiler options, as shown in our experiment, to enhance their BCSA techniques.It is evident from our experiment's results that one can easily increase the success rate of their technique by inferring the compiler options used to compile the target binaries.There exists such an inference technique [137], and combining it with existing BCSA methods is a promising research direction.
Second, there are only a few studies on utilizing decompilation techniques for BCSA.However, our study reveals the importance of such techniques, and thus, invites further research on leveraging them for BCSA.One could also conduct a comprehensive analysis on the implication of semantic features along with decompilation techniques.
Additionally, we investigated fundamental presemantic features in this study.However, the effectiveness of semantic features is not well-studied yet in this field.Therefore, we encourage further research into investigating the effectiveness of semantic features along with other presemantic features that are not covered in the study.In particular, adopting NLP techniques would be another essential study as in many recent studies.
Recall from §2, we did not intend to completely survey the existing techniques, but instead, we focused on systematizing the fundamental features used in previous literature.Furthermore, our goal was to investigate underexplored research questions in the field by conducting a series of rigorous experiments.For a complete survey, we refer readers to the recent surveys on BCSA [152], [153].

CONCLUSION
We studied previous BCSA literature in terms of the features and benchmarks used.We discovered that none of the previous BCSA studies used the same benchmark for their evaluation, and that some of them required manually fabricating the ground truth for their benchmark.This observation inspired us to design BINKIT, the first large-scale public benchmark for BCSA, along with a set of automated build scripts.Additionally, we developed a BCSA tool, TIKNIB, that employs an interpretable model.Using our benchmark and tool, we answered less-explored research questions regarding the syntactic and structural BCSA features.We discovered that several elementary features can be robust across different architectures, compiler types, compiler versions, and even intra-procedural obfuscation.Further, we proposed potential strategies for enhancing BCSA.We conclude by inviting further research on BCSA using our findings and benchmark.

Fig. 1 :
Fig. 1: Typical workflow of binary analysis (upper) and similarity comparison (lower) in binary code similarity analysis.Tools may skip some of the steps.

Fig. 3 :
Fig. 3: The final number of functions and basic blocks in NORMAL (see Appendix for the detailed version).

TABLE 1 :
Summary of the features used in previous studies.

TABLE 2 :
Summary of the datasets used in previous studies.

TABLE 4 :
Summary of numeric presemantic features used in TIKNIB., arith, data transfer, cmp, and logic instrs.# of shift, bit-manipulating, float, misc instrs.# of arith + shift, and data transfer + misc instrs.# of all/unconditional/conditional control transfer instrs.Avg.# of edges per a basic block Avg./Sum of basic block, loop, and SCC sizes Avg.# of all, arith, data transfer, cmp, and logic instrs.

TABLE 5 :
Breakdown of the feature extracting time for BINKIT.