Fine-Grained Compiler Identification With Sequence-Oriented Neural Modeling

Different compilers and optimization levels can be used to compile the source code. Revealed in reverse from the produced binaries, these compiler details facilitate essential binary analysis tasks, such as malware analysis and software forensics. Most existing approaches adopt a signature matching based or machine learning based strategy to identify the compiler details, showing limits in either the detection accuracy or granularity. In this work, we propose NeuralCI (Neural modeling-based Compiler Identification) to infer these compiler details including compiler family, optimization level and compiler version on individual functions. The basic idea is to formulate sequence-oriented neural networks to process normalized instruction sequences generated using a lightweight function abstraction strategy. To evaluate the performance of NeuralCI, a large dataset consisting of 854,858 unique functions collected from 19 widely used real-world projects is constructed. The experiments show that NeuralCI achieves averagely 98.6% accuracy in identifying the compiler family, 95.3% accuracy in identifying the optimization level, 88.7% accuracy in identifying the compiler version, 94.8% accuracy in identifying the compiler family and optimization level, and 83.0% accuracy in identifying all compiler components simultaneously, outperforming existing function level compiler identification methods in terms of both detection accuracy and comprehensiveness.


I. INTRODUCTION
In the software production process, diverse toolchains and toolchain settings can be adopted to transform the source code to the final binary. For example, different compilers like GCC and Clang as well as different compiler options like O0-O3 can be used by the developers, in consideration of the stability and performance issue, the size requirement on the produced binary and their familiarity with the tools. Besides, it is also a common practice to apply various kinds of code obfuscation techniques [31], [32] and packers [30], [41] in the binary production process, with which developers protect the core algorithms and codes from being easily reverse engineered [13], [43], while malware writers use them to hide The associate editor coordinating the review of this manuscript and approving it for publication was Claudia Raibulet . malicious part in the binary from being easily detected by security analysis tools [20], [33].
Usually, binaries produced with these different toolchains and toolchain settings exhibit significant differences when viewed in a straight way [37], [38], [51]. These differences just indicate that toolchain footprints are preserved during the translation process from source code to binary code, enabling the possibility of revealing the toolchain and toolchain setting choices made during the production process of a binary. This task, which in the literature is called binary program provenance analysis, provides ways to spy on the specifics of the binary production process. As its major subtask to focus on the compilation phase, compiler identification attempts to infer from a piece of binary code the compiler-related details such as the specific compiler family, the optimization options, etc., which can facilitate essential binary analysis tasks such as malware analysis and software forensics. For example, in the scenario of software plagiarism detection, where the source code of the plaintiff program is generally accessible, we can eliminate the interference of different compiler settings to the similarity analysis by recompiling the source code with the same compiler settings identified from the defendant binary program.
Overall, there have been relatively few works conducted on compiler identification, which mainly fall into two categories: signature matching based methods [6]-[8] and learning based methods [27]- [29], [39]. The former, implemented in several reverse engineering tools like IDA [6] and PEiD [8], performs whole program level identification via matching against a corpus of generic and rigid signatures. The drawbacks to these kinds of methods are the stringent expertise in constructing a good enough compiler-specific signature as well as their coarse identification granularity. The latter formulates compiler identification as a machine learning task, which trains models to capture compiler-specific patterns, and infer the compiler details on previously unseen binaries. For such kinds of methods, syntactic or structural features are extracted based on artificially defined templates, such as idioms [29] which are short sequences of instructions with wildcards, or graphlets [28] which are small subgraphs within the CFG (Control Flow Graph). As typical feature engineering based methods, their effectiveness, to a large extent, depends on the quality of expert-defined feature extraction templates, where more domain-specific knowledge is required.
In recent years, tremendous successes have been witnessed of applying natural language processing techniques and deep learning models to various program analysis tasks [16], [23], [34], [45], [48], which leverage many layers of non-linearities to capture invariances from transformation in the raw input space, and have thus automatically boosted the semantic richness for the learned representations. Inspired by these great successes, in this paper, we attempt to adopt some of the most popular neural network structures to achieve fast and accurate fine-grained compiler identification on function level. Specifically, we feed typical convolutional neural network (CNN) and recurrent neural network (RNN) based structures with normalized assembly instruction sequences to train classification models for inferring the compiler families, the optimization levels, and the compiler versions. Our intuition is based on the observation that co-occurring instructions together with their orderings in short instruction sequences form good enough signals of distinguishing different compilers or optimization levels, which can be substantially captured by neural models. Our main contributions are summarized as following: • We propose to reveal fine-grained compiler details for individual functions by designing a lightweight function abstraction strategy and leveraging typical sequence-oriented neural networks. It alleviates the task complexity and human bias impacts by handing over the professional process of extracting and selecting features significant for compiler identification from the domain experts to the less human intervened neural networks.
• We elaborate two neural network structures CNN and RNN to solve compiler identification problem. The former contains three variations (i.e., one naïve and two attention augmented ones), and the latter contains four variations (i.e., one naïve and three attention augmented ones). All of them are implemented in a tool called (Neural modeling based Compiler Identification).
• We construct a large dataset consisting of 854,858 unique functions by processing a set of diverse real world projects, and systematically evaluate and compare the performance of the proposed methods with respect to revealing compiler family, optimization level, compiler version, and compiler setting combination respectively. The experimental evaluation shows that Neu-ralCI achieves promising performance of revealing these fine-grained compiler details and outperforms existing function level compiler identification methods in both detection performance and comprehensiveness. Our dataset and source code implementation of NeuralCI have been made public at https://github.com/zztian007/NeuralCI to facilitate further researches.
The rest of this paper is organized as follows: Section II formulates the problem to solve and our design overview. Section III and Section IV describe in detail the considerations and designs of the function abstraction and the neural network models respectively. The dataset construction and experimental evaluations are presented in Section V. Section VI mainly discusses the limitations of our method. Section VII summarizes the related works. Finally, Section VIII concludes.

II. PROBLEM DEFINITION AND DESIGN OVERVIEW A. PROBLEM OVERVIEW
The goal of compiler identification is to reveal in reverse from the final produced binary the compiler-related details applied in processing the program source code. The feasibility of this task lies in the usually significant differences imposed by different compiler and optimization settings. Such differences across homogenous binaries, which drive and meanwhile bother massive binary code similarity analysis researches, just indicate that the different mechanisms, heuristics and design choices within certain compilers will ultimately manifest on their outputs, providing evidence of program provenance.
To give a more intuitive sense of the differences, Figure 1 lists the corresponding binary codes produced with several different compiler and optimization settings from a simple C function. Take the binary code in Figure 1(b) (compiled with gcc 4.7 and the generally recommended optimization level O2) as the base, we compare it against the binary code generated with every other compiler setting using the binary code comparison tool Bindiff. The detected similarity scores are 0.76, 0.23, and 0.98 respectively for the binary code pairs 1b, 1c , 1b, 1d and 1b, 1e (we use 1b, 1c to denote the pair of binary codes shown in Figure 1(b) and Figure 1(c) for the sake of simplicity), illustrating notable yet varying impacts on the produced binaries via changing the compiler family, compiler optimization level or even the compiler version.
More specifically, it can be observed that the compiled binaries are distinguishing in the length of their function bodies, the specific instructions and instruction combinations used, as well as their instruction orderings. As depicted in Figure 1(b) and Figure 1(d), where the same compiler yet different optimization levels are adopted, the most significant impact on the produced binaries is observed for this short program. The code in Figure 1(d) where no optimization is applied, contains the well-known function prologue of push rbp and mov rsp, rbp that saves the caller's stack frame, and epilogue of pop rbp and retn that appears at the end of a function for restoring the stack and registers to the state before the function is called, while the code in Figure 1(b) does not use the stack at all but begins with some accesses to the function arguments passed in register esi and edi. If functions generated by the compiler gcc with O0 setting were known to almost invariably begin with the typical prologue and end with the typical epilogue, while other compiler settings hardly ever did so, then the existence of such typical instruction combinations will serve as significant features to identify compiler optimization settings.
Likewise, as illustrated in Figure 1(b) and Figure 1(c), where two different compilers gcc and clang with the same optimization level are applied, the generated function bodies are of the same length, but different instructions including both operators and operands being adopted to achieve the same functionality. For example, for judging the condition a > 1, gcc utilizes cmp edi, 1 and jle, while clang adopts another instruction combination of cmp eax, 2 and jl. In addition, differences are shown in the instruction layouts. For example, the starting instruction cmp esi, 1 in Figure 1(b) that judges the condition b == 1 is arranged as the third instruction in Figure 1(c). The least differences are observed between the codes in Figure 1(b) and Figure 1(e), where two different gcc versions are applied to process the source code.
As it shows, their binary codes are almost identical, except that Figure 1(b) adopts the combination of two instructions mov edx, eax and sar edx, 1Fh to accomplish the same goal (setting all the bits in edx to the value of the highest bit in eax) as the single instruction cdq adopted in Figure 1(e). Despite not being as remarkably impactful as the compiler 49162 VOLUME 9, 2021 family and optimization level act on the produced binaries, compiler version can also enforce significant changes to the produced binary codes, where Figure 2 just gives one such representative example taken from our constructed dataset.
From the discussions above, we can see that clues for inferring the compiler-related settings are well implied in their produced binaries. Yet these clues are scattered all over the binary, and generally subtle and trivial, making them difficult to be systematically abstracted and refined to achieve effective and efficient detection without a deep and laborious large-scale analysis from domain experts. Thus, the manually identified clues or features can either be insufficient or even contrary to what intuition might suggest. For instance, the well-known push rbp and mov rsp, rbp prologue seems to be a good indicator for determining the compiler settings adopted, which, however, turns out not to be in the top ranked idioms [28], [29].
Traditional machine learning based methods generally extract extremely large amount of features with artificially defined templates, and reduce them with certain feature selection strategies to ensure scalability. Without a comprehensive understanding on the characteristics of the binary code, the compiler and the programming language, the manuallycrafted feature extraction and selection strategies tend to either capture massive irrelevant and redundant features that make the whole compiler identification approach nonscalable, or fail to extract subtle yet significant ones that lead to unreliable identification results. Inspired by the tremendous successes and superior feature learning power of deep learning in various program analysis tasks [16], [23], [34], [45], [48], [51], in this work, we resort to typical neural network structures to automatically capture and select the scattered, subtle yet significant features that manifest compiler settings, so as to achieve less human intervened yet effective and efficient fine-grained compiler identification.

B. PROBLEM DEFINITION
We formally define the compiler identification task we are going to solve as follows.

Definition 1 (Fine-Grained Compiler Identification):
Given an individual function f in its binary form and stripped of debug and symbol information, we infer from it the compiler settings D adopted in the compilation phase that produces it, with a set of models M which are sought in a learning way.
The fine-grain in Definition 1 that we want to achieve reflects in two aspects. Firstly, the object to be operated on is an individual function rather than a whole program [6], [12], [39], [45]. We refer to the individual function as an independent function, about which we know nothing else (such as its adjacency functions in the call graph) but just the function itself. It does not distinguish between compiler-related [27] or user-defined functions, which means a compiler-related function (i.e. compiler helper/utility functions) can also be an individual function. Secondly, the compiler settings D to be revealed contain more elements including the compiler family, optimization level, compiler version or combinations of them (such as both compiler family and optimization level) than simple compiler family [23], [29] or optimization level [14]. That is, when we say a model m ∈ M infers compiler optimization level, we know in prior the compiler family of the binary to be detected. For example, given a binary function f compiled with gcc, we would choose the model trained on gcc-compiled (with different optimization levels) samples to identify whether it is compiled with optimization level O0 or O2. When we say a model m ∈ M infers both compiler family and optimization level, such a model can determine simultaneously whether f is compiled with gcc and O1 or clang and O3. Apparently, depending on the specific component in D to be revealed, the difficulty of training an accurate model m ∈ M varies, as different amounts of clues or footprints are preserved in the finally produced binaries under different compiler settings. Intuitively and as confirmed in our experimental evaluations, the difficulty of learning an model that accurately works for compiler identification increases in the order the models trained for detecting compiler family, optimization level, and compiler version solely, and M combo denotes the model for detecting compiler setting combinations.

C. OUR DESIGN OVERVIEW
The overview of our designed NeuralCI to solve the above discussed compiler identification problem is depicted in Figure 3, which consists of two phases: the training phase (upper figure), and the detection phase (lower figure). The training phase includes three steps. As a deep learning-based method, the first step is to construct a high-quality dataset comprised of labeled functions which will be discussed in Section III. The second step takes each raw function as input and outputs a normalized instruction sequence via a lightweight abstraction strategy implemented in the function abstraction module. Then these normalized sequences together with their ground truth labels are fed into the neural network-based classification module to train compiler identification models. The detection phase reads in an individual function, processes it with the function abstraction and utilizes the trained models to produce predictions. In the following sections, we discuss the details of the function abstraction module and the neural network-based classification module, respectively. Table 1 shows some important notations used in this paper.

III. FUNCTION ABSTRACTION
A function must be represented in certain forms such that it can be processed for further analysis [16], [23], [34], [48], [51]. The typical ways include using the raw byte sequence, the assembly instruction sequence, or the control flow graph [23] to depict a function. As discussed in Section II-A and many existing works [12], [37], [38], [51], different compiler settings generally produce distinguishable assembly instructions in their specific instruction contents, orderings, and combinations. Also, as indicated by the outstanding compiler family identification accuracy in [28], short assembly instruction sequences successfully capture compiler-related features. Thus, in this work we choose to use the assembly instruction sequence within the function body as the representation of each function and use IDA Pro for the parsing. 1 That is, a function f is represented as a sequence S f = {ins 1 , ins 2 , · · · , ins n }, where n denotes the number of instructions within the function, and each assembly instruction ins i consists of an opcode (i.e. mnemonic) and an ordered list of operands. Also, the sequence ensures that ∀i, j ∈ However, many existing binary analysis tasks [23], [51] suggested that it is unwise to work directly on raw assembly instructions. For our case, we want to capture features reflecting the compiler details rather than the function functionality. Using all instructions as they appear exactly in the functions may immerse us in too many details of the functionalities, which may accordingly increase the complexity of representation learning and decrease the embedding quality (as too many different instructions are preserved) in the instruction embedding phase, and distract the attention of successive neural network training as well. On the other hand, excessive normalization to the instructions will introduce a mass of unintentional human bias, and lead to the loss of certain subtle yet significant features, such as the instance of the different ways of gcc and clang adopted to handle the a > 1 predicate.
To this end, we choose to process the raw instructions with a lightweight abstraction strategy. Specifically, we perform normalization on each assembly instruction in a function with the following rules: • The mnemonics remain unchanged. • All registers in the operands remain unchanged. • All base memory addresses in the operands are substituted with the symbol MEM.
• All isolated immediates 2 with absolute values below certain threshold (which is set to 5,000 in current design) remain intact, while in all other cases the immediates in the operands are substituted with the symbol IMM. The first two rules are devised in consideration of the significant differences caused by different compiler settings to the instructions' operators and operands of register type, which apparently form good indicators (thus should not be weakened) for inferring the settings. The third rule is to omit meaningless displacements that appear solely in the operand of the mov instruction for direct addressing. For instance, the instruction ''mov ebx, [0×3435422]'' will be transformed to ''mov ebx, MEM'' according to the first three rules. The fourth rule is to avoid considering those unnecessary details while still retain subtle yet significant ones. For example, as shown in Figure 1

IV. NEURAL MODELS FOR COMPILER IDENTIFICATION A. INSTRUCTION EMBEDDING
As we adopt a learning model for detecting compiler settings, the normalized assembly instruction sequences must be transformed to numerical vectors such that they are able to be fed as inputs to the subsequent classifiers. Since our design chooses to leverage the advanced deep neural networks to grasp significant compiler-setting-related patterns, the straightforward BoW (bag of word) representation fails to work. As such, we use word embedding to distribute a vector for each unique instruction first, based on which the whole instruction sequence can then be modeled and represented.
There are several word embedding choices that can be leveraged for our application, of which the one-hot encoding has been widely deployed. It represents each unique word by a n-dimensional vector, with the i th dimension being set to 1 and all other dimensions being set to 0, where i is the index of the word in the vocabulary of size n. This technique is computationally intractable as the generated vectors are too sparse (the same dimension as the size of the whole vocabulary) and generally needs to do joint-learning with subsequent neural networks, making the learnt word semantics significantly task-specific. In this respect, NeuralCI leverages the popular skip-gram model [24] to learn more compact vector representations that carry instruction co-occurrence relationships and lexical semantics in an independent and unsupervised manner, so as to make the learnt vectors reusable in other binary analysis tasks [16], [19], [48], [51]. Specifically, we treat each basic block as a sentence and each normalized instruction within the basic block as a word, and feed all basic blocks from our binary collection to the skip-gram model to learn for each unique instruction a d-dimensional vector, by minimizing the loss of observing an instruction's neighborhood (within a window w) conditioned on its current embedding. The objective function of skip-gram can be defined as [26]: where φ(ins i ) is the current embedding of e i . We train the embedding model for 100 epochs with the learning rate of 0.001 and context window size w of 5.

B. NEURAL NETWORK MODELS
Based on the instruction embeddings learned by skip-gram, it is promising to explore different schemes such as maxpooling, averaging or concatenation to aggregate the embeddings for each normalized instruction sequence, and then feed it to any classification model for compiler identification. However, it still faces the following two limitations: (1) skip-gram assigns each instruction a static embedding vector, which is not context-aware to different sequences it interacts with; this may fail to learn the compiler-related features; (2) since instruction sequences are abstracted from functions, they may not only enjoy local instruction correlations, but also global or long-range instruction dependency; in this respect, it calls for sequence learning models to better capture the expressive compiler-specific patterns and features from instruction sequences for compiler identification. As advanced neural network structures, both CNN and RNN have boosted the state-of-the-arts in sequence learning. As such, in this work, NeuralCI attempts both CNN and RNN models to learn the semantic and structural information of instruction sequences and leverages these advances to identify their compiler settings.

1) CNN-BASED MODELS
CNN is known to learn the local correlations with shared weights and utilize pooling mechanism to greatly reduce the number of parameters needed to find important local patterns.  In other words, CNN is able to attend those frequently co-occurring instructions in the short sequences. In our model formulation, we further take advantage of different kernel-size filters to thoroughly extract interacted salient features among different instruction grams to capture the behaviors of compilers. The structure of our CNN-based models is shown in Figure 4. On the basis of instruction embedding, each instruction sequence is first transformed into a raw feature matrix A ∈ R l×d where l is the sequence length, 3 and d is the instruction embedding dimension.
where e i ∈ R d is the corresponding embedding of ins i in the sequence. Then, in the convolutional layer, k convolutional filters with shape n × d are adopted for convolution operations on the raw feature matrix A, obtaining a new feature matrix c ∈ R (l−n+1)×k , where n denotes the kernel size. To extract different views of feature patterns, we convolute A by different kernels of size 2, 3 and 4 (analogy to 2, 3 and 4 instruction grams respectively), which are then passed through 1D-maxpooling layer for dimensionality reduction. The resulting representations are finally concatenated through a dense layer to be fed to a softmax layer for compiler prediction. Impressed by the performance gains of integrating attention into neural network structures in many natural language processing (NLP) tasks, we further attempt introducing an attention layer between the input layer and the convolution layer as depicted in Figure 4. In our case, the intuition of introducing such an attention layer is to capture long term contextual information and correlation between non-consecutive instructions. Specifically, NeuralCI implements two kinds of attentions, including the scaled dot product attention and an additive attention, which are originally used to estimate the relevance between hidden vectors.
The scaled dot product attention is the attention mechanism proposed in the famous Transformer [42], which computes attentions simultaneously with matrix operations: where Attention(Q, K , V ) ∈ R l×d v is the attention matrix in which each row g i is a d v -dimensional attention vector for the corresponding row in the input matrix, Q ∈ R l×d k , K ∈ R m×d k and V ∈ R m×d v are three matrices derived by multiplying the input matrix with weight matrices, and √ d k is a scaling factor to ensure more stable gradients.
The additive attention mechanism we adopt is the one originally used to enhance sequence tagging [50], based on which we calculate a context vector g i for each (instruction) embedding e i in matrix A as: where, β i,j captures the correlation between the instruction embedding e i and e j ; α i,j is the attention weight normalized with Softmax function so as to ensure α i,j ≥ 0 and j α i,j = 1; σ is the element-wise sigmoid function; W x ,W y and W z are respectively the weight matrices corresponding to e i , e j and their non-linear combinations; b x and b y are the bias vectors.
For both attention mechanisms, NeuralCI concatenates the vectors output from the attention layer with the initial embedding vectors as a new representation to be further processed with the convolution and subsequent layers. For simplicity, we use Neural BS CNN , Neural SD CNN , and Neural AD CNN to specify the base CNN, the enhanced CNN with scaled dot product attention and the enhanced CNN with additive attention throughout the experimental evaluations. RNN is known to learn the sequential dependency, and strict to align the positions and contexts for the instances in the input sequences. By capturing long range dependencies between the instructions in a sequence, RNN can produce a comprehensive and contextualized latent vector representation for the entire instruction sequence. In consideration of the vanishing/exploding gradient issue faced by naïve RNN, NeuralCI employs the Gated Recurrent Unit (GRU) [15], [17] in the current design to alleviate the problem. Figure 5 depicts the basic structure of our RNN-based models. Similar to the CNN-based models, the matrix A composed of instruction embeddings forms the input, which is read through a GRU cell such that various non-linear transformations are performed to generate a hidden vector state h i at each timestep i. Improved over GRU, bidirectional GRU (BiGRU), which leverages two GRU cells reverse to each other to process the sequence, captures both the previous and the future timestep features via forward and backward states. Therefore, Neu- ralCI adopts the BiGRU structure to jointly capture both the forward and backward sequential dependency and global contextual information in the instruction sequence, so that the hidden state vector h i at timestep i can be concatenated as: After forward and backward reading the entire input sequence, in Figure 5(a) the hidden states h l corresponding to the last timestep will act as the latent vector representation in our naïve RNN model, which is then fed to subsequent dense and Softmax layers for predication. During the training process, the training loss is adopted to measure the correctness of sequence learning and compiler prediction. Also, dropout is applied to prevent the neural network from overfitting.
Alternatively, as illustrated in Figure 5(b) and 5(c), Neu-ralCI also attempts introducing an attention layer between the GRU layer and the dense layer on the basis of our naïve RNN model, in the sense of attending to partial informative instructions (that are potentially more important for compiler identification) rather than focusing on all instructions equally. In Figure 5(b), NeuralCI aggregates the attention vectors produced by either of the two attention mechanisms discussed in CNN-based models) with an average pooling 4 operation, generating a single informative vector representation for a sequence. Besides these two attention mechanisms, as shown in Figure 5(c), NeuralCI also incorporates another kind of attention mechanism [46] which introduces an external context/query vector µ w to compute for each word (embedding) in a sentence an importance weight and then compute a weighted sum of the word embeddings based on the importance weights as the sentence's vector representation. For our application, we compute a vector for the instruction sequence with this attention mechanism as follows: where, α i is the normalized attention weights through Softmax function; W x is the weight matrix corresponding to h i , b x is the bias vector, both of which are to be jointly learned with the context vector µ w during the training process. The same as the naïve RNN, the latent vector produced with either of the three attention mechanisms is further fed to subsequent dense and Softmax layers for compiler prediction. Also, to get these RNN models distinguishable, we use Neural BS GRU , Neural SD GRU , Neural AD GRU and Neural QU GRU (the most impressive characteristic of the third attention mechanism is the introduction of an external query vector, thus we call it QUerybased-attention for short) to depict them for simplicity in the following experimental evaluations. • IDA Pro is then used to identify and extract functions from each binary. Also, we get rid of trivial functions (functions containing just a few instructions, such as the stub functions) that are meaningless to analyze. We consider functions containing less than 10 instructions as trivial in our current setting.

V. EXPERIMENTS AND EVALUATION
• During the training phase, to avoid neural models seeing functions that are really similar to the ones in the testing phase, which, if not properly coped with, can inflate the performance metrics, we only keep unique functions. Specifically, a function is considered redundant if it has the same normalized instructions as others. Then we label each remaining function with the compiler settings used to compile the binary that the function resides in. With these settings, we finally construct a dataset comprised of totally 854,858 unique functions from 4,810 binaries 5 with an average of about 260 instructions within a function. The distribution of the function sizes is depicted in Figure 6, where 50% of functions contain less than 100 instructions, and nearly 90% of functions contain less than 500 instructions.

B. IMPLEMENTATION DETAILS AND EXPERIMENTAL SETTINGS
We have implemented NeuralCI as a prototype tool. It utilizes IDA Pro for the parsing of binaries to obtain functions and their raw assembly instructions. The function abstraction module is implemented in Java, and the neural modeling module is implemented using Python and Tensorflow framework. The skip-gram implementation provided in gensim [4] is used to generate instruction embedding vectors, with the embedding size d setting to 100. The maximum length l for an instruction sequence is set to 500, which ensures to cover the complete semantics of nearly 90% functions in our dataset. The number of convolution filters in the CNN-based models is set to 128, and so is the dimension of the GRU hidden state vectors. 5 It should be noted that not every project can be successfully compiled under every compiler setting. Also, the successful compilation of different projects produce varying number of binaries. For experimental settings, we randomly split the dataset into training, validation and testing sets according to a percentage of 80%, 10% and 10% respectively. The neural network models are trained with a Tesla V100 GPU card using a batch size of 128, initial learning rate 0.001 (we divide the learning rate by 10 when the validation loss has stopped improving for at least 5 epochs), and the Adam optimizer. For each epoch, the training samples are shuffled and accuracy on the validation set is calculated. Also, the EarlyStopping mechanism is leveraged to stop the training after the epoch when the validation accuracy rate no longer rises, so as to avoid over-fitting, non-convergence and other problems. Finally, we take the model with the best accuracy as the final model to be further evaluated on the testing set with respect to performance metrics including accuracy, precision, recall and f1-score.

C. EVALUATION
In the following parts, we evaluate the performance of Neu-ralCI on identifying the compiler family, optimization level, compiler version and compiler setting combination respectively, and report the comparative results across the neural network models as well as against existing function level methods that support the detection of corresponding compiler settings. Note that the accuracy values in Table 2, 3, 5, 6, 7,  and 8 all refer to the total accuracy, and the precision, recall, and f1-score values in these tables all refer to the weighted-average precision, recall and f1-score respectively.
To be specific, let k be the class label, {c 1 , c 2 , · · · , c k } be the numbers of samples with respect to each class, and c 1 , c 2 , · · · , c k be their numbers of correctly classified samples by a classifier. The total accuracy can be defined as: Let {p 1 , p 2 , · · · , p k }, {r 1 , r 2 , · · · , r k } and {f 1 , f 2 , · · · , f k } be the precision, recall and f1-score values computed with respect to the k classes respectively. The weighted-average precision, recall and f1-score can be defined as:

1) PERFORMANCE OF IDENTIFYING COMPILER FAMILY
In this experiment, we take the compiler family that each function is compiled with as the ground-truth, and get the NeuralCI models trained and evaluated by adopting the experimental settings as described in Section V-B. As it shows in Table 2, the NeuralCI models achieve rather high accuracy and f1-score in identifying the compiler family. Also, there exist no obvious performance differences between the NeuralCI models, among which Neural SD GRU exhibits the best performance with an accuracy of 98.7% and f1-score of 0.987 for the compiler family identification task.

2) PERFORMANCE OF IDENTIFYING OPTIMIZATION LEVEL
In this experiment, the optimization levels of certain compilers are taken as the ground-truth to train and evaluate Neu-ralCI. We do not use the default 4-level optimization option setting, but adopt the same strategy as provided in works [27], [28] that condenses the 4 optimization levels to 2 classes 'low' and 'high', considering the findings presented in existing studies [27], [28] that it is difficult to distinguish between O2 and O3 compiled binaries. That is, O0 and O1 will be considered as the low optimization class as denoted by O L , while O2 and O3 will be considered as the high optimization class as denoted by O H . In addition to this problem simplification strategy, we also test the performance of the models with a 3-level splitting way that condenses the optimization levels to O0, O1 and O H . The data summarized in Table 3 shows: 1) The NeuralCI models exhibit relatively good detection results in the either 2-level or 3-level optimization condensation case, with little performance differences observed across the models still. 2) Surprisingly, NeuralCI performs equally or even better on the 3-level optimization option identification task than VOLUME 9, 2021  2-level task for some models. As the most obvious example, the largest difference is observed when applying Neural AD GRU to identify the optimization option for Clang, where accuracies of 92.1% and 90.4% are achieved in the 2-level and 3-level case respectively. Table 4 gives their detailed precision and recall values with respect to each optimization level correspondingly. As it shows, by splitting O1 separately out of O L , the recall of identifying O H drastically increases from 0.781 to 0.910, indicating that many O H samples which are falsely classified as belonging to O L in the 2-level case are now correctly classified again. Meantime, some samples actually belonging to O1 are now falsely classified as O H , causing the precision of identifying O H to decrease from 0.845 to 0.816, and the relatively low recall value of 0.89 for the newly separated class O1. This matches the common sense that binaries compiled with O1, O2 or O3 are more similar to each other than to binaries compiled with O0 where no optimization is applied.
3) The performance of NeuralCI in identifying optimization level varies across compilers, with the best and worst performance (about a margin of 7% in accuracy) observed on GCC and Clang respectively, indicating that the differences introduced by different optimization levels of GCC to the produced binaries seem more obvious. Also, the varying detection difficulty of compiler optimization levels implies that different compilers adopt different ways of defining their optimization levels.

3) PERFORMANCE OF IDENTIFYING COMPILER VERSION
Identifying the specific compiler version used to compile a function is intuitively more challenging. Table 5 summarizes the detection results of NeuralCI applied on GCC and Clang that both contain multiple versions in our dataset. As it shows, varying performance of NeuralCI is observed on different compiler family, where better accuracies ranging from 93.7% to 94.5% are achieved on the major versions of GCC compared with the accuracies observed on Clang that range from 82.3% to 84.3%. It suggests that varying rates of ''churn'' across versions are exhibited in different compiler families, with GCC producing significantly more varied code between major versions than Clang. Moreover, we evaluate whether NeuralCI is capable of identifying minor compiler versions (GCC 4.7, 4.8 and 4.9). As it shows in the right columns of Table 5, the accuracies are much lower compared with the accuracies achieved in the major version identification case, which conforms to the common sense that more changes are generally introduced across the major version releases than across the minor version releases of a program. The fairish detection accuracies indicate that NeuralCI can capture very subtle yet significant features which may otherwise be missed by artificially crafted feature extraction and selection strategies. Again, no apparent gaps are observed across the neural network models in NeuralCI.

4) PERFORMANCE OF IDENTIFYING COMPILER SETTING COMBINATION
Detecting compiler setting combinations from a binary can be generally achieved by jointly applying multiple models, each of which targets a different part of the settings. For example, to determine both the compiler family and the optimization level for a binary function, we can firstly use the trained model for compiler family identification to detect the family that compiles the function, and then use the optimization level identification model corresponding to the identified family to further detect the optimization level. An alternative way is to train a single model that can detect these settings simultaneously. To check if NeuralCI is competent with this challenging task, we evaluate its performance of identifying compiler family and optimization level altogether first. As the results depicted in Table 6 show, all the models in NeuralCI exhibit relatively good and similar detection performance. Further, Table 7 reports the results of NeuralCI in identifying all compiler setting combination 6 simultaneously, where accuracies ranging between 81.5% and 83.8% are observed across different models.

5) COMPARISONS WITH EXISTING METHODS
In this section, we compare NeuralCI with existing methods including Idioms [29] and Graphlets [28] that support compiler identification on individual functions. Specifically, we implement three methods based on the ORIGIN [10] code shared on github that extract idiom or graphlet features from binary functions, select significant ones via mutual information computation, and train classifiers with the support vector machine (SVM) 7 on the selected idiom features, the graphlet features, and both of them, producing three models for each specific compiler identification task, which we denote as IDM, GRA, and IDM-GRA respec- 6 We report the results on the major compiler versions, considering that NeuralCI does not perform so good on identifying compiler minor versions (as shown in Table 5). 7 Besides the SVM model, CRF model which incorporates the adjacency between functions is also adopted in the compared works [29] and [28]. Yet it assumes a sequence of functions rather than treating each function independently, we only compare with their SVM-based model that operates on each independent function. tively. Similarly, we choose the top 20,000 features with the highest mutual information scores as significant features and use linear SVMs considering their good performance on high-dimension data sets. Despite that, due to the natural properties of SVM, these methods still fail to scale to our whole dataset by either consuming the whole memory or the training never ends. To deal with that, we randomly select 100,000 samples from the whole dataset, taking 80% of them as the training data and the rest as the test data. We repeat the above process using 10-fold cross validation 8 to obtain averaged evaluation results. Table 8 summarizes the experimental results, where the values in the column NeuralCI report the best performance achieved by our models. As it shows, NeuralCI outperforms the other methods comprehensively in terms of detection accuracy, precision, recall, f1 and f0.5 values in almost every compiler identification task, except the task of identifying compiler ICC's optimization level where IDM-GRA outperforms NeuralCI by a small margin.

VI. DISCUSSION
Similar to existing binary analysis methods that work on the function level, NeuralCI also assumes the reliable acquisition of functions from the binaries to be analyzed. In this paper, we select the state-of-the-art commercial reverse engineering tool IDA Pro for parsing, in consideration of its performance and wide application in the binary analysis domain. Alternatives, such as Binary Ninjia [2], Radare2 [9], Dyninst [3], Angr [1], etc., are also compatible with NeuralCI. But in general, correct parsing of binaries is still a complex and open issue. To ensure a more reliable way of obtaining functions, NeuralCI might benefit from recent advances that adopt deep learning to identify functions from binaries [34].
In addition to the impacts from the limitations of binary parsing tools, NeuralCI may also suffer from various kinds of code obfuscations [18], [31] enforced on the code to be analyzed. On one hand, code obfuscation techniques such as compression and encryption [41] may disable the analysis at the first step, as they interrupt the correct parsing and extraction of functions from binaries, which is the fundamental basis for NeuralCI and other function-level compiler identification methods. On the other hand, code obfuscation techniques, such as instruction replacement, dead code insertion, function inlining, etc., that commit changes to the code [30], [32] are very likely to destroy or submerge features important to compiler identification, hence decreasing the detection accuracy. One possible solution to this problem is to deobfuscate the binaries first with deobfuscation techniques [22], [44], [49] before enforcing compiler identification. Another possible way is to adopt the idea of adversarial training [40], which trains the compiler identification model with adversarial examples (i.e. obfuscated samples), to improve the resistance against code obfuscation attacks. Overall, how to achieve robust compiler identification in the face of code obfuscation is a challenging yet important research topic, and we leave it as our future work.
To facilitate other researchers conducting experiments and presenting their findings, we have made public the dataset and NeuralCI's source code on github. To construct a relatively reliable dataset, the compilation itself is a challenging task, as the source files usually have numerous dependencies that complicates the compilation process. Worse still, we must ensure the projects are correctly compiled with the specified compiler settings. Finally, 19 projects with relatively complete compilation environments are processed, which costs us a massive of time reading the guidelines accompanied by each project for configuring and compiling them with the specified compiler settings, so as to produce correctly labeled samples. One concern as to the dataset is its diversity that correlates to the generalization ability of the trained models. For this issue, we try to select projects that involve different program types and application domains. Another concern is the effectiveness of NeuralCI on other compilers 49172 VOLUME 9, 2021 or compiler settings. For example, as the projects used to construct our dataset are widely used open source projects that work under the linux system, it's difficult to successfully compile them with the msvc compiler that generally works under the windows system. Thus we did not test NeuralCI on identifying msvc compiled binaries. Besides, there are other compilers that can produce binaries, such as the IBM XL compiler [5]. Covering all of them apparently will promote the usability of the compiler identification researches, but it is a non-trivial task that requires lots of human efforts. We made public our constructed dataset as the first step, with the expectation for more researchers of participating in the compiler identification and related investigations, and together with other researchers we can gradually refine and enrich the dataset with more samples involving more kinds of compilers and programs.
We did not perform a systematic hyper-parameter tuning, but rather adopt either the default or the commonly used empirical values for the hyper-parameters. As illustrated by the evaluation results, NeuralCI trained with current parameter setting achieves very promising detection accuracy. A systematic or exhaustive grid search based hyper-parameter tuning may further improve the NeuralCI's performance, meanwhile consuming a lot more computing resources and time. We leave it as our future work as well.

VII. RELATED WORK
In general, existing works on compiler identification can be divided into two categories: signature matching based methods and learning based methods.

A. SIGNATURE MATCHING BASED METHODS
The signature matching based methods [6]-[8] search the binary program against a corpus of generic and rigid signatures, and attribute to the whole program the compiler label corresponding to the matched signature string. This kind of method has been implemented in several reverse engineering tools, such as IDA Pro [6], PEiD [8] and LANGUAGE 2000 [7], with its high detection efficiency and low cost. Their drawbacks lie in the stringent expertise in constructing a good enough compiler-specific signature, as well as the easily affected accuracy due to slight differences between signatures. Besides, the signatures usually depend on the metadata or details of program headers, which can be easily altered or become unavailable in stripped binaries. Moreover, these tools identify compilers on the whole binary, while a program can be produced with multiple compilers in scenarios such as statically linking library code to produce the final binary program.

B. MACHINE LEARNING BASED METHODS
This type of method formulates compiler identification as a machine learning task performed on (in most instances stripped) binaries, based on the belief that the implicit features of the resulting binaries reflect design and implementation decisions of a certain compiler that is used to pro-duce the binaries. Specifically, they train models that capture compiler-specific patterns, further with which to infer the compiler provenance on previously unseen binaries.
The pioneering work [29] adopting this type of approach was conducted by Rosenblum et al. that defined a set of idioms (short sequences of instructions with wildcards) and utilized mutual information calculation to capture and select significant patterns indicative of the source compiler for the program binaries. High accuracy was observed for inferring the compiler families, but we have no idea of its performance on optimization levels identification as no evaluation was conducted. ORIGIN [28] achieved superior accuracy in recovering the compiler details by introducing graphlets (small and non-isomorphic subgraphs within the CFG) in addition to idioms so as to capture additional structural features. Hidden Markov models were learned via observing the differences in the type and frequency of instructions comprising the binaries compiled with different compilers, and proved to be accurate in identifying the compiler family for a whole program [12], [39]. However, for each individual compiler family, a corresponding separate model needs to be learned. Also, these models do not extract information regarding the optimization levels. To improve efficiency in terms of computational resources and detection time, BinComp [11], [27] adopted a stratified approach to infer different compiler details on different granularity. It identifies compiler family for the whole program via matching of signatures, and conducts compiler version and optimization level detection for compiler-related functions. However, as the compiler-related functions usually constitute a small portion of all functions in real-world programs where user-defined functions hold the principal status, the method does not solve the function-level compiler identification problem which we are targeting in this paper. Basically, accuracy of these machine learning based methods greatly depends on the quality of expert-defined feature extraction templates and feature selection strategies, where more potential human bias exists, resulting in capturing lots of irrelevant or redundant features for the compiler provenance task meanwhile failing to capture closely relevant ones.
In recent years, significant successes have been witnessed of applying deep learning techniques to the domain of binary program analysis [16], [23], [34], [45], [48], [51]. Bin-Eye [45] is one of the few works that utilize neural network models to implement compiler identification. It combines word embedding and position embedding to encode the raw bytes of an object file, and then utilizes CNN to learn a model that supports optimization level recognition on each individual object file. Our work differs in that we achieve finer grained identification of both compiler family and optimization level for each individual function by adopting an abstraction strategy that operates on assembly instructions rather than the raw bytes. o-glassesX [25] designs an attention augmented CNN-based model for compiler identification, where positional encoding is also applied to capture features of instruction orders. Structure2Vec [23] utilizes a graph embedding network to transform the function CFGs into vectors, which are then fed into a dense layer to train a classifier for compiler family identification. Compared to this work, we operate directly on the instructions comprising a function with a lightweight abstraction strategy, and adopt the much faster sequence-oriented neural networks to train models for comprehensive detection of the optimization level, the compiler version, and the compiler setting combination in addition to the compiler family as done by Structure2Vec.

VIII. CONCLUSION
In this work, we model functions as normalized instruction sequences using a lightweight abstraction strategy, which are then fed to well-designed neural networks to solve the problem of fine-grained acompiler identification. We implement the methods in a prototype tool NeuralCI, and get its performance evaluated on a large dataset consisting of totally 854,858 unique functions. As the experimental evaluation shows, NeuralCI outperforms existing function level compiler identification methods. The outstanding accuracies reported on detecting the compiler family, optimization level, compiler version and the compiler setting combination strongly suggest that deep neural networks are capable of capturing subtle yet significant features indicative of compiler settings, and sever as a promising and reliable way to reveal these compiler settings in a less-human-and-domainknowledge-involving manner.
Despite several popular attention mechanisms that work well on NLP tasks are adopted to enhance our models, limited advances are observed. This indicates that the subtle clues implied in a sequence has already been fully tapped by our models. Besides the instructions and their orders, different compiler settings can also cause significant changes to the binary functions' structures. In consideration of that, we plan to model a function with more comprehensive representation forms such as CFG that preserves both syntactic and structural information, and resort to graph neural networks to achieve the performance enhancement for compiler identification task. The explicability of the models is another work we are planning to focus on, with which we may find some interesting patterns to provide heuristics for researchers in the domain. There exist some candidate techniques [21], [35], [36] such as the gradient based saliency map, the regression model based LEMNA etc. that can be referred to. Also, the well designed neural networks that show advanced performance on tasks such as video deblurring [47] are promising to be applied on compiler provenance task after proper adjustment.