o-glasses: Visualizing x86 Code from Binary Using a 1d-CNN

Malicious document files used in targeted attacks often contain a small program called shellcode. It is often hard to prepare a runnable environment for dynamic analysis of these document files because they exploit specific vulnerabilities. In these cases, it is necessary to identify the position of the shellcode in each document file to analyze it. If the exploit code uses executable scripts such as JavaScript and Flash, it is not so hard to locate the shellcode. On the other hand, it is sometimes almost impossible to locate the shellcode when it does not contain any JavaScript or Flash but consists of native x86 code only. Binary fragment classification is often applied to visualize the location of regions of interest, and shellcode must contain at least a small fragment of x86 native code even if most of it is obfuscated, such as, a decoder for the obfuscated body of the shellcode. In this paper, we propose a novel method, o-glasses, to visualize the shellcode by recognizing the x86 native code using a specially designed one-dimensional convolutional neural network (1d-CNN). The fragment size needs to be as small as the minimum size of the x86 native code in the whole shellcode. Our results show that a 16-instruction-sequence (approximately 48 bytes on average) is sufficient for the code fragment visualization. Our method, o-glasses (1d-CNN), outperforms other methods in that it recognizes x86 native code with a surprisingly high F-measure rate (about 99.95%).


Introduction
In recent years, targeted attacks have become a major threat.In a targeted email attack, an email contains a request to open an attached file or click on a hyperlink in the email body.If the recipient does so, then some malware is launched.Most such malware is newly crafted, unknown malware, and is thus often hard for antivirus scanners to catch.In particular, malicious document files used in targeted email attacks often contain an executable file embedded within a decoy document file: over 60% of the attached files in targeted email attacks occurring in 2014 were reported to be document files [1].
The left-hand side of Fig. 1 shows a typical structure for a malicious document file.The malicious document file consists of four parts: exploit code, shellcode, an executable file, and a decoy document file.Exploit code is a program designed to exploit a document processor vulnerability.The exploit code is executed when the malicious document file is opened, leading to execution of the shellcode.Shellcode is a program designed to create an executable file and a decoy document from the remainder of the file and to launch the executable.Then, the PC that opened the malicious document file becomes controllable by attackers.The right-hand side of Fig. 1 shows a typical execution process of a malicious document file.
To reach attackers' information, we should not only detect the malware but also figure out the features of the malware in detail.Here, we face several problems.First, we should prepare a runnable condition for the malware in order to conduct dynamic analysis.When the target file is a malicious document exploiting specific vulnerabilities, it is often difficult to prepare the activatable environment (OS versions, browsing software, language, patches, and so on) because the conditions are complicatedly intertwined.Therefore, we are often forced to conduct static analysis.When the target file is an executable file, it is easy to find the entry point for analysis.However, when the target file is a document file, it is not so easy to find the entry point.In this case, we focus on the shellcode executed after exploit code.When the malware uses JavaScript or Flash, we can figure out the location of the shellcode quickly.However, exploit code uses not only JavaScript and Flash but also font and image files, for example, a TIFF image (CVE-2017-5133 [2,3], CVE-2004CVE- -1308 [4] [4]), a jpeg2000 image (CVE-2016-8332 [5,6]), and a TrueType font (CVE-2011-3402 [7]).When searching for shellcode, it is important to consider various types of exploit code.Thus, our target is a class of malicious document files that contain x86 native code hidden somewhere in them.
Although attackers tend to use obfuscation to protect their code, shellcode must contain at least a small fragment of x86 native code, such as a decoder.("0xCC" in this example) Fig. 2 shows an example of a small decoder containing 17 opcodes in only a 29-byte sequence.This code was obtained from a malicious document file with a size of more than 100kB used in a real attack.Our challenge, therefore, is finding a small amount of code like that shown in Fig. 2 in often large document files.To do this, we introduce a novel method, called o-glasses, to visualize the shellcode by recognizing the x86 native code using a specially designed one-dimensional convolutional neural network (1d-CNN).
In summary, the main contributions of our approach are as follows: 1

.1 Easily Collectible Training Datasets
One of the most significant problems in using machine learning is how to prepare the training dataset.Even an excellent model cannot demonstrate its performance without large samples.However, studies of malware using machine learning sometimes struggle to collect samples because they need examples of malware for training.In contrast, our approach does not need malware for the training dataset.Thus, samples for learning are easily available for anyone.

High Recognition Rate for x86 Code
Conventional signature-based malware detectors do not work when an unknown code is embedded.On the other hand, program code is not supposed to be in document files.So, extracting shellcode from malicious document files becomes a reality if we can separate program code precisely from normal byte sequences in document files.The solution provided in this paper is based on the assumption that shellcode and general program code have similar distributions of code.

Visual Analysis for Supporting Analysts
Visualizing a binary as an image helps to quickly get an overview of the file.While some experienced analysts can deduce the location of the embedded program code from a grayscale image converted from the binary file, even unexperienced analysts can achieve similar results using our proposed methods.

Preliminaries
The proposed solution lies in the static analysis of files.We do not take into account the file structure.The only thing of importance for us is whether a file fragment is a piece of x86 code.

x86 Architecture
The x86 and x86-64 architectures are probably the most widely used CISC (Complex Instruction-Set Computing) architectures [8].Their instruction sets are rich and complex, and most importantly they support instructions of varying length.Instruction lengths range from just one byte (i.e., instructions comprising just a one-byte opcode) to 15 bytes.

Assumptions
We made the following assumptions.
Assumption 1 The distribution of the byte sequence from x86 code is dissimilar to that from document files.
Assumption 2 The distribution of shellcode is the same as that of common x86 code.
In other words, we expect to be able to detect shellcode by detecting any x86 code.
We next describe Shannon entropy, conventional visualization methods, and the deep learning models (multi-layer perceptron [MLP] and CNN) used in the study.

Shannon Entropy
We calculate the information entropy of each file fragment using the Shannon entropy rate given by H(X) = − 1 8 where X is a random variable over [0, 255].The entropy rates are real numbers between 0 and 1, where 1 means the file fragment is uniformly random.

Conventional Visualization Methods
Visualizing a binary as an image is very helpful for getting a quick overview of the file.In this section, we describe the three conventional visualization methods.
Grayscale A technique for representing different files with grayscale images was introduced by Conti et al. [9] and was applied to automatic malware classification by Nataraj et al. [10].
Bit-image representation of a binary file Goto [11] implemented the visualization of a binary file as a "bit-image" in a hex editor named "Stirling" in 1998.In Stirling, a given binary is read as a vector of 8-bit unsigned integers and then organized into a two-dimensional array.This can be visualized as a bit-image in four colors: 0x00 (null) in white, 0x01-0x1F (control characters) in light blue, 0x20-0x7F (ASCII) in red, and 0x80-0xFF in black.
Structural entropy Document files contain data of various kinds: metadata, text, and packed data.All of these file areas differ not only in size but also in the level of information entropy.When a document file may be considered as a system of such elements, then we can use the term structural entropy for its characterization.Sorokin [12] built entropy diagrams by using the sliding window method.He selected 256 bytes for the window (block) size and 128 bytes for the window (block) shift.In our experiment, we used the same block size but changed the block shift to 1 byte to provide more detail.We calculate entropy level at each offset and visualize the structural entropy as a grayscale image.

MLP
A standard MLP neural network has a three-tier structure: the input layer, the hidden layers, and the output layer.Every layer in an MLP consists of nodes fully connected with the nodes in the adjacent layer.

CNN
In our method for recognizing x86 native code, we use a 1dCNN [13].In contrast to an MLP, a CNN has limited connections between each layer (see Fig. 3) and nodes in an intermediate layer receive only input from a localized part of the previous layer, which is called the receptive field.Tools based on CNNs have now led to great results in a wide range of vision tasks [14].Generally, image data are continuous data.So, when image data are input to a CNN, high object recognition power can be obtained by adjusting the CNN's local receptive field.On the other hand, program code is classified as discrete data when viewed one byte at a time.Therefore, when binary data directly converted into an image are input to a CNN, there is the possibility that the benefit of the local receptive field cannot be obtained.On the other hand, program code is a sequence of instructions, which may reduce the variation, so the possibility of receiving the benefit of the CNN's local receptive field is not entirely ruled out.
Weight sharing is a mechanism in which all links to nodes of a local receptive field have the same weight.In the case of Fig. 3, the three blue links have the same weight.Similarly, the three red links have the same weight.By using the local receptive field in this way, the result for some input data is the same as the result for shifted input data.This allows us to reflect all the data in intermediate layers despite the limited connectivity to the input.
Several hyperparameters control the size of the output volume of the convolutional layer (Fig. 4): the kernel field size, the depth, stride, and zero-padding.We will ignore zero-padding because we do not use it.The depth (D) of the output volume controls the number of neurons in a layer that connect to the same region of the input volume.The stride (S) controls how depth columns around the spatial dimensions (width and height) are allocated.
The spatial size of the output volume can be computed as a function of the input volume W , the kernel field size of the convolutional layer neurons K, and the stride with which they are applied S. The formula for calculating how many neurons "fit" in a given volume is given by (W − K)/S + 1.

Related Work
Methods of analyzing malware can be divided into two types: static analysis and dynamic analysis.We focus on static analysis as explained previously.
OfficeMalScanner [15] (OMS) is an analysis tool for document files.OMS scans entire files for generic shellcode patterns, an embedded signature of document files, or an embedded executable file.Although this method incorporates fuzzy search, it is easy to avoid detection because the number of the search patterns is small.
MDScan [16] is a standalone malicious document scanner.The tool analyzes PDF document files individually and detects malicious code.The tool combines static analysis of the document format representation and dynamic analysis of the embedded script code.The method focuses only on JavaScript in PDF files.Hence, the method does not work well when the exploit code is not written in JavaScript.
There are several approaches to malware detection that use binary or grayscale images (binary texture analysis [17], malware images [10], support vector machines [18] and visualization of binary files [19]).These approaches are aimed toward the detection and the classification of malicious software based on image processing techniques.Hence, they do not focus on finding a small amounts x86 code, such as shellcode, as we are doing here.
Binary fragment classification can visualize the location of regions of interest.The fragment size needs as small as the size of shellcode to find it.Xu et al. [20] treated a 1024-byte file fragment as a grayscale image and used an image classification method to classify file fragment.They focused on file type classification for digital forensics.It is difficult to make the fragment smaller because the texture of its grayscale image becomes harder to analyze.Hence, it is difficult to find shellcode using this method.

Training Data
We prepared two categories of dataset for training, both of which can be gathered easily.One category is labelled "Program" and comprises various sets of x86 code taken from two sources: Github and Ubuntu 16.04.The other category is called "Others" and consists of various document files and portions of data extracted from them.The "Others" category contains "CFB,""OOXML," and "PDF" files.CFB stands for compound file binary [21], and it is used as a container like the FAT16 file system.CFB is used in files with the extensions ".doc," ".xls," ".ppt," ".jtd" (used by the "Ichitaro" Japanese word processor), ".hwp" (used by the "Araea Han-geul" Korean word processor), and so on.OOXML stands for Office Open XML [22], which is a zip container in reality.OOXML is used in ".docx," ".xlsx," and ".pptx" files.PDF stands for portable document format [23], which has the extension ".pdf."For each category and source or file type, we constructed there types of dataset: the whole files, 256byte blocks extracted from these files, and 2048-bit segments of code extracted from the files.Table 1 shows an outline of our datasets.An outline of how to make our datasets is shown in Fig. 5.The methods for making each of our types of dataset are as follows.
File The following procedure is conducted for making the file dataset in the "Program:GitHub" category.
-Gather various C/C++ source code files from GitHub [24] -Compile these files into x86 object files by using gcc [25] -Extract only the native code from these object files.
To make the file dataset in the "Program:Ubuntu" category, we extracted program code from the elf files in the "/bin" and "/sbin" directories of Ubuntu 16.04 using the header information.
Finally, to make the file datasets in the "Others" category, we used a search engine to gather various open-source document files.Table 2 shows the keywords  used for this search.We downloaded document files from the beginning of the list of search results.We then checked these download files using VirusTotal [26], and we removed suspicious files that were detected as malware.
Block "Block" datasets are made by extracting 256-byte blocks by random sampling from every file in a "File" dataset.We calculated Shannon entropy rates (Equation ( 1)) for each block in the "Block" datasets.The distributions of the entropy rates for blocks in the "Program" and "Others" categories are shown in Fig. 6. tecture [8] 15 bytes is basically the maximum length of one instruction.Thus we padded each instruction with null bytes to achieve a fixed length of 16 bytes (one byte larger than the maximum instruction length) and combined 16 of these padded instructions to form a code segment that has a convenient length for our analysis.Although 15 bytes is the basic maximum length of instruction, longer instructions could appear in theory (particularly when the file being interpreted as x86 code is actually a document file).5However, we did not find any instruction longer than 15 bytes in our experiment.The results of disassembling x86 native code and a CFB file are shown in Fig. 7.The average of the lengths of each "instruction" is 2.95 bytes for the" Program" category and 2.38 bytes Fig. 8. Distributions of instruction lengths in files from the "Program" and "Others" categories.The total number of instructions in each category is over 600,000.
for the" Others" category.As shown in the figure, there are various lengths of instruction in x86 CPU architecture, which appear to have no regular pattern.
The frequencies of each instruction length in files from each category are shown in Fig. 8.

Proposed Visualization Methods
In this section, we introduce three visualization methods: o-glasses (1d-CNN), o-glasses (MLP), and o-glasses (entropy), which are based on a 1d-CNN, an MLP, and entropy, respectively.These methods classify the input block as either "Program" or "Others" and visualize the input block as an image in two colors ("Program" is shown in red, "Others" is shown in green).Fig. 9 shows the result of visualizing "notepad.exe"using our methods and three conventional methods.The details of each method are as follows.

o-glasses (1d-CNN)
First, we consider the o-glasses method based on a 1d-CNN.
Local receptive field for the x86 instruction set We aim to make our model specialize in recognition of native program code.If you directly input binary, such as an x86 instruction set, into a convolutional layer, you cannot identify single instructions as expected.The input data consist of instructions serialized as one-dimensional data.We convert the input data into N-bit fixedlength instructions to obtain features of the instructions.Additionally, the kernel field size and the stride should be adjusted to N .We selected 128 (16 bytes) as the value of N because this is a convenient size that is just larger than the maximum size (15 bytes) of an x86 instruction.In the 1d-CNN, the first layer consists of local receptive fields against each instruction.Therefore, it is expected that the next layer obtains the relationships among instructions.
Our 1d-CNN The whole of our 1d-CNN is shown in Fig. 10.We serialize a set of 16 fixed-length instructions into an array of 2048 bit values as input data.
The first layer is a convolutional layer (Bit-CNN).We apply 96 layers of 128 bit-filters to a 2048 input volume.Choosing a stride of 128, the output volume is 16 × 96.The second layer is also a convolutional layer (Instruction-CNN).We apply 256 2-filters to a 16 × 96 input volume with a stride of 1.We expect that the second layer will obtain the features of the relationship between two Fig. 10. Outline of our 1d-CNN adjacent instructions.Our 1d-CNN does not contain any Pooling layer.The 3rd to the 5th layers are fully connected.Their output volumes are 400, 400, and 2, respectively.We add two batch normalization [27] layers before the 1st and 2nd fully connected layers to speed up and stabilize the learning process.After each layer except the last one, we apply a ReLU [28] layer.The ReLU layer applies the function f (u) = max(u, 0) to all of the values in the input volume.
The softmax function is used in the final layer of our network.
where K = 2 and k ∈ {1, 2}.Our network is trained under a cross-entropy regime.The cross-entropy function for one training sample (x n ,t n ) for n ∈ [1, N ] is where the input data is x n ∈ {0, 1} 2048 , the true label is t n ∈ {0, 1} K , and the number of output units is K.The sum of the errors E n calculated from each training sample is the total error function E: where the number of samples is N .

Stochastic gradient descent
We use the stochastic gradient descent (SGD) method to minimize the error function in the backpropagation algorithm.To economize on the computational cost of each iteration, SGD samples a subset of summand functions at every step.This is very effective in the case of large-scale machine learning problems.The current weight w t is updated to w t+1 using the following equation.
where η is the learning rate.
A compromise between computing the true gradient and the gradient of a single example is to compute the gradient against more than one training example (called a "mini-batch") at each step.
where N m is a subset of the index set {1, . . ., N } such that m N m = {1, . . ., N } and N mi ∩ N mj = Ø for i = j.

o-glasses (MLP)
Like the previous method, this method focuses on each block of the target file.The input data size is one block (a 256-byte sequence), and the block shift is 1 byte.The network containing hidden layers and the output layer is the same as fully connected layers 3-5, shown in Fig. 10, for the 1d-CNN (see Section 5.1).

o-glasses (Entropy)
The method detects program code based on whether the entropy of the block lies within a given range.When appropriate range criteria are selected, this method achieves reasonable accuracy in the detection of program code.

Recognition Performance
We investigated the detection rates of program code by our methods using the training datasets described in Section 4. Table 3 shows an overview of the results.In the comparison of the different algorithms, we use the F-measure defined by In this calculation, precision is given by and recall is given by where TP is the true positive rate, FP is the false positive rate, FN is the false negative rate, and TN is the true negative rate (see Table 4).
o-glasses (Entropy) We examined many ranges for the entropy rate-based binary classifier and chose the range that gives the maximum F-measure for the training dataset.The F-measure for entropy in Table 3 was calculated using this "range" parameter (also shown in the table) against the test dataset.
o-glasses (MLP) and o-glasses (1d-CNN) To train and test our network, 10-fold cross-validation was used.After 200 epochs, we calculated the F-measure, the precision, and the recall of the test data.
Here is our parameter configuration: learning rate(η) = 0.001 mini-batch size = 100 The learning curves of the error are shown in Figs.

Experiments with Malicious Documents
In this section, we visualize three malicious document files to discuss the effectiveness of our methods.The first malicious document file contains 127 bytes of x86 code.The second malicious document file contains 29 bytes of x86 code.The third malicious document file does not use vulnerabilities, and does not have any x86 code.These files are referred to in the following discussion as File 1, File 2, and File 3, respectively.The parameters used in these experiments are the same as those described in the previous section.After 200 epochs of training using all our datasets, we visualized the three files.
File 1: CVE-2014-7247 File 1 contains a compressed executable.If we can analyze the file dynamically, it is easy to output the executable.However, this file is a .jtddocument file for Ichitaro, which is Japanese word processing software similar to Microsoft Word.The old version of Ichitaro had a vulnerability called CVE-2014-7247, which this document targets.So, we need the old version for dynamic analysis.When we do not have the old version, we must find the decoder for the executable file to output it.Therefore, we need to find the shellcode.
This document file contains 127 bytes of x86 code.The code is split two sequences; the size of the first sequence is 77bytes, and the size of the second sequence is 50 bytes.The first sequence codes for jumping to the second sequence.
Fig. 13 shows the result of visualizing File 1.The o-glasses (1d-CNN) method shows an x86 code sequence at almost the same location as the first sequence.However, the method could not locate the second sequence.shows the results of visualizing File 2. The o-glasses (1d-CNN) method could not locate the decoder.However, it found a sequence of "nop" instructions located just before the decoder.
File 3: VBA script donwloader Unlike the other two files, File 3 does not contain any executable file.Additionally, this document file does not attack any vulnerabilities.Instead, a VBA script in this document file downloads an executable file from the internet and runs it.Therefore, this document file does not contain any x86 code.As shown in Fig. 15, o-glasses (1d-CNN) correctly reports no x86 code in this document, while the other methods report many falsepositive blocks.Thus, human examiners can confidently focus on the positive blocks reported by o-glasses (1d-CNN) to search for real shellcode in malicious documents.

Discussion
In this section, we discuss the usage and limitations of our methods, and areas for future work.
Easily Collectible Training Datasets One of the most significant problems using machine learning is how to prepare the training dataset.Even an excellent model cannot demonstrate its performance without large samples.Many Fig. 14. Results of visualizing File 2 using various methods studies of malware using machine learning have sometimes struggled to collect samples because they need hard-to-collect malware.In contrast, our approach does not need malware for the training dataset.Since all we need to collect is x86 code and normal document files, it possible for anyone to create training datasets from easily accessible sources.Surprisingly, in spite of this fact, our method, o-glasses(1d-CNN), can find the locations of shellcode almost exactly.Therefore, our proposed methods suggest a possible beneficial effect for professional malware analysis.On the other hand, some shellcode is known to contain garbage code.We did not consider such cases, and therefore our dataset needs to be improved.High Recognition Rate for x86 Code In this paper, we have presented a method of recognizing program code in document files using a 1d-CNN.Using a local receptive field and weight sharing, our 1d-CNN can capture important features of instructions.Thus, even if the input instruction sequence is shifted, our network can recognize program code with a high degree of success, as measured by the F-measure rate.
The result of our experiments, inputting 16 opcodes into our network, is that the F-measure rate reaches about 99.95%.While this value seems to be very high at first glance, it means that, when the target file size is 100 KB, about 50 bytes of noise is generated in the visualization result.When looking for a small program like shellcode, this noise becomes an obstacle to analysis.Although our method has already achieved a real-use level of performance for human analysts, it still needs further improvement for automatic shellcode detection.
Visual Analysis to Support Analysts In this paper, we visualized several malicious document files and showed that we could find some small programs like shellcode.Furthermore, in the case of a document file which does not contain any x86 native code, other methods do not provide convincing evidence that x86 native code was not present.But, by using our method, we can be fairly confident that a file does not contain x86 native code.
However, some malicious document files do not contain x86 native code, but contain interpreted code such as JavaScript.Our methods do not cover such files.For these files, it is necessary to analyze the malware by another method, which may be combined with our o-glasses (1d-CNN) method.

Conclusion
In this paper, we proposed a 1d-CNN for detecting program code in document files.We observed that a local receptive field for a 128-bit fixed-length instruction is effectively formed in the first layer of our network.We can balance both high precision rate and high recall rate for detecting program code by using our network.Our network can narrow down a target for human static analysis of unknown malware.Future work includes increasing the number of malicious document files used to check the validity of our proposed method.Another task is to combine our network with various analysis methods for unknown malware.

Fig. 1 .
Fig. 1.Typical structure and execution process of a malicious document

Fig. 2 .
Fig. 2.An example of a small decoder with a 29-byte sequence.It contains only 17 opcodes, and it decodes the body of the shellcode with 1-byte-key xor encoding.("0xCC" in this example)

Fig. 3 .
Fig. 3. Schematic diagrams of a CNN and an MLP

Fig. 6 .Fig. 7 .
Fig. 6.Distributions of the Shannon entropy rate for blocks in the two categories of dataset

Fig. 9 .
Fig.9.The result of visualizing "notepad.exe."In the case of the grayscale image, we adopted the conventional conversion techniques[9,10] except for fixing a 128-pixel (byte) image width.In the case of the structural entropy image, we selected 256 bytes for the block size.Our methods classify the input block as either "Program" (red) or "Others" (green).The block sizes are 256 in o-glasses (entropy) and o-glasses (MLP), and 16 instructions in o-glasses( 1d-CNN).The block shifts are 1 byte in all the methods.
11 and 12.The blue areas indicate the range of possible values of the training errors in the 10-fold cross-validation process.The solid blue lines indicate the average of the training errors in the 10-fold cross-validation process.The red areas indicate the range of possible values of the test errors in the 10-fold cross-validation process.The solid red lines indicate the average of the test errors in the 10-fold cross-validation process.From this figure, it can be seen that our 1d-CNN method does not cause over-fitting.

File 2 :
CVE-2012-0158 File 2 contains an executable file encoded with a 2byte-key xor.This document file is a Word (.doc) document file and attacks a vulnerability called CVE-2012-0158.This document file contains only 29 bytes of x86 code.This is the code which we mentioned in the Introduction.Fig.14

Table 1 .
Number of elements in each of our datasets

Table 3 .
Performance of our methods to detect x86 code