Introduction
Since the born of software, bugs and vulnerabilities have been with us like a shadow. The continuous increase of software scale brings great challenges to testing. For complex and systematic software such as Sensor-Cloud Systems and Applications (SCSA), vulnerability testing requires a huge amount of manpower. This makes testing and monitoring diverse applications of such systems very difficult. To relieve the pressure of developers when testing complex software, many automated testing methods are proposed continuously. In this paper, we will focus on a simple and efficient automatic vulnerability detecting method, that is, fuzzing. It is a testing method to find software vulnerabilities by constructing test inputs and conducting a lot of tests on the software. The software system that hosts a fuzz testing is called a fuzzer.
Among the many fuzzers, mutation-based fuzzers, such as American Fuzzy Lop (AFL) [1], can achieve higher efficiency. They commonly leverage random mutation strategies to produce test inputs and evaluate their quality through feedback. However, their knowledge-blind mutation strategies lack the capability to produce syntactically and semantically valid test inputs. They suffer from exploring deep vulnerabilities and handling highly-constructed inputs.
While in real software systems, it is very common for some programs to require complex and structured inputs. Take XML and JavaScript (JS) for instance. The XML files are widely used to transfer data between SCSA’s applications. And the JavaScript (JS) codes are the foundation of the interaction interface of SCSA software. For such complex inputs, the parser will check their syntax and semantics validity before they are executed. Therefore, considering grammatical reliability is key to generate valid inputs. Researchers are more inclined to take advantage of prior knowledge like Context-Free Grammars (CFGs) to guide the progress of mutation. In the process of fuzzing, grammars are often utilized to generate valid initial seeds (e.g., [30], [33]) or to guide mutation (e.g., [31], [32], [36]).
An input is converted into an AST at the early parsing stage of common interpreters and engines. Only the ones that pass the parsing stage can be executed. Hence the existing GBFs (e.g., [31], [32]) have implemented strict control over the grammatical reliability of mutation based on ASTs. Usually, the units of GBFs’ mutation strategies are internal nodes along with their subtrees. They perform mutation by replacing, splicing, deleting, and other operations with subtrees. In this paper, we will focus on mutation-based greybox fuzzing with grammar guidance.
A. Problems and Challenges
Most grammar-based mutation operations require a reliable subtree source. For the aforementioned replacement mutation methods, they usually take a subtree as the source to replace the original ones. But existing GBFs generally do this by randomly selecting a seed from the seed queue and parsing it to an AST as the subtree source ([31], [32]). Consequently, during mutation, the subtree resources obtained by the aforementioned methods are random in quality, small in quantity, and the context is not strictly controlled, which leads to great fluctuations in the effect of fuzzing, as evaluated in Section IV. How to ensure that there are abundant and high-quality subtree sources during mutation is an urgent problem (challenge 1). By implementing a reliable source of subtrees, we can ensure the diversity and correctness of mutation results, thus improve the effect of fuzzing.
At the same time, existing GBFs pay more attention to methods of mutation, but discuss little about where and how many times to mutate (i.e., the mutation priority and frequency). While, in classic mutation-based fuzzers, the influence of mutation priority and frequency has been proved to be significant ([1], [19], [21]). Many mutation-based fuzzers have developed different approaches to optimize power schedule, such as reducing the search space of inputs ([16], [19], [21], [23]), or modeling the process of power schedule ([17], [18], [24]).
GBFs like NAUTILUS [31] and Superion [32] also leverage efficient fuzzing strategies such as coverage feedback to guide mutation and also achieve better efficiency. However, their application of the aforementioned strategies remains shallow. They separate guiding and optimizing strategies with grammar-based mutation methods instead of combining them organically. Some effective guiding and optimizing methods proved to be efficient during mutation, e.g., mutation priority and power schedule strategy, remain unexplored in GBFs. Hence, we introduce these two methods into the GBF to improve fuzzing efficiency. However, we encounter the following challenges:
In existing mutation-based fuzzers, the order of bytes in a test input is determined based on a predefined priority. They typically use taint tracking [19] or data flow analysis [21] to determine the priority of each byte. It ignores grammar connections between bytes, which is often difficult to find only by taint tracking or data flow analysis. For example, in an XML statement like <?xml version = “1.0” encoding = “utf-8”?>, <?xml and ?> are connected in grammar, and mutating only one of these two separately may lead to syntax errors. Thus, it is a waste of time to perform mutations on syntax-invalid inputs. Traditional analysis methods are generally for consecutive bytes ([19], [21]), hence it is hard for them to find the aforementioned connection.
Therefore, byte-level is cumbersome, complex, and unusable in complex and highly structured inputs. We need to provide a more reasonable and efficient granularity for prioritization than byte-level and give a well-defined criterion of priority from the point of grammar (challenge 2).
Power schedule strategies are commonly leveraged to determine the number of mutations in a test input file. While in the grammar-oriented approaches, each AST internal node has a different meaning given by the grammar. If the whole AST is tread as a whole during energy assignment, it is likely to lead to unfair energy distribution for nodes. For example, in the example in Fig. 1, mutating the chardata node will not bring as much effect as mutating the element or content node. Because in the grammar of XML [5], non-terminators like chardata do not have the complex and varied production rules as the latter two do. To maximize the mutation benefit, how to define the energy of different AST nodes is also challenging (challenge 3).
B. The Proposed Approach
To solve the aforementioned challenges, we propose a novel grammar-based mutation strategy focusing on the subtree source and power schedule. We implement it based on a fuzzing framework that combines initial seed generation with effective mutation.
To solve the problem of insufficient subtree source of existing grammar-based mutation methods in challenge 1, we propose a Subtree-pool-based Random Mutation strategy. In the process of fuzzing, we construct and incrementally update the subtree pool, which can provide enough random sources during mutation. Simultaneously, we propose the AST-node-level Power Schedule to solve the problem of how to define priority and assign energy in a GBF in challenge 2 and challenge 3. We have brought priority mutation and power schedule down to AST nodes. We operate grammar-based mutation on AST-node level. For each node
We propose a subtree pool so that grammar-based mutation can have a more abundant and reliable subtree source.
At the AST-node level, we introduce priority mutation and power schedule strategies into GBFs.
The exhaustive experiments on real programs that handle XML and JavaScript inputs demonstrate that our method achieves a delightful improvement in efficiency.
Background
In this section, we will first introduce the main categories and process of greybox fuzzing. Then we will present the related grammars commonly used for input generation.
A. Greybox Fuzzing
Fuzzing can be divided into whitebox (e.g. [2], [3]), greybox (e.g. [1], [16], [19]), or blackbox (e.g. [12]) based on how much it utilizes the internal structure of target programs. Also, depending on the method of producing test inputs, fuzzing can be classified as mutation-based (e.g. [1], [19], [23], [32]) or generation-based (e.g. [26], [29], [30]).
Among the many greybox fuzzing methods, AFL is the most widely used and studied. Hence most coverage and mutation-based fuzzers follow its overall architecture, as shown in Fig. 2. This architecture is efficient [32]. It usually takes a seed corpus as input and maintains a seed queue using initial seeds. In the process of fuzzing, it selects a seed
GBFs are good methods to avoid the aforementioned problem. They leverage predefined grammars as the guidance of mutation ([31], [32], [36]) or generation ([30], [33]), so that the generated inputs have a large probability to pass the syntax and semantic checks of the parser, decrease the number of invalid inputs and improve the fuzzing result.
B. Related Grammar
Grammars are very effective prior knowledge, defining the organizational rules and structure of a programming language. For grammar-blind fuzzers, it’s not trivial to process highly structured test inputs generated by grammars. Fortunately, many complex inputs (e.g., programming languages) can be described using CFGs. They can be obtained easily from public documentation [5], which makes it possible for us to fuzzing with grammars. Generally speaking, a CFG is a tuple of shape
is a finite set of non-terminal symbols or variables that represent different types of phrases or clauses in a sentence.N is a finite set of terminal symbols that constitute actual content of a sentence, whereT .N\cap T=\varnothing is a finite set whose members are production rules of a grammar. A production rule follows a format with the shape ofR , whereA\rightarrow a andA\in N .a\in ^{*} N \cup T is the start symbol and is leveraged to represent the entire sentence (or program), whereS .S\in N
Although CFG is capable of producing most complex inputs, because of its context-free peculiarity, inputs generated through CFG have great difficulty passing semantic checks of a parser. Therefore, Wang and Junjie et al. [30] proposed a Probabilistic Context-Sensitive Grammar (PCSG), with a shape of
is Context-Sensitive Grammar (CSG) on the basis of CFG. It is a tuple ofG_{cs} , whose constituent elements are defined the same way as CFG, except for(N, T, R_{c}, S) . The production rules that make upR_{c} are in the form ofR_{c} , where[c]A\rightarrow a is a predefined context shaped like [type of A’s great-grandparent, type of A’s grandparent, type of A’s parent, value of A’s first sibling or type of A’s first sibling if the value is null].[c] assigns each production rule with a probability and ensures that for a given non-terminal symbol, the probability of the applicable production rules for all contexts sums to one [30].q
Our Approach
To address challenges mentioned in section I-A, we propose a novel grammar-aware mutation strategy with subtree-pool-based random mutation and AST-node-level power schedule. We implement our strategy on a grammar-aware greybox fuzing prototype, which combines efficient coverage feedback with valid seed generation.
A. Overview
The overall structure of our prototype, as shown in Fig. 3, is divided into two main modules: generator and fuzzer. The generator takes as input one CFG and a seed corpus corresponding to the target program. It learns a PCSG from the corpus, which is leveraged to generate syntactically and semantically valid seeds for the fuzzer. Then we utilize the overall structure of AFL [1] to perform the main character of a fuzzer, in which we implement our power schedule and random subtree pool strategy. Note that the PCSG model constructed by the generator will also be used as part of the mutation strategy (in III-C2) in order to guide the power schedule. (light dotted line in Fig. 3).
B. Generator
We construct the generator by an improved and streamlined PCSG model in Skyfire [30]. We fine tune the probabilistic definition and shorten the whole process. As shown in the generator part of Fig. 3, we divide the whole generation process into PCSG model learning and valid-seed generation. To start with, we will analyze the PCSG model construction along with the adjustments we made and then describe the seed generation process.
1) PCSG Construction
In the process of PCSG learning, we take a collected corpus and one CFG as input. And for every file (e.g., an XML file or a JS code) in the corpus, we parse it into an AST and gather the information such as context and frequency of each production rule to construct a PCSG model.
The PCSG model proposed in Skyfire [30] leverages statistical probability to measure the rarity of each production rule, as presented in Formula 1, where \begin{equation*} P([c]R_{k}) = \frac {count([c]R_{k})}{count(A)} \tag{1}\end{equation*}
It is a common probability that is knowledge acquired without or before experience. It describes a variable with the lack of predefined fact. Here, in the process of seed generation, the predefined fact is context and the probability here is exactly the probability of each production rule with its context known (i.e., a posterior probability). Formula 2 presents the posterior probability expression based on Bayesian theory. \begin{equation*} P(R_{k}|[c]) =\frac {P(R_{k})P([c]|R_{k})}{P([c])} \tag{2}\end{equation*}
Compared with the probability of Formula 1, the posterior probability can be regarded as a modification of it. Since the posterior probability includes more information on usage scenarios, it is a better reflection of the underlying truth in a data generation process.
2) Seed Generation Process
In the process of generation, we take advantage of the left-most derivation. It begins from the start symbol of PCSG and iteratively replaces and expand the leftmost non-terminal with a selected production rule at each step. The probability of Formula 2 is leveraged to determine which production rule is chosen to replace the original non-terminal. The smaller the posterior probability of a rule in the PCSG model, the more possible that it is selected. The goal of this strategy is to select not commonly used rules to trigger deeper bugs [30].
C. Fuzzer
In the fuzzer module of Fig. 3, we reference the classic AFL processing architecture. Initially, we instrument target source code to generate an instrumented binary file (①) and run it (②) using the valid seed inputs generated by the generator. Then, we construct and update the seed queue (③), and select a seed in the queue for further mutation (④). In the mutation step (⑤), we integrate two grammar-based strategies, subtree-pool-based random mutation and AST-node-priority power schedule. The mutated test input is then executed in instrumented binary (⑥) and the queue is iteratively updated (③) based on the result of the execution (i.e. coverage feedback). The two mutation strategies are described in detail as follows.
1) Subtree-Pool-Based Random Mutation
In the mutation process, we select a seed from the queue and mutate it to produce multiple test inputs. This aforementioned process is called a round of mutation. In a round of mutation, we first parse the selected seed into the format of AST and then perform mutating operations on it. Particularly, for each internal node
Replacement mutation selects the corresponding subtree from the subtree pool and replaces the original subtree of
. While mutating, we maintain a subtree pool shown in Figure. 4. We design a subtree set for each type of node (i.e. non-terminal symbol) with different contexts, and treat it as the source subtrees in replacement mutation. As shown in Algorithm 1 (line 12-25), we first record the type{n} and context{t} of{c} , and then randomly choose a subtree with a root node whose type is the same as{n} and context is consistent with{t} from the subtree pool (Algorithm 1) and replace{c} with its subtree.{n} The advantage of subtree pool is that it not only increases the randomness of the replacement mutation, but also ensures semantic correctness of the mutated test inputs. Among the mutation methods designed by predecessors, when considering the replacement mutation, they commonly select another seed S’ from the seed queue and parse it into AST A’, and take A’ as the source of Replacement mutation ([31], [32]). Their methods do not guarantee randomness and semantics at the same time, as there are no strategies to make sure that a subtree fitting current context
can be found in a randomly selected seed S’. Not to mention the subtree search space in subtree pool is much larger than that in S’.{c} Havoc mutation performs havoc strategy of AFL on the string format of
and its subtree. It randomly combines bitflip, arithmetic mutation, interesting value, etc., which is a very stochastic approach. We take this strategy only when the subtree pool is not big enough to provide a sufficient number of subtrees with root type{n} and context{t} as sources.{c}
Subtree pool architecture. While mutating a test input, we first convert it to an AST. For each internal node
Algorithm 1 Random Subtree-Pool-Based Mutation
test input to be mutated
a set
parse
if parse error then
return
end if
/*
for all internal node
end for
/*
for internal node
for
if
else
end if
end for
end for
return
Throughout the fuzzing process, we incrementally maintain the subtree pool. Specifically, at the end of each round of mutation, and before we put an interesting test input into the seed queue, we parse it, extract subtree of each internal node along with its type and context, and add them to the subtree pool.
2) AST-Node-Level Power Schedule
In guided mutation-based fuzzers, power schedule is conventionally leveraged to determine the number of times of a particular test input being fuzzed [13] (i.e., file-level). We are the first to introduce the idea of power schedule into AST-based mutation approach, and it is accurate down to the node-level. More specifically, we assign energy
In our approach, energy assigned to each node determines its importance in a round of mutation, that is, priority among nodes and number of mutations. We design the AST-node-level power schedule from the point that benefits of mutation at different nodes are not equally the same. For instance, in the example of Fig. 1, mutating the chardata node will not bring as much effect as mutating the element or content node for it has less importance than the latter two. We measure the importance of each node by two dimensions: rarity and influence.
For an internal node \begin{align*} Energy(n)=&a\times Rarity + b \times Influence \\=&{U \times \left[{a \times (1-p)+b \times \left({\frac {impact_{cov}}{d}}\right)}\right] } \tag{3}\end{align*}
Here
Rarity describes how uncommon a node is from the point of production rules. Typically, an AST internal node and its children conform to one production rule in PCSG. In section III-B, we deliver a posterior probability to each production rule with a particular context. Start from here, posterior probability
of the production rule (i.e., an internal node and its children comply within current context) is the rarity measure of a node. If the probabilityp of a rule is lower, it is proved that this rule is less used in the current context [30], and mutations on its corresponding node are more likely to find the potential vulnerability. Therefore, it is necessary to assign more mutation energy to rarer nodes.p Influence describes how much impact a node along with its subtree has across the entire AST. The impact here is from a coverage-based perspective. We traverse the entire AST and evaluate the coverage that each node can affect, and record it as
. Specifically, on the traversal of an AST, we trim one internal node and its corresponding subtree each time [32], observe the difference of AST’s coverage before and after trimming, and treat it asimpact_{cov} . It is obvious that theimpact_{cov} of nodes with lower depth (such as root node) will be larger, hence we need to consider depthimpact_{cov} at the same time when calculating the total influence. Generally speaking, for an internal node, the smaller itsd is and the larger itsd is, the more mutation energy is assigned to that node.impact_{cov}
Implementation and Evaluation
We implement our design and strategies with a prototype using C/C++ code following the overall structure of AFL [1] and Skyfire [30]. Particularly, for a given CFG grammar [5], we utilize an open-source tool ANTLR v4 [4] to automatically generate parser and lexer, so that we can parse inputs into ASTs and traverse them breezily. We publish our project on https://github.com/J-Daniel/FSA.
A. Evaluation Setup
1) Target Programs
To evaluate the efficiency of our proposed approach, we choose XML and JavaScript engines as target programs. They both handle complex and structured inputs that can be defined through CFG grammars.
XML is a markup language designed to structuralize, store, and transmit information. It is the most commonly used tool for transferring data between SCSA applications, and is becoming increasingly popular in the field of information storage and description. We select a series of real-world XML engines, including the commonly used libxml 2.9.2 [6], Libplist 2.2.0 [7], Libxslt 1.1.31 [8] and Sablotron [9], as our test targets.
JavaScript is a lightweight, interpretive (or just-in-time) programming language with function first functionality. It is commonly used to build web applications in the application layer of SCSA. As a programming language, JS’s CFG is more sophisticated than XML, making it a sort of test input with more intricacy behaviors. We choose the lightweight JerryScript engine 2.4.0 [10] and the open source Webkit engine [11] as the target programs for testing JS.
2) Experiment Environment
We perform fuzz testing and evaluate our prototype on a Debian 8.3.0 server with 4-core CPUs and 2GB RAM. CMake version 3.16.0 and Clang version 7.0.1. are used.
3) Research Questions
We set the following research questions and answer them in the following evaluation.
RQ1: How our method outperforms the other state-of-the-art fuzzers?
RQ2: Can subtree pool provide a rich enough subtree resource base for mutation?
RQ3: How to balance rarity and influence in AST-node-level Power Schedule?
B. RQ1: Overall Performance
We choose open source fuzzers AFL [1] and Superion [32] for comparison evaluation. AFL is the most widely used mutation-based fuzzer. Superion is a state-of-the-art fuzzer based on grammar-oriented mutation and can both achieve stable syntactic accuracy and fuzzing efficiency.
As shown in Fig. 5, the fuzzing result and efficiency of our prototype are all improved to a certain extent in the three tested programs comparing to our competitors. Here, FSA in the figures refers to Fuzzing with Subtree-pool-based random mutation and AST-node-level power schedule.
Overall performance of our strategies over 6 real world programs, here FSA represents Fuzzing with Subtree-pool-based random mutation and AST-node-level power schedule.
FSA is a big improvement over AFL. It has a big advantage over AFL because the three chosen programs are dealing with complex and structrured inputs like XML and JS codes. This also proves that the grammar blindness of AFL in such programs, which cannot effectively handle such complex and structured inputs. For a relatively simple grammar like XML, AFL is not falling far behind GBFs. While for grammar like JavaScript, AFL is much more inferior to grammar-based ones.
On the other hand, FSA could also compare favorably with Superion. In Libxml and Sablotron, the efficiency of FSA is better than Superion, partly thanks to reliable seeds produced by the generator in Fig. 3 as the initial inputs. At the same time, compared with Superion’s curve, the curve of FSA has a more obvious rising trend and slope rate, also proves that our proposed mutation strategy is effective.
Meanwhile, in JerryScript and Webkit, GBFs like FSA and Superion outperform much better than AFL. When dealing with simple inputs such as XML, the gap between them is not so obvious. At the same time, FSA performs better than Superion in programs that handle XML inputs, but not in JerryScript and Webkit. This alarms that the ability of our strategy to handle complex grammars requires to be improved. One of the reasons is that the more complex the grammar is, the more expensive will construction and maintenance of the subtree pool cost, resulting in the overall fuzzing efficiency is reduced. Secondly, due to the complex grammar, it is much more time-consuming to traverse ASTs, which increases the cost of each round of mutation, thus inhibiting the performance of FSA.
C. RQ2: Evaluating Subtree Pool
In order to provide a magnificent source of subtrees for replacement mutation strategy, we design a storage structure named subtree pool. In the experiments later in this section, we will compare our method with AFL and Superion to evaluate the quality of subtree source in the process of mutation.
More specifically, we count the number of unique subtrees (i.e., search space) that can be provided by different fuzzers at different time during fuzzing. For the number of unique subtrees in AFL and Superion, we take out the seeds in their seed queue for parsing and statistics. While for FSA, we directly count the number of unique subtrees in the subtree pool. TABLE 1 shows the results of our evaluation.
As can be seen from TABLE 1, in the process of fuzzing, the number of unique subtrees (subtree search space) that can be obtained by FSA is larger than that of the other two fuzzers. It can also be seen from the table that, with the increase of the number of unique subtrees, the edge coverage also increases. It indicates that the number of unique subtrees has an obvious influence on the fuzzing effect. FSA’s subtree pool is proven to improve the diversity of unique subtrees, which is of great help to fuzzing performance and efficiency.
Compared with AFL, FSA is able to mutate according to grammar and is more capable of producing reliable seeds than AFL’s grammar-blind strategy. AFL itself does not take grammar into account, so there will be a considerable number of seeds that cannot be parsed correctly. Compared with Superion, our method is to update the subtree pool incrementally, which is more robust and reliable compared with its random selection of a seed in the queue as the subtree source each time.
D. RQ3: Optimizing Power Schedule Strategy
In our strategy, we assign energy to each internal node of the AST to determine the priority and time for each node to be mutated. In the following experiment, we will evaluate the two criteria (i.e. rarity and influence) for the power schedule mentioned in Equation 3 and discuss the precedence of the two in determining the final assigned energy. We use
From TABLE 2, it is clear that
Generally speaking, FSA performs better than AFL and Superion in terms of overall efficiency, with a greater advantage in handling simple XML inputs than JavaScript codes. Thanks to the effective mutation strategy, valid seed generation and subtree pool, FSA has a strong ability to deal with complex and structured inputs. At the same time, we also prove that the rarity of AST nodes has a greater impact than the influence on the final energy assignment during power schedule. However, we should also note that FSA’s leading position is not that huge, especially when dealing with complex grammars like JavaScript. That’s something we must improve on continually.
Related Work
In this section, we will discuss serial works that are most relevant to ours. We categorize them broadly into guided mutation-based and grammar-based approaches. First, we will introduce guided mutation-based fuzzers, which are the source of priority mutation and power schedule strategies, and also the starting point of our thoughts on improving mutation strategy. Then, we will introduce grammar-based fuzzers, which expert in handling complex inputs in a grammar-oriented way. We observe problems and propose challenges from grammar-based fuzzers.
A. Guided Mutation-Based Fuzzing
Mutation-based fuzzing is a commonly used and efficient way to find vulnerabilities in the program. They usually need a corpus for the initial seed pool and use it to form the initial seed queue. In the process of fuzzing, several certain strategies are used to mutate seeds and generate test inputs. They utilize some metrics (e.g. coverage) to measure a test input and guide mutation. It updates the seed queue with the ones that perform well. Most mutation-based fuzzers follow the overall structure of AFL [1], a fuzzer that has achieved quite high efficiency with random mutation strategies and code coverage measurement based on instrumentation.
While AFL is fast, its blind mutation strategy has a negative influence on its efficiency to some extent. Therefore, some methods of guided mutation have been developed in the hope of improving the knowledge-blindness of basic AFL.
Some taint-based fuzzers, such as BuzzFuzz [14] and TaintScope [15], use taint analysis to confirm the relationship between input bytes and branch behaviors. Subsequent fuzzers (e.g. VUzzer [16], Angora [19], Steelix [20], GreyOne [21], Matryoshka [22], etc.) improve their methods considerably. By enhancing taint tracking, using data flow or utilizing control flow analysis, they are capable of determining prioritized bytes of test inputs and get through unreachable edges (e.g. branches with magic bytes comparison). So that these fuzzers are able to learn some logic of a target program and trigger deeper bugs. At the same time, there are also some learning-based fuzzers that use machine learning to find the prioritized bytes. NEUZZ [23] is a representative one among them. It utilizes a neural network model to calculate the gradient of each input byte and then takes advantage of the gradient distribution to guide mutation, allowing the fuzzer to explore specific edges.
As for power schedule, fuzzers like AFL-fast [13], MOPT [17] and Ecofuzz [18] utilize various mathematical models (e.g. Markov chain, PSO and MAB) to help energy assignment. Note that AFLSmart [24] leverages the structure representation in seed files to guide mutation, and defines a validity-based power schedule to generate test inputs that are more likely to explore deeper program logic. Although it is close to main ideas of grammar-based and generation-based fuzzers, its application scenarios are limited due to the relatively simple chunk strategy and virtual file structure.
B. Grammar-Based Fuzzing
Even though guided mutation-based fuzzers perform well in solving unreachable constraints, when it comes to complex and highly structured inputs like XML and JavaScript, they are usually stuck in the syntax checking phase of parsers due to their incomplete consideration of syntax and semantics. Therefore, more GBFs are proposed. They utilize specific grammars to automatically generate syntactic and semantic valid inputs to help fuzzers pass the checking phase of parsers.
In order to perform fuzz testing on C compilers, CSmith [25] generates C source code according to C language grammar. It randomly selects C programs that conform to the generation rules and grammar rules. This method can avoid the occurrence of errors caused by undefined and undeclared compilation. Langfuzz [26] learns code snippets from a test set based on grammar and reassembles the snippets to generate new test inputs. It assumes that test inputs generated by recombination of problem test sets are more likely to trigger program defects than those randomly collected. Ifuzzer [27] takes the CFG as input and generates an AST. Then it extracts code fragments from the test corpus, and utilizes a genetic evolutionary algorithm to recombine the code fragments and generate new test inputs. JSFunfuzz [28] uses the knowledge of historical vulnerabilities and hard-coded rules to generate test inputs, and finds more than 1800 vulnerabilities against the JavaScript interpreter in Mozilla.
Learn&Fuzz [29] is the first to use a RNN-based learning model to learn grammar from an existing PDF input corpus and generate new inputs. The model it learns not only generates a large number of new, well-formed inputs, but also increases code coverage for fuzzing PDF engines. But it is limited to the simple grammar of PDF Objects, and does not delve further into the higher-level hierarchical structure of PDF documents.
Skyfire [30] also learns a distribution model based on a corpus. Different from Learn&Fuzz, it requires one CFG as input. Skyfire learns a PCSG from the corpus, and then generates initial seeds for fuzzing based on it. It generates grammar-valid seeds to help the fuzzer pass the syntax and semantic checking phase of parsers. However, it does not use the learned model to guide the mutation phase of fuzzing, which means that there is still a high chance that test inputs may not conform to syntax and semantics during the mutation process.
Therefore, more and more grammar-based mutation strategies have emerged. NAUTILUS [31] proposes a number of novel mutation strategies, such as rules mutation, random recursive and splicing, etc. It also introduces the mechanism of scheduling based on code coverage into a GBF. These aforementioned strategies can significantly improve the efficiency of fuzzing. But NAUTILUS simplifies the entire fuzzer’s input into just one CFG, thus it doesn’t consider semantic checking. Superion [32] develops an AST-based trimming and mutating method, and it is used as an extension of AFL. Unlike the two mentioned earlier, CodeAlchemist [33] takes semantics into account. It utilizes a method of Semantically Aware Assembly to combine JS code blocks, so as to generate semantically valid JS code snippets during fuzzing. When mutating based on AST, the aforementioned fuzzers seem to be more concerned with how to mutate, rather than where and how many times to mutate. This is exactly our starting point: nodes of AST are not exactly equivalent, and if we differentiate them and use a priority mutation strategy, we can make the mutation more efficient.
At the same time, more fuzzing methods based on intermediate representation (IR) have been proposed, such as FreeDOM [34], Squirrel [35], etc. Compared to AST, IR is more compact, general, and closed to the bottom layer of compiling systems. Some learning-based approaches, like Montage [36], use a sequence of subtrees transformed from ASTs to train a Neural Network Language Model and then use it for test input generation.
Conclusion
In this paper, we have proposed a novel mutation strategy in GBFs, which combines Subtree-pool-based Random Mutation and AST-node-level power schedule. We provide an affluent and reliable source of subtrees for grammar-based mutation on ASTs. We introduce the key idea of priority mutation and power schedule into grammar-based fuzzing approaches, and define energy assignment and priority from a novel perspective of AST-node-level. The prototype realized by utilizing the aforementioned strategies has a good performance and efficiency in real programs like XML and simple JavaScript engines.
Although our fuzzer is an improvement over previous methods, it is still inefficient and ineffective when testing more complex JS engines. In the future, we will focus on continually improving and refining our proposed mutation strategy to accommodate more structured grammars and more complex target programs, such as JS engines commonly used in browsers. We will refine our mutating approach from the point of Intermediate Representation that is closer to the underlying level, and will focus on reducing the resource costage of the subtree pool. At the same time, we will work hard to improve the efficiency of prototype and make it outperform better than the state-of-the-art fuzzers that have been proposed before.