Skip to Main Content
Observing the host-or network-level behavior of malware as it executes constitutes an essential technique for researchers seeking to understand malicious code. Dynamic malware analysis systems like Anubis , CWSandbox  and others , , , ,  have proven invaluable in generating ground truth characterizations of malware behavior. The anti-malware community regularly applies these ground truths in scientific experiments, for example to evaluate malware detection technologies , 10, 17, 19, 24, 26, 30, 33, 44, 48, 52 , to disseminate the results of large-scale malware experiments , , , to identify new groups of malware , , , , or as training datasets for machine learning approaches , , , , , , , . However, while analysis of malware execution clearly holds importance for the community, the data collection and subsequent analysis processes face numerous potential pitfalls.
In this paper we explore issues relating to prudent experimental evaluation for projects that use malware-execution datasets. Our interest in the topic arose while analyzing malware and researching detection approaches ourselves, during which we discovered that well-working lab experiments could perform much worse in real-world evaluations. Investigating these difficulties led us to identify and explore the pitfalls that caused them. For example, we observed that even a slight artifact in a malware dataset can inadvertently lead to unforeseen performance degradation in practice.
Thus, we highlight that performing prudent experiments involving such malware analysis is harder than it seems. Related to this, we have found that the research community's efforts (including ours) frequently fall short of fully addressing existing pitfalls. Some of the shortcomings have to do with presentation of scientific work, i.e., authors remaining silent about information that they could likely add with ease. Other problems, however, go more deeply, and bring into question the basic representativeness of experimental results.
As in any science, it is desirable for our community to ensure we undertake prudent experimental evaluations. We define experiments reported in our paper as prudent if they are are correct, realistic, transparent, and do not harm others. Such prudence provides a foundation for the reader to objectively judge an experiment's results, and only well framed experiments enable comparison with related work. As we will see, however, experiments in our community's publications could oftentimes be improved in terms of transparency, e.g., by adding and explaining simple but important aspects of the experiment setup. These additions render the papers more understandable, and enable others to reproduce results. Otherwise, the community finds itself at risk of failing to enable sound confirmation of previous results.
In addition, we find that published work frequently lacks sufficient consideration of experimental design and empirical assessment to enable translation from proposed methodologies to viable, practical solutions. In the worst case, papers can validate techniques with experimental results that suggest the authors have solved a given problem, but the solution will prove inadequate in real use. In contrast, well-designed experiments significantly raise the quality of science. Consequently, we argue that it is important to have guidelines regarding both experimental design and presentation of research results.
We aim in this work to frame a set of guidelines for describing and designing experiments that incorporate such prudence, hoping to provide touchstones not only for authors, but also for reviewers and readers of papers based on analysis of malware execution. To do so, we define goals that we regard as vital for prudent malware experimentation: transparency, realism, correctness, and safety. We then translate these goals to guidelines that researchers in our field can use.
We apply these guidelines to 36 recent papers that make use of malware execution data, 40% from top-tier venues such as ACM CCS, IEEE S&P, NDSS and USENIX Security, to demonstrate the importance of considering the criteria. Figure 1 shows the number of papers we reviewed by publishing year, indicating that usage of such datasets has steadily increased. Table II (on page 6) lists the full set of papers. We find that almost all of the surveyed papers would have significantly benefited from considering the guidelines we frame, indicating, we argue, a clear need for more emphasis on rigor in methodology and presentation in the subfield. We also back up our assessment of the significance of some of these concerns by a set of conceptually simple experiments performed using publicly available datasets.
We acknowledge that fully following the proposed guidelines can be difficult in certain cases, and indeed this paper comes up short in some of these regards itself. For example, we do not fully transparently detail our survey datasets, as we thought that doing so might prove more of a distraction from our overall themes than a benefit. Still, the proposed guidelines can—when applicable—help with working towards scientifically rigorous experiments when using malware datasets.
To summarize our contributions:
We begin by discussing characteristics important for prudent experimentation with malware datasets. In formulating these criteria, we draw inspiration from extensive experience with malware analysis and malware detection, as well as from lessons we have learned when trying to assess papers in the field and—in some cases—reproducing their results. We emphasize that our goal is not to criticize malware execution studies in general. Instead, we highlight pitfalls when using malware datasets, and suggest guidelines how to devise prudent experiments with such datasets.
We group the pitfalls that arise when relying on data gathered from malware execution into four categories. Needless to say, compiling correct datasets forms a crucial part of any experiment. We further experienced how difficult it proves to ensure realism in malware execution experiments. In addition, we must provide transparency when detailing the experiments to render them both repeatable and comprehensible. Moreover, we believe that legal and ethical considerations mandate discussion of how conduct such experiments safely, mitigating harm to others. For each of these four “cornerstones of prudent experimentation”, we now outline more specific aspects and describe guidelines to ensure prudence. As we will show later, the following guidelines can be used by our community to overcome common shortcomings in existing experiments.
The previous section described guidelines for designing and presenting scientifically prudent malware-driven experiments. As an approach to verify if our guidelines are in fact useful, we analyzed in which cases they would have significantly improved experiments in existing literature. This section describes our methodology for surveying relevant publications with criteria derived from our guidelines.
Initially, we establish a set of criteria for assessing the degree to which experiments presented in our community adhere to our guidelines. We aim to frame these assessments with considerations of the constraints the reviewer of a paper generally faces, because we ultimately wish to gauge how well the subfield develops its research output. Consequently, we decided not to attempt to review source code or specific datasets, and refrained from contacting individual authors to clarify details of the presented approaches. Instead, our goal is to assess the prudence of experiments given all the information available in a paper or its referenced related work, but no more. We employed these constraints since they in fact reflect the situation that a reviewer faces. A reviewer typically is not supposed to clarify missing details with the authors (and in the case of double-blind submissions, lacks the means to do so). That said, we advocate that readers facing different constraints should contact authors to clarify lacking details whenever possible.
Table I lists the guideline criteria we used to evaluate the papers. We translate each aspect addressed in § II into at least one concrete check that we can perform when reading a given paper.1 We defined the assessment criteria in an objective manner such that each item can be answered without ambiguity. We also assign a three-level qualitative importance rating to each check, based on our experience with malware execution analysis. Later on, this rating allows us to weigh the interpretation of the survey results according to the criteria's criticality levels.
For an informal assessment of our approach, we asked the authors of two papers to apply our criteria.2 The researchers were asked if the criteria were applicable, and if so, if the criteria were met in their own work. During this calibration process, we broadened the check to determine coverage of false positives and false negatives, to allow us to perform a generic assessment. In addition, as we will discuss later, we realized that not all criteria can be applied to all papers.
We assessed each of the guideline criteria against the 36 scientific contributions (“papers”) in Table II. We obtained this list of papers by systematically going through all of the proceedings of the top-6 computer- and network-security conferences from 2006–2011.3 We added a paper to our list if any of its experiments make use of PC malware execution driven datasets. We then also added an arbitrary selection of relevant papers from other, less-prestigious venues, such that in total about two fifth (39%) of the 36 surveyed papers were taken from the top-6 security conferences. As Figure 1 shows, we see increasing use of malware execution during recent years.
The surveyed papers use malware datasets for diverse purposes. A significant number used dynamic analysis results as input for a training process of malware detection methods. For example, Botzilla  and Wurzinger et al.  use malicious network traffic to automatically generate payload signatures of malware. Similarly, Perdisci et al.  propose a methodology to derive signatures from malicious HTTP request patterns. Livadas et al.  identify IRC-based C&C channels by applying machine-learning techniques to malware execution results. Zhu et al.  train SVMs to model the abnormally high network failure rates of malware. Morales et al.  manually derive characteristics from malware observed during execution to create detection signatures. Malheur ,  can cluster and classify malware based on ordered behavioral instructions as observed in CWSandbox. Kolbitsch et al.  present a host-based malware detection mechanism relying on system call slices as observed in Anubis.
In addition, we have surveyed papers that used malware execution solely to evaluate methodologies. Most of these papers leverage malware traces to measure true positive rates of malware detection mechanisms , 1719, 24, 30, 33, 44, 48, 52 . Typically, the authors executed malware samples in a contained environment and used the recorded behavior as ground truth for malicious behavior, either via network traces (for assessing network-based IDSs) or via host behavior such as system call traces (for system-level approaches). Similarly, researchers have used malware execution traces for evaluating methodologies to understand protocol semantics , , to extract isolated code parts from malware binaries , to detect if malware evades contained environments , or to improve the efficiency of dynamic analysis .
A third group of papers used malware traces to obtain a better understanding of malware behavior. For example, JACKSTRAWS  leverages Anubis to identify botnet C&C channels. Similarly, FIRE  identifies rogue networks by analyzing malware communication endpoints. Caballero et al.  execute malware to measure the commoditization of pay-per-install networks. DISARM  measures how different malware behaves in virtualized environments compared to Anubis. Bayer et al.  and Jang et al.  present efficient clustering techniques for malware behavior. Bailey et al.  label malware based on its behavior over time. Finally, Bayer et al.  and Rossow et al.  analyze the behavioral profiles of malware samples as observed in Anubis and Sandnet.
To ensure consistency and accuracy in our survey results, two of our authors conducted an initial survey of the full set of papers. Employing a fixed pair of reviewers helps to ensure that all papers received the same interpretation of the guideline criteria. When the two reviewers did not agree, a third author decided on the specific case. In general, if in doubt or when encountering vague decisions, we classified the paper as conforming with the guideline (“benefit of the doubt”). Note that our assessments of the papers contain considerably more detail than the simple statistic summaries presented here. If a paper lacked detail regarding experimental methodology, we further reviewed other papers or technical reports describing the particular malware execution environment. We mark criteria results as “unknown” if after doing so the experimental setup remained unclear.
We carefully defined subsets of applicable papers for all criteria. For instance, executions of malware recompiled to control network access do not require containment policies. Similarly, analyzing the diversity of false positives only applies to methodologies that have false positives, while removing goodware samples only matters when relying on unfiltered datasets with unknown (rather than guaranteed malicious) binaries. Also, removing outdated or sinkholed samples might not apply if the authors manually assembled their datasets. Balancing malware families is applicable only for papers that use datasets in classification experiments and if authors average classification performances over the (imbalanced) malware samples. Moreover, we see a need to separate datasets in terms of families only if authors suggest that a methodology performs well on previously unseen malware types. We further define real-world experiments to be applicable only for malware detection methodologies. These examples show that building subsets of applicable papers is vital to avoid skew in our survey results. Consequently, we note for all criteria the number of papers to which we deemed they applied.
We also sometimes found it necessary to interpret criteria selectively to papers. For example, whereas true-positive analysis is well-defined for assessing malware detection approaches, we needed to consider how to translate the term to other methodologies (e.g., malware clustering or protocol extraction). Doing so enabled us to survey as many applicable papers as possible, while keeping the criteria fairly generic and manageable. In the case of malware clustering techniques, we translated recall and precision to true positive and false positive rate, respectively. This highlights the difficulty of arriving at an agreed-upon set of guidelines for designing prudent experiments.
We divide our survey interpretation into three parts. First, in a per-guideline analysis, we discuss to which extent specific guidelines were met. The subsequent per-paper analysis assesses whether only a small fraction of all papers accounts for the results, or if our findings hold more generally across all of the papers. Finally, a top-venue analysis details how papers appearing in more competitive research venues (as previously defined) compare with those appearing in other venues.
Table III lists the results of our assessment methodology ordered by theme and importance. The second major column includes statistics on all surveyed papers, while the third major column represents data from publication at top-tier venues only. App specifies the number of papers for which the criterion applied. OK states the proportion of those applicable papers that adhered to the guideline, whereas Ukwn specifies the proportion for which we could not assess the guideline due to lack of experimental description.
In this section we highlight instances of criteria that potentially call into question the basic correctness of a paper's results.
In five cases, we find papers that mix behavioral traces taken from malware execution with traces from real systems. We find it difficult to gauge the degree of realism in such practices, since malware behavior recorded in an execution environment may deviate from the behavior exhibited on systems infected in the wild. For instance, Celik et al.  have pointed out that time-sensitive features such as frames per hour exhibit great sensitivity to the local network's bandwidth and connectivity latency; blending malware flows into other traces thus requires great care in order to avoid unnatural heterogeneity in those features. Another difference is generally the lack of user interaction in malware execution traces, which typically exists in real system traces. Consequently, we argue that researchers should not base real-world evaluations on mixed (overlay) datasets. On the positive side, two papers avoided overlay datasets and instead deployed sensors to large networks for real-world evaluations , .
In two papers, the authors present new findings on malware behavior derived from datasets of public dynamic analysis environments, but did not remove goodware from such datasets. Another two malware detection papers include potentially biased false negative experiments, as the datasets used for these false negative evaluations presumably contain goodware samples. We illustrate in § V-B that a significant ratio of samples submitted to public execution environments consists of goodware. Other than these four papers, all others filtered malware samples using anti-virus labels. However, no author discussed removing outdated or sinkholed malware families from the datasets, which has significantly side effects in at least one such case.
Summarizing, at least nine (25%) distinct papers appear to suffer from clearly significant problems relating to our three most basic correctness criteria. In addition, observing the range of further potential pitfalls and the survey results, we speculate that more papers may suffer from other significant biases. For example, in another 15 cases, the authors did not explicitly discuss the presence/absence of sinkholed or inactive malware samples. In addition, three malware detection papers do not name malware families, but instead use a diverse set of malware binaries during experiments. We illustrate in § V-C that such datasets are typically biased and potentially miss significant numbers of malware families. We further observed seven papers with experiments based on machine learning that did not employ cross-validation and thus potentially failed to generalize the evaluation to other datasets. To name good examples, the authors in , , ,  chose a subset of malware families and balanced the number of samples per family prior to the training process. Similarly, we observed authors performing cross-validation to avoid overfitting detection models , , , .
Lastly, nearly all of the papers omitted discussion of possible biases introduced by malware execution, such as malware behavior that significantly differs if binaries execute in a virtual machine , . Typically, further artifacts or biases, for example, due to containment policies exist when executing malware as illustrated in § V-E. We highlight the importance of real-world scenarios, as they favor methodologies which evaluate against realistic and correct datasets.
We observed two basic problems re- garding transparent experiment descriptions in our community. First, descriptions of experimental setups lack sufficient detail to ensure repeatability. For example, 20% of the papers do not name or describe the execution environment. For a third of the papers it remains unclear on which OS the authors tested the proposed approach, and about a fifth do not name the malware families contained in the datasets. Consequently, in the majority of cases the reader cannot adequately understand the experimental setup, nor can fellow researchers hope to repeat the experiments. In addition, 75% do not describe containment policies.
Second, we find the majority of papers incompletely describe experimental results. That is, papers frequently fail to interpret the numeric results they present, though doing so is vital for effectively understanding the import of the findings. Consider the simple case of presenting detection rates. In which exact cases do false positives occur? Why do some malware families raise false negatives while others do not? Do the true positives cover sufficient behavioral diversity?
Our survey reveals that only a minority of papers includes real-world evaluations, and very few papers offer significant sample sizes (e.g., in numbers of hosts) for such experiments. The lack of real-world experiments makes it hard to judge whether a proposed methodology will also work in practice. We find that authors who do run real-world experiments often use locally accessible networks (e.g., a university campus, or a research lab). Doing so does not constitute a problem per se, but authors often base such experiments on the untenable assumption that these environments do not contain malware activity. In eight cases, authors used university networks for a false positive analysis only, although their methodology should also detect malware in such traces.
We noted a further eight papers that model malicious behavior on malware samples controlled by the authors themselves. Without justification, it seems unlikely that such malware samples behave similarly to the same samples when infecting victim machines in the wild. The malware execution environment may introduce further biases, e.g. via author-controlled servers that may exhibit unrealistically deterministic communication patterns. All of these cases lack representative real-world evaluations, which could have potentially offset these criteria.
We find that the typical paper evaluates its methodology against eight (median) distinct malware families, and five (14%) evaluated using only a single family. Similarly, two thirds of the surveyed malware detection methodologies evaluated against eight or fewer families. There may be a good reason for not taking into account further families, e.g., if no other malware families are applicable for a specific experiment. In general, however, we find it difficult to gauge whether such experiments provide statistically sound results that can generalize.
Most papers did not deploy or adequately de- scribe containment. More than two thirds (71%) completely omit treatment of any containment potentially used during the experiments. The reasons for this may be that authors rely on referencing to technical reports for details on their containment solution. We found, however, that only few such reports detail the containment policies in place. Two papers state that the authors explicitly refrained from deploying containment policies.
The preceding discussion has shown the high potential of our guidelines for improving specific prudence criteria. As a next step, we analyze how many papers can in total benefit from significant improvements.
To do so, Figure 2 details how many of the most important criteria • in Table I)4 a paper violated. The fewer criteria a paper met, the more its experiments could have been improved by using our guidelines. The figure shows that only a single paper fulfilled all of the applicable guidelines. More than half (58%) of the papers violate three or more criteria. In general, the plot shows a correlation between the number of violated criteria and the number of applicable criteria. This means that our guidelines become increasingly important when designing more complex experiments.
We then separate the results into presentation and safety issues (middle graph) and incorrect or unrealistic experiments (right graph). We find that lacking transparency and safety constitutes a problem in half of the cases. Far more papers (92%) have deficiencies in establishing correct datasets and realistic experiments. Note that this does not imply that the experiments suffer from heavy flaws. It does flag, however, that many papers remain silent about important experimental descriptions. In addition, this analysis shows that experiments in applicable papers could be significantly improved in terms of correct datasets and realistic experiments.
In some cases, malware datasets were reused in related papers (such as ), often inheriting problems from the original experiments. In such cases, issues are mostly with the original paper. However, we knowingly did not remove such papers, as we wanted to survey the use instead of the creation of malware datasets.
We now ask ourselves if experiments presented at toptier conferences appear to be more prudent than others. To measure this, Figure 3 compares results for the ten most important guidelines (• in Table I). We do not observe any obvious prudence tendency towards top-tier conferences or other venues. The first strong difference regards the prevalence of real-world experiments: while more papers presented at top-tier venues include real-world scenarios, authors base these on potentially skewed overlay datasets (e.g., mixing malware traces in real traces). Second, we observed more papers interpreting false positives at toptier conferences than at other venues. However, while the number and ratios of violations slightly differ across the criteria, the violations generally remain comparable. We therefore conclude that research published in top-tier conferences would equally benefit from our guidelines as papers presented at other venues. Thus, these shortcomings appear endemic to our field, rather than emerging as a property of less stringent peer review or the quality of submitted works.
We now conduct four experiments that test four hypotheses we mentioned in previous sections. In particular, we will analyze the presence of (1) goodware, (2) malware family imbalances, (3) inactive and sinkholed samples, and (4) artifacts in malware datasets taken from contained environments that accept public submissions. Similar datasets were used in many surveyed experiments, raising the significance of understanding pitfalls with using such datasets. As we will show, our illustrative experiments underline the importance of proper experiment design and careful use of malware datasets. At the same time, these experiments show how we can partially mitigate some of the associated concerns.
We conducted all malware execution experiments in Sandnet , using a Windows XP SP3 32bit virtual machine connected to the Internet via NAT. We deploy containment policies that redirect harmful traffic (e.g., spam, infections) to local honeypots. We further limit the number of concurrent connections and the network bandwidth to mitigate DoS activities. An in-path honeywall NIDS watched for security breaches during our experiments. Other protocols (e.g., IRC, DNS or HTTP) were allowed to enable C&C communication. The biases affecting the following experiments due to containment should thus remain limited. We did not deploy user interaction during our experiments. As Windows XP malware was most prevalent among the surveyed papers, we did not deploy other OS versions during dynamic analysis.
We base experiments V-B and V-C on 44,958 MD5 distinct malware samples and a diverse set of more than 100 malware families. We (gratefully) received these samples as a snapshot of samples submitted to a large public dynamic analysis environment during Jan.1–30, 2011. The samples originated from a diverse set of contributors, including security companies, honeypot infrastructures, and spamtraps. To analyze the dynamic malware behavior in experiments V-E and V-D, we randomly chose 10,670 of these 44,958 samples. We executed this subset of samples and recorded the malware's network traces at the Internet gateway. An execution typically lasts for at least one hour, but for reasons of scale we stopped execution if malware did not show network activity in the first 15 minutes. The reader can find the data regarding execution date, trace duration, MD5 hashes, and family names of the malware samples used in the experiments at our website.5 As we use the following experiments to measure the presence of imbalances, goodware, sinkholing and artifacts, we explicitly did not clean up our dataset in this regard.
Experiments that erroneously consider legitimate software samples as malware suffer from bias. For example, when evaluating detection accuracies, legitimate software may cause false positives. Similarly, surveys of malicious behavior will exhibit bias if the underlying dataset contains legitimate software. Thus, in this experiment, we test our hypothesis that goodware is significantly present in public dynamic analysis systems' datasets.
To give lower bounds for the ratio of goodware, we queried the MD5 hash sum of all 44,958 binaries in two whitelists during the first week in November 2011. First, we query Shadowserver.org's bin-test  for known software. Second, we consulted Bit9 Fileadvisor , a file reputation mechanism also used by anti-spam vendors. bin-test revealed 176 (0.4%) goodware samples. The Bit9 Fileadvisor recognized 2,025 (4.5%) samples. In combination, both lists revealed 2,027 unique binaries as potentially being benign. As Bit9 also includes malicious software in their database, we inspected a small sample of the 2,027 known binaries to estimate the ratio of goodware in the hits. In particular, we manually analyzed a subset of 100 randomly selected matches and found 78 to be legitimate software. Similarly, we cross-checked the 2,027 binaries via VirusTotal and found that 67.5% did not register any anti-virus detection. Estimating more conservatively, we use the minimum ratio of goodware samples (67.5%) to extrapolate the number of goodware samples within the 2,027 “whitelisted” samples. This translates to a lower bound of 1,366 (3.0%) goodware samples in our total dataset. We can also approximate an upper bound estimate regarding the prevalence of non-malicious samples by observing that 33% of the samples that were scanned by VirusTotal were not detected by any of the 44 vendors listed at VirusTotal. We therefore conclude that the ratio of legitimate binaries (3.0%-33%) may significantly bias experiments.
In this experiment we test our hypothesis stating that polymorphic malware manifests in an unduly large proportion in randomly collected sets of malware samples. We used the VirusTotal labels obtained in Experiment V-B and counted the occurrences of malware families for each anti-virus vendor. To obtain the malware family names, we parsed the naming schemes of three anti-virus vendors (Avira, Kaspersky and Symantec) commonly used by our community to assign malware labels.
The CDF in Figure 4 shows the relationship of malware families to prevalences of families in our dataset. Ideally, a multi-family malware corpus stems from a uniform distribution, i.e., each malware family contributes the same number of samples. In our dataset, randomly collected from a large public dynamic analysis environment, we find this goal clearly violated: some malware families far dominate others. For example, when relying on Kaspersky, almost 80% of the malware samples belong to merely 10% of the families. In the worst case, this would mean that experiments performing well with 4/5's of the samples may not work with 90% of the remaining malware families. In summary, unless researchers take corresponding precautions, polymorphic malware families can disproportionately dominate randomly drawn corpora of malware samples.
The identification of correctly functioning malware samples poses one of the major challenges of automated dynamic analysis. Large fractions of analyzed samples do not exhibit any behavior . Further complicating things, even if network communication manifests, it remains unclear whether it constitutes successful operation and representative behavior. During recent years, Shadowserver.org, Spamhaus, and other individuals/organizations have exercised take-overs of botnet infrastructure or botnet-employed domains. Such achievements can have the significant side effect of perturbing empirical measurement: takedowns introduce “unnatural” activity in collected datasets .
To assess the magnitude of these issues, we analyzed which of the 10,670 executed samples showed network activity, but apparently failed to bootstrap malicious activities. We used Avira to identify the malware families. Only 4,235 (39.7%) of the 10,670 samples showed any network activity, Of these samples, we found that of the 22 families with at least 5 distinct samples showing any HTTP activity, 14 (63%) included samples that only had failing HTTP communication (HTTP response codes 4XX/5XX). Similarly, of the most prevalent 33 families that used DNS, eight (24%) contained samples that did not have any other communication than the (typically failed) DNS lookups. We observed such inactive samples to be more prevalent in some families (e.g., Hupigon 85%, Buzus 75%), while other families (e.g., Allaple 0%) were less affected.
Next, we tried to quantify the effects of sinkholed malware infrastructure. We contacted various security organizations to obtain information about sinkholed C&C servers. These contacts enabled us to obtain sinkholing patterns of four different organizations operating sinkholing infrastructure. We then searched for these patterns for sinkholed samples among the 4,235 samples showing network activity. Most significantly, we found that during 59 of the 394 Sality executions (15%) and 27 of the 548 Virut executions (5%), at least one sinkholed domain was contacted. Although we are aware of additional malware families in our dataset to have sinkholed domains (e.g., Renos/Artro, Gozi, TDSS, Spy Eye, ZeuS, Carperb, Vobfus/Changeup, Ramnit, Cycbot), we could not spot sinkholing of these in our sample dataset. Combining this data, this translates to the observation that at least eleven of the 126 active families (8.7%) in our dataset are potentially affected by sinkholing.
In summary, execution of inactive or sinkholed samples will not yield representative activity, highlighting the need for authors to consider and quantify their impact.
Due to the specific setups of malware execution environments, the artifacts introduced into recorded malware traces can be manifold. For example, network traffic contains specific values such as the environment's IP address or the Windows user name. We found such easy-to-spot artifacts widespread across many malware families. Specifically, we analyzed which of the recorded network traces contain the contained environment's IP address, Windows user name, or OS version. For instance, more than 10% of all Virut samples that we executed transmitted the bot's public IP address in plaintext. Similarly, one in five Katusha samples sent the Windows user name to the C&C server. The use of “Windows NT 5.1.2600” as HTTP User-Agent, as for example by Swizzor (57%) or Sality (52%), likewise occurs frequently. These illustrative examples of payload artifacts are incomplete, yet already more than a third (34.7%) of the active malware families in our dataset communicated either Sandnet's external IP address, our VM's MAC address, the VM's Windows username, or the exact Windows version string in plaintext in at least one case.
More dangerous types of biases may hide in such datasets, unbeknownst to researchers. For instance, methodologies relying on time-based features should consider artifacts introduced by specific network configurations, such as limited bandwidth during malware execution. Similarly, containment policies may bias the analysis results. For example, we have observed spambots that cease running if a containment policy redirects their spam delivery to a local server that simply accepts all incoming mail.
In general, it is hard to measure the exact ratio of malware families generating any artifact. Some artifacts, such as limited bandwidth or particular system configurations such as installed software, are inherent to all malware families. Consequently, authors need to carefully consider artifacts for each experiment. The best advice to preclude artifacts is to either carefully and manually assemble a dataset, or to perform representative real-world experiments.
Kurkowski et al.'s survey  of the technical quality of publications in the Mobile Ad Hoc Networking community inspired our methodology. As their survey's verification strategies do not immediately apply to our community's work, we needed to establish our own review criteria. Krishnamurthy and Willinger  have identified common methodological pitfalls in a similar fashion to ours, but regarding the Internet measurement community. They established a set of standard questions authors ought to consider, and illustrate their applicability in a number of measurement scenarios. Closer to our community, Aviv and Haeberlen have discussed a set of challenges in evaluating botnet detectors in trace-driven settings , and proposed distributed platforms such as PlanetLab as a potential enabler for more collaborative experimentation and evaluation in this space. Moreover, Li et al.  explored difficulties in evaluating malware clustering approaches. Supporting our observations, they observed that using balanced and well-designed datasets have significant effects on evaluation results. They then show the importance of creating ground truths in malware datasets, broaching concerns related to some guidelines in this paper.
Perhaps most closely related to our effort is Sommer and Paxson's approach to explaining the gap between success in academia and actual deployments of anomaly-based intrusion detection systems . The authors find five reasons: (1) a very high cost of errors; (2) lack of training data; (3) a semantic gap between results and their operational interpretation; (4) enormous variability in input data; and (5) fundamental difficulties for conducting prudent evaluations. In fact, the anomaly detection community has suffered from these problems for decades, whereas experiments with malware datasets are increasingly used in our community. Consequently, our work complements theirs in that we shift the focus from anomaly detection to malware experiments in general.
Malware datasets typically stem from dynamic analysis in specially prepared environments , , , , , . To ensure diverse datasets, malware must not evade dynamic analysis. Others have studied the extent to which malware can detect and evade dynamic analysis , , . Chen et al. present a taxonomy of dynamic analysis fingerprinting methods and perform an analysis to which extend these are used . Paleari et al. present methods to automatically generate tests that effectively detect a variety of CPU emulators . Most recently, Lindorfer et al.  analyzed how and to which extent malware samples evade Anubis.
Stinson and Mitchell  presented a first approach to evaluate existing botnet detection methodologies. They focus on possible evasion methods by evaluating six specific botnet detection methodologies. Their survey is orthogonal to ours, as we explore how authors design experiments with malware datasets. Further, we provide guidelines how to define prudent experiments that evaluate methodologies in absence of any evasion techniques. In addition, we assist researchers in designing experiments in general rather than evaluating specific methodologies.
In this work we have devised guidelines to aid with designing prudent malware-based experiments. We assessed these guidelines by surveying 36 papers from our field. Our survey identified shortcomings in most papers from both top-tier and less prominent venues. Consequently, we argue that our guidelines could have significantly improved the prudence of most of the experiments we surveyed.
But what may be the reasons for our discouraging results? The observed shortcomings in experimental evaluation likely arise from several causes. Researchers may not have developed a methodical approach for presenting their experiments' or may not see the importance of detailing various aspects of the setup. Deadline pressures may lead to a focus on presenting novel technical content as opposed to the broader evaluation context. Similarly, detailed analyses of experimental results are often not given sufficient emphasis. In addition, page-length limits might hamper the introduction of important aspects in final copies. Finally, researchers may simply overlook some of the presented hidden pitfalls of using malware datasets.
Many of these issues can be addressed through devoting more effort to presentation, as our transparency guidelines suggest. Improving the correctness and realism of experiments is harder than it seems, though. For instance, while real-world scenarios are vital for realistic experiments, conducting such experiments can prove time-consuming and may raise significant privacy concerns for system or network administrators. Furthermore, it is not always obvious that certain practices can lead to incorrect datasets or lead to unrealistic scenarios. For example, it requires great caution to carefully think of artifacts introduced by malware execution environments, and it is hard to understand that, for example, experiments on overlay datasets may be biased. The significance of imprudent experiments becomes even more important in those instances where current practices inspire others to perform similar experiments—a phenomenon we observed in our survey.
We hope that the guidelines framed in this paper improve this situation by helping to establish a common set of criteria that can ensure prudent future experimentation with malware datasets. While many of our guidelines are not new, we witnessed possible improvements to experiments for everyone of the criteria. We believe this approach holds promise both for authors, by providing a methodical means to contemplate the prudence and transparent description of their malware experiments, and for readers/reviewers, by providing more information by which to understand and assess such experiments.
We thank our shepherd David Brumley for his support in finalizing this paper. We also thank all anonymous reviewers for their insightful comments. We thank all our anonymous malware sample feeds. Moreover, we thank Robin Sommer for his valuable discussion input. This work was supported by the Federal Ministry of Education and Research of Germany (Grant 01BY1110, MoBE), the EU “iCode” project (funded by the Prevention, Preparedness and Consequence Management of Terrorism and other Security-related Risks Programme of the European Commission DG for Home Affairs), the EU FP7-ICT-257007 SysSec project, the US National Science Foundation (Grant 0433702) and Office of Naval Research (Grant 20091976).
1Although the guideline “Choose appropriate malware stimuli” is in the Realism section, we added the criterion “Mentioned trace duration” (as one possible criterion for this guideline) to the Transparency category.
2One of these is a coauthor of this paper, too. However, he undertook applying the criteria prior to obtaining any knowledge of the separate assessment of his paper made as part of our survey.
3We determined the top-6 conferences based on three conference-ranking websites: (1) Microsoft Academic Search - Top Conferences in Security & Privacy (http://academic.research.microsoft.comlRankList?entitytype=3&topdomainid=2&subdomainid=2), (2) Guofei Gu's Computer Security Conference Ranking and Statistic (http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm), and (3) Jianying Zhou's Top Crypto and Security Conferences Ranking (http://icsd.i2r.a-star.edu.sg/staff/jianying/conference-ranking.html). As all rankings agreed on the top 6, we chose those as constituting top-tier conferences: ACM CCS, IEEE S&P, NDSS, USENIX Security, and two conferences (Crypto and Eurocrypt) without publications in our focus. We defined this list of top-venues prior to assembling the list of papers in our survey.
4We note that we devised the importance ranking prior to conducting the analyses in this section.
Back to Top