2019-10-17

When Results Are All That Matters: Consequences

by Andreas Zeller and Sascha Just; with Kai Greshake

The Case

In our previous post "When Results Are All That Matters: The Case of the Angora Fuzzer", we reported our findings when investigating the Angora fuzzer [1]. If you have not read that post yet, you should stop here and read our write-up first. There, we focus on our findings and problems that surprised us when experimenting with Angora.
In this article, we have collected some suggestions to advance the field of fuzzing and have a long-term impact on the reliability of software.

1. Science is about insights, not products.

To ensure scientific progress, we need to know which technique works, how, and under which circumstances. We write papers to document such insights such that the next generation of researchers, as well as the non-scientific world, can build on them.  The value of a paper comes from the impact of its insights.

2. Scientists and companies can create tools.

It is fun to build a tool, and if it works well, the better.  Typically, this will involve not one single magical technique, but a multitude of techniques working together.  Tools will have to succeed on the market, though, and will be evaluated not based on their insights, but their effectiveness.

Evaluating tools for their effectiveness can be part of a scientific approach. However, evaluation settings should
  1. be fair and thus not be defined by tool authors; and
  2. avoid overspecialization and thus involve tests not previously known to tool authors.
In other words, the only way to obtain reliable performance comparisons is by independent assessment.   Other communities do this through specific tool contests that operate on secret benchmarks created for this very purpose.  And of course, tools need to be available for evaluation in the first place. It is nice to see the security community to adapt such techniques, such as artifact evaluation.

3. Combinations of techniques must be assessed individually.

If results depend on a larger set of novel processing steps, the contribution of each must be – for instance, by replacing each processing step by a naive approach and assessing the impact of the change.  All decisions affecting performance must be well motivated and documented.

Without assessing the impact of each step individually, one can still have a great tool, but the insight on what makes it great will be very limited.  As an analogy: We know that Usain Bolt is a record shattering sprinter; the scientific insight is to find out why.

4. Document your hypotheses, experiments, and results.

Good scientific practice mandates that experiments and their results be carefully documented.  This helps others (but also yourself!) in assessing and understanding the decisions in the course of your project.  If you make some design decision, such as a parameter setting, after examining how your software runs on some example, it is important that the motivation for this design decision can be traced back to the experiment and its result.

If this sounds like lot of work, that's because it is.  We're talking about the scientific method, not some fiddling around with parameters until we reach the desired result on a benchmark.  Fortunately, there are great means to help you with these tasks.  Jupyter Notebooks [8], for instance, allow you to collect your hypotheses (in natural language), your experiment design, its results (in beautiful and interactive graphs, among others), and your next refinement step – allowing anyone (as well as yourself) to understand how a specific result came to be.  Be sure to place your notebooks (and code) under version control from day one, and throw in some tests and assertions for quality assurance.  Control your environment carefully to make results reproducible for anyone.

5. Having benchmarks to compare tools and approaches is helpful, but brings risks.

Benchmarks are helpful means to assess the performance of tools.  However, they bring two risks.
  1. First, there is the risk of having researchers focus on the benchmark rather than insights.  It is nice to have a well-performing tool, but its scientific value comes from the insights that make its performance.
  2. Second, benchmarks bring the risk of researchers knowingly or unknowingly optimizing their tools for this very benchmark. We have seen this with compilers, databases, mobile phones, fault localization, machine learning, and now fuzzing.  To mitigate the risk of overspecialization, tool performance should be compared on programs they have not seen before.
A benchmark like LAVA-M is representative for detecting buffer overflows during input processing but very little else.  As the LAVA creators state themselves, "LAVA currently injects only buffer overflows into programs" and "A significant chunk of future work for LAVA involves making the generated corpora look more like the bugs that are found in real programs." [3].

It has been shown that optimizing against the artificial LAVA bugs, such as 4-byte string triggers, can have very naive approaches yield impressive results [2].  The conceptual match between the features injected by LAVA and those features exploited by fuzzers such as Angora is striking.

The question of what makes a good benchmark for fuzzers and test generation is still open.  One possible alternative to LAVA-M is Google's fuzzing test suite which contains a diverse set of programs with real bugs [5].  Michael Hicks has compiled excellent guidelines on how to evaluate fuzzers [4, 6].

6. Researchers must resist the temptation of optimizing their tools towards a specific benchmark.

While developing an approach, it is only natural to try it out on some examples to assess its performance, such that results may guide further refinement. The risk of such guidance, however, is that development may result in overspecialization – i.e., an approach that works well on a benchmark, but not on other programs.  As a result, one will get a paper without impact and a tool that nobody uses.

Every choice during implementation has to be questioned "Will this solve a general problem that goes way beyond my example?", and one should take that choice only with a positive, well-motivated answer, possibly involving other experts who would be asked in the abstract.  We recommend that during implementation, only a very small set of examples should be used for guidance; the evaluation should later be run on the full benchmark.

Good scientific practice mandates to create a research and evaluation plan with a clear hypothesis well before the evaluation, and possibly even before the implementation.  This helps to avoid being too biased towards one's own approach. Note that the point of the evaluation is not to show that an approach works, but to precisely identify the circumstances under which it works and the circumstances under which it does not work.

Papers should investigate those situations and clearly report them. Again, papers are about insights, not competition.

7. It is nice to have tools discovering vulnerabilities...

...especially as these vulnerabilities have a value on their own. However, vulnerabilities do not follow statistical distribution rules (hint: otherwise it would be easier to find them).  Having a tool find a number of vulnerabilities in a program, therefore, is not necessarily a good predictor to find bugs in another program.

In any case, the process through which vulnerabilities were found must be carefully documented and made fully reproducible; for random-driven approaches such as fuzzers, one thus needs to log and report random seeds.  Obviously, one must be clear not to optimize tools towards given vulnerabilities.

For fuzzing tools, the technical challenge is to find inputs that cover a wide range of behavior across the program and not only during input processing and error handling.  Let us remind you that during testing, executing a location is a necessary condition for finding a bug in that very location.  Since we are still far from reaching satisfying results in covering functionality, improvements in code coverage are important achievements regardless of bugs being found.

8. What does this mean for reviewers and authors?

Papers must clearly show how the insights of the paper contribute to the result, both in terms of motivation as well as in evaluation.

In many cases, it will be hard to describe all the details of all the necessary steps in the paper. Therefore, it will be necessary to supply an artifact that allows for not only reproduction but also applying it on subjects not seen before; again, all design decisions in the code must be motivated and documented. This is tedious; this is rigorous; this is how science works.

Reviewers should be aware that an approach is not simply "better" because it performs well on a benchmark or because it found new bugs.  Approaches have a long-term impact not only through performance, but also through innovation, generality, and simplicity. Researchers are selected as reviewers because the community trusts them to assess such qualities. Tool performance that is achieved through whatever means has little scientific value.

Having said that, conference organizers should create forums for tool builders and tool users to discuss lessons learned.  Such exchanges can be extremely fruitful for scientific progress, even if they may not be subject to rigorous scientific assessment.  Tool contests with clear and fair rules would allow assessing the benefits and fallbacks of current approaches, and again to foster and guide discussions on where the field should be going.  A contest like Rode0day [7] could serve as a starting point.

Conclusion

Having tools is good, and having tools that solve problems is even better.  As scientists, however, we also must understand why what works and what does not.  As tools and vulnerabilities come and go, it is these insights that have the longest impact.  Our papers, our code and our processes, therefore, must all be shaped to produce, enable, assess, and welcome such insights.  This is the long-term path of how we as scientists can help to make software more reliable and more secure.

Acknowledgments.  Marcel Böhme, Cas Cremers, Thorsten Holz, Mathias Payer, and Ben Stock provided helpful feedback on earlier revisions of this post.  Thanks a lot!

References

[1] P. Chen and H. Chen, "Angora: Efficient Fuzzing by Principled Search." 2018 IEEE Symposium on Security and Privacy (IEEE S&P), San Francisco, CA, 2018, pp. 711-725.
[2] Of bugs and baselines
[3] B. Dolan-Gavitt et al., "LAVA: Large-Scale Automated Vulnerability Addition," 2016 IEEE Symposium on Security and Privacy (S&P), San Jose, CA, 2016, pp. 110-121.
[4] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS '18). ACM, New York, NY, USA, 2123-2138.
[5] Google's fuzzing test suite
[6] Michael Hicks, Evaluating Empirical Evaluations (for Fuzz Testing
[7] Rode0day
[8] Project Jupyter

2019-10-10

When Results Are All That Matters: The Case of the Angora Fuzzer

by Andreas Zeller and Sascha Just; with Kai Greshake

The Case

"Fuzzers" are programs that generate random inputs to trigger failures in tested programs. They are easily deployed and have found numerous vulnerabilities in various programs. The last decade has seen a tremendous increase in fuzzing research [4].

In 2018, Chen and Chen published the paper "Angora: Efficient Fuzzing by Principled Search" [1] featuring a new gray box, mutation-based fuzzer called Angora. Angora presented extraordinary results in its effectiveness and shines with its stellar performance on the Lava-M benchmark [3], dramatically outperforming all competition.

The reason behind this breakthrough and leap in effectiveness towards deeper penetration of target programs was cited as the combination of four techniques: scalable byte-level taint tracking, context-sensitive branch count, input length exploration, and search based on gradient descent. The former three techniques had already been explored in earlier work; but the novel key contribution, as also indicated by the paper title, was the Gradient Descent approach to solve constraints modeling branch conditions as functions and approximating their gradient.

One of our students was intrigued by the outstanding performance and wanted to investigate why and how optimization strategies affect the fuzzing performance that much - with the initial goal of actually further improving on Angora. To properly measure the effect of his work, he went and worked directly on the Angora source code. During his study [2], he came across some confusing findings, which we summarize below.

The Findings

Angora's Gradient Descent performs similar to Random Search on LAVA-M

Contrary to what is stated in the Angora paper, we could not find real differences between using a random search algorithm and using Gradient descent to solve constraints. After further investigation, it turns out that in almost all optimization cycles, the constraint is solved immediately when the first input is executed. That first input is generated by using magic byte extraction – and only when it is replaced by a random starting point does the performance of Random Search drop off. Since 32-bit magic byte values comprise all of the bugs artificially injected into the LAVA benchmark, it does not surprise that this feature dominates the performance instead of Gradient Descent.

Of course, this raises the question of why Angora is so successful. Magic byte extraction is not a new technique, but in the case of this benchmark, in combination with taint tracking, good heuristics, and other improvements it leads to spectacular results on LAVA-M. However, Gradient Descent contributed little to that success.

These findings are confirmed by Angora's authors [5]: "If the fuzzer can extract the magic bytes and copy them to the correct location in the input, then obviously the fuzzer can solve the constraint immediately." They point out that Gradient Descent will show and did show advantages outside of solving magic byte constraints, even if this is not their evaluation setting. And indeed, our findings do not imply that Gradient Descent would not work; on the contrary, we still find it a highly promising idea. However, whether and when it can make a difference is an open question.

The Angora Evaluation Methodology is Inadequate

In the Angora paper, the authors claimed to have investigated the efficacy of each contribution individually and proved the effectiveness of Gradient Descent. Unfortunately, the comparison they present to support this claim is biased towards Gradient Descent. They tried to compare the constraint-solving capabilities of random search, random search with magic byte extraction and Gradient descent. They collected an input corpus using AFL and then started Angora with each algorithm and the same input corpus. This way, all algorithms start with the same set of constraints to solve and the authors reported the corresponding solve rate for each algorithm. 
 
Since AFL uses random mutations to solve constraints, it is not surprising that all constraints that can be easily solved with random mutations have been already solved in the input corpus. This gives random search an immediate disadvantage. 
 
While inspecting the code of Angora, we also found another issue. The evaluation ([1, Section 5.4]) states that Gradient Descent is compared to random search with Magic Byte extraction. However, the Angora Gradient Descent implementation itself makes use of Magic Byte Extraction, thus guaranteeing that it always performs at least as well as its closest competitor in the comparison. This methodology is inadequate and the original conclusions about the efficacy of Gradient Descent are not supported by our research. 
 
The Angora authors state that their "exploitation routines are a part of the enhancements aforementioned. During gradient descent, Angora first descends by the entire gradient. Only when this fails to solve the constraint does Angora try to descend by each partial derivative as a crude attempt of salvation." [5] According to our findings, however, these heuristics dominate the search of Gradient Descent for constraints with a large number of input dimensions – a fact that should have been discovered in an adequate evaluation setting.

Odd Heuristics and Parameters

The Angora code contains a number of optimizations and behaviors that are not documented in the paper. While the volume of information that can be published in a paper is limited, many of these had a significant impact on the potential performance; yet, they are not documented or motivated in the released source code. These are some examples:
  • The Gradient Descent implementation goes beyond what is described in the paper. It contains special exploitation routines that inject fixed values that are known to trigger overflows and numerical bugs. The algorithm also does not only use the entire gradient but instead descends by each partial derivative.
  • Parameters like the optimization budget are set to fixed, arbitrary values without justification or documentation for the chosen parameters:
    // SEARCH
    pub const ENABLE_DET_MUTATION: bool = true;
    pub const MAX_SEARCH_EXEC_NUM: usize = 376;
    pub const MAX_EXPLOIT_EXEC_NUM: usize = 66;
    pub const MAX_NUM_MINIMAL_OPTIMA_ROUND: usize = 8;
    pub const MAX_RANDOM_SAMPLE_NUM: usize = 10;
    pub const GD_MOMENTUM_BETA: f64 = 0.0;
    pub const GD_ESCAPE_RATIO: f64 = 1.0;
    pub const BONUS_EXEC_NUM: usize = 66;
    

    Values like MAX_SEARCH_EXEC_NUM (376; the number of iterations to solve a constraint) or BONUS_EXEC_NUM (66; the number of additional iterations added under certain conditions) would be odd choices in an experiment design; computer scientists would normally prefer multiples of powers of 10, 2, or 5. These choices are not justified or documented; and in any case, the impact of these choices (compared to others) would have to be determined.
    The value of 376 for MAX_SEARCH_EXEC_NUM is a particularly odd choice. Since the few constraints that Gradient Descent operated on were actually either solved after a maximum of 50 iterations or not at all, setting the iteration budget to at most 50 should actually increase the performance on LAVA-M.
The Angora authors confirm that the "current values seem to work well on most programs, but we did not rigorously evaluate them or search for their optimal values." [5] This implies that the authors themselves do not know which factors contribute to Angora's evaluation performance, or why their choices would be particularly well suited to settings outside of the evaluation.

Reproducibility

All of the above findings refer to the published Angora code, which may differ from the one used in the paper evaluation. We have therefore asked the Angora authors to provide us with the version used for the experiments described in the paper. They said that they "have been steadily maintaining and improving the Angora repository over the past year, with numerous bug fixes and enhancements. Its current performance is similar to, if not better than, the version on which our paper's evaluation was based." Unfortunately, the history of the public repository does not go back to the time the paper was published, and our request for the original version remains unfulfilled. We do not know whether the Angora authors would be able to reproduce their published results.

Summary

The performance and effectiveness of the Angora tool are uncontested, notably on the LAVA-M benchmark. If you want to fuzz a program that is part of the LAVA-M benchmark or very similar, Angora is likely to give you good results. Finally, the authors are to be commended for making their code available, and for providing detailed and timely answers to our questions. However,
  1. Angora's novel Gradient Descent approach has little to no impact on its performance on LAVA-M;
  2. The performance of Angora is impacted by several decisions that are not documented, motivated, assessed, or evaluated individually or sufficiently; and
  3. The evaluation methodology is inadequate to support the paper's conclusions on Angora's performance.
Guidelines for good scientific practice mandate that all information and decisions that helped to achieve a specific result be well documented. This is not the case here – neither in the paper nor in the code.

It would be a mistake, however, to point at the Angora authors only. Chen and Chen did make their code available, allowing for independent assessment. Several recent fuzzing papers use similar techniques and benchmarks as Angora, yet only few make their code available. Can we trust their results?

This calls for the community to discuss and question its procedures and criteria on what makes good research. Reviewers and researchers must not only focus on great results, but also assess how these results have been obtained, whether they have been rigorously evaluated, and how they translate into insights that will stand the test of time. How this can be achieved for programs with thousands of lines of undocumented code is a pressing question.

For now, the Angora case tells us how much we do not know, and how much further insights into what works and what does not will be needed. The good news is that this opens lots of opportunities for discussion – and future research.


Acknowledgments.  Marcel Böhme, Cas Cremers, Thorsten Holz, Mathias Payer, and Ben Stock provided helpful feedback on earlier revisions of this post.  Thanks a lot!

If you liked this post, also see our follow-up post with consequences.

You can contact Kai Greshake as development@kai-greshake.de.


References