2019-10-10

When Results Are All That Matters: The Case of the Angora Fuzzer

by Andreas Zeller and Sascha Just; with Kai Greshake

The Case

"Fuzzers" are programs that generate random inputs to trigger failures in tested programs. They are easily deployed and have found numerous vulnerabilities in various programs. The last decade has seen a tremendous increase in fuzzing research [4].

In 2018, Chen and Chen published the paper "Angora: Efficient Fuzzing by Principled Search" [1] featuring a new gray box, mutation-based fuzzer called Angora. Angora presented extraordinary results in its effectiveness and shines with its stellar performance on the Lava-M benchmark [3], dramatically outperforming all competition.

The reason behind this breakthrough and leap in effectiveness towards deeper penetration of target programs was cited as the combination of four techniques: scalable byte-level taint tracking, context-sensitive branch count, input length exploration, and search based on gradient descent. The former three techniques had already been explored in earlier work; but the novel key contribution, as also indicated by the paper title, was the Gradient Descent approach to solve constraints modeling branch conditions as functions and approximating their gradient.

One of our students was intrigued by the outstanding performance and wanted to investigate why and how optimization strategies affect the fuzzing performance that much - with the initial goal of actually further improving on Angora. To properly measure the effect of his work, he went and worked directly on the Angora source code. During his study [2], he came across some confusing findings, which we summarize below.

The Findings

Angora's Gradient Descent performs similar to Random Search on LAVA-M

Contrary to what is stated in the Angora paper, we could not find real differences between using a random search algorithm and using Gradient descent to solve constraints. After further investigation, it turns out that in almost all optimization cycles, the constraint is solved immediately when the first input is executed. That first input is generated by using magic byte extraction – and only when it is replaced by a random starting point does the performance of Random Search drop off. Since 32-bit magic byte values comprise all of the bugs artificially injected into the LAVA benchmark, it does not surprise that this feature dominates the performance instead of Gradient Descent.

Of course, this raises the question of why Angora is so successful. Magic byte extraction is not a new technique, but in the case of this benchmark, in combination with taint tracking, good heuristics, and other improvements it leads to spectacular results on LAVA-M. However, Gradient Descent contributed little to that success.

These findings are confirmed by Angora's authors [5]: "If the fuzzer can extract the magic bytes and copy them to the correct location in the input, then obviously the fuzzer can solve the constraint immediately." They point out that Gradient Descent will show and did show advantages outside of solving magic byte constraints, even if this is not their evaluation setting. And indeed, our findings do not imply that Gradient Descent would not work; on the contrary, we still find it a highly promising idea. However, whether and when it can make a difference is an open question.

The Angora Evaluation Methodology is Inadequate

In the Angora paper, the authors claimed to have investigated the efficacy of each contribution individually and proved the effectiveness of Gradient Descent. Unfortunately, the comparison they present to support this claim is biased towards Gradient Descent. They tried to compare the constraint-solving capabilities of random search, random search with magic byte extraction and Gradient descent. They collected an input corpus using AFL and then started Angora with each algorithm and the same input corpus. This way, all algorithms start with the same set of constraints to solve and the authors reported the corresponding solve rate for each algorithm. 
 
Since AFL uses random mutations to solve constraints, it is not surprising that all constraints that can be easily solved with random mutations have been already solved in the input corpus. This gives random search an immediate disadvantage. 
 
While inspecting the code of Angora, we also found another issue. The evaluation ([1, Section 5.4]) states that Gradient Descent is compared to random search with Magic Byte extraction. However, the Angora Gradient Descent implementation itself makes use of Magic Byte Extraction, thus guaranteeing that it always performs at least as well as its closest competitor in the comparison. This methodology is inadequate and the original conclusions about the efficacy of Gradient Descent are not supported by our research. 
 
The Angora authors state that their "exploitation routines are a part of the enhancements aforementioned. During gradient descent, Angora first descends by the entire gradient. Only when this fails to solve the constraint does Angora try to descend by each partial derivative as a crude attempt of salvation." [5] According to our findings, however, these heuristics dominate the search of Gradient Descent for constraints with a large number of input dimensions – a fact that should have been discovered in an adequate evaluation setting.

Odd Heuristics and Parameters

The Angora code contains a number of optimizations and behaviors that are not documented in the paper. While the volume of information that can be published in a paper is limited, many of these had a significant impact on the potential performance; yet, they are not documented or motivated in the released source code. These are some examples:
  • The Gradient Descent implementation goes beyond what is described in the paper. It contains special exploitation routines that inject fixed values that are known to trigger overflows and numerical bugs. The algorithm also does not only use the entire gradient but instead descends by each partial derivative.
  • Parameters like the optimization budget are set to fixed, arbitrary values without justification or documentation for the chosen parameters:
    // SEARCH
    pub const ENABLE_DET_MUTATION: bool = true;
    pub const MAX_SEARCH_EXEC_NUM: usize = 376;
    pub const MAX_EXPLOIT_EXEC_NUM: usize = 66;
    pub const MAX_NUM_MINIMAL_OPTIMA_ROUND: usize = 8;
    pub const MAX_RANDOM_SAMPLE_NUM: usize = 10;
    pub const GD_MOMENTUM_BETA: f64 = 0.0;
    pub const GD_ESCAPE_RATIO: f64 = 1.0;
    pub const BONUS_EXEC_NUM: usize = 66;
    

    Values like MAX_SEARCH_EXEC_NUM (376; the number of iterations to solve a constraint) or BONUS_EXEC_NUM (66; the number of additional iterations added under certain conditions) would be odd choices in an experiment design; computer scientists would normally prefer multiples of powers of 10, 2, or 5. These choices are not justified or documented; and in any case, the impact of these choices (compared to others) would have to be determined.
    The value of 376 for MAX_SEARCH_EXEC_NUM is a particularly odd choice. Since the few constraints that Gradient Descent operated on were actually either solved after a maximum of 50 iterations or not at all, setting the iteration budget to at most 50 should actually increase the performance on LAVA-M.
The Angora authors confirm that the "current values seem to work well on most programs, but we did not rigorously evaluate them or search for their optimal values." [5] This implies that the authors themselves do not know which factors contribute to Angora's evaluation performance, or why their choices would be particularly well suited to settings outside of the evaluation.

Reproducibility

All of the above findings refer to the published Angora code, which may differ from the one used in the paper evaluation. We have therefore asked the Angora authors to provide us with the version used for the experiments described in the paper. They said that they "have been steadily maintaining and improving the Angora repository over the past year, with numerous bug fixes and enhancements. Its current performance is similar to, if not better than, the version on which our paper's evaluation was based." Unfortunately, the history of the public repository does not go back to the time the paper was published, and our request for the original version remains unfulfilled. We do not know whether the Angora authors would be able to reproduce their published results.

Summary

The performance and effectiveness of the Angora tool are uncontested, notably on the LAVA-M benchmark. If you want to fuzz a program that is part of the LAVA-M benchmark or very similar, Angora is likely to give you good results. Finally, the authors are to be commended for making their code available, and for providing detailed and timely answers to our questions. However,
  1. Angora's novel Gradient Descent approach has little to no impact on its performance on LAVA-M;
  2. The performance of Angora is impacted by several decisions that are not documented, motivated, assessed, or evaluated individually or sufficiently; and
  3. The evaluation methodology is inadequate to support the paper's conclusions on Angora's performance.
Guidelines for good scientific practice mandate that all information and decisions that helped to achieve a specific result be well documented. This is not the case here – neither in the paper nor in the code.

It would be a mistake, however, to point at the Angora authors only. Chen and Chen did make their code available, allowing for independent assessment. Several recent fuzzing papers use similar techniques and benchmarks as Angora, yet only few make their code available. Can we trust their results?

This calls for the community to discuss and question its procedures and criteria on what makes good research. Reviewers and researchers must not only focus on great results, but also assess how these results have been obtained, whether they have been rigorously evaluated, and how they translate into insights that will stand the test of time. How this can be achieved for programs with thousands of lines of undocumented code is a pressing question.

For now, the Angora case tells us how much we do not know, and how much further insights into what works and what does not will be needed. The good news is that this opens lots of opportunities for discussion – and future research.


Acknowledgments.  Marcel Böhme, Cas Cremers, Thorsten Holz, Mathias Payer, and Ben Stock provided helpful feedback on earlier revisions of this post.  Thanks a lot!

If you liked this post, also see our follow-up post with consequences.

You can contact Kai Greshake as development@kai-greshake.de.


References






1 comment:

  1. I think one possible lesson here is a (difficult to stick to in fuzzing/testing) goal of investigating new techniques by implementing "one weird trick" on an existing fuzzer baseline, whenever possible. If you build a very different fancy fuzzer from scratch, there are lots of possible "gotchas" in assigning credit/blame for performance. If you add 50 lines of code to AFL, on the other hand...

    But this hard to do for designs like VUzzer, Angora, Eclipser, etc.

    ReplyDelete

Note: Only a member of this blog may post a comment.