Your comments are valuable, we thank you.

Xuejun Yang is a Senior Staff Engineer at FutureWei Technologies. He is the DFA on the 2011 paper, “Finding and Understanding Bugs in C Compilers.”

Today Ken and I discuss a clever idea from that paper.

The paper was brought to our attention just now in a meaty comment by Paul D. We thank him for it—the topic interests both of us. We don’t think Paul D. means to be anonymous, but in keeping with that we’ll give just a cryptic hint to his identity: The saying “a man on the make” is widely known, but for more than the millennium he has been the unique person in the world to whom it applies literally. Update 11/20: Turns out we (I, Ken) were wrong about the identity, see this.

Yang was made unique by being listed out of alphabetical order on the paper. This is notable because the most common practice in our field is to list alphabetically irrespective of prominence. Hence we’ve invented the term ‘DFA’ for “Designated” or “Distinguished” First Author. The other authors are Yang Chen, Eric Eide, and John Regehr, all from the University of Utah.

## The Topic

Paul D.’s comment notes that there was evidence that verification methods could improve compiler correctness. By compiler we mean the program that transforms high level code into machine code. These programs are used countless times every day and their correctness is clearly very important.

Their correctness is tricky for several reasons. The main one is that almost all compilers try to optimize code. That is when they transform code into instructions they try to rewrite or rearrange the instructions to yield better performance. Compilers have been doing this forever. The trouble is that changing instructions to increase performance is dangerous. The changes must not affect the values that are computed. If they are not done carefully they can actually make the answers faster, but incorrect. This is the reason correctness is tricky.

Formal verification requires a lot of effort. The highest effort should go into mission-critical software. But compilers are mission-critical already, unless we know mission-critical software won’t be compiled on a particular one. Hence it is notable when formal verification makes a compiler more reliable.

## The Paper

The idea in the paper Paul referenced is quite elegant. They built a program called Csmith. It operates as follows:

Suppose that ${X}$ is a compiler they wish to test. Then generate various legal C programs ${P}$. For each of these let ${A}$ be the answer that ${X(P)}$ yields. Here ${X(P)}$ is the compiled program. Then check whether ${A}$ is correct.

For example:

```int foo (void) {
signed char x = 1;
unsigned char y = 255;
return x > y;
}
```

Some compilers returned ${1}$, but the correct answer is ${0}$. There are further examples in a 2012 companion paper and these slides from an earlier version. The Csmith homepage has long lists of compiler bugs they found.

Of course if ${X(P)}$ crashes or refuse to compile ${P}$ then the compiler is wrong. But what happens if ${A}$ is computed. How does Csmith know if the answer is correct? This seems to be really hard. This correctness testing must be automated: the whole approach is based on allowing tons of random programs to be tested. They cannot assume that humans will be used to check the outputs.

This is the clever idea of this paper. They assume that there are at least two compilers say ${X}$ and ${Y}$. Then let ${A}$ be the output of ${X(P)}$ and let ${B}$ be the output of ${Y(P)}$. The key insight is:

If ${A}$ is not equal to ${B}$, then one of the compilers is wrong.

A very neat and elegant idea. For software in general it is called differential testing.

This at least alerts when there are problems with some compilers and some programs. One can use this trick to discover programs that cause at least some compilers to have problems. This is extremely valuable. It allowed Csmith to discover hundreds of errors in production compilers—errors that previously were missed.

## Smart Fuzzing

Fuzzing is defined by Wikipedia as testing by “providing invalid, unexpected, or random data as inputs to a computer program.” An early historical example, Apple’s “Monkey” program, worked completely randomly. To ensure that the found bugs are meaningful and analyzable, Csmith needed a deeper, structured, “intelligent” design, not just the generation of Mayhem.

For one, Csmith needed to avoid programs ${P}$ than do not have deterministic behavior. The formal C standards itemize cases in which compilers are allowed to have arbitrary, even self-inconsistent, behavior. There are lots of them in C. A bug with dubious code could be dismissed out of hand.

For another, the probability that a program ${P}$ built haphazardly by the original Csmith version would reveal bugs was observed to peak at about 80KB source-code size, about 1,000 lines across multiple pages. Those don’t make great examples. So Csmith has its own routines to compress bug instances it has found. Simple tricks are shortening numerical expressions to use only the bug-sensitive parts. Others are lifting local variables out of blocks and bypassing pointer jumps.

A third goal is that the generator should branch out to all aspects of the language—in this case, C—not just the “grungy” parts that are ripe for finding compiler bugs. The paper talks about this at length. Regehr, who was Yang’s advisor, is also a blogger. His current post, dated November 4, is titled, “Helping Generative Fuzzers Avoid Looking Only Where the Light is Good, Part 1.” We guess that “Part 2” will go even more into details.

## Formal Methods as Bugscreen

Regarding the formally-verified CompCert compiler, Paul D. quoted from the paper:

The striking thing about our CompCert results is that the middle-end bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users.

This August 2019 paper by Michaël Marcozzi, Qiyi Tang, Alastair Donaldson, and Cristian Cadar gives recent results involving Csmith and other tools. They have an interesting discussion on page 2, from which we excerpt:

In our experience working in the area […], we have found compiler fuzzing to be a contentious topic. Research talks on compiler fuzzing are often followed by questions about the importance of the discovered bugs, and whether compiler fuzzers might be improved by taking inspiration from bugs encountered by users of compilers “in the wild.” Some … argue that any miscompilation bug, whether fuzzer-found or not, is a ticking bomb that should be regarded as severe, or avoided completely via formal verification (in the spirit of CompCert).

They go on to say, however, that when a fully-developed compiler is used for non-critical software, the kinds of bugs typically found by fuzzing tend to have questionable importance. Their paper is titled, “A Systematic Impact Study for Fuzzer-Found Compiler Bugs.”

So far they have found definite results that seem to have mixed implications. In their future-work section they note that they have evaluated the impact of bugs in compilers on the intended function of programs they compile, but not on possible security holes—which as we noted in our Cloudflare post can come from (misuse of) simple code that is completely correct. This leads us further to wonder, coming full-circle, whether formal methods might help quantify the relative importance of aspects of a language and areas of a compiler to guide more-intelligent generation of test cases.

## Open Problems

The above comment is interesting, but perhaps finding obscure bugs is important. Perhaps such bugs could be used to attack systems. That is perhaps some one could use them to break into a system. Security may be compromised by any error, even an unlikely one to occur in the wild.

What do you think?

21 Comments leave one →
1. Paul W. Homer permalink
November 19, 2019 9:13 pm

Obscure bugs are probably the most valuable ones to black hat hackers these days. If you find a few that are exploitable, before everyone else, you can leverage them To gain access. It seems to be quite profitable.

From my perspective, although we can’t automatically assess the fit of any given peice of software to a given real world problem (a big source of bugs) we can ensure that the software is at least predictable for a very large set of inputs, both on a micro and macro scale. In the case of differential testing (sometimes called reference testing) one program can be seen as predicting the output of the other. Even if they choose different paths through the computations, they eventually need to converge towards the end.

Paul.

2. Paul D. permalink
November 20, 2019 12:48 am

I’m not sure what the “man on the make” comment refers to — is that Paul D. Smith, the maintainer of GNU Make? That’s not me. I apologize for the confusion.

Marcozzi et al. make this observation about the relative importance of fuzzer-found compiler bugs:

“Contrary to fuzzer-found bugs, user-reported ones were discovered by spotting miscompilations in the code of a real application. Our results tend to show that this does not make user-reported bugs likelier to trigger when compiling other applications. On the contrary, diﬀerent binaries are only spotted for 2/10 bugs and 2% of the package builds, far below the values of 11/20 and 8% reached by Csmith and EMI. ”

This would be evidence that fuzzer-found bugs are more serious than user-reported bugs, at least in compilers.

I get the impression that many, perhaps most, compiler development efforts these days are using random testing techniques as part of their testing effort. Personally, when testing Common Lisp implementations, I’ve found it to be very effective.

The mainstreaming of fuzzing as a desirable, even required, testing technique is remarkable, given the disdain with which the older testing literature treated random testing. Computing power is now millions of times cheaper than it was then. This is even making mutation testing [Lipton, 1971] something that’s being applied in the “real world”, at Google:

• November 20, 2019 3:25 pm

You’ve guessed the answer right—but oh my, I put a 90-percenter on it being a unique match to you when I augmented Dick’s draft of the intro.

Thanks for highlighting one of the Marcozzi et al. paper’s conclusions. We didn’t find a way to put them crisply, beyond saying “definite results with seemingly mixed implications.”

November 20, 2019 5:15 am

I think the old paper “Social Processes and Proofs of Theorems and Programs” is an interesting read even today. Some of the points it makes are excellent — for example on the social nature of proof verification in mathematical practice, or the eternal existence of a formal-informal gap when specifying systems. But some of the points were also the expression of the authors’ subjective opinion at the time, very negative of formal verification as a scientific field, and have been conclusively proven wrong opinions in the recent year. For example the paper says:

> But the stakes do not affect our belief
> in the basic impossibility of verifying
> any system large enough and flexible
> enough to do any real-world task.

But today we have a verified compiler for the C programming language, for an SML dialect, verified operating systems, a verified XML-RPC implementation, verified cryptography stacks, etc. While we are still far from the full verification of a personal computer, we have collectively verified large systems that can be used for real-world tasks. There remain many challenges, research challenges and verification-engineering challenges, to go there, but a central idea of the paper that formal verification cannot scale to large program due to the discontinuity between small routines and large systems is now falsified, or at least put into very serious doubt.

I was thus interested to see your previous post on this paper. Is it time to admit that your prediction of the time has now been proved wrong, and provide a new assessment of the state of formal verification of software and its research potential? The post did not provide such a discussion (it focused on the parts of the paper that are correct, which is much easier).

Then you got an excellent comment about the Compcert C compiler, a formally verified C compiler, and you made a new post about this comment. However this present post is fairly disappointing again:

– Instead of giving publicity to the Compcert result, a formally verified, real-world C compiler, you decided to give the highlight to a different work, not on formal verification of software, that was used in the comment because it brings empirical evidence on the value of formal verification. (This is also an excellent work, but still the shift from the original question of formal verification is a bit disappointing.)

– You also report on a different work, trying to assess the impact of fuzzer-found bugs in real-world, battled-tested (but not formally-verified compilers). Unfortunately, your presentation misrepresents this work: you say that “the kind of bugs found by fuzzing tend to have questionable importance”, but the paper experimentally shows that *all* bugs found within battle-tested compilers, whether human-found or fuzzer-found, have limited impact on existing software (which is a sort of obvious result, although it’s get to get empirical confirmation: critical bugs affecting a lot of code have been found and fixed earlier in the compiler’s lifetime).

• Paul D. permalink
November 20, 2019 7:31 am

gasche: It’s important to understand the limitations of CompCert. It did not verify the entire compiler, just the “middle end”. Csmith found bugs in other, unverified parts.

Moreover, if I am not misunderstanding, the verification of the middle end was of the form “if this terminates normally, the result is correct”, not “this will always terminate normally, with correct results.” This means, for example, that one does not have to verify that (say) a graph coloring register allocator is always correct, one just has to have a pass that checks that the register allocation is legal, and abort if it’s not (and also verify that checker). This seems perfectly fine for a compiler, where the concern is that wrong code bugs will silently slip through; aborts at compile time can be addressed by ordinary bug fixing as they occur. It wouldn’t be fine for verifying (say) avionics software controlling an Ariane 5 launcher.

(There’s a whiff of P vs. NP in this separation of computing and checking.)

Ultimately, the competition between verification and testing (or other means to improve quality) will be economic. As computing power has gotten cheaper, techniques capable of hands-off automation have an advantage. Verification systems will have to become less and less “hands on” just to keep up.

November 20, 2019 7:58 am

I’m familiar with the CompCert work. (I think your statement that it “only verifies the middle end” is an exaggeration. CompCert verifies the whole pipeline between a token stream as input and an abstract syntax of an assembly program as output. There do exists informal gaps at the edge of the system, some of which have been closed after this work was published, and some that are still around.)

Indeed, CompCert uses translation-validation in some parts (to reduce the verification effort), so the correctness statement only applies to correct compilation runs. Note that even when you avoid translation-validation, you will rarely have system specifications saying “this will always terminate normally and …”: most problem domains have error cases, and those must/can be described in the specification.

I certainly agree that there is an economic competition between verified and non-verified software, and not in any way trying to claim that all software, or even all software for which we can work out reasonable specifications, is going to be formally verified. But it is evident, with the current state of research, that for large systems for which we do have reasonable (possibly partial) specifications, it is technically, humanly and socially possible to produce a formally-verified implementation, unlike what the original article predicted. (One important aspect of this is the ability to use proof assistants to separate the high-level/interesting mathematical content, typically lemma statements, that is shared and improved and grown socially as standard mathematical proofs, from the low-level parts, partly delegated to automated provers, partly accepted by verified-software authors as things they don’t look in details unless they need to, closer to large-scale software systems than to usual mathematical proofs.) Whether we (we the programming industry?) actually want to do more formal verification of software systems is a fairly different question.

One important area of programming where I’m not aware of important practical verification results is numerical programming (as in the avionics software you mention). I am far from an expert in this area, but my understanding is that it is difficult to give good/strong specifications to these systems when their computations are numerically unstable, and there is often no good human-level specification of the relation between the program artifact (with precision approximations) and the underlying mathematical model. There are a few examples of small-scale verified software (see for example https://www.lri.fr/~sboldo/research.html ), but I haven’t heard of formal specifications and verifications for a real-world simulation program, for example.

4. November 20, 2019 10:02 am

The researchers involved claim that CompCert is a verified C compiler, but this is soap powder advertising: https://shape-of-code.coding-guidelines.com/2013/03/10/verified-compilers-and-soap-powder-advertising/

There are other source generation tools that have generated code that finds mistakes in CompCert (not many mistakes, but a few)..

There are finite resources available to fix mistakes in compilers.

Human written code contains certain usage patterns: http://www.knosof.co.uk/cbook
It makes sense for fuzzers to mimic these patterns, on the basis that compilers are being tested with the distribution of input they are most likely to encounter. Some fuzzers are becoming more ‘human’ like.

5. November 20, 2019 7:00 pm

I wish my brain was like C#(interpreter). After 2 hours of work, it stops working and needs to recompile my
whole life(like C++). Which takes hours
I need to share my private parts? My source code is private I need to make it public.
Online life like an interpreter, not offline

6. November 25, 2019 8:23 am

> This is notable because the most common practice in our field is to list alphabetically irrespective of prominence.

(Humor about DFA aside :)) This paper was presented at PLDI. In the Programming Languages field, it is fairly common to have first authors as the PhD student who did the most work and their advisor as the last author. I am unsure what this scheme of authorship order is called generally.

7. Song Li permalink
December 3, 2019 1:06 pm

Back in 2005 I implemented a industrial level of Verilog / VHDL random syntax forest walker to generate random Verilog / VHDL test cases for Altera (now Intel) Quartus II. This became part of Quartus II test suites, and my colleague further extended the test suites using other HDL simulators to verify the simulation results. On the first day using this we generated 10K test cases and had around 1K failures, those boiled down to around 20 bugs.

December 3, 2019 2:52 pm

Dear Song Li:

Thanks for this info. Is there a report on the type of bugs that were found? I would be interested to see a sample or a type of the bugs.

Best

Dick

• Song Li permalink
December 3, 2019 6:01 pm

Hi Dick,

I am sorry this was 14 years ago, and later we found many more bugs – 20+ was only day 1 – so I may not remembered everything.

If my memory serves, the most bugs we found were invalid C++ pointers, Quartus II wasn’t the best memory managing C++ software I saw, and I had to implement a memory manager for it to handle designs larger than 1GB on 32bit windows, but that is another story. Many of the auto-generated cases will trigger the same memory bug.

The 2nd most frequent bugs we found were elaboration issues, elaboration in HDL design is roughly the counter part of code generation for sequential programming languages. Bringing in another simulator’s results will help find those bugs.

The rest are just some random type of bugs. Turned out bugs, too, follow the power law of distribution when aggregated by types.

Cheers

Song