Don't trust the compiler: part #2
In his previous article, yazoo gives an example about why you shouldn't trust the compiler. But all in all, this is because of the user (you, me...), and not because of the compiler. Instead of saying "don't trust the compiler", we should rather say "don't trust your knowledge of the compiler", as a hint about the fact WE need to make the right code in input, and then we trust the compiler to do the right thing at the end, right ?
All in all, if some programmers won't trust the compiler about the optimization of the code, most programmers will trust it about the correctness of the code. Having a code that runs properly is usually way more important than a code that runs fast. So we trust the compiler religiously about its correctness, but we're wrong. Here's why.
While at work, I can hear people around me saying "oh, it's the compiler fault", "rebuild, and it'll work this time", "nah, my code is right", etc... In almost, almost every single occurrence, I was able to prove people wrong, and show them why the code was actually bad so that the compiler was actually producing proper code, just not doing what the original intention of the code was. Almost every occurrence. Until recently.
One of the most common issue people have with compilers nowadays is pointer aliasing. It seems also to be a quite common reported bug to the gcc people so that they had to create a specific entry in their bug page for it. Our codebase here uses a huge lot of pointer aliasing. So, naturally, I had to sadly turn the optimization off by using that infamous -fno-strict-aliasing option on the command line.
Now, as you probably guessed by now, we managed here to hit an actual bug in gcc. Without further ado, here's the testcase I managed to write that's showing this bug (or if you prefer, here's the plain text version of it). This code is plagued by a bad case of pointer aliasing, as described by the inner comments. So if you want to compile it in an optimized way, but still make it working, you need to compile it with the -O3 and -fno-strict-aliasing options. The proper result of that code should be a fairly small vector, something like x = 2.857178, y = -11.428467, z = 1.904770. Using a buggy gcc, or if you forget to add -fno-strict-aliasing, you should get values in the range of hundreds, such as x = -247.555176, y = 301.496094, z = -23.009033.
Looking at the code, you can see that it's doing some pointer aliasing, transforming a single float into a vector class, and hoping that, by a side effect, the next floats will be affected. That works in theory because the floats should be next to each other in memory, but in practice, this is a violation of the strict aliasing rule. A float can not become magically a class, or vice-versa. Disclaimer: I still consider the code is correct. It violates the strict aliasing rule allright, but apart of that, I think the C++ code itself is bug-free. Feel free to prove me wrong though. I don't pretend being perfect.
Well, in all cases, I'm not exactly sure what bug I've hit inside of gcc, because the bogus values are only y and z, but x is correct. As if it understood that casting the first float of each row into a vector touches that float, but not the rest behind. And this happen whenever you use -fno-strict-aliasing or not! Previous (and fixed using a workaround) versions of gcc weren't even trying to optimize this: they simply output the proper value, whatever happen. So to me, this is an actual optimization that has been introduced in the gcc streamline, but was somehow buggy, as it wasn't doing what it was told.
I've already built for myself a matrix of gcc versions in order to pinpoint where this code is failing within the vanilla gcc streamline, and it seems the buggy optimization has been introduced in gcc 4.3.0, and has been taken off in gcc 4.3.3. gcc 4.4.0 reintroduces that optimization, but it's working fine this time. So technically, I'm talking about a closed, fixed problem.
But feel free to take this code and test it on various gcc flavors around, to see if they're affected or not, and report it there. I had the very unpleasant surprise to discover that, although it's officially a 4.1.2, the official Redhat gcc has this problem, even though the vanilla version doesn't: it means they took the buggy optimization from 4.3.0+ and put it into their 4.1.2, without following up afterward to discover that this was not a good one! And the very pleasant surprise to see that Debian's 4.3.2 doesn't, even though the vanilla version does, which means debian found out about the buggy optimization, but instead of bumping their gcc revision to 4.3.3 or 4.3.4, potentially taking more bugs also, they decided to backport the fixes.
I mean, what the hell ?! What should I trust ? What should I believe ? Is Redhat doing it again by releasing a compiler that produces code so bad that a compiled linux kernel with this revision won't even boot, only for the sake of fast code ? Is code speed more important than code stability and correctness ? Or were they just not paying attention to the gcc releases and known bugs ? In that case, what makes Redhat, an organization that takes money out of customers, less focused on bugs than others, such as Debian ? Okay well, I won't say that one should blindly trust Debian either. But still, what and who should I trust then ? Should I just always use a vanilla gcc from stream ? Is DJB right after all, by forbidding altered binary distribution of his work ?
I think I still won't be quick to say that "it's the compiler's fault", but now, I'll take this with a grain of salt. And still blame Redhat for releasing buggy compilers. I still haven't digested gcc 2.96 properly.
Using the gcc git repository and the git bisect magical command, I finally found that this issue I was facing has been fixed in the gcc svn revision r142040. What's really disturbing is that this commit comes with a few new torture testcases. I'm now going to try running all of these torture testcases against the redhat gcc, and see how many of them are coming out defunct...