Friday, November 23, 2007

Worst Bug of the Year

I recently encountered a bug during Democracy 2's first beta test that drove me mental. It took me two and a half days to fix. Here's the details...

The game would crash on shutdown. Not in debug, not in release-mode if run from within visual studio. But if I just clicked the exe and ran it, I'd get an exit crash.
Everything everyone knows about release-mode only bugs says this is a variable that I had not set a default value for (debug will do that for you, and even release-from dev studio does it).

I spent at least a day looking for such a variable, and found none. Bah...

So I try allsorts of tricks to see WHERE it crashes, running 'release with symbols' which still crashes but lets me see where. It's in the code where a dilemma is being freed. Aha... I must corrupt the heap somewhere during the game right?

But no...
because not only is ALL the dilemma data fine, if I run heapCheck() JUST before the memory is freed (in the destructor), it says it's fine. Everything seems ok and the memory seems fine right up to the picosecond where it clearly isn't and the game crashes... On top of that it happens in the 34th dilemma. If I remove #34, it happens in the next one, so the data must be fine. how weird...

Day 2. In a co-incidence, I realise it runs fine on my laptop. that's Xp,not vista, but I have XP testers with the bug. how strange...

Day 2.5... I am experimenting and realise to my amazement that if I exit from the quit button, its fine, rather than just alt+f4 or using the windows X button. Why could that be? I step through the code...

Holy cow, in the quit button stuff I release all my game data then do this

PostQuitMessage(0)

nothing wrong with that. but in the code for WM_DESTROY I do this:

PostQuitMessage(0)
GetGame()->ReleaseData();

Holy crap. I see the problem and its fixed. On a single-core PC I think this is fine, but on multi-core, it looks like one core does the PostQuitMessage() while the other starts freeing my data. The PostQuit() stuff (which is part of windows) finishes and terminates the process while the other core is still deleting the dilemmas, and in code-terms, pulls the rug from under the other cores memory.

Somehow in debug, this isn't allowed to happen.

Lesson learned kids! Always ensure you don't do bugger all code after PostQuitMessage()!

Now back to the fun stuff.