Ordering randomness

One of the things that game developers hate the most are ‘random bugs’. I’d love to get an euro for every time I got a bug report and when asked how could I reproduce it, received: ‘I don’t know, it’s random’. Thing is, it simply is not true in majority of cases. There are no truly random bugs, they are caused by certain sequence of events. Sure, sequence may be incredibly rare, very unlikely, but usually - it’s not random. If you can identify the sequence, that’s half of the work done already. That’s what distinguish great testers from average/good ones. They are able to catch all those subtle details and hints and include them in the report. After while, you learn who from your QA team has a gift for that. In many cases, however, you’ve to do most of the investigation by yourself.

Unless you’re very lucky, it’ll take you many tries to notice patterns that lead to a problem. At first, when I don’t have any clues, I like to just try if I can reproduce it at all (but remember to observe some basic conditions as well). If it’s a relatively isolated case that doesn’t require playing a level for 30 minutes, it shouldn’t take that much time. Now it’s turn to try and form some hypotheses about possible root of a bug. Where were we, what happened before, who was nearby, what state (weapon, armor, movement, inventory, etc) was our character in, quest states, the more the better. Now, the important part is to make sure that our assumed sequence of events really does lead to a bug. Otherwise, it’d be tough to ensure that we really fixed it. Say, we have a crash that occurs roughly once every 20 tries. We think we know why it happens, we fix it or at least fix something… Now, how to make sure bug’s not there anymore? Of course, we can try to play it 100 times, but even if it doesn’t happen now, it doesn’t really prove anything. Much better way is to create a situation where the bug manifest itself in every single case. Sounds difficult? Doesn’t have to be. Remember, we have control over the source code! Recent example: I found a rare crash during some gun fights. It’d happen maybe once every 20-30 runs. After a while, I formed a hypothesis it may occur when NPC attacks in the same/next frame as the player. Next step: analyze complete codepath in such case. It should start in a moment you think it’s the cause and end in the place of a bug. On paper, I found a path (sequence of events) that could indeed lead to a crash (included some weird conditions, like quite unlikely order of sending messages, but was possible). How can we verify if we’re right? Well, we can run the game many times trying to attack exactly when the NPC tries to do it. Not exactly piece of cake at 30+ frames per second. Luckily, there’s an easier way. Let’s repeat: we have total control over the source code, we can prepare whatever scenario we imagine. In my case it was trivial, if NPC attacked, it’d set a flag (global, of course, if you hack something, do it properly), player Update function in the same frame would trigger a counterattack if this flag was set. Compile, load the level, engage, bam - it crashes (actually, at first it didn’t, but that’s because there was one frame of delay, after forcing both attacks in the same frame - it did). At this point, we’re home. Not only is our bug reproducable in 100% of runs, when we fix it - we can actually verify it works with absolute certainty. I simplify things here a little bit, make it sound easier than it really is, usually still a long and tedious process (in described case it took me several hours to even notice important patterns and then analyze possible paths).

One of the worst example of ‘random’ bugs are all kinds of thread-related problems, especially data races. I described one such problem before. This was an example of situation where I actually fixed the problem by analizing different possible paths of execution rather than reproducing it, because it was so rare, it’d take hours every time. That’s also how Relacy works. It basically simulates various execution scenarios (possible thread switch points), introducing order to seemingly random world of multi-threading.

At the end, let me tell you a story of a true (pseudo)random bug that did happen to me, though. Those were final months of development and we got this very peculiar crash report. Log file contained complete callstack, but it looked, well’ impossible. Code was crashing inside some soundsystem function, but all the functions before were related to rendering. There was no possibility for such sequence of calls to occur (or so I thought). The more I checked it, the more it was obvious that this crash just couldn’t happen. In despair, I started to examine latest commits. Few days before, I’ve been doing some modifications in our copy-protection wrapper, that involved copying buffers of random data back and forth. After some more debugging, it turned out that one of the buffers was too small, so I was overwriting memory. Amongst the others, virtual table of renderer has been trashed with random data. However, this random number was exactly the same as address of soundsystem function (plus, only the first function has been “substituted’, rest of the vtable was OK). Crash occured only because those routines took different number of arguments, which was fortunate coincidence, actually. Otherwise, it’d probably take me much, much longer to trace it (we’d only got ‘something broken with renderer’ report, most likely’).

Old comments

shash 2009-09-08 09:11:55

Nice post, one strange bug that happened with older compilers (VC2k3, at least) was that doing partial builds when only a few headers changed led to compiler not updating dependencies correctly, leading to incorrect link. It was easily noticeable, as once you stepped through the code, it jumped to locations that never should. It hasn’t happened on VC2k5 or VC2k8, though.

Sobol 2009-09-05 17:05:59

My biggest random-bughunt was connected to network game I have developed some time ago. Some crashes where really so “random like”, that the whole team was sitting and staring at the screen with manga-like stupid face expressions :P Networking problems are somehow similar to MT problems, anyway they are quite specific. If you cannot find a random bug, just call it a random feature :P

More Reading