Steve Ellmore: What salmon know about software engineering

Last time, I wrote about papering over problems, with a focus on a specific type of mistake that I frequently see programmers making. The whole idea was to find the root cause, rather than adding "fault tolerance" at the site of the crash.

Now, I'd like to generalize that thought to a philosophy for preventing some types of failure from ever occurring.

When a program malfunctions, it's often expensive and time consuming to diagnose the issue. It involves trying to reproduce the incorrect behavior: getting the timing and user-actions right, and doing them in the right order, all-the-while sleuthing what chain of events led up to the problem occurring. Often this involves figuring out which steps are irrelevant, and which matter; inspecting memory, and looking for patterns. Memory stomps and threading issues are notoriously difficult to get reliable repro steps for.

In some senses, from the programmer's perspective, it's worse to have a program that mostly works than one that clearly does not work. I often joke about the "programmer's curse", where bugs are impossible to repro when a programmer is watching.

When I encounter a crash, or a malfunction at runtime, I generally adopt the attitude that my program is missing an 'assert'. This isn't always the case, but in most cases it is. My thinking is that if the program had adequate asserts in place, then one of those should have tripped long before the fault or malfunction occurred.

More specifically, I look at each problem as if it's potentially a cascade of missing asserts.

After debugging, when I've finally got a picture of the critical chain of events leading to the problem, I start at the most downstream place where I could have detected the problem, and add an assertion there. To be clear, I add an assertion close to the symptoms (not necessarily anywhere near the root cause). Ideally at this time, I repro the crash again and make sure my assertion catches it...after all it's possible to make mistakes even when writing assertions!

After that's in place, and verified to work, I move a little upstream, to a place where I could have caught the problem slightly earlier, and add an assertion there. And I repeat this process all the way upstream to the root of the problem. Of course, it's important to move gradually upstream because if you immediately put an assertion at the root cause, you effectively cut off your ability to repro the downstream failures.

Most modern game engines are heavily data driven, so sometimes, the origin of the fault is in data. At that point, I jump into the data conditioning pipeline to make sure that it detects the problematic data and reports an error. Sometimes it's possible to take it even further, and modify the editor to prevent problematic data from being input in the first place.

At the end of the exercise, I go ahead and fix the problem. During this process, I've potentially altered dozens of places in the code, all of which are now slightly more diligent about preventing mistakes from slipping through. This expounds on an idea that I brought up last time: a crash (or any type of malfunction) is an opportunity to better understand the code, and beyond that, to make it more robust.

This theme of methodical and incremental steps to make the code better and more robust will recur as my blog develops, but I hope this entry inspires you to go beyond fixing bugs and spend some time on better bug detection.

Steve Ellmore

Friday, March 27, 2015

What salmon know about software engineering

No comments :

Post a Comment