Steve Ellmore: March 2015

Last time, I wrote about papering over problems, with a focus on a specific type of mistake that I frequently see programmers making. The whole idea was to find the root cause, rather than adding "fault tolerance" at the site of the crash.

Now, I'd like to generalize that thought to a philosophy for preventing some types of failure from ever occurring.

When a program malfunctions, it's often expensive and time consuming to diagnose the issue. It involves trying to reproduce the incorrect behavior: getting the timing and user-actions right, and doing them in the right order, all-the-while sleuthing what chain of events led up to the problem occurring. Often this involves figuring out which steps are irrelevant, and which matter; inspecting memory, and looking for patterns. Memory stomps and threading issues are notoriously difficult to get reliable repro steps for.

In some senses, from the programmer's perspective, it's worse to have a program that mostly works than one that clearly does not work. I often joke about the "programmer's curse", where bugs are impossible to repro when a programmer is watching.

When I encounter a crash, or a malfunction at runtime, I generally adopt the attitude that my program is missing an 'assert'. This isn't always the case, but in most cases it is. My thinking is that if the program had adequate asserts in place, then one of those should have tripped long before the fault or malfunction occurred.

More specifically, I look at each problem as if it's potentially a cascade of missing asserts.

After debugging, when I've finally got a picture of the critical chain of events leading to the problem, I start at the most downstream place where I could have detected the problem, and add an assertion there. To be clear, I add an assertion close to the symptoms (not necessarily anywhere near the root cause). Ideally at this time, I repro the crash again and make sure my assertion catches it...after all it's possible to make mistakes even when writing assertions!

After that's in place, and verified to work, I move a little upstream, to a place where I could have caught the problem slightly earlier, and add an assertion there. And I repeat this process all the way upstream to the root of the problem. Of course, it's important to move gradually upstream because if you immediately put an assertion at the root cause, you effectively cut off your ability to repro the downstream failures.

Most modern game engines are heavily data driven, so sometimes, the origin of the fault is in data. At that point, I jump into the data conditioning pipeline to make sure that it detects the problematic data and reports an error. Sometimes it's possible to take it even further, and modify the editor to prevent problematic data from being input in the first place.

At the end of the exercise, I go ahead and fix the problem. During this process, I've potentially altered dozens of places in the code, all of which are now slightly more diligent about preventing mistakes from slipping through. This expounds on an idea that I brought up last time: a crash (or any type of malfunction) is an opportunity to better understand the code, and beyond that, to make it more robust.

This theme of methodical and incremental steps to make the code better and more robust will recur as my blog develops, but I hope this entry inspires you to go beyond fixing bugs and spend some time on better bug detection.

Programming is all about making decisions. Making good decisions is the "art" of programming. I'll write a lot more about "programming as art" in the future, and also on the decisions that programmers have to make, but today I'm going to talk specifically about a particular choice: what to do when you encounter a crash bug.

Over my career, I've seen a lot of code bases, and worked with a lot of teams. All too often, programmers make check-ins with descriptions that say something like "fixed bug" (more on better check in comments in the future too!), but when you take a look at their actual change, the "fix" is just to check some pointer for null, and skip the operation in that case. This behavior is so pervasive that many programmers do it without realizing that it's a problem.

In the words of Will McAvoy: "The first step in solving any problem is recognizing there is one."

The problem, of course, is that you need to consider what caused the pointer to be null in the first place. If the design of the system is loosely bound (i.e. the pointer is sometimes expected to be null), then perhaps checking for null is the appropriate fix. But if it's not, then I contend that you haven't "fixed" anything - rather, you've papered over a problem.

There are consequences to this behavior: all of a sudden the program *appears* to work. But perhaps it malfunctions in some less obvious, but potentially more serious way downstream. Worse yet, the program might no longer crash, but once in a blue moon it might exhibit strange behavior that nobody can explain. And good luck tracking that down.

Blindly checking for null convolutes the code. Someone else working in the same spot might see your null check, and assume that, by design, the pointer can be null. So they add null checks of their own, and before too long it's not clear whether *any* pointers can be relied upon.

Crashes are your friend. They're an opportunity to get to the root cause of an issue - perhaps even an incredibly rare issue. Crashes are like comets...you can cower superstitiously, or, like Newton, you can use them to understand the universe on a deeper level.

A pointer being null might be caused by:
- an initialization issue
- a lifetime issue (relationships between objects not being enforced)
- someone inappropriately clearing the pointer
- a memory stomp
- or something even more exotic

The point is that until you understand what's really going on, you haven't diagnosed the crash at all! The appropriate action is to diagnose the upstream cause of the issue, not to put in some downstream hack to paper over the problem.

Papering over the problem is a deal with the devil. Sure, you're a "hero" because of your quick fix...but usually the cost is that someone else has to spend much more time investigating the real cause when it comes back to bite. And nine times out of ten, the real problem could easily be discovered, but laziness and/or immediacy steals that opportunity.

Don't get me wrong, system programmers are The Night Watch, and there's nothing we love more than rolling up our sleeves and spending several uninterrupted hours sleuthing into a seemingly impossible bug, but our time is better spent diagnosing real issues, as opposed to issues that are easily avoidable with better technique.

At Disbelief, we call handling failures in-line "soft-fails". And we have a rule that says you don't use them, unless you understand why the failure can occur, and further, you have to put a comment explaining the reason at the point where you add the soft-fail. This covers all the bases: it prevents people from tossing in soft-fails because they're lazy; it demands that people understand why the failure can occur; and it documents the case for future observers.

At this point, I think I should stop and talk about technique vs. practicality. There are more factors at play in development than just software engineering. If the null pointer crash is stalling your team, and you can be reasonably sure that there aren't data-destructive consequences to patching it, then the right tradeoff might be to patch it temporarily. Equally, if your project could get cancelled because you need to hit a milestone, then practicality has to rule the day. And a code base that's overly paranoid about everything can cripple development because it trips up on situations that are, in fact, expected.

But don't use those factors as an excuse not to spend some time investigating - many problems can be fixed with just a tiny bit more effort. And my plea to all the programmers who might be reading this is to spend those extra few minutes of effort and find the real problem. And to be mindful of the consequences and the cost of adding a quick fix.

I'll finish by saying that I used null pointers as an example during this post, but the same principle applies to other varieties of soft-fail. If speed is below zero, don't just compare it to zero, think about whether speed is allowed to be below zero in your design. Just the other day, I saw an experienced programmer writing code to squish NaNs, but did that person really take the time to understand why NaNs were present in the first place?

If you want to get to the point where you're solving the deepest, most mystifying crashes, then doing due diligence whenever you encounter a crash is a great way to start.

Steve Ellmore

Friday, March 27, 2015

What salmon know about software engineering

Friday, March 13, 2015

Papering over problems