Friday, March 13, 2015

Papering over problems

Programming is all about making decisions. Making good decisions is the "art" of programming. I'll write a lot more about "programming as art" in the future, and also on the decisions that programmers have to make, but today I'm going to talk specifically about a particular choice: what to do when you encounter a crash bug.

Over my career, I've seen a lot of code bases, and worked with a lot of teams. All too often, programmers make check-ins with descriptions that say something like "fixed bug" (more on better check in comments in the future too!), but when you take a look at their actual change, the "fix" is just to check some pointer for null, and skip the operation in that case. This behavior is so pervasive that many programmers do it without realizing that it's a problem.

In the words of Will McAvoy: "The first step in solving any problem is recognizing there is one."

The problem, of course, is that you need to consider what caused the pointer to be null in the first place. If the design of the system is loosely bound (i.e. the pointer is sometimes expected to be null), then perhaps checking for null is the appropriate fix. But if it's not, then I contend that you haven't "fixed" anything - rather, you've papered over a problem.

There are consequences to this behavior: all of a sudden the program *appears* to work. But perhaps it malfunctions in some less obvious, but potentially more serious way downstream. Worse yet, the program might no longer crash, but once in a blue moon it might exhibit strange behavior that nobody can explain. And good luck tracking that down.

Blindly checking for null convolutes the code. Someone else working in the same spot might see your null check, and assume that, by design, the pointer can be null. So they add null checks of their own, and before too long it's not clear whether *any* pointers can be relied upon.

Crashes are your friend. They're an opportunity to get to the root cause of an issue - perhaps even an incredibly rare issue. Crashes are like can cower superstitiously, or, like Newton, you can use them to understand the universe on a deeper level.

A pointer being null might be caused by:
- an initialization issue
- a lifetime issue (relationships between objects not being enforced)
- someone inappropriately clearing the pointer
- a memory stomp
- or something even more exotic

The point is that until you understand what's really going on, you haven't diagnosed the crash at all! The appropriate action is to diagnose the upstream cause of the issue, not to put in some downstream hack to paper over the problem.

Papering over the problem is a deal with the devil. Sure, you're a "hero" because of your quick fix...but usually the cost is that someone else has to spend much more time investigating the real cause when it comes back to bite. And nine times out of ten, the real problem could easily be discovered, but laziness and/or immediacy steals that opportunity.

Don't get me wrong, system programmers are The Night Watch, and there's nothing we love more than rolling up our sleeves and spending several uninterrupted hours sleuthing into a seemingly impossible bug, but our time is better spent diagnosing real issues, as opposed to issues that are easily avoidable with better technique.

At Disbelief, we call handling failures in-line "soft-fails". And we have a rule that says you don't use them, unless you understand why the failure can occur, and further, you have to put a comment explaining the reason at the point where you add the soft-fail. This covers all the bases: it prevents people from tossing in soft-fails because they're lazy; it demands that people understand why the failure can occur; and it documents the case for future observers.

At this point, I think I should stop and talk about technique vs. practicality. There are more factors at play in development than just software engineering. If the null pointer crash is stalling your team, and you can be reasonably sure that there aren't data-destructive consequences to patching it, then the right tradeoff might be to patch it temporarily. Equally, if your project could get cancelled because you need to hit a milestone, then practicality has to rule the day. And a code base that's overly paranoid about everything can cripple development because it trips up on situations that are, in fact, expected.

But don't use those factors as an excuse not to spend some time investigating - many problems can be fixed with just a tiny bit more effort. And my plea to all the programmers who might be reading this is to spend those extra few minutes of effort and find the real problem. And to be mindful of the consequences and the cost of adding a quick fix.

I'll finish by saying that I used null pointers as an example during this post, but the same principle applies to other varieties of soft-fail. If speed is below zero, don't just compare it to zero, think about whether speed is allowed to be below zero in your design. Just the other day, I saw an experienced programmer writing code to squish NaNs, but did that person really take the time to understand why NaNs were present in the first place?

If you want to get to the point where you're solving the deepest, most mystifying crashes, then doing due diligence whenever you encounter a crash is a great way to start.


  1. Not quite the same, obviously, but recently at work there was a discussion about visual scripting methods, and we were arguing about firing actions in parallel (ie, all actions emanate from a single event) versus firing actions as a chain, where one successful action feeds into the next action on success (in these cases the timing wasn’t super important).

    The argument against doing it my way was, “Well, if one thing in that chain fails, then the entire chain is going to fail.” And my response was, “Yes! Exactly!”

  2. Yeah, it's very much related. It's much better for a program to fail obviously than to fail subtly.