Thursday, May 28, 2015

The learning curve

Once upon a time in the not-too-distant past, games were small. Game engines were often crafted in a tight, consistent way that made understanding the entire code base relatively easy. It was feasible to keep up to date on "state of the art" techniques with a little bit of dedicated reading.

That time has passed. The days of being able to maintain intimate knowledge of every single system in a AAA game are long gone: games are just too big. And it's not just games that are bigger: game development itself is a much bigger field than it was 10 years ago. There's just too much to understand, and too many breakthroughs being made on a daily basis for anyone to stay on top of it all.

Probably the most useful skill in game development is being able to learn quickly.

I personally learn new things every day. It's a job requirement. When you have a bug in an unfamiliar system, your mission is to quickly get up to speed on what it's doing, what it's supposed to be doing, and why it's not doing what it's supposed to be doing. That all involves acquiring domain knowledge, and overcoming the learning curve.

It's well established that the best way to learn is by doing. But how does one learn how to learn? Well, I can tell you something that doesn't help...

All too often in practice, when people get stuck, they ask for help too easily. And equally as often, their lead programmer's focus is on unblocking them, rather than allowing people to learn how to unblock themselves.

Having an attentive, helpful and always-available lead is good, right? Wrong.

This behavior is bad for the individuals, it's bad for the lead, and it's bad for team dynamics: it creates an environment where only a handful of people are capable, and the rest of the team quickly falls apart when they're not around.

I'm not saying that teaching is ineffective. But if you think about it, the lesson being taught here is a reinforcement of behavior: on a meta-level, it's teaching people to be dependent, rather than self-sufficient.

And it's worse than that. A strange anomaly I've observed is that engineers seem to have a "helpnessness bit". Once it's flipped, it tends to stay flipped. After they get stuck, and receive help, they continue to request help on things that they could figure out for themselves, because their bit has been flipped.

In order to learn how to solve hard problems, you have to solve hard problems.

That means grinding up against them. It means dealing with frustration, and helpnessness, and figuring out a way to make progress anyway. It means figuring out new techniques to unblock yourself, and on the meta-level, figuring out how to figure out new techniques.

Giving engineers an easy answer when they get stuck is actually doing them a disservice. It denies them the opportunity to develop the skills they need to unblock themselves, both now and in the future. It's giving them a fish rather than teaching them how to fish.

Of course, the constraints of a particular project might not afford the luxury of personal development. But I think this should always be a conscious choice, and that it should be made with due consideration to what you gain (immediate progress) vs. what you lose (teaching your engineers to solve hard problems by themselves).

So given a project with enough wiggle-room to "do it right", how should you conduct business? Well, here's my advice for both mentors and mentees.


Mentee:
  • Before you ask for help, spend an appropriate amount of time investigating the problem for yourself.
  • What would you do if your lead/mentor was not available? How would your lead/mentor go about solving this problem?
  • Don't be lazy. Solving problems is hard. It's always going to be easier to have someone explain something to you, but the goal is to become more self-reliant. Put in the effort.
  • Be methodical, and observant. Write down what you've tried, and fully investigate anything you don't understand.
  • If you need more knowledge, figure out where to find that knowledge. Hit the documentation. Read through the code in the functions you're calling. F11 is your friend..."step into" code and see what it does.
  • Check your assumptions, and if possible, reinforce them by using asserts. Is the refcount what you expect it to be? Who owns this object? What chain of events led us to this point?
  • When you decide to ask for help, make it a conscious choice.
  • When you do ask for help, ask for specific help and have your data ready: What does the callstack look like? What specific guidance do you need to become unstuck? Understand that your mentor's time is valuable - they probably have problems of their own to solve.
  • When you receive help, you should be looking for that "lightbulb" moment, when things click into place. At that point, thank the person who helped you, and get back to "independent" mode. Unflip the "helplessness bit".

Mentor:
  • Provide context, but let your mentee solve the problems. Set expectations: make it clear that the responsibility to complete the task is on the mentee.
  • If the project affords it, plan tasks that build on your mentee's previous knowledge. Set them up for success.
  • Unless the task is critical path, give your mentee time and space to figure things out. Don't hover. But of course, don't be absent either. There's a huge difference between between being "hands off" vs. plain negligent!
  • When you decide to help someone, make it a conscious choice, not something you do by default. Weigh the consequences.
  • When someone asks for help, require that they've done their homework first. If they haven't, ask them to do that before you provide help.
  • Provide guidance by asking questions rather than giving answers. "How do you know that your assumptions are correct?", "How can you go about figuring that out?", etc.
  • When you unblock someone, don't solve the entire problem for them. Give them what they need to be unstuck and then let them figure out the rest.

This arrangement is a contract, and it's important for both parties to understand that this is how the relationship works so both sides know what to expect. As a mentee, this isn't the easiest path. But the world needs more self-reliant programmers. And smooth seas do not make skillful sailors.

Friday, March 27, 2015

What salmon know about software engineering

Last time, I wrote about papering over problems, with a focus on a specific type of mistake that I frequently see programmers making. The whole idea was to find the root cause, rather than adding "fault tolerance" at the site of the crash.

Now, I'd like to generalize that thought to a philosophy for preventing some types of failure from ever occurring.

When a program malfunctions, it's often expensive and time consuming to diagnose the issue. It involves trying to reproduce the incorrect behavior: getting the timing and user-actions right, and doing them in the right order, all-the-while sleuthing what chain of events led up to the problem occurring. Often this involves figuring out which steps are irrelevant, and which matter; inspecting memory, and looking for patterns. Memory stomps and threading issues are notoriously difficult to get reliable repro steps for.

In some senses, from the programmer's perspective, it's worse to have a program that mostly works than one that clearly does not work. I often joke about the "programmer's curse", where bugs are impossible to repro when a programmer is watching.

When I encounter a crash, or a malfunction at runtime, I generally adopt the attitude that my program is missing an 'assert'. This isn't always the case, but in most cases it is. My thinking is that if the program had adequate asserts in place, then one of those should have tripped long before the fault or malfunction occurred.

More specifically, I look at each problem as if it's potentially a cascade of missing asserts.

After debugging, when I've finally got a picture of the critical chain of events leading to the problem, I start at the most downstream place where I could have detected the problem, and add an assertion there. To be clear, I add an assertion close to the symptoms (not necessarily anywhere near the root cause). Ideally at this time, I repro the crash again and make sure my assertion catches it...after all it's possible to make mistakes even when writing assertions!

After that's in place, and verified to work, I move a little upstream, to a place where I could have caught the problem slightly earlier, and add an assertion there. And I repeat this process all the way upstream to the root of the problem. Of course, it's important to move gradually upstream because if you immediately put an assertion at the root cause, you effectively cut off your ability to repro the downstream failures.

Most modern game engines are heavily data driven, so sometimes, the origin of the fault is in data. At that point, I jump into the data conditioning pipeline to make sure that it detects the problematic data and reports an error. Sometimes it's possible to take it even further, and modify the editor to prevent problematic data from being input in the first place.

At the end of the exercise, I go ahead and fix the problem. During this process, I've potentially altered dozens of places in the code, all of which are now slightly more diligent about preventing mistakes from slipping through. This expounds on an idea that I brought up last time: a crash (or any type of malfunction) is an opportunity to better understand the code, and beyond that, to make it more robust.

This theme of methodical and incremental steps to make the code better and more robust will recur as my blog develops, but I hope this entry inspires you to go beyond fixing bugs and spend some time on better bug detection.

Friday, March 13, 2015

Papering over problems

Programming is all about making decisions. Making good decisions is the "art" of programming. I'll write a lot more about "programming as art" in the future, and also on the decisions that programmers have to make, but today I'm going to talk specifically about a particular choice: what to do when you encounter a crash bug.

Over my career, I've seen a lot of code bases, and worked with a lot of teams. All too often, programmers make check-ins with descriptions that say something like "fixed bug" (more on better check in comments in the future too!), but when you take a look at their actual change, the "fix" is just to check some pointer for null, and skip the operation in that case. This behavior is so pervasive that many programmers do it without realizing that it's a problem.

In the words of Will McAvoy: "The first step in solving any problem is recognizing there is one."

The problem, of course, is that you need to consider what caused the pointer to be null in the first place. If the design of the system is loosely bound (i.e. the pointer is sometimes expected to be null), then perhaps checking for null is the appropriate fix. But if it's not, then I contend that you haven't "fixed" anything - rather, you've papered over a problem.

There are consequences to this behavior: all of a sudden the program *appears* to work. But perhaps it malfunctions in some less obvious, but potentially more serious way downstream. Worse yet, the program might no longer crash, but once in a blue moon it might exhibit strange behavior that nobody can explain. And good luck tracking that down.

Blindly checking for null convolutes the code. Someone else working in the same spot might see your null check, and assume that, by design, the pointer can be null. So they add null checks of their own, and before too long it's not clear whether *any* pointers can be relied upon.

Crashes are your friend. They're an opportunity to get to the root cause of an issue - perhaps even an incredibly rare issue. Crashes are like comets...you can cower superstitiously, or, like Newton, you can use them to understand the universe on a deeper level.

A pointer being null might be caused by:
- an initialization issue
- a lifetime issue (relationships between objects not being enforced)
- someone inappropriately clearing the pointer
- a memory stomp
- or something even more exotic

The point is that until you understand what's really going on, you haven't diagnosed the crash at all! The appropriate action is to diagnose the upstream cause of the issue, not to put in some downstream hack to paper over the problem.

Papering over the problem is a deal with the devil. Sure, you're a "hero" because of your quick fix...but usually the cost is that someone else has to spend much more time investigating the real cause when it comes back to bite. And nine times out of ten, the real problem could easily be discovered, but laziness and/or immediacy steals that opportunity.

Don't get me wrong, system programmers are The Night Watch, and there's nothing we love more than rolling up our sleeves and spending several uninterrupted hours sleuthing into a seemingly impossible bug, but our time is better spent diagnosing real issues, as opposed to issues that are easily avoidable with better technique.

At Disbelief, we call handling failures in-line "soft-fails". And we have a rule that says you don't use them, unless you understand why the failure can occur, and further, you have to put a comment explaining the reason at the point where you add the soft-fail. This covers all the bases: it prevents people from tossing in soft-fails because they're lazy; it demands that people understand why the failure can occur; and it documents the case for future observers.

At this point, I think I should stop and talk about technique vs. practicality. There are more factors at play in development than just software engineering. If the null pointer crash is stalling your team, and you can be reasonably sure that there aren't data-destructive consequences to patching it, then the right tradeoff might be to patch it temporarily. Equally, if your project could get cancelled because you need to hit a milestone, then practicality has to rule the day. And a code base that's overly paranoid about everything can cripple development because it trips up on situations that are, in fact, expected.

But don't use those factors as an excuse not to spend some time investigating - many problems can be fixed with just a tiny bit more effort. And my plea to all the programmers who might be reading this is to spend those extra few minutes of effort and find the real problem. And to be mindful of the consequences and the cost of adding a quick fix.

I'll finish by saying that I used null pointers as an example during this post, but the same principle applies to other varieties of soft-fail. If speed is below zero, don't just compare it to zero, think about whether speed is allowed to be below zero in your design. Just the other day, I saw an experienced programmer writing code to squish NaNs, but did that person really take the time to understand why NaNs were present in the first place?

If you want to get to the point where you're solving the deepest, most mystifying crashes, then doing due diligence whenever you encounter a crash is a great way to start.

Friday, February 27, 2015

Welcome

Welcome to my blog!

I'm going to use this blog to post my thoughts about software engineering, game development, and management. The posts will be in random order, based on what I'm thinking about at the time. Sometimes, they'll be based on situations that I've encountered in the past. Sometimes they'll be based on my current thinking. You might even gain some insight into what's currently going on at Disbelief.

My goal is to make my posts useful to the reader, especially if you're an up-and-coming programmer. Even if you don't learn something new, I hope you'll at least relate to the situations I describe: maybe you'll recognize the actions of a wrong-headed producer or manager!