Advent of Code in Production, Day 7: Incident Review

If Advent of Code was a whole system, it might look like this. Of course, the first deployment of a complex system is never smooth. This is a review of that incident.

Advent of Code in Production, Day 7: Incident Review

We've been designing a system to help Santa's elves gather magical star fruit, following along with the scenarios presented in Advent of Code. So far, we've built out some core functionality of the application, as well as some core system architecture. On day 6, we start to deploy it and use the system for real. As with virtually every new product release, there are some issues. So, we declared a production incident, and worked through those problems. Now it's time to review what happened.

The most important thing with incident reviews is to learn something from the process. But we can do better than just learning something. There are some things we need to learn more than others. So, we'll try to tailor our review to surface those things and share that knowledge. In this case, I think our elves most need to learn to improve their operational practices. They maybe even need to learn that they can improve. It seems like they do a lot of improvising and a lot of last-minute scrambling, even around very common activities. So, if I was establishing a real incident review process for this organization, I would have a couple of primary goals with it:

  1. Promote earlier and more proactive communication. On a long timeline, I would want to get to a point where people anticipate each other's needs. But that's the end of a long road, and the place to start is creating more safety to ask questions.
  2. Start identifying things that do work well and get them established as regular processes. It seems like the elves are managing, but my guess is they're using a lot of adaptive capacity on routine work. The idea is to try to make routine things more routine, so they have more capacity available to adapt to the unexpected.

Incident Report

This incident occurred on December 6-7, during the initial deployment to use SFTools on a production communicator. The communicator we had available to use was misconfigured and initially could not connect to the rest of the communicator network. We were eventually able to patch the radio module and connect to the network, but still could not use the messaging system. We believe this was due to outdated message versions, so we tried to perform a system update to get the latest version. This failed because the communicator's disk was mostly full and there wasn't space available to download the updates. Without knowing what was stored on disk, we opted to do some quick analysis and a little bit of guess work in identifying the minimal set of files that could be removed to enable the update. After that, the update was successful, and the communicator worked as expected.

Recommendations

We want to be careful not to suggest that these issues could have been prevented. That would be speaking with hindsight, and it's not productive. However, there are steps we can take to be better equipped to respond to issues like this more effectively in the future, or to detect them earlier.

  • Add an IT asset management facet to the expedition's inventory management process. Communicators are general purpose computers. They should be inspected and prepared for use and restored to a consistent state before they're issued.
  • Allow people to check out their own equipment from inventory. The members of the expedition are responsible and highly competent. They can be trusted to manage their own needs.
  • Provide some guidance in the form of checklists to prepare for important expedition events such as moving and establishing camps.
  • Consider building consistent teams of expedition members. Small teams can build familiarity, learn each other's needs and skills, and provide support as needed.
  • Investigate how the radio module got into a nonfunctional state. That's a surprising and worrisome failure mode.

Complications

  • The communicator we had for initial deployment was known by the group as "the broken one."
  • We were not able to secure a communicator to deploy until after leaving base, camp, so this incident response happened with limited resources while en route to the remote camp.
  • In addition to being known broken, the communicator was generally in an unknown state from unknown prior use.
  • System updates require nearly half the total disk space on a communicator.

Timeline

This timeline was recreated after the fact, and mostly without the benefit of any timestamped communications or events. It may not be entirely accurate, but that's okay. Incident timelines are not very useful on their own. The purpose of it is to contextualize the choices and actions that were made during the incident response.

  • Morning Dec 6 - We accompany a large crew going to establish a field camp.
  • We're given a communicator by one of the elves for the first time. The communicator is broken, but they decided to use it anyway because we have a reputation for being able to fix these things.
  • We began investigating the device and identified the communication failures.
  • Afternoon Dec 6 - Mostly spent traveling.
  • Evening Dec 6 - The communicator radios are reasonably well documented. Between the docs and some experimentation, we're able to write a patch for the radio that restores some functionality. Specifically, this enabled the radio to identify packets in the from the communication stream.
  • With some more experimentation we can patch the radio module to correctly process multi-packet messages.
  • We discover that the communicator is not correctly deserializing those messages. We stop for the day.
  • Morning Dec 7 - We investigate message serialization and learn the serializers are provided by system update. We try to perform one and encounter the problem with the disk being full.
  • We try to take a partial update but learn there's no good way to do that.
  • So, we start looking for ways to clear space on the disk. This is mostly exploratory work, searching for files that seem safe to delete. We find some and delete them.
  • We're able to perform a system update after that, and the communicator seems to be fully functional after that.

In Real Life

Normally an incident review would include the people who were involved with the incident. My preference would be to do some short one on one interviews with the key players to understand what they did and why. That would be followed by a group review to share learnings and discuss productive follow up. That review should also produce a report for the broader organization. That could be given as a presentation, or not, depending on group preference.

In this case, all of the people involved are fictional, and I'm already butting up against the limits of the authorial liberties I want to take with this series. I'm not trying to write the Phoenix Project. So, we'll have to make do with just reading the report.