Mental maps, part 2: incidents and observability

We map the system so that we can change the system, so then we must remap the system. That's the tight inner loop of software development.

A photo of a radio telescope at night, with background of stars in the sky

I wrote recently that the fundamental role of software developers is discovery, learning, and building mental maps of the system. Personally, I think it's a good article, and you should read it. I ended that post with a discussion of on-boarding, as both a time and a task when the developer's explicit goal is to build a mental model of the system. The other members of the project often put some extra effort into helping with that task. And the only real expected outcome is the new developer becomes familiar enough with the project to no longer need that extra attention. Of course on-boarding isn't special, improving discoverability can happen any time. As it happens, there are other times, and other tasks, when exploration or understanding are primary goals. Two notable cases are incidents and observability. Devoting care and attention to these tasks all contribute to a virtuous cycle toward being more effective as software developers.

Incidents

Let's first make sure we're talking about the same things. When I talk about an incident with a software system, that's usually a significant disruption in service. When the service gets slow, error prone, or unresponsive, and people get called in to fix it, that's an incident. An incident might also be a disruption to the normal operation of the system that doesn't (noticeably) disrupt service. You might use an incident response framework to coordinate a major release or upgrade. Either case works for our purposes. The critical thing is that the system is not operating as normal and desired, and people are actively working to understand why and correct it.

All happy families are alike; each unhappy family is unhappy in its own way.

Incident response is about understanding what's causing a system to act strangely, and then address it. It's explicitly a task to build a mental model of the system. Actually, more than that, it's often a task to re-build a mental model of the system. Whether we call something an incident or not depends in part on whether the system is behaving as intended. That intent reflects our goals and desires, as well as our preexisting understanding. Then reality comes along and upends what we thought we understood. This can be very stressful in the moment. But it's a goldmine for learning and discovery. A critical part of the way we learn about complex systems is by observing how they behave in response to stimuli. An incident is almost definitionally a significant new behavior. The Anna Karenina principle suggests there's an unlimited number of ways a system can break, and only a limited number of ways it can work. Incidents present a rich opportunity to explore some of that unbounded possibility space.

Of course, it's pretty rare that we can take our time to do a careful study during an incident. The priority is almost always to return the system to working order. This means that incident response tends to involve the people who already have the best understanding of the system. That makes sense, but it also leaves everyone else out of a great opportunity. We can recapture much of that opportunity by conducting reviews of our recent incidents. There's been quite a lot written about how to do that, so I won't repeat it. I think the Howie process from Jeli.io, and the Learning From Incidents conference and community are great resources. You might also search for terms like "SRE, incident, review, retrospective, postmortem, blameless, and blame aware."

I will say that you should pay attention to what you're learning, not just how you're learning. You can't really control what lessons an incident has to teach, but you can control which you focus on. That should be informed by what your organization actually needs. That is, what understanding you lack, or where your mental maps don't align (with each other, or with the part of the world you just discovered). If you're new at this as an organization, then you might want to start with the skills or confidence you lack. And then when you're done, find ways to share your learnings. The retrospective meeting itself is best kept to just the people involved. But discussion and documentation that comes out of the retro would likely benefit a very large audience.

Observability

Observability is the other topic I'd like to discuss in terms of helping ourselves build the mental maps we need to be effective as software developers. I'll start again with a bit of definition, because it's a term that's taken on a lot of marketing weight recently. The term originates in the realm of control theory. I'll be honest, my education in that area is informal. Maybe some day I'll have the bandwidth to formalize it, but for now I'll just stick to how it comes up in practice. Observability is a quality of a system that indicates how well the operators can deduce the system's internal state based solely on it's observable output. That is, without stopping the system, disassembling it, or modifying the inputs, do you have enough information to reason about what the system is doing and why? If so, you have good observability; congrats! If not, you have poor observability.

Not just telemetry

If you've encountered Observability™ marketing, you've probably heard a lot about the three pillars of logs, metrics, and traces. Collectively, these things are telemetry. Telemetry is a good and important mechanism to improve observability, certainly. Being able to process and analyze your telemetry is just as important. But don't forget about your system's regular, for-purpose behavior. Status and error messages are just as much something you can observe as logs and traces are, for instance. That's somewhat tangential to my point here, but I think it's important to say.

As software developers, we often find ourselves working on complex systems. One of the key features of complex systems is that they are not actually comprehensible. Regardless of an individual developer's knowledge, skill, or experience, it's not actually possible to have enough information about a complex system to know with certainty how it will behave, or what is causing a given behavior. They don't even necessarily have singular causes or behaviors. What this means for us is that we can't recreate and examine arbitrary states. What we can do is make sense of the system based on it's observed behavior. We can theorize, test, adapt, and theorize some more. This is the only really effective way to build useful mental models of a complex system. We have to poke it, and then see how it responds. Good observability lets us do this much more effectively.

Telemetry is a vital component of observability. Having that telemetry permits much more robust understanding of the system. Getting the telemetry has the same effect. There's very little telemetry that you'll ever just get for free. Some frameworks (web servers, for instance) come with some prepackaged logging, but that's about the best you can normally expect. That means in order to get useful signals out of your system, you have to instrument the system to produce those signals. The process of doing that will involve a lot of hunting through the code base to find and expose interesting data. It requires thinking about how to collect and expose that data, and in the best case might even involve design changes to the system to make the data more accessible. But most importantly, it requires having some idea what signals would be useful. That requires a useful model of the system; its purpose, its behavior, and the people who operate it.

The purpose of maps

We put all this work into drawing these mental maps for ourselves, even though that work is very often incidental to something else. A lot of people act as though being a programmer is about knowing programming languages, or data structures and algorithms, or maybe design patterns. But every decision you'll ever make about any of those is down stream from your understanding of the system you're working on. We make these mental maps of our systems, not because we enjoy it, but because we need them in order to work on the system. Don't get me wrong, we can also enjoy it; I certainly do. But the motivating factor is that we're trying to get something done which requires some understanding of the system.

That's important. Maintaining and operating software systems requires a robust understanding of the system. And those systems are constantly changing, so we need to constantly revise our understanding of them. In order to understand the system, we have to explore and experiment with it. Taken together, the result is that we can do faster, better, more reliable work on systems that are easier and safer to explore. But don't stop there. Software doesn't isn't some naturally occurring thing. We build it. We decide how it's built. And we can build it to be safer and easier to explore. That's not just a "nice to have" feature. It's not a luxury that only other teams can afford. We map the system so that we can change the system, so then we must remap the system. That's the tight inner loop of software development. Optimizing that literally makes us better at our jobs.


Cover Photo by Igor Mashkov