In software engineering, we often talk about scale; scaling systems, the scalability of a design, making things scalable. That might include things like load balancing, caching, event busses, replication, and any number of other technical features. These are all important, valuable, necessary things. These are complex and interesting topics. And these things are also only part of the equation. We are a part of the systems we build. We spend a lot of time talking, thinking about, and working on the technical aspect. But these are sociotechnical systems. The people who build and operate a system are just as much a factor in the outcome as the network and services it's made of. When we plan to scale these systems, ignoring the social part—ourselves and our teams—can create just as much of a hazard as any other part.
Technical interviews for software developers often have a data structures & algorithms test. Presumably it's to evaluate whether a candidate can recognize problems that are addressable with certain well-known solutions. These tests are very often poorly applied, but the objective is a reasonable one. Some of these things are harder than others. But all of it is likely impossible if you don't even recognize that a problem exists in the first place, or what kind of problem it is. The same is true in numerous other contexts. As a project grows, matures, and scales, there will be challenges to overcome. These challenges may be difficult. But they'll be practically impossible unless you recognize that the problem exists at all. I can't tell you what the solution to your problems are. I can't even tell you what your specific problems are. My intent here is to identify some categories of these problems, as a framework to help you recognize and think about them yourself.
When I say that the systems we build as software engineers are sociotechnical, what does that mean? Basically, we're part of the system. We tend to be accustomed to thinking of our computer systems in terms of their hardware and software. It's the programs we wrote, the runtimes and operating systems that execute them. It's the CPUs and RAM that flip bits around. It's the networks and disks that allow us to communicate, store, and retrieve data. It's the sensors, screens, speakers, lights, and motors that interact with the rest of the world. And that's all true, the system is the collection of those things.
But that's not all the system is. Those things are machines. Machines need to be built, maintained, repaired, and operated. All of that is done by people. That work is critical to the function of the rest of the system. We give the machines work to do. We modify them to do different work. We repair them when they break down, and we restore them to good order when they get into undesired states. If we stopped doing those things, or even if we did them differently, that would dramatically change the way the system behaves. Without us, the system would not exist, and would not function. We're part of the system.
The purpose of scaling (up) a system is to be able to handle more demand than in the past. We do this because we expect them to grow. Or at least we hope they will. We want to be able to support more customers. We want to support increased usage from our existing customers. And we want to do it without sacrificing the quality or level of service we provide. We also want our costs to grow slower than the demand does. We accomplish that by designing the system to be scalable. But, scalable how? What is being scaled exactly? And in response to what?
There are various ways that a sociotechnical system can scale, and the one we usually mean is to scale capacity in response to (or anticipation of) demand. Capacity is the amount of work that the system can take on and process. This is what we're focused on when we talk about things like caching, sharding, eventual consistency, or Big O complexity. These are all valid concerns. But at the same time, remember that there are people in the system, not just computers. Capacity also means the amount of support you can provide to users. The ability of your operations teams to respond to events happening in the system. Even the ability of the developers to continue to develop the system.
Compared with staffing up more support and operations roles, computers and software are easy. The capacity of your system includes those functions, too. So, any plan to scale the system which ignores the work of support and operations is—at best—incomplete. Support and ops can grow through experience and training. Tooling. Documentation. Those things only buy you so much capacity, and then the only remaining avenue to scale is hiring. Which is hard—and ironically it doesn't scale well. I'll get into why that is later. But remember that computers are easy. There are many times when development teams will scale the capacity for work done by the technical part of the system with no regard for the social part: the people. This has the effect of shifting work and complexity off of the computers, software, and developers. Instead that work is shifted onto ops and support. This is obviously rude and insensitive. It's also bad engineering. The narrow view of the system taken by development teams means that the system doesn't actually respond well to scaling demand. It may be faster to build complicated features, confusing workflows, and error intolerant software. But these things don't accomplish their goals. They just shift the burden onto different parts of the sociotechnical system. Good engineering will take this into account. It recognizes that machines are supposed to serve us, not the other way around. It means ensuring that as much work as possible is done by the tools we build. And minimizing the new work (and the difficulty of that work) that those tools create for us.
“Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.”
Capacity may be what comes to mind first when we talk about scale, but it's only one dimension. Systems also scale in terms of scope: the kinds of things the system even does. Any successful system will have pressure to expand its features over time. People use it for some tasks, and then want to continue using it for related tasks. The truth in Zawinski's law isn't about email specifically, it's about that expansion of features. Our systems will scale to encompass new and larger scopes.
That added scope comes with added complexity. New interactions, new dependencies, and new failure modes. Designing a system to be scalable to larger scope means accounting for those things, mitigating them, and limiting their growth. Failures to do so tend to go unrecognized as a failure to scale well, but they do tend to be noticed and felt by development teams. There's a very good chance that any complaint about "legacy" code is actually hinting at this. Legacy code is code that's hard to modify and maintain, hard to evolve, hard to understand. Designing a system to be scalable in scope means tackling that problem. The software itself is a major component of this. Some software architectures are more scalable than others in various dimensions. Just about any architecture will scale better than churning out features without regard for how they'll respond to changing requirements and expectations.
But again, our systems are not just technical systems. What we do matters. The process of development also influences how scalable, or not, the system will be. Documentation, training, review, knowledge sharing, and automation can all make a system scale more gracefully to larger feature sets. So can including features that are just for your benefit. It's unlikely that your customers will ever ask you to build a feature flagging system, or blue/green deployments. But having these and other similar features of your system make it easier and safer to continue modifying it over time. These things are features of a scalable design for a system. Without them, it's likely the only option to scale is by adding more people to more development teams. This is much harder to do. It's more expensive and less effective.
Time and Space
Even if nothing else changes about your system, time will continue to pass. People will pursue their careers, sometimes at other companies. New software trends will unfold. Languages and frameworks will rise and fall in popularity. Security vulnerabilities will be discovered. All of these things affect your system. Some you need to respond to directly, others affect your ability to respond. Likewise, people and things move. Your customers and their usage might shift to other parts of the world. Your company might be bought or sold, or maybe just move offices for more mundane reasons. You might experience a global airborne pandemic that makes remote work a mandatory health precaution.
Where and how you work, where your users are, when they use your systems. All of these things can change, and likely will. New competitors may crop up or leave the market. Your market itself may change, and people's expectations along with it. All of these are things that you will need to adapt to. In the prior dimensions many of the solutions had at least a degree of software in them. This mostly doesn't. Maybe infrastructure as code helps you shift your deployments to other data centers when needed, but that's a bit of an exception. The way people live and work changes over time. Sometimes abruptly. And the response to that is a question of adapting to new ways of working. Examine your tools and processes and communication habits and change them when they no longer suit you. Doing that effectively requires practice. Even more, practice is required to recognize that it should be done. That means doing that examination regularly.
The Team Itself
Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.
— Melvin E. Conway
Teams change and grow. People join and leave. They get promotions, take vacations, switch roles, and even move between teams. And all the while, they're learning about the system, changing the system, and taking part in the system. They're developing habits and social norms. They're becoming repositories of implicit knowledge. Those habits, those norms, that implicit knowledge is critical in determining how a system will scale. I find Conway's law to be one of the most useful in understanding how sociotechnical systems behave. Technical systems reproduce the communication patterns of the teams that build them. This holds true very consistently. Rigid, defensive, closed off communication will produce a rigid, brittle, disjointed product. The way to make this scalable is to foster safety, curiosity, and open communication.
That kind of safety also builds resilience. Resilience both within the team and the product. Because, again, those things are all the system, and the system is all of them. Resilience is the capacity to absorb and respond to adverse and often unexpected events. For the computers, that could be spikes in demand or network failures. It looks like recovering from error states or degrading gracefully when dependencies are unavailable. These are complex and difficult things to achieve. But they're only part of the story, and they're once again the easy part, compared to building resilience within a team.
For a team, resilience looks like teaching and curiosity, volunteering support, anticipating each other's needs, or valuing rest. It's doing things for yourselves and each other, and creating slack in the workflow, and then making use of that slack as needed. A resilient team is one that has the safety to try things and fail, or to make mistakes. Not just psychological safety, although that's a big part of it. But actual material safety. It's one where the cost of failure is small, where mistakes can be noticed quickly and the risk is contained, where damage is isolated and easily repaired. Doing that work is hard. And it never ends. This isn't a status that you achieve. It's a condition that you create, foster, preserve, and protect. But the difference in what teams can do with this kind of safety and resilience vs without is night and day. If you want to go fast, go far, go big: this is what it takes.
I hope that provides a more useful and complete way to think about scaling and growth. When people talk about scale, it's so often just in mechanical terms of throughput and algorithmic complexity. That view is, at best, woefully incomplete. That's if it's not myopically narrow on purpose. We can be better than that. We owe it to our users and ourselves to be better. And we likewise deserve it.
Photo by Leon Macapagal