Testing needs a better vocabulary

If you build or test software, you've almost certainly seen some variant of the Test Pyramid. The basic concept it's trying to communicate is sound, and that is: focus your testing resources on the kinds of tests that provide the most value. And specifically that means focus primarily on unit tests, with only as many higher level tests as you can do easily. Or reliably. Or quickly? Cheaply?

The Fabled Test Pyramid

The test pyramid is good advice, but a terrible guide to implementation. Which is why I assumed before that you've seen "some variant" of it. It may have even been one of these. People love to try to improve on the pyramid, because it's so unhelpful at a practical level. That there are so many versions of this very simple idea is a problem. Or rather, it's a symptom of a problem. The root of the problem is that we don't have any good words to describe these concepts. We talk about a "hierarchy of tests" that progress from "low level" unit tests to "high level" functional tests. Or acceptance tests? End-to-end? Apparently the only kind of testing we have a common and clear enough understanding of is unit tests. Everything gets very fuzzy and ambiguous very quickly from there.

Of course the other contributor to the multitude of test pyramids is that people seem to expect the pyramid to actually be a practical guide to something. Really, it's a visual aide that describes the relationship your tests should have to each other. It's not a goal, or a plan, or even practical advice. It's just trying to communicate that you should have a large foundation of unit tests before you spend resources on less well isolated kinds of tests.

The army we have

Before we try to change the world, let's first try to understand it as it is. We do actually have quite a few words to describe various levels of a testing hierarchy. These are the terms I encounter frequently, and the strictest definition I can give for their common usage. Unit tests are tests of the most basic conceptual units of your software system. How you define that unit will largely be informed by whether you are following some TDD variant like ATDD or BDD or something else. As a practical matter, for this article, we'll assume that it's somewhere between single functions and groups of classes (or an analogous structure in your trendy functional language of choice). Regardless, unit tests can and should isolate the units from their dependencies and be very narrowly targeted with as close to zero side effects as the design allows. Component tests are bigger unit tests. They might just test larger units, or collections of units. They otherwise should follow the same pattern as unit tests. Service tests are end-to-end/functional tests of a web service. End-to-end, user acceptance tests, or functional tests are traditional QA tests executed through the regular user interface against a complete live system in a production-like environment (hopefully not in production). These tests may or may not be automated. These tests are rife with unmitigated external dependencies; often there are even numerous unknown dependencies. Integration tests are tests of two live things interacting with each other. What those things are depends entirely on who is using the term. These tests should but often don't eliminate external dependencies. And then beyond these terms, there are numerous others that are either completely redundant or refer to subsets of another category (think smoke tests).

The enemy we have

The quote by Donald Rumsfeld that I linked above has a corollary, but I couldn't find an attribution for it: "you have to go to war with the enemy you have, not the enemy you want". In other words, you have to acknowledge and understand your problem before you can solve it; and solutions which don't work aren't actually solutions.

The vocabulary we use to describe our tests doesn't do a very good job of describing our tests. I would suggest the reason is that the way we're trying to categorize our tests isn't actually useful. We talk about "levels" of tests, but that's not really meaningful except to distinguish unit tests from any other type. Instead, we should consider the objectives and requirements of the test. What is it testing, and who will test it? Here's how I would define various categories of software tests.

Not a pyramid

Unit Tests

Unit tests are whitebox (or graybox) tests of some basic unit of source code which will require developer level access and understanding to create. These tests are entirely isolated from external dependencies. The defects they detect are technical gaps or inconsistencies that could have otherwise caused crashes, or the kind of generic unhelpful error dialogs that users hate, or the loss or corruption of data, and so on. In addition to what you probably think of as "unit tests" already, this category would include any kind of static analysis and code review. My definition wouldn't require that these be automated, but for most cases there is no practical way to execute them manually.

System Integration Tests

System integration tests are graybox or blackbox tests of the interactions between two live systems. These tests don't necessarily have to be created by developers, but they may be limited in their usefulness unless the system and components were built or modified by developers to be testable in this way. They can and should be isolated from all external dependencies, but doing this effectively may require developer support. The kind of defects these tests discover relate to access and communication, performance, configuration errors, and unknown dependencies (ex. accidentally relying on behavior in a dependency that is caused by a bug). These tests can and should be automated.

Functional Tests

Functional tests are blackbox tests of a complete live system in an environment that is reasonably production-like. These tests will have a high level of dependence on external systems, but these dependencies can and should be limited and managed. The defects these tests detect will be related to the requirements, and are usually caused by a misunderstanding of what was required, or assumptions that were not well-known or well-communicated. These tests will also be sensitive to any defect that was missed by any other kind of test, but will not be very good at identifying root cause. Functional tests should not be created by the developers who built the software in the first place. These tests can be automated, but the numerous and possibly hidden dependencies can make them unreliable. Thus some discretion is warranted on which ones should be automated.

The nice thing about standards is there's so many to choose from

You'll notice I didn't include component tests, or service tests. actually a lot of other kinds of tests, but those are the two I made a point to exclude. Component tests are unit tests. The difference is a matter of scale, rather than type. If you're a developer talking to another developer then it may be useful to you to distinguish between a small and large test. For a general audience, though, all tests which require developers to create are "unit tests" and adding more terms confuses the issue. Similarly, service tests are functional tests. They target a service interface instead of a user interface, but the nature of what the test requires and what it reveals is the same. If you're a test engineer talking to another test engineer about how to build or execute tests against a service, then the term may be valuable. But again, to a general audience, the extra term is confusing. By similar logic, I specify "System Integration" instead of just integration, because if different audiences are free to assume different contexts for the integration then the term loses its usefulness. System Integration is a term that ISTQB uses to distinguish the integration of complete apps or systems vs subsystems within a self contained application.


More definitions

Terms I used but didn't define above.

  • Whitebox tests are ones which have full knowledge of the design and implementation of the system. By way of analogy: you're able to examine all of the parts and the way they fit together. A strict definition would require that this exclude any test which requires the system to actually be operating. Whether traditional unit tests are "operating" the system is debatable.
  • Graybox tests are ones which have some knowledge of the design and implementation of a system, as well as some ability to inspect the state of the system while it's running. In practical terms, if your test involves attaching a debugger to something, you can probably think of it as graybox.
  • Blackbox tests only have the running system to interact with. The system may be configured in a way that facilitates testing, but it does not expose any special access to its internal state.