Chaos Engineering

Jul 02, 2025

Chaos Engineering is a test-in-production strategy that's an essential aspect of distributed-architecture development, particularly microservices, but is also invaluable even if your system is a single monolith or client application. On the network side, it's used most famously at Amazon and Netflix, but applies everywhere.

I first saw a sort of chaos testing decades ago, when dinosaurs roamed the Earth. We were working with agility, even back then. Our system grew incrementally and was deployed into the real production environment every few days. Every time we deployed, my coworker Larry would walk over to the running system and try to break it. He’d type ridiculous inputs, enter things in random order, and use insanely long strings and numbers. He wrote programs that sent vast numbers of random events into the production system at unpredictable intervals. These systems were partially robotic, so we’d do similar things with the physical hardware. Everything from power fluctuations at critical times to deliberately breaking parts of the machine (usually by installing known-to-be-defective parts or simply unplugging stuff as the machine worked). I once tossed a teddy bear into the works to see how it handled unexpected travel limits. Chaos. (The bear was never the same.)

Larry always found problems.

We (and many others) called this monkey testing—a reference to Gulliver’s monkeys (which are not in the book, by the way—Swift uses students 😄) typing random characters from which books emerge. Netflix uses the term “monkey” as well, I assume for the same reason.

In addition to finding bugs, this approach led us to a significant change in how we worked. We started writing code assuming that it would fail. We wanted it Larry-proof. We never assumed that anything would just work, and we wrote the code so that when (not if) it did fail, it would fail gracefully. This way of working became habit, integral to development, not something we added after the fact. You cannot add reliability and fault tolerance with a ticket that says “Make it reliable and fault tolerant.” The thinking is similar to Deming’s “Inspection is too late. The quality, good or bad, is already in the product."

Nowadays, this sort of testing is essential for distributed systems. Consider Netflix, which has about 6,000 microservices, all running independently and communicating at unpredictable times with varying unpredictable loads. To top that off, the topology is constantly changing as service instances replicate to handle load. The system’s runtime behavior is nondeterministic. There is no way to test that using typical static testing before deployment (though they do that too, of course)

Netflix solved the problem with the Chaos Monkey, which randomly kills microservices in the real production system (the one you actually use to watch movies). Other monkeys do things like insert a random load by issuing large numbers of synthetic messages. The system is heavily monitored, of course, so if performance starts to degrade, the monkeys are killed immediately. This is happening in the real production system with millions of users connected, not some mockup or test environment. You can’t simulate a production system of that size.

Distributed systems are not the only place where chaos testing can work. For example, I do not put business logic in the UI. Consequently, the real work is done on the server in response to events coming in from the UI, and the server sends events in the other direction that trigger UI updates. That leaves things open for a Virtual Larry (or VL). When testing, I can replace the UI with a VL that emits events for the backend to catch, and which catches the UI update events to verify that the right things happened. These synthetic events do the same sort of crazy stuff Larry used to do. Of course, you can have a VL that tests in a more restrained fashion.

A similar testing architecture is useful anywhere that you have components with a hard boundary and a well-defined interface (whether it's APIs or messages). Test the component with a Virtual Larry that simulates the surrounding context gone mad.

So, when it comes to testing, at least, a little chaos is a good thing.

Agility!

Discussion about this post