Alarm fatigue and ignorable warnings

We had a minor Major Incident^* today, and it was a nice little example of alarm fatigue.

A service — which was thankfully not in use yet — just up and stopped running in production. No sign of it where it used to be in Kubernetes or even in ArgoCD. Thankfully, when we spun up the major incident process, someone had an “oh… oh no” moment and realized that they’d done a terraform apply in what should’ve been an unrelated repo right around the time the incident started and proactively joined the call.

We carefully redeployed the impacted service, all was good, we spun down the incident.

It turned out that the Terraform in the unrelated repo had been copied, in part, from the service that disappeared. On investigation, we found that the state file location in this repo wasn’t updated after the code was copied, so when the Terraform in this repo was applied, it happily destroyed the unexpected resources that were in the state file. But this happens regularly enough here that we even have a bit in our terraform CI that catches unexpected-looking state file paths and emits a warning at the top of the plan!

One thing that was a little surprising, though, is that a site reliability engineer from our Kubernetes team was pairing with the involved product engineer on the change that triggered the issue. That SRE has a lot of experience with Kubernetes and Terraform, and so would be less likely to point to the wrong state file, especially with a warning present.

But that warning is somewhat new, and we’ve only insisted on “compliant” state file paths since the warning was introduced. That in turn means that old repos often trigger the warning, even though their state file path is correct. And that in turn means that people that spend a lot of time working in established repos’ Terraform — like, say, a site reliability engineer who works with a lot of older repos — has learned that it’s safe to ignore the warning, because it usually is safe for them to ignore the warning when they’re doing their typical work.

But pairing with this other engineer was not their typical work.

We had decided a while ago that old repos which trigger the warning didn’t need to be updated, because we wanted to prevent a small change to an old, stable Terraform codebase from requiring the extra effort of a state move. I imagine whatever time we saved over that period was lost in one incident today. We haven’t had the incident review yet so I don’t know what we’ll do about that, but I won’t be surprised if we make it a blocking error requiring a state file move. Of course, moving state files around also introduces some risk! On the other hand, it’s probably automatable, except that then there’s risk that the automation will get something wrong which we won’t notice before the next incident.

(I’m reminded of the primarily Japanese railway practice of point and call, where a railway operator is required to point at an indicator and say its status out loud. For example, we could decide that when we encounter that warning, we have to paste the state file location from the code into a Github comment, as a forcing function to make sure anyone who wants to bypass the warning verifies that it’s a false alarm. My current main use of point and call is to ensure the sunroof on my car is closed when I’m about to go into a car wash.)

There’s no blame on the product engineer and site reliability engineer here. We created a situation in which we encouraged ignoring a warning, and someone did. If anything, this low-impact incident was the best possible opportunity to discover that we were producing alarm fatigue, because it gave a clear demonstration of how things might go wrong. And the fact everything was in Terraform in the first place made it easy to undo the change and deploy the affected service.

^* By which I mean someone spun up our Major Incident process for help resolving a problem, even though in the end we understood the customer impact from the problem to be very low.