Skip to main content
Cloud

Cloud Support & Maintenance: Five Strategies for Sleeping Through Saturdays

An on-call rotation that doesn't wake you up isn't luck. It's five specific strategies applied consistently. Here's what actually works.

John Lane 2023-11-09 6 min read
Cloud Support & Maintenance: Five Strategies for Sleeping Through Saturdays

There are two kinds of on-call rotations. The first is the one where the pager goes off every weekend, the team dreads the schedule, and the senior engineer is quietly job-hunting. The second is the one where the pager rarely fires, the alerts that do fire are actually actionable, and the person on call catches up on reading. The second kind is not luck. It's the result of five specific strategies applied with discipline.

Here they are, in the order we roll them out when a new customer asks us to take over their cloud support and maintenance work.

1. Delete the alerts that don't matter

The first thing we do on a new engagement is look at the alert history. Most teams have hundreds of alerts that fire regularly, get acknowledged, and get ignored. This is called alert fatigue, and it's the leading cause of real incidents being missed. When the pager goes off 15 times a day, the 16th time — the one that matters — gets swiped away with everything else.

The fix is ruthless. Every alert has to answer two questions: "is this actually broken?" and "can a human do something about it right now?" If the answer to either is no, the alert gets deleted, converted to a dashboard, or rewritten with better thresholds.

In practice we usually delete 40 to 60 percent of a customer's alerts in the first week. Nothing bad happens. What happens is that the remaining alerts start to mean something again. The on-call engineer stops tuning out the pager, and when a real problem hits, they notice.

The rule is that every alert should correspond to a runbook, and every runbook should have been tested at least once. If you can't write the runbook, you don't understand the alert well enough to wake someone up for it.

2. Automate the boring fixes

The second wave of improvement comes from noticing the patterns in the alerts that remain. Certain things break the same way over and over: a disk fills up, a process runs out of memory, a certificate is about to expire, a cache gets stale. These are not incidents. They're maintenance tasks that happen to fire alerts.

Every time we see the same fix applied more than twice, we automate it. A disk filling up gets a job that trims old logs. A cert expiring gets renewed automatically. A memory leak in a service that can't be fixed soon gets an auto-restart on a schedule. Each one of these takes maybe two hours to set up and eliminates a pager alert forever.

Six months of this, and the on-call rotation starts to look very different. Most of what used to wake people up no longer exists as an alert, because the underlying problem resolves itself.

3. Build the patch window into the business calendar

Cloud infrastructure still needs patching. Operating systems, Kubernetes versions, managed database upgrades, certificate rotations — all of it has to happen, and all of it carries some risk. The strategy that works is to make the patch window boring and predictable, not clever.

We run a monthly maintenance window, always the same day of the month, always the same time, always announced in advance. Nothing important ships to production during that window. Non-production environments get patched first, then staging, then production. The patches that require downtime happen during the window; the ones that can be done live happen any time.

This sounds like ITIL-style process. It isn't. The difference is that it's two hours a month, it runs by itself because it's automated, and it prevents the alternative — a panicked emergency patch at 11pm on a Wednesday because CVE-2024-whatever got published.

Organizations that skip regular patching always end up doing panicked patching later. The panicked version is 10x more disruptive than the scheduled one.

4. Test your backups like you test your code

The second most common thing we see that wakes people up on weekends — after alert fatigue — is data loss events. A corrupted database, a deleted table, an accidentally-dropped index. The fix for this class of problem is boring but effective: tested backups.

"Tested" is the operative word. Most organizations have backups. Very few have recently restored from one. The first time you try to restore a 2TB database from an S3 snapshot and discover it takes six hours and the process nobody has run in 18 months doesn't work anymore is a bad day.

We run quarterly restore drills for every customer. The process is: pick a random backup, spin up a clean environment, restore into it, run a data integrity check, then tear it down. If anything fails, we fix the process and run the drill again. This sounds like overkill until the day a customer needs to restore for real and the process just works.

The other thing quarterly drills catch is backup drift. New databases get created and nobody adds them to the backup job. Retention policies get tightened and nobody notices. Encryption keys rotate and the old backups become unreadable. None of this shows up until you try to use the backup. Drills surface it early.

5. Write things down and keep them current

The last strategy is the least glamorous and the one that most directly determines whether your Saturday gets interrupted: maintain a runbook library that matches reality.

Every on-call handoff should include an up-to-date runbook for every service in scope. Every new service that goes to production should have its runbook written before it goes live — not as a best practice, as a deploy blocker. Every incident should result in a runbook update, either to fix an outdated step or to add a new scenario.

This is the work engineers hate most and the work that pays off most over a year. When the pager goes off at 2am and the person answering it is the junior engineer who joined two months ago, the difference between a 10-minute resolution and a 2-hour resolution is whether the runbook tells them exactly what to do.

A good runbook answers: what does this alert mean, what are the first three things to check, what are the common causes, what commands fix them, and who to escalate to if none of the above works. Every time we onboard a customer we find runbooks that are out of date, generic, or don't exist. Every time we leave a customer those same runbooks exist and have been tested against real incidents.

What this looks like when it's working

A team running all five of these strategies has an on-call rotation that gets paged maybe once or twice a week, with most of those alerts resolved in under 15 minutes by following a known procedure. Patch windows happen without drama. Backups are known-good. The team ships features during the week and takes the weekend off. When a real incident hits — and they still do — the response is calm because the people involved have practiced the moves.

A team running none of these strategies has a pager that goes off every night, an on-call rotation nobody wants to be on, and a slow bleed of senior engineers to companies that run their operations better.

Three Takeaways

  1. Alert fatigue is the root cause of missed incidents. Delete alerts aggressively.
  2. Automate the boring fixes. Every repeated pager alert is a todo item, not a lifestyle.
  3. Test your backups and keep your runbooks current. These two habits separate teams that sleep from teams that don't.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →