"We need high availability" is a sentence that has launched more overengineered cloud architectures than any other. The problem is not that HA is bad. The problem is that it is almost always specified in units that have no engineering meaning. "Four nines" is not a requirement — it is a number on a slide. The real question is: for which workload, measured how, over what time window, and who is paying for the complexity? Here is how we actually scope HA for customers, and what each additional nine of availability really costs.

The Nines, Decoded

Availability targets look clean on paper and translate into uncomfortable budgets on implementation. Here is the math that usually goes unspoken.

99% (two nines) — 3.65 days of downtime per year. This is what a single-instance app in a single availability zone gets you if nothing goes catastrophically wrong. Basically free.
99.9% (three nines) — 8.76 hours of downtime per year. Achievable with a single multi-AZ deployment, automated health checks, and reasonable incident response. Modest premium.
99.99% (four nines) — 52.6 minutes per year. Requires multi-AZ active-active, automated failover tested regularly, and a team that can detect and mitigate within minutes. Significant premium.
99.999% (five nines) — 5.26 minutes per year. Requires multi-region active-active, consensus storage, chaos engineering as a discipline, and a 24x7 on-call rotation staffed by engineers who know the system cold. Enormous premium.

Most businesses asking for four nines would be happier and richer with three and a well-practiced recovery plan. Most businesses asking for five nines have not done the math on what it costs to engineer a system that actually delivers it.

The First Takeaway: Availability Is Not Measured the Way You Think

The most common HA mistake is measuring the wrong thing. Your SLO is typically something like "the API returns a 2xx response within 500ms." Your monitoring measures whether the load balancer returns a 2xx within 500ms. These are not the same thing.

The synthetic check problem

Most HA dashboards are driven by synthetic checks from a single monitoring vendor. The check hits a health endpoint every 30 seconds from three cloud regions. If all three regions see a 200, the system is "up." What the check does not see: a database primary that is returning stale reads, an authentication path that works for 90 percent of users but not the 10 percent in a broken shard, a feature flag that silently disabled your checkout page.

The fix is to instrument for user-perceived availability: sample real user sessions, measure error rates and latency percentiles from the client side, and alert on meaningful statistical deviations rather than binary up-or-down states. The first time a team does this they are almost always shocked at how different their real availability is from their dashboard.

The SLO math

The other measurement trap: computing availability over a calendar month is generous. A two-hour outage on the first of the month looks the same as two 30-minute outages spread across the month — in availability percentages — but the first incident likely breached customer SLAs while the second probably did not. SLIs, SLOs, and error budgets (as described in Google's SRE book and now standard practice) are the right framework. If your team is still reporting "four nines this month" as a headline, you have not had the right conversation yet.

The Second Takeaway: Multi-Region Is Rarely the Cheapest Path to Reliability

Multi-region active-active sounds like the gold standard. It is also the most expensive architecture to build correctly, and many teams who commit to it end up with a system less reliable than a well-operated single-region deployment. The reasons are instructive.

The consistency tax

Active-active across regions forces you to confront data consistency. Either you accept eventual consistency (which many applications cannot) and spend engineering effort resolving conflicts, or you use a globally consistent store (Spanner, CockroachDB, Cosmos DB strong consistency mode) and pay 5 to 20x the storage cost plus higher write latency. There is no third option that does not eventually bite you.

The failure mode multiplication

Every additional region is another thing that can fail. We have seen deployments where a badly tested failover procedure caused more outage than the event it was supposed to mitigate. Chaos engineering exists because active-active failover must be exercised routinely or it will not work the one time it matters.

For most customers, a single region with multi-AZ deployment, well-tuned auto-scaling, and a clean disaster recovery plan to a warm standby in a second region hits 99.95 to 99.99 percent availability at a fraction of the cost and complexity of active-active. The failover is not instantaneous — it is minutes, not seconds — but the total annual downtime is often lower because the simpler architecture has fewer failure modes to begin with.

The customers who genuinely need sub-minute RTO across a region failure are typically regulated financial services, certain healthcare workloads, and real-time bidding systems. Everyone else is buying complexity they will not operate well.

The Third Takeaway: People and Process Are the Real Bottleneck

The math works out roughly like this: the marginal cost of moving from three nines to four nines of infrastructure is real but bounded. The marginal cost of moving from three nines to four nines of actual delivered availability is dominated by humans. Detection, triage, escalation, on-call staffing, runbook quality, chaos testing cadence — these are where the additional nines live.

What a real four-nines operation looks like

Detection in under two minutes. Automated health checks that catch real user impact, alerting thresholds tuned to page a human before customers notice.
Runbooks for every foreseeable failure mode. A runbook is a document that tells a tired engineer at 3am exactly what to do. If the runbook is not written, you do not have four nines.
On-call rotation with at least five engineers per tier. Smaller rotations burn out and lose institutional knowledge. Your people cost per nine goes up non-linearly.
Quarterly game days. Actual practice failing over, actual practice restoring from backup, actual practice responding to a simulated outage. Teams that do not practice do not execute well in real incidents.

The dirty secret of cloud HA: the infrastructure part is commoditized. You can buy multi-AZ, multi-region, active-active infrastructure off the shelf. What you cannot buy is the operational maturity to run it well. That takes 12 to 24 months of engineering investment and it will not appear on any vendor's price sheet.

What We Actually Tell Customers

The framework we use: start with the business impact of an hour of downtime, translate that into a dollar figure, and compare it to the annual engineering cost of each availability tier. Most mid-market customers land at three nines with a fast recovery story, not four or five. The exceptions are in regulated industries or high-transaction-value systems where an hour of downtime has a six-figure impact.

When customers ask for four nines we ask two questions: can you name the last three outages at your last provider, and what did the post-mortems identify as the root cause? If the answer is vague, the conversation we need to have is not about adding more regions — it is about building the observability and response capabilities that make any architecture operable.

Three Takeaways

Measure user-perceived availability, not synthetic uptime. The gap between the two is where real incidents hide.
Multi-region active-active is expensive and often less reliable than a simpler architecture operated well. Only commit to it when the business case genuinely requires it.
Each additional nine is paid for in people, not in infrastructure. Operational maturity is the binding constraint, and it takes years to build.

High Availability on Cloud: What 99.99% Actually Costs You

The Nines, Decoded