The cloud reliability marketing cycle has a pattern. Every year there's a new availability zone design, a new managed service with "four nines built in," and a new story about chaos engineering that makes it sound like outages are solved. Then a major region goes down for four hours and everyone remembers that reliability is a property you build, not a checkbox you buy.

Here's what's actually worth paying attention to in 2023, from the perspective of people who have to clean up the outages.

1. Zonal Isolation Is Finally Real (Mostly)

For a long time, "multi-AZ" was a thin veneer. Control planes crossed zones. Shared regional services had hidden dependencies. When us-east-1 had its famous outages, being in "multiple availability zones" sometimes didn't help because the failure was in a regional control plane you didn't know you depended on.

This has genuinely improved. AWS has re-architected several regional services to be zonal. Azure has pushed availability zone coverage deeper into managed services. GCP has always had the cleanest zonal story and continues to widen its lead.

What this means for you: multi-AZ is now worth the cost for genuinely critical workloads, where it used to be partially theater. But the reliability boundary is still the region for control-plane operations (IAM changes, new resource provisioning). If your incident response depends on being able to spin up new capacity during a regional control-plane incident, you have a latent failure mode.

2. The Managed Service Reliability Ceiling

Managed databases, queues, and object stores now offer SLAs in the 99.99 percent range. That sounds impressive until you remember that 99.99 percent is still 52 minutes of outage per year, and those minutes tend to happen all at once. If your SLA is 99.9 percent (9 hours per year) and your dependency is 99.99 percent, you're fine. If your SLA is 99.99 and you depend on three managed services each at 99.99, the math stops being flattering.

The advancement worth noting: cross-region replication for managed databases has gotten good enough that it's the default path for new critical workloads, not an afterthought. Azure SQL hyperscale active geo-replication, DynamoDB global tables, and Spanner multi-region configurations are all mature enough to rely on. They were not, five years ago.

3. Chaos Engineering as a Managed Service

AWS Fault Injection Simulator, Azure Chaos Studio, and Gremlin-as-a-service have made it actually feasible to test your failure modes in production without writing your own orchestration. Five years ago, chaos engineering was a Netflix thing. Now it's something a four-person team can adopt in a week.

Worth the investment. The first time FIS drops an availability zone on your staging environment and the dependency graph reveals a service you forgot about, you'll know why.

4. The Control Plane Is Still a Single Point of Failure

Here's what the marketing doesn't say: every hyperscaler still has a global or near-global control plane for some operations. IAM, billing, certain DNS operations, account-level policy. When these fail, every region is affected simultaneously. The December 2021 AWS outage is the canonical example, but there have been smaller incidents since.

No architectural change you make in one cloud account fixes this. The only real mitigations are multi-cloud or cloud-plus-on-prem for the workloads where a two-hour outage is genuinely unacceptable. And multi-cloud has its own costs — you're trading a 2-hour-every-three-years outage for 10-20 percent higher steady-state complexity and cost. Only worth it for a small subset of workloads.

5. Region Pair Selection Matters More Than It Used To

Hyperscalers have started to publish "paired region" guidance. Azure has always been clear about this (East US / West US, North Europe / West Europe). AWS has gotten more explicit with regional services. GCP uses multi-region storage classes that span specific geographies.

The trap: picking a "close" secondary region that shares a failure domain with the primary. us-east-1 and us-east-2 are closer in every dimension — latency, replication cost, operational familiarity — but they share vendor-level risk. For true DR you want geographic separation. For multi-AZ HA within a region you want the cheapest path.

Decide which problem you're solving before you pick the region.

6. Immutable Infrastructure Is Finally the Default

Five years ago, making "immutable infrastructure" a goal was a multi-quarter project. Today, Terraform, Pulumi, and the various IaC tools are mature enough that rebuilding an environment from scratch in 15 minutes is the baseline, not the aspiration.

Why this matters for reliability: recovery time objectives (RTO) are a function of how long it takes to rebuild. When rebuilding is a shell script, RTO is hours. When it's IaC with tested modules, RTO is minutes. We've walked customers through DR exercises where the old process would have been "restore from backup, call Microsoft, pray" and the new process is "terraform apply in the DR region, restore database, done."

The advancement here isn't a new product. It's that the tools are mature enough that doing it right is no longer expensive.

7. Observability Has Finally Eaten the Reliability Story

The biggest reliability improvement most teams can make this year is not a new service or a new architecture. It's better observability. OpenTelemetry has become the lingua franca for tracing. Managed Prometheus / Managed Grafana options exist in every cloud. Log aggregation is cheap and fast.

The connection to reliability: you can only operate what you can see. Teams with good tracing find problems in minutes that used to take days. Teams without it are still grepping through log files when the incident is three hours old.

This is the one advancement that gives us the most leverage per dollar for customers. A month of telemetry work pays off forever.

For a customer with a real reliability requirement — 99.9 percent or better, single-region outage tolerance:

Run active-active across two AZs within a region. Good default for most workloads.
Use a managed database with a cross-region read replica. Not for load, for failover.
Keep your IaC clean enough to stand up in a second region within an hour. Test this quarterly.
Instrument aggressively. OpenTelemetry + a managed backend. Budget for it.
Run at least one chaos game day per quarter. Learn what breaks before production teaches you.

This is not glamorous. It's also how sites stay up.

Three Takeaways

Most "reliability advancements" this year are refinements to things that already existed. The story is maturity, not novelty.
The control plane is still the ceiling. For the workloads where hours of downtime is existential, single-cloud reliability has a real limit.
Observability is the reliability investment with the highest ROI. Spend the month. It pays for itself the first time an incident is 10 minutes instead of 10 hours.

Cloud Reliability in 2024: What's Actually New, What's Just Marketing

1. Zonal Isolation Is Finally Real (Mostly)

2. The Managed Service Reliability Ceiling

3. Chaos Engineering as a Managed Service

4. The Control Plane Is Still a Single Point of Failure

5. Region Pair Selection Matters More Than It Used To

6. Immutable Infrastructure Is Finally the Default

7. Observability Has Finally Eaten the Reliability Story

Three Takeaways

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

Cloud Reliability in 2024: What's Actually New, What's Just Marketing

1. Zonal Isolation Is Finally Real (Mostly)

2. The Managed Service Reliability Ceiling

3. Chaos Engineering as a Managed Service

4. The Control Plane Is Still a Single Point of Failure

5. Region Pair Selection Matters More Than It Used To

6. Immutable Infrastructure Is Finally the Default

7. Observability Has Finally Eaten the Reliability Story

What We Actually Recommend

Three Takeaways

Talk with us about your infrastructure