Skip to main content
Cloud

Reading Cloud SLAs: What the Numbers Actually Promise

An honest walkthrough of what hyperscaler SLAs actually guarantee, what the service credits are really worth, and the architectural work you still have to do yourself.

John Lane 2023-01-04 5 min read
Reading Cloud SLAs: What the Numbers Actually Promise

Cloud SLAs are the most misread documents in IT procurement. Executives see "99.99 percent" and assume they bought four nines of uptime. They did not. They bought a service credit against a monthly bill if the specific service they signed up for misses a specific definition of availability measured in a specific way. Twenty-three years of running infrastructure has taught us that the gap between what customers think an SLA means and what it actually means is the single largest source of production surprise we encounter. Here is how to read them.

What Four Nines Is Really Worth

Four nines — 99.99 percent — allows 52 minutes of downtime per year, or about 4 minutes per month. This sounds tight and it is, but the question you have to ask is "whose downtime, measured how, starting when." In almost every hyperscaler SLA, the clock only starts ticking when the provider declares an incident, the metric is averaged across customers or regions, and single-instance failures are explicitly excluded from the calculation.

Read that again. A single VM going down does not count toward your SLA. A single AZ going down usually does not count against a regional SLA. The SLA measures the health of the platform, not the health of your application.

For a single Azure VM, the SLA is 99.9 percent if you use premium storage — about 43 minutes of downtime per month. For a multi-AZ VM set, it rises to 99.99 percent. For a true multi-region deployment, you can get to 99.995 percent or better, but only because you built the redundancy yourself. The cloud did not give you four nines. You bought two nines of platform and added two nines of architecture.

What the Service Credits Are Actually Worth

Here is the part that gets buried. If your provider misses the SLA, you are entitled to a service credit — usually 10 percent to 50 percent of the monthly bill for the affected service, depending on how badly they missed. That credit is not cash. It is a discount on next month's invoice for that same service. You cannot use it against other services. You cannot convert it to money. And in almost every contract we have read, the maximum credit is capped at 100 percent of the monthly fee for the service in question.

Which means the total financial liability the hyperscaler takes on for breaking the SLA is, at worst, one month's bill for the affected service. If an outage costs your business $400,000 in lost transactions and your monthly VM bill is $3,000, the most you will recover is $3,000 and you will have to file paperwork for it. The SLA is not an insurance policy. It is a performance commitment with a token forfeit attached.

If you need financial protection against outages, you buy business interruption insurance from an actual insurer. The cloud SLA does not replace it.

The Exclusions That Matter

Every SLA has an exclusions section and the exclusions section is where the real contract lives. The usual exclusions:

  • Scheduled maintenance windows. These do not count as downtime, even if they take you offline.
  • Customer misconfiguration. If you throttled your own API with a bad IAM policy, that is not the provider's problem.
  • DDoS attacks and other "force majeure" events. These are excluded in most contracts even though they happen regularly.
  • Beta or preview services. Anything with "preview" or "beta" in the name usually has no SLA at all or an explicitly weaker one.
  • Regional failures caused by natural disasters. Your regional SLA assumes the region is up. If an earthquake takes out us-east-1, the credit probably does not apply.

The practical meaning of the exclusions is that the SLA only covers the unglamorous failures — a hypervisor crash, a control plane bug, a network fabric hiccup. Everything that tends to make the news is excluded.

Composite SLAs: The Multiplication Trap

If your application depends on three services with 99.9 percent SLAs each, your effective SLA is not 99.9 percent. It is 99.7 percent, because the probabilities multiply. Ten dependent services at 99.9 percent each give you 99.0 percent — about 7 hours of allowable downtime per month. This is the math nobody does before they draw the architecture diagram.

We've seen customers promise 99.95 percent to their own end users based on SLAs from a database, a message queue, an API gateway, a CDN, an auth provider, and a compute platform, each with its own 99.9 percent or 99.95 percent commitment. The composite was closer to 99.5 percent. They were going to miss their promise by design, and they did not know it until we drew the math on a whiteboard.

What to Actually Promise Your Customers

The honest number you can commit to is one nine lower than the composite of your platform dependencies, and you should architect to beat it. If your cloud platform gives you 99.9 percent across the services you depend on, promise 99.5 percent to your customers and build toward 99.9 percent. That leaves room for your own bugs, your own deployment errors, and the outages the SLA does not cover.

If you need to promise better than that, the architectural cost is steep: multi-region active-active, independent control planes, data replication with conflict resolution, tested failover runbooks, and a DR drill calendar. These are not checkboxes. They are ongoing engineering commitments that consume real headcount.

How We Read SLAs Before Signing

Here is the short checklist we use when a customer asks us to evaluate a cloud SLA before they commit:

  1. What is the measurement window? Monthly is standard, annual is rare and worse for the customer because one bad month can be averaged out.
  2. What counts as downtime? Read the definition. "Unable to process requests" is narrower than "degraded performance."
  3. What is excluded? Scheduled maintenance, beta services, customer fault. All of them.
  4. What is the credit cap? Almost always 100 percent of monthly fees for the affected service.
  5. How do you claim the credit? If you have to file a ticket within 30 days and provide logs, assume most customers never claim.
  6. What is the composite SLA across your dependencies? Do the multiplication. It is always lower than the individual numbers.

Three Takeaways

  1. An SLA is a refund policy, not an insurance policy. The maximum financial recovery is almost always capped at one month's bill for the affected service.
  2. Four nines from the platform does not give you four nines for your application. The composite SLA across dependencies is what matters, and it is almost always worse than any individual number.
  3. The exclusions section is the real contract. Scheduled maintenance, customer fault, and force majeure cover most of the outages that actually hurt.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →