Skip to main content
Cloud

Cloud Scalability Services: Five Patterns and Their Failure Modes

Autoscaling is not magic — it is five distinct patterns, each with specific failure modes that punish teams who don't understand them.

John Lane 2024-04-04 7 min read
Cloud Scalability Services: Five Patterns and Their Failure Modes

"The cloud scales automatically" is one of those statements that is technically true and operationally misleading. Yes, cloud platforms provide scalability services. No, they do not magically make your application scale. They give you five or six specific patterns you can apply, each one solves a specific problem, each one has a specific failure mode, and choosing the wrong pattern for the workload is one of the fastest ways to spend a lot of money without improving anything.

Here are the five patterns we use regularly, what each one actually does, and the failure modes that will bite you if you do not pay attention.

1. Horizontal Autoscaling (Instance Count)

The classic pattern. You define a minimum and maximum number of instances, define a metric that drives scaling decisions — CPU, memory, request count, queue depth — and the platform adds or removes instances as the metric moves. EC2 Auto Scaling Groups, Azure VMSS, GCP Managed Instance Groups, and the equivalent in every container orchestrator.

When it works

When your workload is stateless (see my earlier post on this — stateless is a discipline, not a checkbox), when instance startup time is measured in seconds to a couple of minutes, when the scaling metric is predictive of actual load, and when the costs of over-provisioning and under-provisioning are roughly symmetric.

How it fails

Slow startup defeats the pattern. If your instance takes 8 minutes to boot, warm up, and start serving traffic, the autoscaler is always chasing yesterday's load. By the time the new instances are ready, the spike has passed or the outage has already happened. Fix the startup time or switch to a pattern that pre-warms capacity.

Scaling metric lag. Scaling on CPU feels right until you realize your application hits a latency wall at 60 percent CPU, not 80 percent. By the time CPU crosses the threshold, you are already in trouble. Scale on the metric that actually predicts user pain — usually p99 latency or queue depth — not the metric that is easy to measure.

Scale-in storms. The autoscaler pulls instances as load drops, and then a new wave arrives before the next scale-out. Use cooldown periods aggressively. Do not let the autoscaler thrash.

2. Vertical Autoscaling (Instance Size)

Vertical autoscaling changes the size of the instance rather than the count. Bigger VM, more CPU and memory, same architecture. Kubernetes has the Vertical Pod Autoscaler for this. Most VM platforms require a stop-start to resize, so vertical scaling is typically slower and more disruptive than horizontal scaling.

When it works

When the workload is stateful and hard to horizontally scale — a single large database instance, a legacy application with in-memory state, a single-threaded service. When the load pattern is predictable enough that you can resize on a schedule rather than in real time. When the platform supports online resizing (some managed databases do).

How it fails

Hitting the ceiling. There is a biggest instance size in every cloud. When you hit it, vertical scaling stops working and you have an architecture problem you were hoping to avoid. Know where the ceiling is. Plan to not hit it.

The stop-start is not free. Resizing a VM on most platforms requires a reboot. Resizing a database may require a failover. Both interrupt service. Do not rely on vertical scaling as a real-time response to load.

3. Serverless (Per-Request Scaling)

Lambda, Azure Functions, Cloud Run, Cloud Functions. The platform manages the instances entirely — you deploy a function, it runs when called, you pay per invocation and per millisecond of execution time. No capacity planning, no idle cost, no scaling configuration.

When it works

Spiky, unpredictable workloads with modest per-invocation work. Webhook handlers, event processors, cron jobs, ad-hoc data transforms, glue code that runs rarely. Workloads where the cold start time is not in the user's critical path. Workloads where the per-invocation cost model is cheaper than an always-on VM.

How it fails

Cold starts. The first invocation on a new instance takes longer than subsequent invocations. Sometimes a lot longer. If your users feel the cold start, they will hate the experience. Keep functions warm for user-facing paths, or use a pattern (provisioned concurrency, min instances) that eliminates cold starts at a cost.

The hidden per-invocation cost. Serverless looks cheap until you run a workload with high volume. At a few million invocations a day, serverless stops being cheaper than a small fleet of always-on containers. Do the math for your actual volume, not for the demo.

Integration complexity. Serverless forces you to handle state in external services — a database, a queue, a cache — because the function itself has no state. This is fine if you designed for it. If you are lifting a stateful application into serverless, you are in for a bad time.

Vendor lock-in through event sources. The programming model for Lambda is subtly different from Azure Functions is subtly different from Cloud Run. Moving between them is not a one-line change. Plan for this if portability matters.

4. Queue-Based Load Leveling

Instead of scaling the worker fleet to match incoming load in real time, put a queue between the front end and the workers. The front end accepts requests, puts them on the queue, and returns a job ID. Workers pull from the queue at a rate they can sustain. Load spikes get absorbed by the queue, and the worker fleet scales on queue depth rather than on request rate.

When it works

When the work is asynchronous by nature — image processing, report generation, email sending, batch analytics. When users do not need a synchronous response. When the fairness question ("who gets served first under overload?") can be answered by "FIFO" or a simple priority scheme.

How it fails

Queues can grow without bound. An unmonitored queue can grow from hundreds of messages to millions of messages overnight, and the resulting worker scaling or recovery job can be catastrophic. Monitor queue depth, alert on it, and implement a circuit breaker on the producer side if the queue is growing faster than it drains.

Dead letter queues are not optional. Messages fail. Poison messages will keep failing forever and block the main queue if there is no dead letter handling. Design the failure path before the happy path.

Latency becomes visible. Users notice when "I clicked submit" doesn't produce an immediate result. The design has to communicate that the work is in progress, or users will retry, and retries on queue-based systems create cascading load.

5. Caching (Buying Time With Memory)

Not technically a scaling service, but absolutely a scalability pattern. Put a cache in front of the expensive thing — a database query, an API call, a computation — and serve repeat requests from the cache at a fraction of the cost. Redis, Memcached, CDN edge caches, HTTP caches. The fastest scaling is the work you never do.

When it works

When the data has a reasonable TTL, when the read:write ratio is high (dozens or hundreds of reads per write), when the cache hit rate is measurable and high, and when the cache failure mode — "the cache is down, everything falls through to the origin" — is survivable.

How it fails

Cache stampedes. When a cache entry expires and 1,000 concurrent requests all hit the database simultaneously to repopulate it, the database falls over and the outage begins. Use single-flight patterns, lease-based repopulation, or jittered TTLs to prevent this. Every production system I have seen without a stampede mitigation has had at least one outage caused by a stampede.

Stale data bugs. Users see outdated information, edits do not appear, and debugging is miserable because the behavior is not reproducible from the database alone. Have a principled story about cache invalidation, and accept that it is one of the two hardest problems in computer science.

Cache memory pressure. Caches fill up. Eviction policies matter. If the cache is evicting your hot keys because a batch job filled it with cold keys, the cache is actively hurting you.

The Meta-Pattern

Real scalability is always a combination of these patterns, chosen for the specific shape of the workload. The serverless evangelists will tell you Lambda solves everything. The Kubernetes evangelists will tell you HPA solves everything. The caching evangelists have been telling you caching solves everything since 1995. None of them are completely wrong, and none of them are right. Understand the five patterns, know the failure modes, and build the scalability story that matches the workload you have — not the one you wish you had.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →