People talk about "scalability in the cloud" as if it were a single topic. It is not. Scalability depends entirely on what your application looks like, and the failure modes are different for every shape. The cloud makes some of these shapes cheaper and faster to scale. It does not fix their inherent limitations.

Here are five common application patterns we see in the wild, how each one scales, and where each one tends to fall over.

1. The Stateless Web API

This is the easy case and the one the cloud marketing decks are built around. Your API servers are stateless (sessions in Redis or JWT), behind a load balancer, reading and writing to a shared database. Horizontal scaling is trivial — add more instances, done. Autoscaling groups handle this in their sleep.

The failure mode is always the database. You can run 400 API instances if you want. They will all happily hammer a single Postgres primary until connection pool exhaustion, query queueing, or IOPS limits put the whole system on its knees. We see this constantly — teams proud of their "infinitely scalable" API tier that melts at 300 requests per second because the database is doing sequential scans on a 50 million row table that nobody indexed.

Fix in priority order: indexes, query review, connection pooling (pgbouncer is still the right answer), read replicas, and only then think about sharding or a different database engine. The last step is the expensive rewrite and you want to exhaust the first four before you consider it.

The read-replica trap

Read replicas look like free scalability until you discover two things. First, replication lag is real — queries against a replica see data a few hundred milliseconds stale, which breaks read-after-write semantics in ways that confuse users. Second, a lot of application code reads and writes in the same transaction, which the replica can't serve. Plan for both before you add replicas.

2. The Stateful Web Socket Application

Real-time applications — collaborative editors, chat, live dashboards, multiplayer games — are harder. Each connection occupies memory on a specific instance. Scaling out means deciding how new connections are placed and how messages route between instances that hold related connections.

We usually see this built with a hub-and-spoke pub/sub model (Redis Streams, NATS, or a cloud-managed service bus) where each instance subscribes to the channels for the connections it is currently holding. The failure mode is the pub/sub layer itself — it becomes the bottleneck once message volume exceeds what a single node can handle. Redis is fine up to tens of thousands of messages per second. Beyond that you need something built for it, like NATS JetStream or Kafka.

The other failure mode is uneven distribution. Load balancers that assign new connections round-robin work fine until one instance happens to get the user whose chat room has 5,000 participants, and that instance starts OOMing while others sit at 5 percent. Sticky connections by tenant or shard key are the defense.

3. The Batch Processing Pipeline

Batch jobs — ETL, data enrichment, ML feature generation, nightly reports — scale beautifully in the cloud because they fit the "burst-y, interruptible, independent" shape that spot instances and serverless were built for. Throw work on a queue, run a pool of workers that grows and shrinks with queue depth, and accept that any single worker can die without losing the job.

The failure modes are more subtle. First, long tails — most jobs finish in five minutes but three of them take three hours because of a data skew problem nobody anticipated. The batch wall-clock time is determined by the slowest job, not the median. Fix this with pre-work partitioning of oversized inputs.

Second, the coordination point. Somebody has to aggregate the results. If your aggregator is a single-threaded Python script reading every output file sequentially, it will become the bottleneck once you scale past a dozen workers. Design the aggregation step to be parallel or incremental from the start.

Third, cost. Serverless batch (Lambda, Cloud Functions) is cheap at small scale and expensive at large scale. The crossover point is usually around 100,000 invocations per day — below that, serverless; above that, workers on spot instances.

4. The Monolith with a Database Behind It

Half the systems we inherit look like this. One big application, one big database, cron jobs, some file uploads, maybe a search index. The conventional wisdom is to refactor it into microservices. Do not do this as a scalability play. Microservices do not make a slow system faster. They usually make it slower, because every call that was a function call becomes a network call with JSON serialization, TLS handshakes, and retry logic.

For a monolith, the scaling playbook is different:

Vertical scaling first. Monoliths are usually memory or CPU bound, and modern cloud instances go up to hundreds of cores and terabytes of RAM. A $2,000/month instance with 64 cores will outrun a clever 20-microservice architecture for most workloads.
Read offload. Move reports and analytics off the transactional database to a replica or a data warehouse.
Async where it hurts. Identify the one or two request paths that are actually synchronous when they don't need to be and move them to a background job.
Cache aggressively. A well-placed Redis or Memcached in front of the hot read paths delivers 90 percent of the scalability benefit of the architecture rewrite at 2 percent of the cost.

Most monoliths can scale to 100x their current load with these four steps. Refactor for reasons other than scalability — team structure, deploy velocity, blast radius — not because somebody said microservices are more scalable.

5. The ML Inference Service

ML workloads, especially LLM inference, break most of the assumptions above. A single request can consume an entire GPU for several seconds. Autoscaling by CPU is meaningless. The cost per request is high enough that caching, routing, and batching matter more than raw throughput.

The patterns that actually work:

Batching at the server. vLLM, TGI, and llama.cpp all support request batching — multiple concurrent requests share a forward pass. This can 5x or 10x throughput on the same hardware.
Routing by complexity. Not every query needs the big model. Use a classifier to send simple queries to a 1B parameter model on CPU and complex queries to the GPU. We run this pattern internally and it cuts our compute cost by 60 percent.
Caching. Semantic caching of prompts (not just exact-match) catches a surprising percentage of repeated queries.
Pre-warm the model. Cold starts on GPU inference are 30 to 90 seconds. Autoscale to N+1, not to zero.

The failure mode is GPU availability. Good luck autoscaling H100 capacity at 2 p.m. on a Tuesday when every other startup is doing the same thing. For predictable production load, reserve the capacity or run it on-prem on your own hardware — which is one of the reasons we have steered several customers back to private cloud for inference workloads.

What Scalability Actually Means

Scalability is not "add more instances." It is "identify the bottleneck, fix it, find the next one, repeat." Every system has a bottleneck somewhere, and the cloud makes moving the bottleneck around fast and cheap. It does not remove the bottleneck for you.

The teams that handle this well share three habits. They load test before they believe anything is scalable. They profile before they optimize. And they are honest about whether their application is actually one of the easy shapes or one of the hard ones — because the playbook for each is different and mixing them wastes months.

Scalability in the Cloud: Five Application Patterns and How They Break

1. The Stateless Web API

The read-replica trap

2. The Stateful Web Socket Application

3. The Batch Processing Pipeline

4. The Monolith with a Database Behind It

5. The ML Inference Service

What Scalability Actually Means

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

Scalability in the Cloud: Five Application Patterns and How They Break

1. The Stateless Web API

The read-replica trap

2. The Stateful Web Socket Application

3. The Batch Processing Pipeline

4. The Monolith with a Database Behind It

5. The ML Inference Service

What Scalability Actually Means

Talk with us about your infrastructure