Skip to main content
Cloud

Cloud Load Balancing: Five Decisions That Determine Tail Latency

Load balancer benefits get listed in every vendor brochure. The interesting question is which configuration decisions actually determine whether users hate your app at the 99th percentile.

John Lane 2023-08-26 5 min read
Cloud Load Balancing: Five Decisions That Determine Tail Latency

Every cloud load balancer product page lists the same benefits. Scalability. High availability. SSL termination. Health checks. If you have been near infrastructure for more than a year you can recite the list in your sleep. What the product pages do not tell you is which configuration decisions actually determine whether your users have a good time at the 99th percentile and which ones are marketing-driven noise. Here are the five that matter, in the order we usually hit them during an architecture review.

Decision one: Layer 4 or Layer 7

This is the first fork and it is more consequential than most teams treat it. A Layer 4 load balancer — AWS NLB, Azure Standard Load Balancer, GCP TCP/UDP balancer — operates on TCP connections. It is fast, handles any protocol, preserves source IPs cleanly with Proxy Protocol, and scales to ludicrous connection counts. A Layer 7 load balancer — AWS ALB, Azure Application Gateway, GCP HTTPS LB — speaks HTTP, understands headers, can route on path and host, and can do content rewrites.

The trap is assuming you want Layer 7 because "HTTP is the application protocol." Layer 7 balancers are slower, terminate connections on both sides, add latency, and impose their own connection pooling that can interact badly with backends. If your traffic pattern is a few dozen long-lived gRPC streams or WebSocket connections, Layer 4 is almost always the right answer. If you need path-based routing to five different microservices per host, Layer 7 pays for itself. Do not pick based on which one has the fancier console. Pick based on whether you actually need HTTP-aware routing decisions.

Decision two: connection draining and graceful shutdown

This is the one that separates teams that understand load balancers from teams that think they do. When you deregister a backend — whether for a deployment, a scaling event, or a health check failure — the load balancer needs to stop sending new connections to that backend while allowing in-flight requests to finish. The window for this is called connection draining, and the default is usually 30 to 60 seconds, which is wrong for almost everyone.

If your longest request takes 10 seconds, 30 seconds of draining is fine. If you have a legacy endpoint that can take two minutes to render a report, 30 seconds of draining will kill those requests on every deploy. If you are serving WebSockets or long-polled HTTP, you need much longer drain windows and an in-app mechanism that tells clients to reconnect gracefully.

Get this wrong and you get mysterious 502s every deployment. Engineering blames the load balancer. The load balancer is working exactly as configured. The configuration is wrong.

Decision three: health check semantics

The second source of mystery 502s is health checks that do not reflect what "healthy" actually means. We see one of two failure modes constantly. Either the health check is too shallow — it hits / or /health and returns 200 even when the database connection pool is exhausted — or it is too deep and flaps because a dependency is intermittently slow.

The right pattern is a dedicated health endpoint that checks the things that would make a request actually fail. Database reachability, critical cache availability, essential downstream services. But it should be forgiving on slow-but-working dependencies and strict on clearly-broken ones. And it absolutely should not run the same code path as real traffic, because then a health check failure means you just made the real failure worse by pulling a backend out of rotation during a hot moment.

Health check interval and threshold matter too. A two-second interval with a three-failure threshold means you pull a backend out of rotation six seconds after it starts failing. That is usually right. Longer intervals save money on health check traffic but delay detection. Shorter intervals create noise and false positives.

Decision four: sticky sessions versus stateless

Sticky sessions — pinning a user to a specific backend — are one of those features that sound useful and are almost always wrong in modern architectures. The pitch is that your application keeps some session state in memory, so routing the same user to the same backend avoids cache misses or login loops. The reality is that sticky sessions create uneven load distribution, make rolling deployments painful, and fall apart when the pinned backend dies.

If you are reaching for sticky sessions, the real problem is that your application has local state. Move the state to Redis, Memcached, or a proper session store. Stateless backends scale linearly, heal from failures gracefully, and deploy in any order. Sticky sessions are a crutch for architectures that have not quite gotten to stateless yet. They are not a feature, they are a workaround.

The one legitimate use case is WebSocket pinning, because the connection itself is stateful by definition. Even there, you want to keep the rest of the application stateless and treat the WebSocket connection as ephemeral.

Decision five: where you do TLS

Most load balancers offer to terminate TLS for you. This is convenient. It is also a decision with real consequences, and not the consequences people usually talk about.

Terminating TLS at the load balancer means the connection from the balancer to the backend is plaintext unless you re-encrypt. In a cloud VPC, this is probably fine for most threat models. Traffic between the balancer and the backend is inside the provider's network and does not cross the public internet. For regulated workloads — PCI, HIPAA, anything with a serious compliance story — you usually need end-to-end encryption, which means either passing TLS through (Layer 4 mode) or re-encrypting on the backend side. Re-encrypting costs CPU and latency. Pass-through means the load balancer cannot do any Layer 7 routing, because it cannot see the request.

The other trap is certificate management. ACM on AWS, App Service Certificates on Azure, Certificate Manager on GCP — each has its own quirks around auto-renewal, SAN limits, and which services can consume the certificates. Pick your TLS termination point with certificate rotation in mind, because a certificate expiring at 3am because nobody renewed it is a classic on-call story and one you do not want to live.

What the benefits list should actually say

Instead of "high availability, scalability, health checks," the honest version would read something like this. Load balancers let you treat a pool of identical backends as a single logical endpoint. They are a necessary piece of infrastructure. They are also easy to misconfigure in ways that amplify rather than absorb failures. The five decisions above determine whether yours is the kind that saves you during an incident or the kind that makes the incident worse.

If you are building new, start with the simplest balancer that meets your routing needs, keep backends stateless, tune your drain and health check settings to match your actual request patterns, and do not pay for features you will not configure correctly. The best load balancer is the one that disappears into the background because it is doing exactly what you configured it to do. The worst one is the one that shows up in your incident postmortems as a contributing factor.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →