Skip to main content
Cloud

Cloud Performance: Four Insider Takeaways from Real Tuning

Most cloud performance problems aren't what you think they are. Here are four lessons from tuning real production workloads — and what we'd do differently next time.

John Lane 2024-02-07 5 min read
Cloud Performance: Four Insider Takeaways from Real Tuning

Cloud performance tuning has a rhythm that is different from on-prem. On-prem, you tune against a fixed set of hardware you can see and touch. Cloud, you tune against a shared substrate you can't see and whose behavior changes under load you don't control. The skills overlap, but the reflexes are different. Here are four things we've learned from tuning real production workloads across Azure, AWS, and private cloud — the ones that come up over and over and the ones that almost nobody reads about until they're in the middle of a fire.

1. Noisy neighbors are real, and the answer is rarely "call the vendor"

Shared tenancy means that the VM sitting on the same host as yours may be doing something CPU-heavy, memory-heavy, or IO-heavy, and that activity will leak into your workload. It won't show up as "your VM is slow." It'll show up as intermittent latency spikes with no corresponding change in your own metrics.

This is especially common on the cheaper cloud instance families (burstable CPU credits, general-purpose shared tenancy, burst-IO storage tiers) and almost nonexistent on dedicated hosts and provisioned IOPS storage. The first move when you're debugging inexplicable latency is to check whether you're on a shared tier. If you are, pay the upgrade, run the workload for 48 hours, and see if the problem went away. Most of the time, it does.

We see teams spend weeks profiling application code looking for the source of tail latency that was coming from a noisy neighbor the whole time. The signal that tells you to suspect it is: no correlated change in your app metrics, but 99th-percentile latency spikes at seemingly random intervals. That's the fingerprint of something happening off-box.

2. Network is usually the thing, and it's the hardest to see

The second-most-common source of performance surprise in the cloud is the network. Not bandwidth — that's usually fine. The things that bite are:

  • Cross-AZ traffic charges and latency. A chatty microservice calling its database across availability zones is both slow and expensive.
  • DNS resolution time. The default cloud DNS resolvers are usually fast, but when they're not, every outbound call pays a latency tax.
  • Egress bandwidth limits per instance. Some instance sizes have egress caps much lower than customers expect, and a sudden burst of outbound traffic will throttle everything else on the VM.
  • TCP connection churn. Applications that open a new connection per request, especially with TLS handshake overhead, burn a surprising amount of time on setup rather than work.

Network is hard to see because it doesn't show up in the usual metrics. CPU is fine, memory is fine, disk is fine — and yet the service is slow. The tell is latency that doesn't scale with throughput the way you'd expect. Reach for packet traces, flow logs, and instance-level network metrics before you reach for a profiler.

The fix is usually architectural: co-locate chatty services in the same AZ, use connection pools, cache DNS, and pick instance sizes with adequate network capacity. None of this is visible from the application code.

3. "Performance" usually means tail latency, not median

Teams coming from single-tenant environments tend to optimize for median latency. "The p50 is 40 milliseconds, we're great." The problem is that users and downstream systems don't experience median. They experience the worst case. The p99 is what makes a system feel slow or unreliable, and the p99.9 is what causes incidents.

In the cloud especially, tail latency is dominated by factors outside the application: GC pauses on a noisy host, a retry storm in a dependency, a storage request that got queued behind someone else's backup job, a load balancer that decided to reroute traffic at exactly the wrong moment. Optimizing median performance will not fix any of those.

The move is to measure tail latency from day one, set SLOs against the p99 (not the p50), and treat tail latency regressions the same way you treat a build failure. When you do this, you start noticing a category of problems that were previously invisible, and you make better architectural decisions as a result.

The corollary: beware of benchmarks that report average throughput. They will tell you a service can handle 10,000 requests per second when what you actually need to know is whether it can handle that load with p99 latency under your SLO. Those are not the same question.

4. Cost is a performance metric whether you want it to be or not

This is the one that trips up engineers who come from on-prem. On-prem, you tune for speed and cost is a capital expense that was decided years ago. In the cloud, cost is a variable expense that changes with every optimization decision. Every performance choice is implicitly a cost choice.

Using a bigger instance to avoid CPU saturation costs money every hour, forever. Adding a caching layer to reduce database load adds a Redis bill. Turning on auto-scaling to handle peak load costs more than a fixed footprint on average. Switching to a faster storage tier doubles the storage bill.

The right mindset is to treat cost as a constraint in the tuning process, the same way you treat latency. "We cut p99 from 400ms to 120ms but tripled the bill" is not always a good outcome. Sometimes it is — if the latency reduction unlocks revenue — but sometimes the right answer is to accept slightly worse performance for meaningfully better economics.

We have saved customers tens of thousands of dollars a month by going the other direction: accepting a small performance regression in exchange for a dramatic cost reduction. The key is making the tradeoff visible and deliberate, not stumbling into it because nobody was watching the bill.

The tuning checklist we actually use

When we walk into a customer environment with a performance problem, the order of operations is:

  1. Measure the tail, not the median. Get p99 and p99.9 for every critical path.
  2. Rule out shared tenancy. Upgrade to dedicated or premium tiers for 48 hours and see if the problem moves.
  3. Check the network. Cross-AZ traffic, DNS time, connection churn, instance egress limits.
  4. Profile the application. Only after the first three — because most cloud performance problems aren't in the app.
  5. Calculate cost impact of every proposed fix. Write it down before implementing.
  6. Measure again. Compare tail, not median.

Most teams start at step 4 and spend weeks there before circling back to steps 2 and 3. The order matters. Infrastructure-side problems are cheap to rule out and expensive to miss.

Three Takeaways

  1. The p99 is the metric that matters. Median hides the problems users actually notice.
  2. Most cloud performance surprises are shared-tenancy or network issues, not application bugs.
  3. Every performance decision is a cost decision. Tune the tradeoff, not just the latency.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →