Skip to main content
Cloud

Cloud Performance Optimization: Six Tactics from Real Production Tuning

Six performance tactics that have consistently paid off on real customer workloads, ordered by the ratio of impact to effort. No microbenchmarks, no synthetic scenarios.

John Lane 2023-05-12 6 min read
Cloud Performance Optimization: Six Tactics from Real Production Tuning

Performance optimization advice on the internet is mostly aspirational. Someone tuned a microbenchmark and wrote it up. Useful if you are writing a database from scratch. Not useful if you are trying to make a production workload faster on Monday. The tactics in this piece are the ones we reach for first on real customer workloads — ordered by the ratio of impact to effort, not by how interesting they are to talk about. Twenty-three years of tuning production systems has sharpened this list quite a bit.

Tactic 1: Fix the N+1 Queries Before Anything Else

The first thing we look at on a slow web application is the query log, and the first thing we usually find is an N+1 query pattern that should have been a single join. A page that renders 50 items hitting the database 51 times — once for the list and once per item for a related row. The fix is a single SELECT ... JOIN or an ORM hint to eager-load the relationship. The impact is typically 5x to 20x on the page load time. The effort is about half a day once you find it.

The reason this is always the first thing to check is that N+1 is the single most common performance bug in any application built with an ORM, and it hides well in development where the database is local and the latency per query is under a millisecond. In production, where the database is across a network hop at 1 to 3 milliseconds per call, the N+1 suddenly becomes a 150 millisecond tax on every page view. That tax dominates every other thing you might optimize.

Tools: enable slow query logging with a threshold of 100 milliseconds, turn on ORM query logging in staging, and watch the log while you click through the actual pages users complain about. The bugs find themselves.

Tactic 2: Add the Index You Have Been Putting Off

The second tactic is the obvious one and it still works. Find the query that shows up most often in the slow log, look at the execution plan, and add the index the plan says is missing. The impact is usually 10x to 100x on that specific query. The risk is tiny if the table is not already massive.

The only judgment call is index maintenance overhead. Every index slows down writes by a small amount. On a read-heavy table, this does not matter — add the index. On a write-heavy table with 10 indexes already, pause and think about which indexes are actually used (most databases will tell you) and whether any of the existing ones can be dropped.

The mistake we see most often is people using query hints or forcing a specific plan instead of adding the right index. Query hints are a sign that your index coverage is wrong. Fix the indexes, not the hint.

Tactic 3: Cache the Expensive Read, Not the Cheap One

Caching advice is often "cache everything." That is usually wrong. Caching has a real cost — cache invalidation complexity, staleness bugs, memory budget — and it only pays back when the cached operation is expensive relative to the cache lookup.

The rule we use is: cache anything that takes longer than 50 milliseconds to compute and is read more than 10 times between writes. A database query that takes 200 milliseconds and is read 100 times per minute is a perfect cache target. A query that takes 5 milliseconds is probably not worth caching because Redis itself costs 1 to 2 milliseconds round trip.

The specific technologies change with fashion, but the pattern that keeps working is: Redis or Memcached in front of a relational database, keyed by a deterministic hash of the query parameters, with a TTL that matches how stale the data can be. 60 seconds for list views, 5 minutes for detail pages, 1 hour for reference data. The TTL is the invalidation strategy, and "TTL plus explicit invalidation on writes" is simpler than "event-driven invalidation" in every case we have seen.

Tactic 4: Move the Work to the Database When You Can

The third-most-common performance mistake is pulling 10,000 rows across the network to the application layer and then filtering or aggregating them there. Databases are extremely good at filtering and aggregation. The network between the app and the database is extremely not good at moving 10,000 rows.

Rewriting a Python or Node loop into a SQL GROUP BY or WHERE clause is usually a 10x to 50x improvement on that operation and reduces the memory footprint of the application tier at the same time. The tradeoff is that the logic now lives in SQL, which some teams find harder to test and maintain. That tradeoff is worth it for the endpoints that actually show up in the slow log.

Tactic 5: Connection Pooling Is Not Optional

The fourth tactic is the one that surprises people who have not tuned a production database before. Every connection to Postgres or MySQL has a real cost — memory on the database, CPU to authenticate, a TCP handshake. An application that opens a connection per request will crush a database that could easily handle 10x the actual query load, because most of the work is connection churn, not query execution.

The fix is a connection pool sized to match the database's connection capacity, not the application's concurrency. For Postgres with 100 max connections, you want the pool to top out around 80 connections total across all app instances, not 80 per instance. For Postgres at scale you usually want PgBouncer in front of it to multiplex application connections against a smaller real-connection pool.

We have seen this single change take a database from "constantly at 80 percent CPU" to "idling at 15 percent" with no application changes. It is boring and it works.

Tactic 6: Size the Instance to the Workload, Then Stop

The last tactic is the one that saves the most money and gets the least attention: right-size the instance and turn off autoscaling on anything that does not actually need it. Most VMs and databases we audit are running on instance sizes that were picked during initial setup and never revisited. The workload grew or shrunk, the instance stayed the same.

Tools like AWS Compute Optimizer and Azure Advisor are actually useful here. They look at 14 days of CPU and memory utilization and recommend a smaller instance if the workload is not using what it has. The typical recommendation saves 30 to 50 percent on compute cost with no performance impact. Take the recommendation, set a reminder to re-run the analysis in 90 days, and move on.

The related discipline is not enabling autoscaling on workloads that have stable load. Autoscaling is a tool for spiky traffic. For a workload that runs at 40 percent CPU every business day, a fixed-size deployment is cheaper and more predictable than an autoscaling group. Autoscaling is not free — every scale event is a potential source of warm-up latency, and the margin the autoscaler keeps in reserve is capacity you are paying for.

What We Actually Do On an Engagement

When a customer asks us to "make the application faster," we follow the order above: enable slow query logging, find the N+1s, add the missing indexes, look for anything that could be cached cheaply, rewrite the worst SQL-in-app-code patterns, fix connection pooling, right-size the instances. Eighty percent of the wins in a typical engagement come from the first three steps. The glamorous optimizations — async rewrites, message queues, read replicas — come much later and only when the cheap wins are exhausted.

Three Takeaways

  1. N+1 queries and missing indexes account for most performance problems in most applications. Fix them before you touch anything else.
  2. Cache the expensive read, not every read. The overhead of cache infrastructure has to be earned by the work it avoids.
  3. Right-sizing is the cheapest optimization you can do and nobody does it. Run the tool, take the recommendation, save 30 percent.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →