"We'll scale when we need to" is the most common scalability strategy, and it's fine for most applications. Most never need to scale beyond a single well-tuned database and a few application servers. But when scaling does matter, the problems that bite are rarely the ones the architecture diagrams predict. The throughput issue you expected to hit turns out to be fine. The one you didn't expect takes down production on Black Friday.

Three specific problems catch teams out often enough that they deserve attention. None of them are new. All of them are still killing production systems in 2025.

1. Database Hot Spots

The seductive promise of a distributed database is that you can just add nodes. Add a node, get more throughput. In practice, a single hot partition can make a 100-node database perform worse than a single-node deployment, because one node is melting while the rest are idle.

The Pattern

You shard your data by customer_id. You have a million customers. In theory, load is spread evenly. In practice, 5 percent of your customers generate 80 percent of the traffic — and those customers end up on 2 or 3 specific nodes. Those nodes saturate. The rest of the cluster is idle. Your app sees latency spikes and timeouts.

This happens with DynamoDB (partition throttling), Cassandra (hot partitions), Cosmos DB (hot logical partitions), Redis Cluster (hot keys), and any sharded SQL setup where the sharding key is not uniformly distributed.

Symptoms

Latency spikes that track individual customers or accounts
Monitoring shows some nodes at 95 percent CPU while others are at 10 percent
Throttling errors from your database under load that's well below provisioned throughput
Retry storms that make the problem worse

What to Do

Pick a partition key that spreads load evenly. Customer ID is usually wrong for multi-tenant systems with power-law usage. A composite key (customer_id + a random bucket) spreads hot customers across more partitions.
Implement write-through caching for read-heavy hot keys so that only a fraction of reads hit the database.
Monitor per-partition metrics, not just cluster averages. Cluster average CPU hides hot spot problems.
Rate-limit per-customer at the application layer to prevent a single tenant from overwhelming shared resources.

The fundamental lesson: uniform distribution is an assumption, not a guarantee. Verify it at load with real data.

2. Cache Invalidation

Phil Karlton's quote — "There are only two hard things in computer science: cache invalidation and naming things" — is overused because it's true. Cache invalidation bugs are responsible for a disproportionate share of hard-to-debug production incidents.

The Patterns That Bite

Stale data after updates. User updates their profile. The read cache serves the old version for the next 5 minutes. User is confused and opens a ticket. Worse: user updates their payment method, charge fails because the cache returned the old method, user churns.

Thundering herd on expiration. A hot cache entry expires. A thousand concurrent requests all miss the cache simultaneously and all try to populate it from the database. The database gets hammered, latency spikes, requests time out, the cache fills but half the clients have already given up. This is the classic thundering herd.

Cache stampede on warm-up. You deploy. The cache is empty. Every request is a miss. The database is overwhelmed. You roll back the deploy, but now the cache is still empty on the old version and the database is still overwhelmed. Recovery takes much longer than the deploy.

Inconsistent cache across nodes. You have a Redis cluster. You invalidate a key. One node drops it. The other nodes still return it because the invalidation message was delayed. Users see different data depending on which node their request hit.

What to Do

Write-through cache invalidation — when the application writes to the database, it writes to the cache in the same operation (or invalidates the cache entry).
Short TTLs with background refresh for hot data. Background refresh means a separate job re-populates the cache before it expires, so clients never see a miss on hot keys.
Cache stampede protection via lock-based population (one client populates, others wait) or stale-while-revalidate (serve stale for a few seconds while the refresh happens).
Cache warming on deploy — populate the cache with known-hot keys before accepting traffic. Don't rely on the cache filling from user traffic.
Consistent hashing for distributed caches so that adding or removing a node only affects a small fraction of keys.

3. Distributed Transactions

You need to charge a customer, decrement inventory, and send a confirmation email. In a single database, this is a transaction. In a microservices architecture, charge is in one service, inventory is in another, email is in a third. You can't wrap them in a database transaction.

Why Naive Solutions Fail

"Just call each service sequentially and roll back on failure" — The rollback steps can fail too. Now you have a charge without an inventory decrement, and you can't reverse the charge because the refund service is down.
"Just use two-phase commit" — 2PC exists but has serious availability problems. Every participant has to be available during the transaction, and the coordinator is a single point of failure. Modern distributed systems almost universally avoid it.
"Just hope failures are rare" — They aren't. At scale, rare is several times per hour.

What Actually Works

Saga pattern. A workflow of steps, each of which has a compensating action. If step 3 fails, run the compensating actions for steps 1 and 2. Compensations are business logic, not technical rollback — "refund the charge" not "undo the write."
Outbox pattern. The service writes to its database and to an "outbox" table in the same transaction. A separate process reads the outbox and publishes events. Guarantees at-least-once delivery without distributed transactions.
Idempotency everywhere. Every operation needs to be safe to retry. Use idempotency keys, natural keys, or deduplication.
Eventual consistency with clear SLAs. Your UI acknowledges that the operation is in progress, shows status, and handles failures explicitly. "Your order has been placed and is being processed" is honest and avoids pretending you have synchronous transactions.

This is harder to design than it looks. Budget accordingly.

The Meta-Lesson

The common thread: scalability is not really about scaling up. It's about what happens when one of your assumptions breaks at scale. Your assumption of uniform load breaks when a few users dominate traffic. Your assumption of cache consistency breaks when the cache and database disagree. Your assumption of atomic operations breaks when operations span services.

Designing for scale means designing for what happens when assumptions break. That's mostly about isolation, idempotency, and graceful degradation, not throughput benchmarks.

What We'd Actually Do

For a team expecting meaningful scale:

Load test with realistic distributions. Not uniform. Use production traces if you have them.
Per-partition monitoring for every sharded system.
Write-through cache invalidation as the default.
Saga pattern for any multi-service workflow.
Idempotency on every write endpoint. Natural keys or idempotency tokens.
Chaos engineering to confirm the degradation paths work when things fail.

Three Takeaways

Hot partitions are the most common scaling failure mode. Monitor them specifically, not just averages.
Cache invalidation bugs are production incidents waiting to happen. Design for staleness and stampedes up front.
Distributed transactions don't work. Design the system so you don't need them.

3 Scalability Challenges Most Architects Miss

1. Database Hot Spots

The Pattern

Symptoms

What to Do

2. Cache Invalidation

The Patterns That Bite

What to Do

3. Distributed Transactions

Why Naive Solutions Fail

What Actually Works

The Meta-Lesson

What We'd Actually Do

Three Takeaways

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

3 Scalability Challenges Most Architects Miss

1. Database Hot Spots

The Pattern

Symptoms

What to Do

2. Cache Invalidation

The Patterns That Bite

What to Do

3. Distributed Transactions

Why Naive Solutions Fail

What Actually Works

The Meta-Lesson

What We'd Actually Do

Three Takeaways

Talk with us about your infrastructure