E-Commerce on Cloud Infrastructure: The Boring Decisions That Determine Black Friday
What actually breaks under peak e-commerce load — and the unglamorous infrastructure decisions that separate sites that survive from sites that don't.

Everyone has opinions about e-commerce platforms. Shopify vs. BigCommerce vs. Magento vs. custom. Those conversations matter less than people think. What actually determines whether your site stays up during a 20x traffic spike is a series of infrastructure decisions most teams don't make until they're already bleeding. Here are the ones that matter.
1. Your Checkout Path Is a Separate Service, Whether You Designed It That Way or Not
When traffic spikes, the browse path and the checkout path need different scaling strategies. Browse is cacheable, mostly read, tolerates eventual consistency, and can be served from the edge. Checkout is stateful, transactional, price-sensitive, and cannot be cached. A homepage that can serve 10,000 requests per second from a CDN tells you nothing about whether your payment flow survives 500 concurrent checkouts.
Treat them as separate concerns:
- Catalog, search, product detail pages: aggressive CDN caching, generous TTLs, edge rendering where possible. These should survive any traffic level because they're not touching your database.
- Cart and checkout: dedicated application capacity, database connections reserved, rate limiting at the front door so a bot flood doesn't eat the connection pool that real customers need.
Most e-commerce outages we've investigated happened because the catalog and the checkout were sharing a database connection pool, and a surge in catalog traffic starved checkout. The technical fix is small. Teams just never think about it until the post-mortem.
2. Inventory Is the Hardest Problem and Everyone Underestimates It
Inventory is the one place where "eventually consistent" is not an option. Overselling 200 units of a $800 item is worse than a 30-second slower checkout. Under-selling (showing out-of-stock for items you have) is also a real cost.
The patterns that work:
- Reserve on add-to-cart, decrement on order commit. Use a short TTL on reservations (10-15 minutes) and release expired ones. This handles abandoned carts.
- One authoritative store for inventory counts. Redis or PostgreSQL with row-level locks, not a microservice with eventual consistency.
- Circuit breaker for inventory reads. If the inventory service is degraded, fall back to "in stock" display but block add-to-cart. Showing stale availability is better than showing none at all.
The common mistake is treating inventory as just another database table. It isn't. It's the hottest contention point in your system during exactly the moments that matter most.
3. Database Connections Are Not Infinite
Every e-commerce Black Friday post-mortem we've read — and we've read a lot — includes the phrase "the database connection pool was exhausted." This is a solved problem and teams still fall into it every year.
The rules:
- Use a connection pooler. PgBouncer for PostgreSQL, ProxySQL for MySQL. Your application does not connect directly to the database at scale.
- Size the pool based on database capacity, not application concurrency. A 4-core PostgreSQL box doesn't want 500 connections. It wants 50. The pooler queues the rest.
- Set connection timeouts aggressively. A hung connection is worse than a rejected request because it holds a slot forever.
- Separate read replicas for read-heavy paths. Product detail pages don't need to touch the primary.
Managed services help but don't solve it. Even on Aurora or Cloud SQL, you still need a pooler in front for high-concurrency workloads.
4. Cache Everything You Can, Cache Honestly About What You Can't
E-commerce has a beautiful property: most of what users see is the same for everyone. Product pages, category pages, search results (for logged-out users), the homepage. All of this can be cached at the edge with long TTLs.
What can't be cached:
- Personalized prices, inventory counts, cart state, "you've viewed this" recommendations.
The trick is designing the page so the cacheable shell loads instantly from the edge, and the personalized fragments load via a separate request to an API that can itself be heavily optimized. This is the pattern Shopify, Amazon, and every high-traffic e-commerce site converged on. Fight the urge to render everything server-side from a single request.
A Varnish or CDN cache hit rate of 90%+ on product pages is achievable and changes the economics of the whole system. Your origin servers serve 10% of the traffic. Your database barely notices the holiday.
5. Rate Limit Before the Attack, Not During
Bot traffic is a given in e-commerce. Scrapers, sneaker bots, checkout bots, price monitors. A sudden surge of traffic during a sale is often half legitimate customers and half bots racing for limited inventory. If your rate limiting is configured for normal traffic, your origin servers will be drowning in bot requests during exactly the moment real customers need them.
What actually works:
- WAF with bot detection at the edge. Cloudflare, AWS WAF, Azure Front Door. Block or challenge before traffic reaches your origin.
- Graduated rate limits. Different limits for logged-in users vs. anonymous, for cart endpoints vs. browse endpoints.
- JavaScript challenge on checkout initiation. Annoys users slightly, kills 90% of bot checkout attempts.
- Honeypot URLs in robots.txt. Any IP that touches them gets a harder rate limit for the rest of the session.
The rate limiting rules we ship to customers have a long list of exceptions and tunings, but the principle is simple: protect the expensive paths first, and protect them at the edge before they reach your application.
6. Have a Surge Runbook, and Run the Drill
Everyone who runs e-commerce says they have a surge plan. Most teams actually have a document nobody has read since 2022. The teams that survive their busy seasons run the drill.
A real runbook has:
- Pre-scaled capacity for the web tier (don't trust autoscaling to react fast enough to a flash sale)
- Database connection pool sizes validated against load test results, not guessed
- A clear "feature flag off" list for non-essential features (recommendations, personalization, analytics) that can be disabled instantly to shed load
- A known-good read-only mode for the site if writes fail
- On-call coverage during the window, not "whoever gets paged"
- A post-event review on the calendar before the event
The last one matters more than the others. The teams that improve year over year are the ones that run a retrospective within a week while memories are fresh.
What Holidays Actually Look Like When You Do It Right
A retail customer we worked with moved from a single-VM monolith to the architecture described above over about four months. The next Black Friday had 14x their normal traffic. The site served it with:
- 2 additional web tier replicas (from 6 to 8)
- Zero database tier changes
- Cache hit rate of 94% on product pages
- Checkout p95 latency of 340 ms (up from 290 ms on a normal day)
- Zero oversold items
The prep work was three weeks of load testing, connection pool tuning, and cache validation. The actual event was quiet. That's the goal. Boring Black Friday is good Black Friday.
Three Takeaways
- Browse and checkout have different scaling profiles. Treat them as separate concerns or one will starve the other when it matters.
- Inventory is a real problem that deserves dedicated design. It's the single biggest source of post-mortem regret in e-commerce outages.
- Cache aggressively, rate limit at the edge, run the drill. None of this is novel. All of it separates the sites that survive peak from the ones that don't.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation