Skip to main content
Infrastructure

Data Center Scalability: Maximizing Value of Modern Facilities

Scalability in a data center isn't a single knob you turn. It's six distinct problems, and the facility that gets all six right is rarer than the brochures suggest.

John Lane 2024-12-13 5 min read
Data Center Scalability: Maximizing Value of Modern Facilities

"Scalable" is the most abused word in infrastructure marketing. Every colo, every cloud, every hyperconverged appliance vendor claims it. And yet, every year we walk into customer sites where growth has hit a wall that nobody saw coming — a wall made of power density, a cooling curve, a storage fabric, a license ceiling, or a single 10-gig uplink that was fine until it wasn't.

After 23 years of building and operating data centers for customers, I've come to believe scalability is not one problem. It's six. A facility is only as scalable as its weakest dimension, and most shops never audit all six until one of them breaks.

The Six Dimensions That Actually Matter

1. Power density

The dirty secret of the modern data center is that rack power has quietly doubled and then doubled again. A cabinet that drew 4 kW ten years ago now draws 12 kW for a mid-tier GPU box and 25 kW plus for a training node. If your facility was designed for 5 kW per rack and you're trying to grow into AI workloads, you don't have a scalability problem — you have a rebuild problem.

Before you sign any colo contract today, ask what kW per cabinet is sustained (not peak), ask whether the PDU and branch circuits can actually deliver it, and ask whether the UPS and generator tiers scale with it. "Up to 20 kW" usually means "in one row, if nobody else orders the same."

2. Cooling headroom

Power in equals heat out. CRAC and CRAH units don't scale linearly — they scale in steps, and each step is a capital project. Hot aisle containment, rear-door heat exchangers, and eventually direct liquid cooling are the progression. If your facility is still running open aisles with perforated tiles, your cooling ceiling is lower than your power ceiling, and you'll hit it first.

3. Network fabric

The fastest way to turn a scalable design into a bottleneck is an oversubscribed top-of-rack. Leaf-spine Clos fabrics are the default now for a reason: they scale east-west at line rate. But the spine switches, the optics budget, and the cable trays all need to grow with the rack count. A 100 Gbps spine feels generous until you add a few dozen storage nodes doing erasure coding reconstruction at 3 AM.

4. Storage throughput and IOPS, not just capacity

Every vendor quotes capacity. Few quote sustained throughput under mixed workloads. A petabyte-scalable array that delivers great numbers on a single-client benchmark can collapse to a fraction of that when 200 VMs simultaneously boot or a backup window opens. If you're planning for growth, benchmark your storage at target load before you sign, not after.

5. Control plane and orchestration

vCenter, Nutanix Prism, Proxmox cluster, Kubernetes control plane — pick your poison. Each has practical limits. vCenter can theoretically manage thousands of hosts but gets painful well before that. Kubernetes etcd starts to suffer past a few thousand nodes. Your orchestration layer scales differently than your hardware, and most teams discover this only when their dashboards start timing out.

6. Operational capacity

This is the one nobody puts on a spec sheet. If you double your rack count and don't double your hands, your change velocity goes to zero. A "scalable" facility with a shrinking ops team is less scalable than a smaller one that's well-staffed. Automation helps, but automation is also a project with a staffing cost.

The Pattern We See Over and Over

A customer buys a data center or colo footprint sized for today plus 50 percent. Three years later they're at 80 percent power utilization, 90 percent cooling utilization, and the network team is quietly begging for a spine refresh. Each dimension is fixable in isolation. Fixing them together, under a compressed timeline, during business hours, is where budgets explode and careers end.

The better pattern is what we now call "balanced growth headroom." You don't oversize any one dimension by 2x — you oversize every dimension by a modest factor and audit them quarterly. 30 percent headroom on power, 30 percent on cooling, 40 percent on network throughput, 50 percent on storage IOPS, 25 percent on control-plane capacity, and a clear staffing trigger that fires before the ops team is underwater. It's less glamorous than a "hyperscale-ready" sales deck, but it actually works.

Liquid Cooling Is No Longer Optional

If you're building or refreshing a data center in 2024 and you're not at least reserving the plumbing for direct liquid cooling, you're designing a facility that will be functionally obsolete inside five years. The math on GPU-dense workloads doesn't work with air cooling past roughly 30 kW per rack, and the economics of retrofit liquid into a live hall are brutal.

You don't have to deploy liquid on day one. But you do need to land the chilled water loop, size the CDU pads, and leave the floor space. Retrofits cost three to five times what a new build costs, and they cost you uptime while you're doing it.

Software Is Half the Story

Hardware scalability buys you nothing if the software layer can't use it. A hyperconverged platform that scales to 64 nodes on paper but degrades at 40 because of metadata contention is not scalable past 40 — it's just labeled that way. This is where we see the biggest gap between vendor claims and customer reality. Test the scaling curve before you commit the purchase order. Put real workloads on it at 25 percent, 50 percent, 75 percent, and 100 percent of target size. If the vendor won't lab that for you, that's your answer.

For private cloud specifically, Proxmox, VMware vSAN, Nutanix, and Ceph-based stacks all have different break points. We've put all of them in production. The one with the longest linear scaling curve, in our hands, depends more on workload shape than on brand.

A Quick Checklist Before You Sign Anything

If you're about to buy a new facility or expand an existing one, here are the questions I'd want answered in writing:

  • Sustained kW per cabinet (not burst), and how much of the hall can run at that density simultaneously.
  • Cooling approach and the kW-per-rack ceiling before liquid is required.
  • Network oversubscription ratio at the leaf and spine, and the optics budget for growth.
  • Storage performance at 80 percent capacity utilization under mixed workload, not empty array benchmarks.
  • Control-plane tested limits, not spec-sheet limits.
  • Operational staffing model at your target size, including on-call rotation.

Any vendor that can answer all six honestly is the vendor you want. Any vendor that treats those questions as hostile is telling you something important.

Three Takeaways

  1. Scalability is multi-dimensional. A facility that's scalable on power but not cooling, or network but not orchestration, is not scalable.
  2. Test the scaling curve, don't trust the spec. Every platform breaks somewhere before the brochure says it will.
  3. Reserve plumbing for liquid cooling today. The retrofit cost curve only goes up, and it goes up fast.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →