HPC in the Cloud: Where Bursting Beats Owning
Cloud HPC is brilliant for bursty workloads and a waste of money for steady-state ones. Here is how to tell the difference.

High-performance computing in the cloud is a perfect example of a technology whose pricing rewards certain workloads and punishes others. If you understand which is which, cloud HPC is a legitimate superpower. If you do not, you will burn through a research budget in two months and end up back on-prem wondering what happened.
The Honest Economics
A modest HPC cluster — say 500 cores, 2 TB of RAM, 100 TB of parallel storage, a decent interconnect — will cost something like $300K to $500K as a capex purchase with a three- to five-year useful life. Running the equivalent continuously on AWS or Azure will cost roughly $600K to $900K per year. On paper, owning wins by a wide margin.
What that calculation hides is utilization. If your cluster runs at 20 percent utilization — which is charitable for most research and engineering shops — you are paying for four idle cores for every one that is working. The cloud flips that. You pay only for cores that are computing. The break-even point is usually somewhere between 30 and 50 percent sustained utilization depending on the instance type and the discount program you can negotiate.
This is why bursty workloads win in the cloud and steady-state workloads win on-prem. The test is not "do I need HPC" — it is "how much of the time do I actually need HPC."
Workloads Where Cloud HPC Earns Its Keep
Design of experiments and parameter sweeps
You have a simulation that runs in 4 hours and you need to sweep 200 parameter combinations. On-prem, that is 33 days on a ten-node cluster. On cloud, spin up 200 instances, run them in parallel, have results in 4 hours, and pay for 800 core-hours. The total cost is often less than a few hundred dollars and the wall-clock savings rearrange the research calendar.
Monthly or quarterly runs
Risk modeling, end-of-quarter financial simulations, annual reservoir models. These jobs run for a day or two and then the hardware sits idle for weeks. Cloud is obvious here.
Rendering farms
Rendering is embarrassingly parallel, rendering nodes are expensive to buy, and demand is cyclical. We have seen studios move entirely to cloud rendering and never look back.
Genomics and bioinformatics pipelines
Most sequencing pipelines are batch jobs with periods of heavy use and long idle stretches. Terra, DNAnexus, and AWS HealthOmics have legitimate integrations with the standard tools.
Workloads Where Cloud HPC Does Not Work
Tightly-coupled MPI at scale
If your code is running 1,000-rank MPI jobs over InfiniBand, cloud HPC is still a compromise. AWS has EFA, Azure has HB/HC-series with InfiniBand, GCP has Compute Engine with high-bandwidth networking — but the latency and consistency are not quite the same as a purpose-built cluster with a tuned fabric. For latency-sensitive CFD, molecular dynamics, or weather modeling, the cloud will work but you may lose 15 to 30 percent of scaling efficiency compared to a good on-prem cluster.
Long-running steady-state workloads
If your cluster runs at 70 percent utilization 24/7, you are paying the hyperscaler premium for no good reason. Own the hardware. Put it in a colo. The three-year TCO will be 40 to 60 percent lower.
Data-gravity-bound workloads
If your input data is 500 TB of seismic traces or instrument telemetry already sitting on a NAS in your facility, egress and ingress to cloud storage become the dominant cost. The math changes entirely. Either commit to a full cloud-first pipeline (move the data once, keep it there) or accept that on-prem is the honest answer.
The Bursting Pattern That Works
The architecture we recommend more often than any other for HPC customers looks like this:
- An on-prem base cluster sized to the steady-state demand. Enough capacity for day-to-day work, interactive sessions, dev and test, and the 80 percent of jobs that do not need elasticity.
- A cloud burst target for the other 20 percent — the parameter sweeps, the big deadlines, the experimental runs that would otherwise block the shared queue.
- A workload manager that handles both — Slurm with cloud bursting plugins, PBS Pro, or LSF. The scheduler decides where jobs go based on resource availability and priority.
- A shared data layer that both sides can read from. For large-scale genomics or seismic work, this usually means a cloud object store as the canonical location and an on-prem cache for active datasets.
This is not the most elegant architecture and it is not what the hyperscaler sales team wants to sell you. It is the one that survives an actual finance review.
Storage Is Where Cloud HPC Budgets Die
The mistake almost every first-time cloud HPC project makes is underestimating storage costs. Compute is metered in obvious units (hours, cores) but storage has a thicker pricing sheet: the GB-month charge, the request charges on object stores, the egress fee when data leaves the region, the snapshot charges, the provisioned throughput on parallel file systems.
A 100 TB FSx for Lustre file system will cost substantially more per month than a 100 TB on-prem NAS over its lifetime. Lustre in the cloud is the right choice if you need its performance for a bursty workload, but it is not a place to park data between runs. Keep canonical data in S3 and hydrate a parallel file system only for active jobs.
Spot Instances Are the Real Discount
For workloads that can tolerate interruption — parameter sweeps, rendering, most ML training with checkpointing — spot instances cut cloud HPC costs by 60 to 80 percent. The catch is that your workflow must handle preemption gracefully: checkpoint often, reschedule automatically, not hold locks across instance lifetimes.
If your workload cannot be made preemption-tolerant, on-demand cloud HPC is rarely competitive with on-prem. If it can, cloud HPC with spot is often the cheapest option available, including owning.
Three Takeaways
- Bursty workloads belong in the cloud, steady-state workloads belong on owned hardware. The break-even is around 30 to 50 percent sustained utilization.
- Storage costs are the silent budget killer. Plan the data lifecycle before you plan the compute.
- If you can tolerate preemption, spot pricing changes the math entirely. Design for it from the start or the cloud HPC premium will eat your budget.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation