We have been running virtual machines in production since the early 2000s, before VMware ESX had vMotion, when Xen was the cool kid, and when "live migration" was a demo at a conference rather than a Tuesday afternoon. Two decades later, a lot of what passes for "VM best practices" on the internet is outdated, vendor-coached, or just wrong. Here are four things that still matter when you are putting VMs into production and writing the check for the hypervisor.

1. Sizing Is an Economics Problem, Not an Engineering Problem

The single biggest mistake we see customers make is treating VM sizing as an engineering optimization problem when it is actually an economics problem. The question is not "how much CPU and RAM does this workload need to run." The question is "at what density can I run this workload before either the user experience degrades or the resource contention causes operational noise."

For most business workloads, CPU is cheap and RAM is expensive. You can overcommit CPU at 4:1 or 5:1 ratios and most hosts will not notice. You cannot overcommit RAM without paying for it in ballooning, swapping, and unhappy users. This means your density is almost always gated by memory, not by cores.

The practical consequence is that you should buy hosts with more RAM than you think you need, and you should resist the temptation to hand out 16 GB VMs to workloads that actually need 4. Every gigabyte you hand out is a gigabyte you cannot give to the next workload, and the next workload is coming. Right-sizing VMs against actual utilization data is boring work that saves real money. Do it quarterly.

2. Snapshots Are Not Backups, and Long-Lived Snapshots Are Time Bombs

This is the one that has bitten more customers than any other, and we still see it in 2023. A snapshot is a diff against a running disk. It is fine for short-term operations — take a snapshot, do an upgrade, verify it works, commit the snapshot within an hour or two. What is not fine is leaving the snapshot in place for weeks or months.

Long-lived snapshots cause three specific problems. First, they grow without bound and eventually fill the datastore. Second, committing a large snapshot takes hours and can effectively lock the VM during the commit. Third, they interfere with backup software that also uses snapshots, resulting in silent backup failures that you only discover when you need the backup.

A snapshot is not a backup. Repeat that sentence until it is reflexive. Backups are copies of the data that live somewhere else, on different hardware, ideally on different media, tested on a schedule. Snapshots are a temporary convenience for short-term rollback. Mixing the two up is how organizations lose data.

Set a hard rule that no snapshot lives longer than 24 hours, write a scheduled task that alerts you if one does, and commit or delete anything older than that aggressively.

3. The Hypervisor You Pick Is a Twenty-Year Commitment, So Do Not Pick It on Features Alone

VMware, Hyper-V, Proxmox, Nutanix AHV, Xen, KVM with oVirt or Harvester — they all run VMs. The feature matrices look different on the vendor websites, but in practice most workloads do not care which hypervisor they run on. What you are actually picking when you pick a hypervisor is a supplier relationship, a licensing model, a toolchain, and a skills market.

The VMware/Broadcom acquisition changed the math on this dramatically. Customers who were paying $500 per socket per year for vSphere Standard are now being quoted multi-year subscriptions that are five to ten times more expensive. For a lot of customers this has made VMware economically non-viable, and they have migrated to Proxmox, Hyper-V, or Nutanix. The migrations are not always fun — VM format conversions, tooling retraining, integration rework — but they are doable, and they pay for themselves within one or two years at current pricing.

Our current default recommendation for new deployments is Proxmox if you have the skills and the appetite, Hyper-V if you are already all-in on Microsoft, and Nutanix if you want an integrated appliance model with a single vendor to call. We keep existing VMware customers on VMware only when the cost of migration exceeds the cost of staying, and that calculation has gotten worse for VMware every year since 2023.

4. Patching the Hypervisor Is Riskier Than Patching the Guest

Patching a guest VM is routine. Patching the hypervisor underneath a hundred running VMs is an event. It requires live migration (which requires shared storage or vSAN), it requires a maintenance window, and it requires that your cluster has enough spare capacity to evacuate at least one host without service impact.

The mistake we see is customers who build a cluster at 90 percent utilization "because why waste capacity" and then cannot patch the hypervisor without shutting VMs down. Your cluster needs enough headroom to tolerate N+1 at minimum — meaning if any one host dies (or gets patched), the rest of the cluster can absorb its VMs without anyone noticing. For small clusters, N+1 is a big percentage. A three-host cluster running at 66 percent utilization is healthy. A three-host cluster running at 85 percent utilization is a pager waiting to ring.

The same logic applies to firmware updates, BIOS updates, and NIC driver updates. These things are not optional — they fix real security issues and performance bugs — but they are only safe to apply when you have the capacity to take a host out of rotation.

Plan the capacity around the operations you need to perform, not just the workload you need to run.

What We'd Actually Do

For a new deployment today, we build a cluster of three to five hosts with enough RAM that N+1 utilization lands around 70 percent, use local NVMe plus software-defined storage or a shared all-flash array, pick Proxmox or Hyper-V as the hypervisor, and automate backups to immutable object storage. We pin snapshot lifetime to 24 hours via scheduled task, right-size VMs quarterly against utilization data, and patch the hypervisor tier monthly during a rolling window.

None of that is exciting. It is, however, the thing that keeps VM estates running for years without drama.

Three Takeaways

RAM is what gates density, not CPU. Size your hosts and your VMs accordingly, and do not hand out memory you will never need.
Snapshots are not backups. Long-lived snapshots are an outage waiting to happen. Kill them automatically.
Pick your hypervisor for its licensing model and ecosystem, not its feature matrix. The features converge. The vendor relationships do not.

Virtual Machines: Four Things to Know Before You Standardize on Them

1. Sizing Is an Economics Problem, Not an Engineering Problem

2. Snapshots Are Not Backups, and Long-Lived Snapshots Are Time Bombs

3. The Hypervisor You Pick Is a Twenty-Year Commitment, So Do Not Pick It on Features Alone

4. Patching the Hypervisor Is Riskier Than Patching the Guest

What We'd Actually Do

Three Takeaways

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

Virtual Machines: Four Things to Know Before You Standardize on Them

1. Sizing Is an Economics Problem, Not an Engineering Problem

2. Snapshots Are Not Backups, and Long-Lived Snapshots Are Time Bombs

3. The Hypervisor You Pick Is a Twenty-Year Commitment, So Do Not Pick It on Features Alone

4. Patching the Hypervisor Is Riskier Than Patching the Guest

What We'd Actually Do

Three Takeaways

Talk with us about your infrastructure