Skip to main content
Cloud

Cloud Memory & Storage Tiering: Five Insights from Workload Tuning

Cloud storage is not one thing — it's a dozen things with different latency, throughput, and billing profiles. Here are five insights from actually tuning workloads against them.

John Lane 2023-12-05 5 min read
Cloud Memory & Storage Tiering: Five Insights from Workload Tuning

"Cloud storage" is not a category. It is roughly a dozen different products with different latency characteristics, throughput ceilings, consistency models, and billing structures. Treating them as interchangeable is how you end up with a workload that technically runs but costs four times what it should, or one that benchmarks fine but falls over in production. Here is what we have learned from moving real workloads across real cloud storage tiers.

1. Object Storage Is Not a Filesystem and Pretending It Is Will Hurt You

S3, Azure Blob, and Google Cloud Storage are key-value stores with a REST API. They are not filesystems. They do not support random writes, they do not have atomic renames, they have eventual consistency on some operations, and their latency per operation is measured in tens of milliseconds rather than microseconds. When you put a FUSE driver in front of them and mount them as a filesystem, things appear to work. They work until someone runs a real workload against them.

The specific failure modes are predictable. Anything that does random writes (databases, logs, VM disks) performs terribly. Anything that does a lot of small files (source code trees, build artifacts, mail spools) spends all its time on per-operation overhead. Anything that relies on POSIX semantics for locking or atomic updates is a lottery.

Object storage is the right answer for: backups, media libraries, data lakes, static website assets, container registries, and anything that is read-mostly with large blobs. It is the wrong answer for: databases, VM disks, shared home directories, build systems, and anything that expects a filesystem. Do not mount buckets as filesystems for production workloads. Use native APIs or a purpose-built gateway.

2. Block Storage Has Two Different Cost Curves and You Need to Know Which One You Are On

Cloud block storage (EBS, Azure Managed Disks, Persistent Disk) has two dimensions of billing that most people forget about: capacity and IOPS. For most small disks, you pay mostly for capacity. For large or high-performance disks, the IOPS component dominates.

The mistake we see repeatedly is customers who provision a 4 TB gp3 volume "because it's big enough" and get the default IOPS baseline, which is not enough for a busy database. Performance is awful. Then they panic-upgrade to io2 at multiple times the cost, when the right answer was to provision provisioned IOPS on gp3 for a fraction of the price.

Read the billing documentation for your provider's block storage options and understand which dimension you are being charged on. Benchmark your actual workload against the tier you think you need before you commit. The gap between "this tier works" and "this tier costs twice as much as the next tier down" is often much smaller than it looks.

3. The Cloud Does Not Have Enough RAM for You to Ignore the Working Set Question

Cloud VMs have fixed ratios of RAM to vCPU depending on the instance family. Memory-optimized families give you 8 GB per vCPU. General-purpose gives you 4. Compute-optimized gives you 2. If your workload's working set (the data it actually needs to touch frequently) does not fit in RAM, you are going to hit disk, and disk latency in the cloud is measured in hundreds of microseconds on the fast tiers and tens of milliseconds on the cheap ones.

The single most impactful performance tuning exercise we do on cloud workloads is profiling the working set size against actual RAM. If you have a 200 GB database on a 16 GB VM with no read replicas, your cache hit rate is terrible, you are hammering the EBS volume, and you are paying cloud egress from the storage tier to the compute tier on every page read. The fix is often to move up one instance family and down one storage tier — more RAM, less IOPS, same or lower total cost, dramatically better performance.

4. Cold Storage Is a Trap Unless You Understand the Retrieval Cost

S3 Glacier, Azure Archive, and Google Coldline look amazing on the price list. Cents per terabyte per month for data that you do not need to access often. What the price list buries in a footnote is the retrieval cost and the minimum storage duration.

If you put data in Glacier and delete it after two months, you pay a 90-day minimum charge anyway. If you need to restore a terabyte of Glacier data urgently, the expedited retrieval fee is measured in dollars per gigabyte, which can turn a $6 monthly bill into a $1,000 recovery fee in an emergency.

Cold storage is the right answer for: compliance archives you genuinely hope you will never read, legal hold data with seven-year retention, and second copies of long-term backups that you already have hot copies of elsewhere. It is the wrong answer for: anything you might need to read once a quarter, monthly audit extracts, or the "maybe we will need this" pile. For those, standard infrequent-access tiers are usually the right compromise.

5. Intra-Region Traffic Is Not Always Free and You Need to Check

The cloud providers all advertise that traffic within a region is free. What they do not advertise prominently is the long list of exceptions. Traffic between availability zones within a region is typically charged at a reduced rate, not zero. Traffic across VPC peering is charged. Traffic across private endpoints (AWS PrivateLink, Azure Private Endpoint) has both a fixed hourly cost and per-gigabyte cost. Traffic through a NAT gateway is charged on the way out.

For a typical application, these charges are small enough to ignore. For any workload that moves serious amounts of data between services — a data pipeline, a streaming analytics platform, a backup system pulling data from multiple accounts — they are not small. We have seen customers with $30,000 per month NAT gateway bills because every Kubernetes pod was reaching the internet through a single gateway for container image pulls.

Audit your data flows. Use gateway endpoints (free) instead of interface endpoints (not free) for S3 and DynamoDB. Use private connectivity between accounts instead of public IPs. Cache container images inside your VPC. These are not micro-optimizations. They are the difference between a reasonable cloud bill and an outrageous one.

What We'd Actually Do

For a new cloud workload, we profile the working set before picking an instance, pick the smallest block storage tier that meets the benchmark, use object storage through native SDKs (never FUSE), and plan egress carefully before the first byte ships. We push cold data to archive tiers only when we are certain we will not need it for a year or more, and we tag every storage resource with an owner and a retention policy so nothing survives its usefulness by more than a month.

Three Takeaways

  1. Object storage is a key-value store. Treat it like one, or it will embarrass you.
  2. Working set size against available RAM is the single biggest cloud performance lever. It is also the one nobody measures before provisioning.
  3. Egress and archival retrieval are the hidden costs that make cloud bills look unreasonable. Audit both before committing to an architecture that depends on them being cheap.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →