Skip to main content
Backup

What Cloud Backup Vendors Don't Put on the Datasheet

Six things about cloud backup that matter during an actual restore and never appear in the marketing collateral.

John Lane 2022-06-07 6 min read
What Cloud Backup Vendors Don't Put on the Datasheet

Cloud backup is now a commodity in the same way that office coffee is a commodity. Everybody has it, most of it is adequate, and the differences only show up when things go wrong. The problem is that "things going wrong" is the only scenario that matters for backup software, and the vendor datasheets say almost nothing useful about it.

Here are six things we've learned about cloud backup in production that never show up in the marketing material. If you're evaluating a product or auditing an existing deployment, these are the questions to ask.

1. Restore speed is not symmetric with backup speed

The datasheet tells you how fast the product ingests data. It rarely tells you how fast it can push data back. These numbers are usually wildly different, and for good reasons.

Backup is a steady-state workload. It runs on a schedule, it can chunk and compress over time, it tolerates throttling, and the product has hours or days to finish. Restore is an emergency. You want everything, right now, and the product has to rehydrate from compressed/deduplicated/encrypted storage, decrypt it, decompress it, reassemble it, and push it over a network to a target that might not even exist yet.

We've seen 10:1 ratios between ingest rate and restore rate on some products. A tenant that can ingest 500 MB/s of change data can restore at 50 MB/s. That means your 5 TB database takes about a day to pull back from the cloud, not an hour. If your RTO assumed otherwise, your RTO was wrong.

Test this. Not in a POC with a toy workload. Test with a restore of a real Tier 1 database to a real target, measure the wall-clock time, and update your RTO planning with the actual number.

2. Egress charges are the line item nobody forecasts

Cloud storage pricing is so cheap that people stop thinking about economics. Then a restore happens, a few terabytes move from the provider back to the customer environment, and the next month's bill has a five-figure line item that nobody predicted.

This is particularly brutal with hyperscaler object storage. AWS, Azure, and GCP all charge for egress out of their backup tiers, and those egress rates are designed to discourage bulk retrieval. A full-environment restore from archive tier can cost more than the yearly storage bill.

Some vendors absorb the egress in their subscription pricing (and you're paying for it in the sub rate). Others pass it through as an at-cost line item that you don't see until it happens. The contract language is worth reading slowly.

3. Deduplication ratios are marketing math

When a vendor tells you their dedupe ratio is 20:1, ask them the source of that number. It almost always comes from a synthetic dataset — identical VMs, identical databases, identical file shares — that bears no resemblance to your production environment.

Real-world dedupe ratios on mixed workloads are more like 2:1 to 5:1. Encrypted databases dedupe at approximately 1:1 because encryption destroys the byte-level similarity dedupe depends on. Virtualized workloads with identical OS images dedupe better, but only on the OS portion — user data doesn't compress.

If your budgeting was based on the vendor's stated ratio, your budget is wrong by a factor of 4 to 10. This usually surfaces as an unpleasant surprise three months in, when storage consumption is triple what was forecast.

4. The delete-and-recreate problem

Most cloud backup products handle file changes gracefully. They don't always handle file deletions gracefully, particularly when a workload pattern involves deleting a large file and creating a new one with different content.

The failure mode looks like this: your application deletes a 50 GB file, creates a new 50 GB file, and does this every night as part of a data refresh. The backup product sees two operations — a delete and a create — and stores 50 GB of new data every day. After 30 days you have 1.5 TB of storage for a 50 GB working set, with no deduplication, because each day's file is genuinely different content from every other day's.

This is particularly common with database exports, report generators, ETL staging, and workloads that use temporary file patterns for persistent-looking data. Check your retention math against your workload patterns, not against the vendor's assumptions about normal file behavior.

5. Immutability is not immutable until you read the fine print

"Immutable backup" has become a checkbox item since ransomware made it non-negotiable. But immutability comes in several flavors, and only one of them actually protects you from the scenario you're worried about.

The three flavors you'll encounter:

  • Soft immutability — the backup product prevents deletion through its own UI and API. If an attacker compromises the backup admin account, they can turn this off. Not real immutability.
  • Provider-side immutability — the cloud storage tier itself enforces the write-once policy. AWS S3 Object Lock in Governance mode, Azure Blob immutable policies. Protects against the backup product being compromised but can sometimes be overridden by high-privilege provider accounts.
  • Compliance-mode immutability — the cloud storage tier enforces the lock in a way that cannot be removed even by the account root. S3 Object Lock Compliance mode, Azure Blob legal hold with WORM. This is what you want. This is what your cyber insurance probably requires.

Ask the vendor which one they use. If the answer is vague, assume it's soft immutability. Configure compliance-mode yourself on the underlying storage if the vendor architecture allows.

6. Granular restore is usually slower than full restore

For reasons that make sense if you understand how backup indexing works but surprise most customers, restoring a single 5 MB file from a 10 TB backup set can take longer than restoring the entire backup set to a scratch destination.

This happens because the product has to walk its index, locate the file's chunks across potentially thousands of containers, pull just those containers from cloud storage, decrypt/decompress them, reassemble the file, and present it. If the containers are in archive tier, you pay retrieval time on each one. A full restore streams sequentially from a fewer number of containers in order and hits the network efficiently.

This matters for "I deleted one file, can you get it back?" scenarios. The answer is yes, but maybe not in five minutes. Set expectations with your users accordingly. Also, this is why file-level backup products (not image-level) exist as a separate category — they optimize for this exact case at the cost of being worse at full-system recovery.

What to Actually Do With This

I am not telling you not to use cloud backup. Cloud backup is the baseline of any modern data protection strategy, and the alternative — tapes in a drawer, snapshots on the same SAN as production, nothing at all — is worse. I am telling you that the marketing claims are the marketing claims, and the operational reality is different.

Before you commit to a cloud backup platform for production:

  1. Measure real restore throughput for your specific workload mix
  2. Forecast egress costs for your worst-case restore scenario
  3. Audit your actual dedupe ratio after 90 days of real production data
  4. Walk through your largest deletes-and-recreates patterns and calculate retention storage
  5. Confirm the immutability mode in writing
  6. Test granular restore for the use cases you actually care about

Do the work up front, and the next time you have to use the product for real, you'll be pleasantly un-surprised. Skip the work, and you'll learn these six facts the hard way on your worst day of the year.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →