Preventing a Data Disaster: A Practical Checklist
Most data disasters don't come from exotic failures. They come from the same ten or twelve unforced errors. Here is the checklist we use to prevent them.

Over twenty-three years of running and recovering production infrastructure, we have seen most categories of data disaster. Hardware failures, software bugs, human error, malicious deletion, ransomware, flood, fire, a contractor unplugging the wrong cable. The surprising thing is not the variety. The surprising thing is how repetitive the pattern is. A small number of unforced errors show up again and again, and the organizations that avoid them are the ones that survive everything else with their data intact.
This post is the practical checklist we give customers when they ask us how to stop losing data. It is not glamorous. It is not new. It is the list of things that, if you do all of them, you will survive almost everything.
Know what data you actually have
The first and most commonly skipped item is a real inventory of where the organization's data lives. Not "we use SharePoint and SQL Server." An actual list of the systems that hold data, classified by sensitivity and business criticality, with a named owner for each.
Most organizations cannot produce this list when you ask for it. They have a general sense that customer data is in the CRM and financial data is in the ERP, but they also have seven file shares, three SharePoint sites, a pile of Excel files on an engineer's laptop, a handful of SaaS applications where somebody signed up with a corporate credit card two years ago, and a shadow database running on a desktop under someone's desk that everybody has forgotten about.
The checklist here is simple. Identify every system that holds data. Classify each one as crown-jewel, important, or operational. Assign an owner. Write the inventory down. Review it quarterly. This takes a few days the first time and a few hours per quarter thereafter, and it is the foundation of everything else on this list.
Back up the things that matter, not the things that are easy
The second unforced error is backing up the wrong things. We routinely see organizations with religious adherence to backing up their Windows file servers, while ignoring their SaaS data (Microsoft 365 mailboxes, SharePoint sites, Teams, OneDrive) because "Microsoft handles that." Microsoft does not handle that in the way you think. They protect their infrastructure. They do not protect you from yourself, from a ransomware event that encrypts files through a sync client, or from a malicious employee who deletes a year of shared mailbox data on their last day.
Every data source on your inventory needs an answer to two questions. What is backing it up? And what is the tested recovery procedure when something goes wrong? If either answer is "nothing," you have a disaster waiting to happen. SaaS backup tools — Veeam for M365, Afi, SkyKick, Druva, Spanning, and others — are inexpensive compared to the cost of losing the data.
Keep at least one copy you can't accidentally destroy
The 3-2-1 backup rule is decades old and still mostly right — three copies of your data, on two different media, with one copy offsite. The modern version adds a fourth leg: at least one copy that is immutable and cannot be altered or deleted by a compromised administrator.
Immutability is not optional in 2024. Ransomware operators have gotten very good at finding and destroying backup systems before they fire the encryption payload. We have seen this first-hand multiple times. The organizations that recover are the ones whose backups were in a form the attacker could not reach — air-gapped tape, immutable object storage in a separate tenant with separate credentials, or a backup appliance with a vendor-managed insider-resistant mode.
Every backup infrastructure review we do includes the question "how does an attacker with domain admin rights destroy this?" If the answer is "they could," you have more work to do.
Test the restore, not the backup
The most painful disaster recovery story is the one where the backup ran successfully every night for years and the restore didn't work when it was needed. This happens constantly. It happens because the backup job exit code is not the same thing as a recoverable dataset, and because the only way to know that a backup is recoverable is to recover it.
The minimum acceptable testing cadence, for the data you would miss most, is quarterly. Take a representative backup, restore it to an isolated environment, and verify the data is intact and usable. Document the procedure and the time it took. If the recovery took longer than your RTO, fix the procedure or lower the expectation.
The organizations that survive ransomware and hardware failure alike are the ones that have rehearsed the restore so many times that it feels routine when it matters.
Watch for silent corruption
One of the more insidious data disasters is silent corruption — bit rot on storage media, filesystem errors that don't trip an alert, database corruption that propagates into backups before anyone notices, or a failing disk that is returning wrong data without announcing it. The failure is invisible until someone tries to read the corrupted data months later and finds it broken.
Modern filesystems and storage arrays have scrub and checksum capabilities that catch most of this. Turn them on. Monitor them. ZFS, Btrfs, ReFS, enterprise storage arrays like Pure, NetApp, and Dell EMC all have some version of end-to-end data integrity that will flag corruption before it becomes a disaster. The feature is usually off by default or requires explicit configuration.
For databases, run consistency checks on a schedule (DBCC CHECKDB for SQL Server, equivalent tools for PostgreSQL, Oracle, and MySQL) and alert on failures. Do not wait for the nightly backup job to discover the corruption for you.
Slow down privileged destructive actions
A disturbing number of data disasters come from an engineer typing the right command at the wrong time. rm -rf in the wrong directory. A DROP TABLE against production instead of the lower environment. A storage snapshot deletion that cascaded. A deploy script that wiped a database it was supposed to migrate.
The defense is to put friction between the human and the irreversible action. Separate production and non-production credentials so that the engineer's default shell cannot reach production. Require a typed confirmation for destructive commands. Use tools that show the blast radius of a change before it runs. Implement change control for production database operations, even when it is slower and more bureaucratic than everyone would like.
None of this is about not trusting the engineers. It is about recognizing that everyone has a bad Tuesday afternoon eventually, and the goal is for that Tuesday afternoon not to end the company.
Know who has the keys and what happens when they leave
Every data disaster has a human backstop. A cloud account with one admin whose password manager went with them when they left. A backup system whose encryption key is known only to the engineer who set it up. A SaaS tool signed up with a personal email address that disappeared.
Make a list of every system's admin credentials and verify that two or more current employees can access each one. Store the critical credentials in a shared password manager with a break-glass procedure. When someone leaves, rotate anything they knew. This is tedious and it is the reason organizations have survived otherwise unsurvivable departures.
The checklist, summarized
A short version you can tape to the wall:
- Inventory every data source, classify it, and name an owner.
- Back up every data source, including SaaS.
- Keep at least one immutable, offsite copy.
- Test restores quarterly and time them.
- Enable scrub and checksum features on storage and databases.
- Separate production and non-production credentials.
- Require out-of-band confirmation for irreversible actions.
- Document admin access and rotate on departures.
Organizations that do all eight don't experience many data disasters. Organizations that do four or five of them experience enough disasters to wish they had done all eight.
Three takeaways
- Data disasters come from repetition, not creativity. The same dozen unforced errors cause most of them. Close the list and you close the risk.
- Immutable backups are the ransomware ground truth. If your attacker can reach and destroy the backup, the backup does not exist.
- Restores, not backups, are the thing you actually need. Test the restore on a schedule or assume it doesn't work.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation