I recently spent a day at a CIO roundtable hosted out of Austin, centered on disaster recovery and business continuity strategy. The room included IT leaders from healthcare, higher education, state government, and mid-market enterprise — the usual Texas mix. What surprised me wasn't the topics. It was how much the DR conversation has quietly changed in the last three or four years, and how many organizations still haven't caught up.

Here are the lessons I took away, filtered through twenty-plus years of building infrastructure for organizations that can't afford to be down.

The DR Playbook of 2015 Is Obsolete

A decade ago, disaster recovery meant a secondary data center, a replication link, a runbook in a binder, and an annual test that most people didn't enjoy. The cost of that setup was enormous — you were essentially paying for a second copy of your production environment and hoping you never needed it.

The CIOs in the room who were still doing DR that way were almost all actively trying to get out of it. The ones who had already moved on had replaced the secondary data center with a cloud target, shrunk their runbooks to automated recovery plans, and dropped their recovery time objectives from "days" to "hours or less." The cost savings were real. The capability improvement was more important.

The key realization: when your DR target is cloud object storage plus on-demand compute, you only pay for the storage continuously. The compute spins up when you actually need it. That pricing model alone has rewritten the economics of disaster recovery for small and mid-sized organizations that could never afford a proper hot site before.

Ransomware Changed the Threat Model

Every CIO in the room mentioned ransomware. Not as a distant threat — as a thing that had either happened to them, happened to a peer, or was actively shaping their 2024 budget. The traditional DR mindset assumed the bad day was a fire, a flood, or a hardware failure. Modern DR planning has to assume the bad day is a ransomware actor who has been in your environment for three weeks and has already deleted or encrypted your backups.

The implications are significant. Backup immutability is no longer optional. Air-gapped or logically separated backup copies are table stakes. Recovery testing has to include "we lost the production environment and the backup admin credentials" as a scenario, not just "the primary SAN failed." Several CIOs had rebuilt their entire backup strategy in the last 18 months specifically because the old one assumed a threat model that didn't survive contact with modern attackers.

The practical answer that kept coming up: immutable backups in a cloud object store, with retention policies that can't be shortened by a compromised admin, plus regular restore tests that include credential compromise scenarios. That pattern has become the de facto standard for organizations that have either been hit or watched a peer get hit.

Recovery Time Objectives Are Finally Honest

Here's something that came up repeatedly: the gap between the RTO in the DR plan and the RTO the business actually expects. For years, organizations wrote 24-hour or 48-hour RTOs into their DR plans because that's what they could realistically hit. The business assumed 4 hours because that's what they could tolerate. Nobody closed the loop until there was an incident, at which point everyone was unhappy.

Modern cloud-based DR lets you hit single-digit-hour RTOs for most workloads without buying a second data center, which means the business and IT can finally agree on a number. The CIOs who had done this exercise reported that the conversation with the business changed dramatically. Instead of "here's what we can do, sorry," it was "here's what three different RTO targets cost — pick one." That's a much better conversation.

The corollary is that if your DR plan still has a 48-hour RTO for your core ERP, there's a good chance the business would pay to shorten it. Ask them.

The Cloud Isn't Always the DR Target

One of the more interesting threads at the event was about DR for cloud-native workloads. Several organizations had moved production to Azure or AWS and then realized they still needed a DR story, because "the cloud" is not a DR plan — cloud regions can fail, cloud accounts can be compromised, and cloud providers occasionally have billing disputes that become operational problems at the worst possible moment.

The emerging pattern for cloud-native DR is multi-region within the same provider for most workloads, plus cross-provider or on-prem DR for the handful of truly critical systems. That sounds expensive, and it can be, but for the systems that actually matter, it's the only way to survive a provider-level incident. The 2017 AWS S3 outage and the 2021 Azure AD authentication outage both made the point loudly enough that most mature cloud shops have internalized it.

For organizations running mixed workloads — some on-prem, some in cloud — the honest DR architecture is usually cross-environment: on-prem workloads DR to cloud, cloud workloads DR to a secondary region or, for the critical few, back to on-prem. It's not simple. It's what works.

Testing Is the Thing Everyone Still Struggles With

The single most common confession at the roundtable: "we don't test often enough." Every CIO said some version of this. DR tests are disruptive, they require coordination across teams, and nothing bad happens if you skip them. So they get skipped, deferred, or run in a reduced scope that doesn't catch the real problems.

The organizations that had gotten serious about testing had all done one thing: automated it. They had replaced the annual manual DR test with automated recovery validation running monthly or weekly in isolated sandboxes, using the same tooling that their backup vendor ships. The annual full test still happened, but it was a confirmation exercise rather than a discovery exercise, because the automated tests had caught the small problems as they emerged.

If you can't test your DR more than once a year, you probably don't actually have a DR plan. You have an aspiration.

What I'd Take Home From the Event

Three things I'd hand to any IT leader thinking about DR strategy in 2024:

Move your DR target to cloud object storage if you haven't already. The economics, the immutability story, and the testability all point the same direction. The capital cost is gone and the capability is better.
Rebuild your threat model around ransomware first, hardware failure second. If your DR plan survives a credential compromise, it survives a fire. The reverse is not true.
Automate your recovery tests. Monthly sandbox recovery beats annual full-scope testing that nobody wants to run. You'll catch more problems earlier, and you'll actually know your plan works.

The DR conversation has changed. The organizations that are keeping up are spending less money on DR than they used to, and getting better outcomes. The ones that haven't caught up are paying for last decade's architecture and waiting for the bad day. Pick which group you want to be in.

Disaster Recovery Strategy and the Cloud: Lessons from a TACC CIO Event

The DR Playbook of 2015 Is Obsolete

Ransomware Changed the Threat Model

Recovery Time Objectives Are Finally Honest

The Cloud Isn't Always the DR Target

Testing Is the Thing Everyone Still Struggles With

What I'd Take Home From the Event

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

Disaster Recovery Strategy and the Cloud: Lessons from a TACC CIO Event

The DR Playbook of 2015 Is Obsolete

Ransomware Changed the Threat Model

Recovery Time Objectives Are Finally Honest

The Cloud Isn't Always the DR Target

Testing Is the Thing Everyone Still Struggles With

What I'd Take Home From the Event

Talk with us about your infrastructure