Cyber Recovery: Hope Is Not a Strategy
94% of ransomware attacks target backups. If you haven't tested your recovery plan, you don't have one.
Adapted from a masterclass at FutureScot Digital Scotland — but the lessons apply to anyone who’d rather not explain to the board why the backups didn’t work.
On 19 November 2024, I ran a masterclass at FutureScot Digital Scotland on cyber recovery strategies. The audience was largely public sector, but the questions afterwards made it clear: everyone is worried about this. And most organisations are worried for good reason — because their disaster recovery plans are, to put it charitably, optimistic fiction.
So here’s the expanded version of that session. If you’re responsible for keeping systems running when everything goes wrong, this is for you.
The Uncomfortable Reality
Let’s not kid ourselves: most organisations are sitting ducks for cyberattacks. Ransomware, supplier outages, and plain old misconfiguration mean that outages aren’t a question of “if” but “when”.
The stats are grim:
- 97% of unplanned outages last an average of seven hours
- 94% of ransomware attacks now target backups specifically
- Only 31% of organisations are confident in their disaster recovery plans
That last number should terrify you. Nearly 70% of organisations know their DR plans probably won’t work. They’re just hoping they won’t have to find out.
If you think your organisation is the exception, you’re probably deluding yourself. I say this with love.
RTO, RPO, and the Fantasy of “Zero Downtime”
Two metrics matter when everything catches fire:
Recovery Time Objective (RTO) — How long can you afford to be offline?
Recovery Point Objective (RPO) — How much data can you afford to lose?
The lower your targets, the more you’ll pay. That’s not a vendor upsell — it’s physics. Continuous replication costs more than daily backups. Instant failover costs more than manual recovery. Set your targets based on a proper business impact analysis, not wishful thinking or vendor PowerPoint slides.
And no, you can’t just write “zero downtime” on a requirements document and expect it to happen. That’s not how any of this works.
Tiering Your Workloads
Here’s where pragmatism beats perfectionism:
| Tier | Description | Typical RTO | Typical RPO |
|---|---|---|---|
| T1 | Crown jewels — business-critical systems | Minutes | Near-zero |
| T2 | Important but not critical | Hours | Hours |
| T3 | Stuff nobody will miss for a day | Days | Daily |
Over-engineering everything to T1 standards is a fast track to budget hell. Be honest about what actually matters. That internal wiki? Probably T3. The payment processing system? T1, obviously.
The Shared Responsibility Model: Stop Blaming the Cloud
Here’s a reality check that catches people out: your cloud provider is responsible for the infrastructure. Your data resilience is on you.
If you botch your backups or misconfigure your recovery, don’t expect AWS or Azure to swoop in and save you. That’s not how the shared responsibility model works. Multi-AZ sounds fancy, but it won’t help if you haven’t:
- Locked down your data with proper access controls
- Implemented immutable backups (so ransomware can’t encrypt them)
- Actually tested your recovery runbooks
Computer says no isn’t an acceptable answer when the board asks why you couldn’t recover.
Accountability isn’t optional. If you can’t prove your plan works, you don’t have a plan. You have a document.
Recovery Strategies: Horses for Courses
There’s no one-size-fits-all. Different strategies trade off cost against recovery speed:
Backup & Restore — Cheapest option. Restore from backups when needed. Simple, but slow. Hours to days for recovery depending on data volumes.
Pilot Light — Minimal environment kept running with core components. Database replication active, but compute scaled down to near-zero. Faster recovery, moderate cost.
Warm Standby — Scaled-down but functional live environment. Can handle traffic at reduced capacity immediately, then scale up. Faster still, pricier still.
Active/Active — Continuous replication, multiple live sites, automatic failover. Zero downtime, zero data loss, and a bill to match.
Here’s the thing — don’t waste money on Active/Active for systems nobody cares about. Mix and match strategies by workload and by component.
For example: your database might need Warm Standby because data loss is unacceptable, but the front-end application can sit on Backup & Restore because it’s stateless and can be redeployed in minutes. The web tier isn’t where your data lives.
It’s about business impact, not technical vanity.
Testing: The Bit Everyone Ignores
An untested backup is a fantasy, not a recovery plan.
I cannot stress this enough. I’ve seen organisations with beautiful DR documentation, automated backup jobs running every night, and zero evidence that any of it actually works. Then something goes wrong, and they discover the backups were silently failing for six months.
Here’s a testing cadence that actually works:
| Activity | Frequency | Purpose |
|---|---|---|
| Backup verification | Daily | Confirm jobs completed successfully |
| Component restore | Monthly | Test individual system recovery |
| Tabletop exercise | Quarterly | Walk through scenarios with the team |
| Full DR test | Annually | End-to-end recovery in isolated environment |
Document your actual RTO/RPO versus your targets. If your target RTO is four hours but your last test took twelve, you don’t have a four-hour RTO. You have a twelve-hour RTO and a document that says four.
Treat every test as a chance to find what’s broken before reality does it for you.
The Greatest Hits of Failure
I’ve seen these patterns repeatedly. Don’t be a statistic:
“Set and forget” DR plans — Written three years ago, never updated, references infrastructure that no longer exists.
Incomplete documentation — If it’s not written down, it doesn’t exist. If the person who knows how to recover the system is on holiday, you’re in trouble.
Single points of failure — One backup location? All backups in the same region as production? Really?
Ignoring dependencies — Everything is connected. Your application might recover fine, but if the authentication service it depends on doesn’t, you’re still down.
Staff who don’t know the runbooks — Documentation is worthless if nobody’s trained on it.
Penny-pinching on DR — Saving money now, paying in reputation and regulatory fines later.
A Framework That Works
If you need a process — and you do — here’s one that’s stood up to real-world use:
1. Assess
Identify what actually matters. Run a business impact analysis. Talk to the business, not just IT. Find out which systems are genuinely critical and which ones people think are critical because they’ve never been asked to prioritise.
2. Design
Set realistic RTO/RPO targets based on the assessment. Pick recovery strategies appropriate to each tier. Don’t let technical preferences override business requirements.
3. Implement
Deploy the infrastructure. Automate everything possible. Lock down access. Implement immutable backups. Make sure your recovery environment is actually separate from production — ransomware that compromises your production admin credentials shouldn’t automatically compromise your DR environment too.
4. Test
Relentlessly. See the testing section above. If you’re not testing, you’re hoping. And hope is not a strategy.
5. Maintain
Update documentation when things change. Train staff. Review after incidents. Improve based on test results. Threats evolve, and so should your resilience posture.
This isn’t a one-time project. It’s a continuous practice.
The Broader Point
Resilience isn’t optional. It’s not a nice-to-have that can wait until next financial year. It’s not something you can outsource entirely to your cloud provider and forget about.
Every organisation — public sector, private sector, large, small — is a target. The question isn’t whether you’ll face an incident, but whether you’ll recover from it.
Set clear targets. Test relentlessly. Document everything. Train your people. And stop pretending the cloud will save you from your own negligence.
If you can’t prove your recovery works, it doesn’t.
This post is adapted from my masterclass at FutureScot Digital Scotland on 19 November 2024. If you’re wrestling with DR strategy, or you’ve got war stories about recovery plans that didn’t survive contact with reality, I’d be interested to hear them.
Architecture Notes / Takeaways
- • Set RTO/RPO based on business impact analysis, not wishful thinking
- • Your cloud provider won't save you from your own misconfiguration
- • An untested backup is a fantasy, not a recovery plan
- • Match recovery strategies to workload criticality — Active/Active for everything is budget suicide