Skip to main content
Security

Managed Cloud Security: Four Practices That Actually Reduce Risk

Cloud security is drowning in products. Here are the four practices that actually reduce risk in the environments we run — and the ones that mostly just check a box.

John Lane 2024-08-24 7 min read
Managed Cloud Security: Four Practices That Actually Reduce Risk

Cloud security is an industry that has figured out how to sell a lot of products. Walk any RSA show floor and you will see hundreds of vendors promising to protect your cloud — posture management, workload protection, runtime detection, shift-left, shift-right, CNAPP, CSPM, CWPP, whatever acronym came out last quarter. Most of these tools produce alerts. Most of those alerts get ignored. And the incidents that actually hurt customers almost always come from things the tooling did not catch or the team did not prioritize.

The useful question is not "which tools should we buy?" It's "which practices actually move the needle on risk?" Here are the four we keep coming back to. They are boring, which is why they work, and they are hard, which is why they are not as common as they should be.

Practice 1: Identity is the perimeter — treat it that way

Every serious cloud incident in the last five years that we have read about, investigated, or helped respond to started with identity. Compromised credentials, overly broad IAM roles, service accounts with production access, stolen OAuth tokens, phished MFA codes. The network perimeter stopped being the main defense the day workloads moved to the cloud, and identity took its place. Most organizations still have not rebuilt their security posture around that fact.

The practice is to treat every identity like a potential attacker and make the blast radius of a compromised identity as small as possible. Concretely:

  • Every human identity has MFA, and the MFA is phishing-resistant. SMS is not phishing-resistant. TOTP is not phishing-resistant under a determined adversary. Hardware keys or platform authenticators are. If you have a privileged user without a hardware key, you have a hole.
  • Service accounts have the minimum permissions they need, scoped to the minimum resources they need, for the minimum time they need. Broad "admin" service accounts are the single worst thing in most cloud environments. We find them constantly when we audit new customers.
  • Permissions are reviewed on a schedule. Humans and service accounts both accrete permissions over time. Someone needed a temporary grant to debug an issue, the grant got added, nobody removed it. Six months later the temporary grant is the path to compromise. A quarterly review catches these before they matter.
  • Access to production is logged, visible, and unusual activity is alerted on. If an engineer accesses production at 3 AM from a country they have never logged in from before, someone needs to know within minutes, not days.

None of this requires buying a new product. It requires using the features that are already in your identity provider and actually enforcing them. Most organizations have the tools. Few are using them to their limits.

Practice 2: Assume every public-facing thing will be scanned — design accordingly

Anything with a public IP or DNS name is being scanned, constantly, by opportunistic attackers looking for the easy wins. This is not a hypothesis. You can watch it in the logs of any new public endpoint — traffic from scanners arrives within minutes of the first DNS record resolving. Most of the scanning is looking for the same specific things over and over: default credentials, exposed management interfaces, unpatched CVEs in common software, misconfigured cloud storage buckets, and open databases with no authentication.

The practice is to design with that reality in mind. Treat every public-facing thing as already under attack by automated tools, and make sure none of the easy-win categories apply to you.

  • Nothing is public unless it needs to be. Management interfaces, databases, internal dashboards, dev environments — private network or VPN only. "Security by obscurity" has a bad reputation, but removing something from the internet entirely is not obscurity, it's deletion of the attack surface.
  • Public-facing services are behind a WAF and a rate limiter. Not because the WAF catches everything, but because it filters out 99 percent of the commodity scanning noise and lets you see the stuff that's actually interesting.
  • Storage buckets are private by default and the default is enforced at the organization level. Every time you read about a customer data exposure, the root cause is a bucket that somebody made public and forgot about. Organization-level policies that prevent public buckets are cheap and prevent the most common catastrophic mistake in cloud security.
  • Patching is automatic and measured. You will not keep up with CVEs manually. Auto-patching for OS packages, automatic base image updates, and a measurable SLA for applying critical patches to everything public-facing. If you cannot answer "how long does a critical CVE live in our environment?" with a number, the answer is "too long."

The goal is not to be unbreakable. It is to be annoying enough that automated attackers move on to the next target and targeted attackers have to work for it.

Practice 3: Logs you never read are a waste of money

Most organizations collect logs. Many organizations pay a lot of money to collect logs. Far fewer organizations can answer the question "if we were compromised last Tuesday, would we find out from our logs?" The answer is usually "probably not, because we do not actually look at them."

The practice is to treat logging as a response capability, not a compliance checkbox. Collect what you need to reconstruct an incident, retain it long enough that a delayed discovery still has a paper trail, and actually use it.

  • Retention is at least ninety days for most logs, a year for authentication and administrative actions. Many breaches are not discovered for months. If your logs rolled off after thirty days, you cannot investigate what happened.
  • Logs are centralized and searchable. If the investigation starts with "let me SSH to each host and grep," you have already lost. A central log store with a real query interface turns a multi-day investigation into a multi-hour one.
  • Detection rules are written for your environment, not generic. Off-the-shelf detection content catches generic attacks. The incidents that matter are the ones targeted at you, and you need rules that understand what "normal" looks like in your environment. "Someone accessed a database they've never accessed before from an IP they've never used before" is a rule you write once and get value from forever.
  • Real investigations happen periodically, even without an alert. Schedule a monthly threat hunt where someone actually pokes around the logs looking for weird things. You find stuff. You always find stuff. Some of it is real.

If you are spending real money on logging and the team does not use the logs on a regular basis, you have bought a very expensive compliance artifact. Fix the usage, not the volume.

Practice 4: Backups are the last line against the worst day

The worst day in cloud security is ransomware, and the thing that turns a bad day into a company-ending day is the state of the backups. Organizations that recover quickly had backups they could actually restore from. Organizations that paid the ransom almost universally did not — their backups were encrypted alongside the production data, or they had not tested a restore recently enough to be confident it would work.

The practice is to run backups as if they are the most important thing in the environment, because on your worst day they will be.

  • Backups are immutable and isolated from production credentials. A credential that can write to production should not be able to delete backups. This is the single most important property and it is violated constantly. Use object lock, use separate accounts, use write-once storage.
  • Backups are tested by actually restoring them, not by reading the status report. A backup that has never been restored is not a backup, it is a hope. Schedule quarterly restore drills, write them up, fix the things that break.
  • Retention is long enough to survive a delayed attack. If the attacker was in your environment for two months before detonating the ransomware, a thirty-day retention policy means your backups are already encrypted. Longer retention is cheap compared to not being able to recover.
  • The restore runbook exists and has been followed recently. Restoring from backup at 2 AM during an incident is not the time to learn the process. Someone on the team should have done it in a calm moment, and the runbook should reflect what actually worked.

On our worst incident responses — the ones where customers called us after something bad had already happened — the thing that determined the outcome was almost always the state of the backups. Good backups made the recovery painful but survivable. Bad backups made the recovery a board meeting.

The short version

If you did only these four things — identity hygiene, minimal public exposure, useful logging, and verified backups — you would have better security than the majority of cloud environments we see. Everything else is optimization on top of these fundamentals. If you have not nailed the fundamentals, buying more tools will not save you. Nail them first, then consider what else you need. In our experience, most customers find they need a lot less than they thought.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →