Skip to main content
Security

Cloud IAM: Seven Things Most Teams Get Wrong

Seven IAM mistakes we find on nearly every security assessment, and the pragmatic fixes that don't require rebuilding your identity stack from scratch.

John Lane 2023-01-24 5 min read
Cloud IAM: Seven Things Most Teams Get Wrong

Identity is the new perimeter. Everyone has said this for a decade. What nobody says is that most organizations, including sophisticated ones, have an IAM configuration that would fail a basic tabletop exercise. We do security assessments as part of our infrastructure work, and the same seven IAM mistakes show up on almost every one of them. Here they are, in descending order of how much damage they do.

1. Long-Lived Access Keys for Humans

The single most common finding is a human user with an IAM access key that has not been rotated in months or years. The key lives in a dotfile, a CI secret, or a sticky note, and it has administrator privileges because rotating it is painful and scoping it down is even more painful.

Long-lived keys for humans are a 2012 pattern. In 2023, humans authenticate through SSO to a short-lived session that assumes a role. AWS IAM Identity Center, Azure AD conditional access with PIM, GCP workforce identity — they all do this. The migration is unpleasant because you have to touch every developer's workflow, but the post-migration state is that nobody has a standing key on a laptop that can be stolen. That is worth the unpleasantness.

2. Service Accounts With Console Passwords

The second most common finding is a service account that has both a programmatic access key and a console password, usually because someone needed to "test something" six months ago and never cleaned up. Service accounts should never be able to log into a console. Console access is an indicator of a human user pretending to be a service, which is an indicator of a human who is about to share credentials with another human.

The fix is a scheduled scan that flags any non-human principal with console access and either removes it or files a ticket. We recommend running this scan weekly. The first run will find things. Every subsequent run should find nothing, and if it finds something, that is a conversation worth having.

3. Star Permissions In "Temporary" Policies

Every environment has at least one IAM policy with a Resource: "*" or Action: "*" in it. The policy is always labeled "temporary" and was always added during an incident. Nobody ever goes back and scopes it down because the incident is over and everything works.

The honest way to handle this is a quarterly review of every policy in the account, not just the ones labeled temporary. The review should ask: what is the minimum set of resources and actions this role needs to do its job? The answer is almost never a star. Tools like AWS Access Analyzer and Azure's role assignment recommendations can generate a proposed narrower policy based on actual usage, which saves an enormous amount of time.

4. Cross-Account Roles Without External ID

If you have a third-party vendor that needs access to your AWS account — a security scanner, a cost tool, a backup service — they should be assuming a role with an external ID condition. The external ID is a shared secret that prevents the confused deputy problem: another customer of the same vendor cannot trick the vendor's automation into accessing your account.

We find this missing constantly. The role is set up with a trust policy that allows the vendor's account, but no external ID condition, so any role in the vendor's account can assume it. If the vendor is compromised, your account is compromised.

The fix is a one-line addition to the trust policy. The pain is discovering all the vendor roles you already set up without it.

5. No Lifecycle on User Accounts

When someone leaves the company, their account gets disabled. Usually. If you are lucky. What almost never gets cleaned up is the service accounts they created, the API keys they generated, the roles they assumed, and the resources those roles touched. Six months later, the ex-employee's access keys are still in a shared dotfile somewhere in the ops team.

The fix is a joiner-mover-leaver process that ties human identity to service accounts and keys through tagging. Every service account should be tagged with the human owner. When the human owner leaves, the tag is orphaned, and an automated scan flags the orphan for review. This is boring infrastructure work and almost no organization does it until after their first painful incident.

6. MFA Optional for Privileged Roles

MFA is almost universal now for human login. What is not universal is MFA as a condition of assuming a privileged role. Many organizations let a developer log in with MFA to their own account, then assume an admin role in another account without a fresh MFA prompt. That is not real MFA — that is MFA for the front door with the back door wide open.

The fix is a trust policy condition that requires aws:MultiFactorAuthPresent: true (or the equivalent in Azure and GCP) on privileged role assumption, and a session duration short enough that the MFA actually expires. We recommend 1 hour for admin roles and 8 hours for developer roles. Longer sessions defeat the purpose.

7. No Break-Glass Account

The seventh mistake is not having a break-glass account — a root or global admin account that is sealed in a physical envelope, has a password nobody uses day-to-day, has MFA on a hardware token stored in a safe, and exists solely for the scenario where your SSO provider is down and you need to recover access.

We find two failure modes here. The first is no break-glass account at all, which means an Okta or Entra ID outage locks you out of your own cloud environment and you are on the phone with support for 6 hours. The second is a break-glass account that exists but is used regularly for convenience, which defeats the purpose because the credentials leak into normal workflows.

The correct configuration is one account, documented, never used, audited monthly to confirm the hardware token is still in the safe and the password has not been changed. Boring is the goal.

What We Recommend for Most Customers

If you are staring at this list and wondering where to start, the order we use is: kill long-lived human keys first (biggest attack surface), add MFA conditions on privileged roles (highest leverage), set up a break-glass account (protects you from your SSO provider), then work through policy scoping and lifecycle management over the next quarter.

None of this is hard in isolation. All of it is hard because it touches developer workflows, and developer workflows are defended by developers. The political work is harder than the technical work.

Three Takeaways

  1. Long-lived access keys for humans are the single largest IAM risk in most environments. Kill them first, accept the disruption, move on.
  2. The exclusions and conditions on IAM policies matter more than the permissions themselves. An admin policy with an MFA condition is safer than a read-only policy without one.
  3. A break-glass account is cheap insurance you will only appreciate during an SSO outage. Set it up now, not after.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →