Skip to main content
DevOps

DevOps on Cloud: Four Practices That Actually Survive

Four DevOps practices we have watched survive every tool change, team reorg, and cloud migration of the last decade. The rest is fashion.

John Lane 2023-03-09 6 min read
DevOps on Cloud: Four Practices That Actually Survive

DevOps has accumulated enough fashion over the last decade that it is hard to tell what actually works from what sounded good on a conference stage. We have watched customers adopt and abandon Jenkins, CircleCI, GitLab, GitHub Actions, ArgoCD, Flux, Harness, and three different flavors of Kubernetes operators. Through all of that, four practices have survived every tool change, every team reorg, and every cloud migration we have been part of. These are the ones worth fighting for. The rest is negotiable.

Practice 1: Everything That Matters Lives in a Repo

The first practice is the oldest and the most important: if a piece of configuration, a deployment script, a database migration, or an infrastructure definition matters, it lives in a git repository with a commit history and a review process. If it is not in a repo, it does not exist for operational purposes, and the first outage that exposes it will be expensive.

This sounds obvious and it is not. Every environment we audit has at least one critical piece of configuration that is only in someone's home directory or a forgotten S3 bucket or a SaaS dashboard that nobody backs up. We have found production Terraform state in a single engineer's Dropbox. We have found database schema migrations that only existed as SQL files attached to Jira tickets. We have found a load balancer configuration that had been clicked into the console three years earlier and no one remembered why.

The survival rule is: if you cannot reproduce it from a repo, it is not reproducible. Act accordingly. The cost of getting there is a few weeks of tedious import and reconciliation work per environment. The cost of not getting there is measured in middle-of-the-night recovery exercises.

Practice 2: The Pipeline Runs the Same Way Every Time

The second practice is pipeline determinism. When the same commit is built and deployed twice, the result should be byte-identical, or as close to it as the build toolchain allows. This means pinned dependencies, pinned base images, pinned tool versions, and a build that does not reach out to the internet for anything except the explicit inputs you declared.

Non-deterministic pipelines are the source of most "it worked yesterday" problems. A build that implicitly pulled latest from npm, or apt-get update during image construction, or an unpinned GitHub Action, is a build that will eventually produce a different artifact for the same source code and nobody will know why.

The fix is boring. Lock files for every package manager. Digest-pinned base images for every Dockerfile. Explicit versions on every GitHub Action. A private package mirror if you can afford one, so the upstream registry going down does not block your deploys. The organizations that do this have builds that work at 2 a.m. on a Sunday when half the internet is on fire. The ones that do not will eventually have a bad day.

Practice 3: Observability Before Features

The third practice is that every new service gets logs, metrics, and traces on day one. Not "after we ship the MVP." Not "when we have time." On day one, before the service handles real traffic.

The reason this practice survives is that the cost of retrofitting observability into a service that is already live is roughly 10x the cost of building it in from the start. The first production incident for an unobserved service is the one where the team spends four hours guessing what is happening because there is nothing to look at, and then spends the next week adding instrumentation that should have been there all along. We have watched this exact sequence play out dozens of times.

The minimum bar we enforce is: structured logs to a centralized log store, a handful of business-relevant metrics (request rate, error rate, latency p50/p95/p99), and distributed traces for any call that crosses a service boundary. The specific tools change every few years. The requirement does not.

A corollary that never gets the emphasis it deserves: dashboards that nobody looks at are not observability. They are wall art. The real measure of observability is whether the on-call engineer can answer "what is wrong" during an incident, not whether there is a dashboard with 40 graphs on it.

Practice 4: Rollback Is a First-Class Operation

The fourth practice is that rollback is a tested, documented, one-button operation — not a theoretical capability. Every deployment pipeline we build assumes that rollback will happen eventually, and designs the forward path to make it cheap.

This shapes a lot of downstream decisions. Database migrations have to be backward-compatible across one version, because rollback happens at the application layer independently of the data layer. Feature flags cover behavioral changes, so rollback can be a flag flip instead of a redeploy. Blue-green or canary deployments exist precisely so that rollback is a traffic shift, not a code push.

The discipline that enforces this is drill-based. Every quarter, take a non-critical service, deploy a benign change, and roll it back in the pipeline. Time the operation. If it took longer than 10 minutes, something is wrong and you need to fix it before you need it for real. If the rollback did not actually work, fix that immediately. If the rollback drill has not been run in the last six months, assume the capability is broken, because it almost certainly is.

Teams that drill rollback regularly are calm during incidents. Teams that have only rolled back in theory are the ones you read about on the postmortem blogs.

What We Leave Off This List

Notice what is not here. Kubernetes is not on this list. GitOps is not on this list. Service meshes are not on this list. Not because those things are bad — they are often useful — but because they are tools, and the four practices above work regardless of which tools you pick. A team that does these four things on plain VMs with Ansible will outperform a team that does none of them on a bleeding-edge Kubernetes stack with a service mesh and a GitOps operator.

The DevOps fashion cycle loves new tools. The boring, practice-first version of DevOps is what actually survives.

What We Actually Recommend

Start with the repo rule. Inventory every piece of configuration that matters and get it under version control. This is the foundation everything else sits on.

Then fix pipeline determinism. Pin everything. Add a private mirror if you rely on public package registries for production builds.

Then enforce observability on every new service as a shipping requirement. Make it a blocker on the code review checklist, not an afterthought.

Then drill rollback. Actually do it. Put a recurring event on the calendar and run the drill even when nothing is broken.

Do these four things and the rest of DevOps becomes tool selection, which is a much easier problem.

Three Takeaways

  1. The practices that survive are the ones about discipline, not the ones about tools. Every tool we picked ten years ago has been replaced. The practices have not.
  2. Rollback drills are the cheapest insurance in operations. A team that has rolled back in anger is calm in a real incident.
  3. Observability before features is a line in the sand worth defending. Retrofitting instrumentation after an incident always costs more than building it in from day one.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →