Skip to main content
DevOps

Building a DevOps Pipeline That Survives Year Two

Most DevOps pipelines look great at launch and fall apart by the second year. Here's how to build one that doesn't — based on what we see break in real customer environments.

John Lane 2023-07-31 6 min read
Building a DevOps Pipeline That Survives Year Two

Every DevOps pipeline looks great the day it's built. The pipeline that still looks great two years later is rare, and the difference between the two is almost never about which tools you picked. It's about a handful of decisions that determine whether the pipeline ages well or rots.

Here's what we've seen work across dozens of customers building pipelines that have to survive real teams and real turnover.

1. Version-Control Everything, Including the Pipeline Itself

The pipeline configuration lives in the same repo as the code it builds. Not in a separate "devops" repo, not in the CI tool's UI, not in a Jenkins job configuration that someone clicked together in 2019. In the repo, reviewed as code, with history.

This sounds obvious in 2023. It's still not the default in probably half the environments we audit. You'll find Jenkins jobs whose last change is from a person who left the company, GitHub Actions workflows that reference a deprecated action version, and CircleCI configs that work by accident.

What "everything" includes:

  • Build and test steps
  • Deployment scripts
  • Infrastructure (Terraform, Bicep, CloudFormation, Pulumi)
  • Secrets management configuration (not the secrets themselves)
  • Alerting and monitoring rules
  • Runbook documentation

If the pipeline isn't in git, you don't have a pipeline. You have a ritual.

2. Fast Feedback or Nothing

A developer's willingness to run the pipeline is inversely proportional to how long it takes. If CI takes 30 minutes, developers find ways around it. If it takes 3 minutes, they trust it.

Rules we enforce:

  • Unit tests: under 2 minutes for the full suite. If they're slower, they're testing the wrong thing. Mock out the slow dependencies.
  • Integration tests: under 10 minutes. Parallelize ruthlessly. Spin up ephemeral databases per test class, not per suite.
  • Full pipeline on PR: under 15 minutes. This is the number developers will tolerate. Past it, they start skipping CI and hoping.
  • Full pipeline to production: under an hour. Not because production deploys need to be fast, but because the slow pipeline is telling you something is wrong.

When pipelines are slow, the answer is almost always parallelization and caching, not bigger runners. Break the test suite into shards. Cache dependencies aggressively (npm, pip, maven, docker layers). Use matrix builds for the things that genuinely need to run in parallel.

3. Separate Build From Deploy

One of the quiet design mistakes is treating "build, test, deploy" as one pipeline. When build and deploy are the same pipeline, a rollback means rebuilding from a git commit. That's slow, and it means your "rollback" might produce a subtly different artifact than the one that was running.

The pattern that works:

  • Build pipeline: produces a versioned, immutable artifact (a container image, a zip, a package). Outputs to a registry. Does not touch production.
  • Deploy pipeline: takes an artifact reference and promotes it through environments (dev → staging → prod). Never builds, never recompiles.

The deploy pipeline's input is an artifact tag, not a git commit. Rollback is "deploy the previous tag," which is a 30-second operation. You can also deploy the same artifact to staging and then to production without rebuilding, which is the only way to have any real confidence that the thing you tested in staging is the thing that went to production.

4. Environments Are Not Snowflakes

The second-biggest source of pipeline rot: environment drift. Dev looks different from staging looks different from production. Something works in staging and breaks in production because of a config difference nobody documented.

The fix:

  • Infrastructure as code for every environment. Same code, different variable files. If production and staging use different modules, you're going to get drift.
  • Environments as cattle, not pets. You should be able to delete staging and recreate it from code in under an hour. If you can't, it's a snowflake and it's going to cause an incident.
  • Configuration externalized, not baked into the image. Same container image runs in all environments. Only environment variables differ.
  • Secrets from a vault, not from a CI variable. AWS Secrets Manager, Azure Key Vault, Google Secret Manager, or HashiCorp Vault. Never in the CI tool's config.

5. Tests That Actually Catch What Breaks

Most test suites grow in size and shrink in usefulness. They get slower every year, they fail intermittently, and they catch fewer real bugs. At some point the team stops trusting them and starts ignoring failures.

The fix is unglamorous: prune. Delete tests that:

  • Haven't failed in six months (they're not testing anything)
  • Fail intermittently without a clear cause (they're noise)
  • Test implementation details rather than behavior (they break on refactors)
  • Mock so much that they're testing the mocks, not the code

A test suite of 500 fast, reliable tests is more valuable than 5,000 slow flaky ones. Teams hate deleting tests. Do it anyway.

The other lever: add tests at the integration and end-to-end layer that exercise real user paths. One good e2e test that logs in, creates an order, and checks out is worth dozens of unit tests that check if a price formatter returns "$" before the number.

6. Observability Belongs in the Pipeline

The pipeline should produce observable artifacts in production. Deploy events, version metadata, feature flag changes should all land in your observability platform. When an incident happens at 2:14 PM, the on-call engineer should see "deploy v4.23.7 at 2:11 PM" as the top entry in the timeline.

Concretely:

  • Tag every metric with the deployed version. Prometheus labels, Datadog tags, whatever your platform uses.
  • Emit deploy events to your monitoring tool. Grafana, Datadog, Honeycomb, all support deploy annotations.
  • Log version on startup. Every service logs its version and git SHA on boot.
  • Deploy notifications to incident channels. Even a simple Slack message in #deploys saves a real hour during an incident.

7. Rollback Is a First-Class Feature

"How do I roll back?" should have an answer that does not involve reading documentation. Ideally, the CI/CD tool has a button. Failing that, a single command. Failing that, a runbook so short it fits on a Post-it.

Rollback strategies we recommend, in order of preference:

  • Deploy the previous artifact tag. The cleanest, cheapest rollback. Works if your deploy pipeline is separated from build.
  • Feature flags for risky changes. Turn off the feature without a redeploy. Good for UI changes and new features.
  • Blue-green deploys. Route traffic back to the previous environment. Fast but requires double the capacity.
  • Canary deploys with automated rollback. Percentage-based rollout with automatic rollback on error threshold. Best-in-class, more infrastructure to build.

The first one handles 80% of cases. Build that before you worry about the rest.

8. Plan for the Second Year

The pipeline you build today will be maintained by someone else in two years. Write it for them:

  • README in the repo explaining the pipeline. Not auto-generated. A human-written "here's how this works and why."
  • Decision log for why weird things are weird. The two lines that look redundant but are load-bearing. Explain them or someone will delete them and break production.
  • Test the rollback quarterly. If it hasn't been used, assume it's broken.
  • Review the pipeline when the team changes. New owners should be able to understand it within a day.

Three Takeaways

  1. The pipeline ages as well as the documentation does. Treat the README and the runbook as first-class deliverables.
  2. Fast feedback is not optional. If developers don't trust the pipeline, they'll find ways around it, and your quality gates stop working.
  3. Separate build from deploy so rollback is trivial. This one decision pays dividends every time something goes wrong.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →