Skip to main content
AI & Automation

7 Ways to Deploy ML in Production Without Regret

Model versioning, A/B testing, drift detection, rollback — the MLOps patterns that separate production ML from science projects.

John Lane 2021-08-10 5 min read
7 Ways to Deploy ML in Production Without Regret

Most ML projects die in the gap between "the notebook works" and "it's running in production and people depend on it." The data scientist's notebook does not translate to a production deployment without a lot of engineering. The engineering is not optional. Shipping an unreliable model to production is worse than not shipping a model at all — users lose trust, cleanup is expensive, and the project gets cancelled.

Here are seven practices that separate production ML from one-off experiments.

1. Version Everything (Code, Data, Model, Config)

A model is defined by four things: the training code, the training data, the hyperparameters, and the weights. Reproducing or debugging a model means having access to all four at a specific point in time.

What to version:

  • Code: Git, like everything else.
  • Data: DVC, LakeFS, or a snapshot convention in cloud object storage. "The dataset as of 2026-03-01" should be a specific, retrievable artifact.
  • Model weights: A model registry — MLflow, Weights & Biases, Vertex AI Model Registry, SageMaker Model Registry.
  • Hyperparameters and run metadata: Logged at training time, stored with the model.

When the model in production starts behaving strangely, you need to be able to reconstruct exactly how it was trained. Without this, you are guessing.

2. Separate Training and Serving Infrastructure

Training is resource-intensive, bursty, tolerant of failure. Serving is steady-state, latency-sensitive, intolerant of failure. Running them on the same cluster leads to one of two outcomes: training starves serving during peak load, or serving wastes expensive GPU capacity during off-hours.

The pattern:

  • Training on spot instances or preemptible VMs when possible. 60 to 90 percent cheaper than on-demand. Use checkpointing to survive interruptions.
  • Serving on dedicated instances sized for steady state, with autoscaling for peaks.
  • Model artifacts as the handoff — training produces a versioned model in the registry, serving pulls from the registry. No shared infrastructure required.

3. Canary Deployments for Models

Never route 100 percent of traffic to a new model version on release. Route 5 percent, watch the metrics, then 25 percent, then 50, then full rollout. If metrics degrade at any stage, roll back.

What to watch during a canary:

  • Prediction latency (p50, p95, p99)
  • Error rate
  • Model-specific metrics (prediction distribution, confidence scores, output range)
  • Downstream business metrics (conversion, click-through, retention)

The last one matters most. A model that scores well in offline evaluation can underperform in production for reasons that only show up once real users are interacting with it.

4. Drift Detection

Models trained on 2024 data assume the world in 2025 looks like 2024. It usually doesn't. Inputs change (seasonality, new user segments, economic shifts), outputs change, and the relationships between them change. A model that was accurate six months ago can be quietly wrong today.

Types of drift to monitor:

  • Data drift: The distribution of inputs changes. A feature that was normally 0 to 100 is now 0 to 200.
  • Label drift: The distribution of ground-truth outputs changes. The base rate of the positive class shifts from 5 percent to 15 percent.
  • Concept drift: The relationship between inputs and outputs changes. The model's internal assumptions are no longer correct.

Tools: Evidently AI, Arize, WhyLabs, NannyML, or a rolled-your-own monitoring pipeline. The monitoring doesn't need to be fancy — histograms of inputs and outputs compared to training data, computed daily, alerted on significant change, will catch most drift.

5. A/B Testing With Real Business Metrics

Offline model evaluation tells you the model is better on the holdout set. That's not the same as the model being better in production. The only way to know is to A/B test against the current model and measure the business outcome.

What's often surprising:

  • A "better" model (higher AUC, lower RMSE) can produce worse business outcomes because it optimizes for the wrong thing.
  • A simpler, less accurate model can outperform a complex one because it's faster and latency matters more than accuracy for the use case.
  • The baseline (no model, heuristic, current model) often performs better than expected.

Run A/B tests long enough to get statistical significance. Don't stop early because the new model looks good after two days.

6. Rollback Without Panic

You will need to roll back a model. When it happens, you want the rollback to be a one-command operation that you've tested this month, not an emergency exercise.

The pattern:

  • Model versions in the registry are immutable. Once deployed, the artifact doesn't change.
  • The serving infrastructure can point to any past version via config.
  • Rollback is a config change, not a rebuild.
  • Test rollback in staging on a schedule.

A team that can roll back a model in two minutes can deploy more aggressively. A team that dreads rollback deploys less often and misses improvements.

7. Cost Monitoring Per Prediction

ML workloads are the easiest part of a cloud bill to hide. A single bad deployment can 10x your inference costs overnight. Monitor cost per prediction as a first-class metric.

What to track:

  • Compute cost for training runs, broken down by project
  • Inference cost per 1000 predictions, by model
  • Storage cost for training data and model artifacts
  • Data transfer costs (these sneak up, especially with cross-region)

Alert when cost per prediction rises significantly. Investigate the cause — could be a model that got larger, inefficient batching, a traffic spike, or a misconfiguration.

Where Self-Hosted Open-Weight Models Fit

In 2025, the self-hosted option is viable for a wider range of workloads than it used to be. A single GPU can run a 35B parameter model at 40 to 50 tokens per second via llama.cpp or vLLM. For classification, extraction, summarization, and other non-frontier tasks, a fine-tuned open-weight model often matches or exceeds GPT-4 class quality at a fraction of the cost.

When self-hosted makes sense:

  • Sustained high inference volume (thousands of requests per hour)
  • Data residency or compliance that rules out cloud APIs
  • Customization that proprietary APIs don't allow
  • Latency requirements that cloud APIs can't meet

When it doesn't:

  • Bursty or low volume — the GPU sits idle
  • Frontier-class quality requirements
  • Small team without ML ops capacity

What We'd Actually Do

For a team deploying their first production ML model:

  1. Model registry from day one. MLflow is free and adequate.
  2. Separate training and serving infrastructure. Don't share.
  3. Canary deployments even if it feels like overkill. The one time you need it, you'll be grateful.
  4. Basic drift monitoring. Histograms, daily, alerted.
  5. One-command rollback. Tested monthly.
  6. Cost per prediction reported weekly.

This is maybe 2 to 4 weeks of engineering on top of the ML work itself. It's what separates shipping ML from experimenting with ML.

Three Takeaways

  1. Version everything, not just code. Reproducibility is the foundation of debuggability.
  2. Drift is inevitable. Monitor for it from day one, not when something breaks.
  3. A production ML model is a product, not a science project. Apply product rigor — canaries, rollback, monitoring, cost tracking.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →