Deploying ML Models on Cloud: From Notebook to Production
Most ML models never leave the notebook. The ones that do usually fail in the same four places. Here is the deployment shape we use to keep them alive.

The uncomfortable truth about enterprise ML is that most models never leave the notebook. The ones that do usually crash into the same small set of problems within the first 90 days of production. This is not a tooling problem — SageMaker, Vertex AI, and Azure ML all work fine — it is a shape-of-the-problem problem. Teams underestimate what "deploying a model" actually means and overestimate how much the managed service will handle for them.
Here is the deployment shape we recommend, the four places production ML typically fails, and what to do about each one.
The Deployment Shape
A deployed model has five things around it, not one. The model is the small piece. The pieces that matter are:
-
A serving layer that turns a prediction request into a prediction response. This is a small HTTP or gRPC service, ideally stateless, that loads the model into memory and returns inferences. For most models this is a container running Triton, TorchServe, BentoML, or plain FastAPI with the model loaded at startup.
-
A feature pipeline that transforms raw inputs into the feature vector the model expects. This is the part teams always forget, and it is the part that breaks first. The transforms the notebook did on the training data need to be reproduced exactly, in production, on every single inference request. If they are not, you have silent training-serving skew and your model is quietly wrong.
-
A model registry that versions the model artifacts, tracks which version is in production, and supports rollback. MLflow, SageMaker Model Registry, Vertex Model Registry, or a plain S3 bucket with a version tag all work. What matters is that "which model is live" is a question you can answer with a single command.
-
An observability layer that tracks prediction volume, latency, input distributions, and output distributions over time. This is how you catch data drift before your users do.
-
A retraining pipeline that takes new labeled data and produces a new model version on a schedule or trigger. Kubeflow, Airflow, Prefect, Metaflow — pick one and stick with it.
If any of those five pieces is missing, your deployment is incomplete. Most of the "failed ML project" post-mortems we have read come down to one or more of those pieces never getting built.
Highlight One: Feature Parity Is the Whole Ballgame
The notebook loads a CSV, runs some pandas transforms, fits a model, and saves it. The deployment loads raw events from Kafka, transforms them in a different language or a different library version, and sends them to the model. The transforms are almost the same. The model is quietly wrong.
The fix is to treat feature transformation as code that is shared between training and serving. Feast, Tecton, and the cloud-native feature stores (Vertex AI Feature Store, SageMaker Feature Store) exist for this reason. Even if you do not use a formal feature store, put the transformation logic in a Python package that is imported by both the training pipeline and the serving container. Never reimplement transforms in a different language for serving — that is how you get drift that takes three months to diagnose.
Highlight Two: Latency Budgets Are Cumulative
A model that returns a prediction in 40 ms looks fast in the notebook. In production, the user-facing request that invokes it has to do a database lookup (20 ms), fetch features from a feature store (15 ms), deserialize and validate the request (5 ms), call the model (40 ms), post-process the output (10 ms), and log the result (5 ms). That is 95 ms before the network round trip. Add a cross-region hop and you are over 150 ms. Add a cold container start and you are at 2 seconds.
Before you deploy, write down the latency budget for the entire request, not just the model inference. Allocate the budget across the pieces. Test the whole chain, not just the model call. And keep the model and its feature store in the same region — cross-region feature fetches are one of the most common reasons production ML feels slow.
Highlight Three: GPU Serving Is Usually the Wrong Default
Teams assume that because training needed a GPU, serving needs one too. For most models, that is not true. A transformer with 100 million parameters can serve on a CPU at 50 to 100 ms per request, which is fine for most applications. GPU serving only pays off when you have batch throughput above some break-even point — usually around 100 requests per second for small models, higher for large ones.
The exception is LLMs and large vision models. A 7B-parameter language model running on CPU is not usable; it needs a GPU or a specialized inference accelerator. For those, the question is not "CPU or GPU" but "which GPU, batched how, with what quantization." Our default for self-hosted LLM serving is llama.cpp or vLLM on a cost-effective GPU tier (L4, A10, or consumer-class if the policy allows), with batching enabled and quantization applied where accuracy testing supports it.
Do not pay for H100s to serve a model that fits in 8 GB and runs fine on a $0.50/hour instance.
Highlight Four: Drift Detection Has to Be Built In on Day One
The model that was 94 percent accurate in validation is 78 percent accurate six months later, and nobody notices until a customer complains. This happens constantly and it is almost always preventable.
Build two things on day one. First, log every prediction with its input features and its output. Second, run a nightly job that compares the input distribution and the output distribution against a baseline (the training set, or the first week of production data). Alert when the distributions drift beyond a threshold. Retrain when the alerts fire.
The tools for this exist — Evidently, WhyLabs, Arize, and the cloud-native monitoring products all do a reasonable job — but the critical thing is not the tool. It is the decision to log every prediction from day one, so you have the data to look back at when something goes wrong. Storage is cheap. Debugging a drift problem with no historical data is not.
The Takeaway
Deploying an ML model on cloud infrastructure is a solved problem in the sense that the tools exist and work. It is an unsolved problem in the sense that most teams do not set up the full shape — serving, features, registry, observability, retraining — and so the model either never ships or dies quietly within a year.
If you are starting a new ML deployment, our advice is: get all five pieces in place before you launch, even if the model itself is trivial. A trivial model with a full deployment shape around it is infinitely more useful than a sophisticated model in a notebook nobody can reproduce. And when the model does need to get better, the infrastructure is already there to support the next version.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation