There is a large and growing gap between "we trained a model" and "we run a model in production that customers depend on." Most of the ML content you will read online is about the first thing. This post is about the second. These are the lessons we've picked up running machine learning workloads — classical, deep learning, and more recently LLM-backed services — on cloud infrastructure for real customers.

1. The training bill is usually not the problem. Inference is.

If you are reading LinkedIn you would think the entire cost of ML is the training run. For a research lab, sure. For a production service, the training run is a one-time expense and the inference bill is the one you pay every month, forever. A model that took $5,000 to train and costs $15,000 a month to serve in production is a cost problem at the serving layer, not the training layer. Optimize accordingly.

The best inference cost reductions come from the boring stuff: right-sized instances, batching requests, caching responses, serving smaller models, and not calling the model at all when a simpler heuristic will do. Before you buy bigger GPUs, check whether you actually need to run inference for every request.

2. "Just use a hosted API" is the right answer more often than ML engineers want to admit

For a large class of problems, calling OpenAI, Anthropic, or a comparable hosted model is the correct architecture. You get state-of-the-art quality with zero infrastructure work. The only reasons to self-host are: cost at scale, latency requirements the hosted APIs can't meet, data sensitivity, or a use case that genuinely needs a custom fine-tune.

Most organizations reach for self-hosting before they have a reason to, burn three months on infrastructure, and end up with a worse model that costs more. The rule we use: start with a hosted API, measure the actual cost and latency at your production volume, and only self-host if the numbers force you to. The tipping point is usually somewhere north of $10,000 per month in API spend, and even then it's close.

3. GPU availability is a planning problem

If your production plan depends on H100s or A100s, you need to know where they are and whether you can actually get them. Azure has been tight for two years. AWS has gotten better. GCP is usually the best of the three for high-end GPUs. Specialty providers — Lambda, CoreWeave, RunPod — often have capacity the hyperscalers don't, at better prices, but with less enterprise polish.

We have had projects delayed by months because a customer assumed they could spin up eight A100s in their preferred region and couldn't. Check availability before you commit to an architecture.

4. Model weights are a data problem, and data problems are boring

A 70B parameter model is 140 GB in fp16. A quantized version is 40 to 80 GB. Moving that much data around is a real operational concern. Pulling it from object storage on every container start adds minutes to your cold start. Baking it into a container image makes the image enormous. Caching it on a shared volume means you need a shared volume that is actually fast.

The most common failure mode we see is teams treating model weights like code and then being surprised when their deploys take 20 minutes. Weights are data, not code. Version them, cache them, and plan the distribution carefully.

5. Eval is the work

Anyone can get a model to produce output. The hard part is knowing whether the output is good. Every production ML system we've helped build has eventually needed an evaluation suite: a fixed set of inputs, expected properties of the outputs, and a way to measure drift when the model or the prompt changes.

Teams that skip eval end up in a world where they can't tell whether a prompt tweak made things better or worse, and they're afraid to change anything. Teams that invest in eval can iterate confidently because they know what "better" means. This is the unglamorous work that separates real ML engineering from prompt-hacking.

Classical ML has the same lesson with slightly different vocabulary: your validation set is the most important artifact in the project, not the model.

6. Observability for ML is different from observability for services

A regular service has latency, error rate, and saturation. A machine learning service has those, plus: input distribution drift, prediction distribution drift, confidence score shifts, and user-feedback signals. If you only monitor the first three, you won't notice when your model starts producing worse results because the world changed underneath it.

We have seen fraud detection models silently degrade over six months because nobody was monitoring the input distribution, and by the time the fraud team noticed, the false negative rate had doubled. The model was still serving fine — it was just wrong.

Plan your ML observability the same way you plan service observability: up front, as a first-class concern.

7. Retrieval-augmented generation is usually better than fine-tuning

For most business use cases involving LLMs, retrieval-augmented generation (RAG) beats fine-tuning. It's cheaper, easier to iterate on, and keeps your domain knowledge out of a model you don't control. Fine-tuning has its place — tasks that need specific output formats, classification with proprietary labels, style transfer — but it's a much bigger commitment than most teams realize.

The pattern we recommend: start with good retrieval, good prompting, and a capable off-the-shelf model. Only fine-tune when you've hit a clear wall that retrieval can't solve. Most projects never hit that wall.

8. Cost attribution is harder than it looks

On a regular cloud workload, you tag resources by project and you get a clean bill. On an ML workload, the costs are spread across training jobs, inference endpoints, feature store reads, data pipeline runs, experiment tracking, and storage for model artifacts. Getting a clean "how much does Model X cost us per month" number requires discipline.

The teams that do this well tag every resource at creation time, attribute API-layer costs by request type, and produce a monthly rollup per model. The teams that don't do it well discover one quarter that the ML team is the single largest cost center in the company and nobody can explain why.

What we'd actually do for a new customer

For a company just starting to put ML into production, the pattern we recommend looks like this:

Start with hosted APIs where possible. Measure cost and latency.
Build an eval suite before you ship anything to production.
Monitor input and output distributions, not just service metrics.
Keep fine-tuning as a last resort, not a first instinct.
Plan for GPU availability before committing to a self-hosted architecture.
Tag every ML resource from day one.

None of this is exciting, but exciting is not what makes ML work in production. Boring and disciplined is what makes it work.

Three Takeaways

Inference is the cost line that matters. Optimize serving before you optimize training.
Eval is the work. If you can't measure "better," you can't get better.
Start with hosted APIs. Self-host only when the numbers force you to.

Machine Learning on Cloud: Eight Insights from Real Production Models

1. The training bill is usually not the problem. Inference is.

2. "Just use a hosted API" is the right answer more often than ML engineers want to admit

3. GPU availability is a planning problem

4. Model weights are a data problem, and data problems are boring

5. Eval is the work

6. Observability for ML is different from observability for services

7. Retrieval-augmented generation is usually better than fine-tuning

8. Cost attribution is harder than it looks

What we'd actually do for a new customer

Three Takeaways

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

Machine Learning on Cloud: Eight Insights from Real Production Models

1. The training bill is usually not the problem. Inference is.

2. "Just use a hosted API" is the right answer more often than ML engineers want to admit

3. GPU availability is a planning problem

4. Model weights are a data problem, and data problems are boring

5. Eval is the work

6. Observability for ML is different from observability for services

7. Retrieval-augmented generation is usually better than fine-tuning

8. Cost attribution is harder than it looks

What we'd actually do for a new customer

Three Takeaways

Talk with us about your infrastructure