Cloud is essential for most AI development, and there are also more situations than ever where self-hosting makes sense. The honest framing: cloud solves specific problems for specific phases of the AI lifecycle, and the right answer for any given team is a mix. Here are the six ways cloud enhances AI work, with the tradeoffs for each and when you might want to run things yourself instead.

1. GPU Access Without Capital Commitment

The biggest reason to use cloud for AI development is GPU access. Buying GPUs has been difficult for three years running — lead times on H100s have been measured in quarters, on-prem deployments require facilities work, and consumer GPUs are restricted for data center use by NVIDIA's EULA.

What cloud gets you:

On-demand access to A100s, H100s, and newer generations
Hourly billing instead of multi-year capital commitment
Access to different GPU types for different workloads (T4 for inference, A100 for training, H100 for frontier work)
Spot pricing for training workloads that can tolerate interruption

Reality check on availability: Even cloud GPUs are constrained. Try provisioning 8 H100s on-demand in us-east-1 and you'll hit capacity limits. Reserved capacity, spot instances, and specialty providers (Lambda Labs, CoreWeave, RunPod) often beat the hyperscalers on both availability and price.

When self-hosted wins:

Sustained workloads with predictable GPU utilization
Non-frontier models that run on consumer-class hardware (RTX 4090, Radeon AI PRO R9700)
Data residency requirements that rule out cloud entirely
Cost optimization at scale — a well-utilized owned GPU pays for itself in months vs cloud hourly rates

Our own llama-server deployment runs Qwen3.5-35B on a Radeon AI PRO R9700 at 48 tokens per second via the Vulkan backend — for roughly the cost of the hardware, we get the equivalent of tens of thousands of dollars of cloud inference per year. That's specifically because the workload is sustained.

2. Data Pipeline Tools Without Building Them

Training models requires pipelines — data ingestion, cleaning, transformation, versioning, feature engineering. Building these from scratch is a lot of engineering work. Cloud-managed services bypass most of it.

Tools worth using:

Snowflake, BigQuery, Databricks for warehouse-style analytics and feature engineering at scale
Dataflow, Dataproc, EMR for batch processing
Kinesis, EventHub, Pub/Sub for streaming data
Airflow (managed), Prefect, Dagster for orchestration
Feature stores (Vertex AI, SageMaker Feature Store, Tecton) for feature reuse

When to roll your own:

Small team, small data volumes — a couple of Python scripts might be enough
Very specific performance requirements
Budget constraints

The rule of thumb: if your data pipeline work is not a competitive differentiator, buy it. If it is, build it.

3. Managed Model Serving

Deploying a model for inference involves load balancing, autoscaling, observability, security, and rollback. Cloud managed services handle most of this.

Options:

Azure ML Managed Endpoints, SageMaker Endpoints, Vertex AI Endpoints — cloud-native model serving
Hugging Face Inference Endpoints — model-agnostic, simpler API, good for open-weight models
Modal, Replicate, Banana — serverless GPU inference, pay per request
Self-hosted on Kubernetes with KServe, Seldon, or BentoML — more control, more ops burden

Decision factors:

Latency requirements (cloud APIs have more network hops than local)
Cost at volume (managed services have premium pricing)
Customization needs (batching, quantization, speculative decoding, custom kernels)
Compliance (data residency, PHI, regulated workloads)

4. Experiment Tracking and Reproducibility

Cloud platforms have built-in experiment tracking that makes reproducible ML less painful. MLflow, Weights & Biases, Vertex AI Experiments, and SageMaker Experiments all log hyperparameters, metrics, and artifacts automatically.

Why it matters: six months after training a model that works, you will need to reproduce it or explain it. Without tracking, that's a research project. With tracking, it's a few clicks.

This is one of the cheapest wins in ML ops. Turn it on from day one even if you're not using anything else from the cloud ML platform.

5. Distributed Training for Frontier Work

Training large models requires distributed infrastructure — multiple GPUs across multiple nodes with high-speed interconnect. Cloud vendors have the networking (InfiniBand, RoCE, NVLink) that most on-prem deployments don't.

Patterns:

Data parallelism (each GPU has a copy of the model, different data batches)
Model parallelism (the model is split across GPUs)
Pipeline parallelism (different layers on different GPUs)
FSDP / ZeRO for memory-efficient training

Tools: PyTorch DDP, DeepSpeed, FSDP, Megatron. These work on-prem but are easier to scale in cloud because the network is already fast and the node provisioning is handled.

When not to bother: Training a LoRA adapter or fine-tuning a 7B model is usually single-node. Don't overbuild.

6. Cost Optimization via Spot and Reserved Capacity

Cloud ML is expensive at sticker price and significantly cheaper with the right discounts.

Levers:

Spot instances for training: 60 to 90 percent discount. Requires checkpointing because instances can be reclaimed.
Reserved capacity for steady-state inference: 30 to 60 percent discount for 1-3 year commitments.
Committed use discounts on GCP: similar structure to reserved.
Savings plans on AWS: flexible commitment across compute types.
Quantization: 8-bit or 4-bit inference dramatically cuts memory requirements, letting you use smaller, cheaper GPUs.
Batching: Higher batch sizes at inference time improve throughput and reduce per-request cost.

Numbers to know: A T4 GPU on-demand is roughly $0.35/hour. An A100 on-demand is roughly $3-4/hour. H100s are $5-10/hour. Spot cuts all of these by more than half. Reserved cuts another 30-50 percent on top. The difference between naive provisioning and optimized provisioning is often 5x or more.

Where Self-Hosted Open-Weight Models Fit

In 2025, the case for self-hosted open-weight models has gotten much stronger. Models like Qwen3, Llama 3.3, Mistral Large, and Gemma 3 run on consumer or prosumer hardware with quality that rivals cloud APIs for many tasks. For organizations with data residency constraints, sustained inference workloads, or tight unit economics, self-hosted is genuinely viable.

Our own infrastructure uses llama-server (llama.cpp's Vulkan backend) on a single 32GB Radeon GPU to serve Qwen3.5-35B for internal workloads. Speculative decoding isn't available for that architecture, but the 48 tok/s throughput is enough for real work. The cost is the hardware, once. The alternative would be tens of thousands of dollars per year in cloud API spend for the same workload.

When to try it:

You have predictable inference volume
Frontier-quality isn't required
You can handle the ops burden
You care about data residency or cost at scale

When not to:

Bursty or low volume workloads
You need frontier-class models (GPT-4 class, Claude Opus class)
Your team doesn't have the time to manage it

What We'd Actually Do

For a team building an AI capability:

Prototype on cloud APIs (Anthropic, OpenAI direct, or via Bedrock/Azure AI Foundry)
Move to managed cloud services for production training and serving
Experiment tracking from day one via MLflow or a cloud equivalent
Add self-hosted open-weight models for workloads where they fit — classification, summarization, extraction
Optimize cost via spot, reserved, quantization, and batching as volumes grow

Three Takeaways

Cloud is the right answer for bursty AI work; self-hosted wins for sustained workloads. Know which side of the line your workload is on.
Experiment tracking is the cheapest high-leverage tool in ML ops. Turn it on before anything else.
Cost optimization on cloud GPU workloads is worth serious attention. The difference between naive and optimized is often 5x.

6 Ways Cloud Actually Enhances AI Development

1. GPU Access Without Capital Commitment

2. Data Pipeline Tools Without Building Them

3. Managed Model Serving

4. Experiment Tracking and Reproducibility

5. Distributed Training for Frontier Work

6. Cost Optimization via Spot and Reserved Capacity

Where Self-Hosted Open-Weight Models Fit

What We'd Actually Do

Three Takeaways

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

6 Ways Cloud Actually Enhances AI Development

1. GPU Access Without Capital Commitment

2. Data Pipeline Tools Without Building Them

3. Managed Model Serving

4. Experiment Tracking and Reproducibility

5. Distributed Training for Frontier Work

6. Cost Optimization via Spot and Reserved Capacity

Where Self-Hosted Open-Weight Models Fit

What We'd Actually Do

Three Takeaways

Talk with us about your infrastructure