Skip to main content
AI & Automation

6 Production-Ready AI Platforms Compared

Honest comparison of six cloud AI platforms we've actually used — cost per inference, vendor lock-in, and the tradeoffs nobody mentions in the marketing.

John Lane 2021-11-04 5 min read
6 Production-Ready AI Platforms Compared

Every cloud vendor will tell you their AI platform is the best. We've shipped production workloads on all six of the ones covered here, and the honest answer is that each one is the best at something and terrible at something else. This is the comparison we wish existed when we were evaluating them.

The six platforms: Azure AI Foundry (formerly Azure OpenAI + ML Studio), AWS Bedrock + SageMaker, Google Vertex AI, Anthropic API direct, Hugging Face Inference Endpoints, and self-hosted via llama.cpp or vLLM.

Cost Per Inference

Raw cost is the easiest thing to compare and the most misleading metric. Here's where the six sit for a typical mid-size chat workload (think customer support assistant, 2000-token context, 500-token response):

  • Azure AI Foundry / AWS Bedrock / Vertex AI: All three resell the same frontier models (GPT-4o class, Claude Sonnet class, Gemini Pro class) at prices within 5 percent of each other. You're not picking on cost.
  • Anthropic API direct: Slightly cheaper than Bedrock for Claude models, but you lose the IAM integration and enterprise billing.
  • Hugging Face Inference Endpoints: Cheaper for open-weight models (Llama, Mistral, Qwen) at moderate scale. Breaks even with hyperscalers around 1M requests/month.
  • Self-hosted (llama.cpp, vLLM on your own GPU): Cheapest by a wide margin at sustained load, most expensive at light load. A single RTX 4090 or a Radeon Pro R9700 can serve a Qwen3 35B model at 40-50 tokens per second — our in-house setup runs that exact configuration and costs roughly $0.0001 per request vs $0.01+ on cloud APIs.

The key insight: cost per inference matters less than utilization. A cloud API you use 100 times a day costs less than a GPU you bought. A GPU you use 100,000 times a day costs less than the equivalent cloud API by 10x.

Vendor Lock-In

This is the topic the vendors don't want to discuss.

Highest lock-in: Vertex AI and Azure AI Foundry. Custom model tuning, feature stores, pipelines, and prompt management are all proprietary. Moving off either takes months of re-engineering.

Medium lock-in: AWS Bedrock. Bedrock itself is a thin wrapper over multiple model providers, so switching models inside Bedrock is easy. Moving off Bedrock entirely is harder because of IAM, CloudWatch, and KMS integrations.

Low lock-in: Anthropic API direct, Hugging Face. Standard REST endpoints, portable prompts, no unique IAM surface.

No lock-in: self-hosted. The weights and the inference engine are yours. The tradeoff is operational burden.

If you're early in your AI journey and you don't know which model family you'll settle on, stay portable. Use Bedrock or a thin abstraction layer (LiteLLM, OpenRouter) so you can swap providers without rewriting your application.

Feature Depth

Azure AI Foundry

Strongest on enterprise integrations — Entra ID auth, private endpoints, content safety, prompt flow. If you're already a Microsoft shop, this is the path of least resistance. The model catalog is solid (GPT-4 family, Llama, Mistral, Phi). Content filtering is strict by default, sometimes aggressively so.

AWS Bedrock + SageMaker

Bedrock gives you multi-vendor models (Anthropic, Meta, Mistral, Cohere, Amazon). SageMaker gives you training and custom deployment. The combo is powerful but operationally complex — we've seen teams spend weeks just getting a SageMaker endpoint behind a private network right. Fine for teams that already have AWS expertise.

Google Vertex AI

Best-in-class for data pipeline integration (BigQuery, Dataflow) and for Gemini-family models. Vertex AI Search and Vertex AI Matching Engine are genuinely good for RAG workloads. Worst developer experience of the three hyperscalers in our opinion — the console is busy, the docs are scattered.

Anthropic API direct

Cleanest developer experience of the lot. If you only need Claude models and you don't need enterprise billing through a hyperscaler, go direct. Prompt caching, tool use, and computer use are released here first.

Hugging Face Inference Endpoints

The right answer for open-weight models at moderate scale. Pay per hour of dedicated GPU, not per token. Good for long-running workloads with predictable throughput. Weak on enterprise features — no private endpoints without the enterprise plan.

Self-hosted (llama.cpp, vLLM)

The right answer when data residency, cost at scale, or customization matters more than convenience. We run a Qwen3 35B MoE model on a 32GB Radeon AI PRO R9700 via llama.cpp with the Vulkan backend, hitting 48 tokens/second on a single GPU for an effective cost of zero after hardware. Operational burden is real — you own updates, monitoring, and model lifecycle.

The Tradeoffs Nobody Mentions

Latency to first token matters more than throughput. A streaming response that starts in 300ms feels snappier than one that starts in 2 seconds and finishes faster. Bedrock is noticeably slower to first token than Anthropic direct for the same Claude models. Why? Extra network hops through the AWS control plane.

Rate limits are the real bottleneck. The published rate limits are not the limits you'll hit. Azure OpenAI quotas are region-specific and request-specific. Bedrock throttles on model-level TPMs. Plan for 30-50% of published limits in production.

Model updates can break your app. A silent model version update can change outputs enough to break a brittle prompt. Pin model versions explicitly in production and test before you upgrade.

What We'd Actually Do

For a company building an AI-powered feature for the first time:

  1. Prototype on Anthropic or OpenAI direct. Fastest path to "does this work at all."
  2. Move to Bedrock or Azure AI Foundry for production if you need enterprise billing, compliance, or IAM integration.
  3. Add a self-hosted open-weight model for workloads where cost per inference matters more than frontier quality — classification, summarization, extraction. These workloads often run fine on a 7B or 35B open model.
  4. Abstract the provider. Use LiteLLM or a similar shim so you can move.

Three Takeaways

  1. Cost per inference is a distraction at low volume and a necessity at high volume. Know which side of the curve you're on before optimizing.
  2. Vendor lock-in is real on the platform features, not the models themselves. The more you use prompt flows, feature stores, and managed pipelines, the harder it is to leave.
  3. Self-hosted open-weight models have become genuinely viable in 2025. A single consumer GPU can replace a surprising percentage of cloud API calls for non-frontier tasks.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →