Skip to main content
Cloud

Cloud Data Warehousing: When the Numbers Stop Adding Up

Six real advantages of cloud data warehousing, and the point where the economics flip and a self-hosted stack starts looking reasonable again.

John Lane 2023-02-16 6 min read
Cloud Data Warehousing: When the Numbers Stop Adding Up

Cloud data warehouses — Snowflake, BigQuery, Redshift, Synapse, Databricks SQL — are the default answer for analytics in 2023, and for most organizations that default is correct. Cloud warehousing genuinely does things that on-prem warehousing could not do. But the economics have a crossover point, and once you cross it the bill gets large enough that a self-hosted stack becomes reasonable again. This is what 23 years of watching data platforms evolve looks like when we are being honest about the tradeoffs.

Advantage 1: Separation of Storage and Compute

The real innovation of modern cloud warehouses is not that they run in the cloud. It is that storage and compute are decoupled. You can store 50 terabytes of data for about $1,000 per month and query it with a warehouse that only runs during business hours. The same setup on a traditional MPP appliance required buying enough nodes to store all the data plus enough headroom for the query workload, whether or not you were using it.

For any organization with spiky query patterns — quarterly reports, month-end close, Monday morning dashboards — this architecture is a genuine improvement. You pay for compute when you use it and storage when you need it, and the two are not tied together.

Advantage 2: Schema Evolution Without the Fear

Adding a column to a table in a traditional warehouse was a planned event. It required downtime, a DBA, and sometimes a full rebuild of downstream materialized views. In a modern cloud warehouse, schema evolution is a ALTER TABLE statement that runs in seconds regardless of table size, because the underlying storage is columnar and append-friendly.

This is a small-sounding benefit that turns out to be enormous in practice. Teams that used to batch schema changes into quarterly releases now ship them daily. The data model evolves with the business instead of being a dead weight the business has to work around.

Advantage 3: Semi-Structured Data Is a First-Class Citizen

Snowflake's VARIANT type, BigQuery's JSON type, Redshift's SUPER type — all of them let you query JSON without unpacking it into a relational schema first. This means event streams, API payloads, and log data can land in the warehouse and be queryable the same day, instead of requiring a two-sprint ingestion project.

The tradeoff is that queries against semi-structured data are slower than queries against fully normalized tables, and the optimizer has less to work with. You want to flatten hot columns eventually. But the ability to land raw data and figure out the schema later is a workflow improvement that traditional warehouses never offered.

Advantage 4: Concurrency Without the Planning

On an MPP appliance, concurrency was a fixed resource. You had a pool of slots, and when the pool was full, new queries waited. Cloud warehouses handle concurrency by spinning up more compute on demand, either through virtual warehouses in Snowflake, slot autoscaling in BigQuery, or concurrency scaling clusters in Redshift.

The practical effect is that a morning of 40 analysts hitting the dashboard at once no longer degrades the overnight ETL jobs. They run on different compute. This was literally impossible on older architectures and it removes an entire category of operational pain.

Advantage 5: Time Travel and Zero-Copy Cloning

Every modern warehouse lets you query a table as of a specific timestamp in the past — 1 hour ago, 1 day ago, sometimes up to 90 days ago. And every modern warehouse lets you clone a multi-terabyte table in seconds without copying the underlying data, because the clone is a metadata operation on top of the same storage.

These two features change how data engineering work gets done. Testing a migration script against a real dataset used to require provisioning a new environment and copying the data overnight. Now it is a single command that runs in 30 seconds. Recovering from an accidental DELETE used to mean restoring a backup. Now it is a single query that selects from the table as it existed 10 minutes ago and writes the rows back.

The productivity gain here is larger than any other single feature and it is the thing self-hosted alternatives have the hardest time replicating.

Advantage 6: Ecosystem and Integration Gravity

The sixth advantage is not technical. It is that every modern BI tool, every ETL vendor, every data catalog, and every reverse-ETL product has a native connector for Snowflake and BigQuery and usually nothing else as a first-class option. Choosing a cloud warehouse means you can plug in Looker, Tableau, dbt, Fivetran, Census, Hightouch, and a dozen other tools with a connection string and an OAuth flow.

Choosing a self-hosted warehouse — Clickhouse, StarRocks, DuckDB at scale — means you will be writing glue code to integrate the same tools, and some of them will not work at all. The cost of the ecosystem gap is real and usually larger than the cost savings on compute.

Where the Numbers Stop Adding Up

Here is the part that nobody talks about. Snowflake and BigQuery are cheap at small and medium scale and expensive at large scale. The crossover point, in our experience, sits around $30,000 to $50,000 per month in warehouse spend. Below that, the operational savings of managed infrastructure dominate the cost calculation, and paying for compute-by-the-second is almost always the right answer.

Above that, the math starts to favor self-hosted alternatives. A Clickhouse cluster on dedicated hardware can deliver the same analytical workload for 30 to 50 percent of the cloud warehouse bill, and at $500,000 per year in savings the headcount cost of running it yourself stops being prohibitive. You need an engineer who understands the stack, but you were going to need a data engineer either way.

The specific workloads that tip the math are: sustained high-concurrency BI on stable data (no spikes to optimize for), append-heavy time-series analytics (Clickhouse is genuinely better), and anything that involves moving 10+ terabytes of data in and out of the warehouse per day (egress fees on cloud warehouses get painful at that scale).

If your warehouse bill is under $20,000 per month, stop reading this section and stay on Snowflake or BigQuery. The operational overhead of a self-hosted warehouse is real and you are not saving enough to justify it.

What We Actually Recommend

For most mid-market customers, we recommend starting on Snowflake or BigQuery, watching the monthly bill carefully, and revisiting the architecture when warehouse compute spend crosses $30,000 per month. Below that threshold, any engineering time spent on warehouse cost optimization is time that should have been spent on business value. Above it, the ROI flips and self-hosted alternatives start to deserve a serious look.

The other thing we recommend is to aggressively separate raw data from modeled data. Land everything raw. Model only what the business actually queries. Most warehouse bills are large because every department ingested data "just in case" and never threw anything away. The cheapest terabyte is the one you did not store.

Three Takeaways

  1. Cloud warehouses win at small and medium scale because of the ecosystem, not the compute. The connectors, BI integration, and managed operations are the real value.
  2. The economics flip somewhere around $30,000 to $50,000 per month in warehouse compute. Above that, self-hosted alternatives become reasonable again.
  3. The cheapest terabyte is the one you did not store. Data retention discipline beats every other cost optimization.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →