Skip to main content
Cloud

Big Data Analytics in the Cloud: Three Secrets the Vendors Don't Lead With

The cloud is genuinely good at big data analytics. It's also genuinely expensive, and the vendors bury the reasons why. Here are three things to know before you commit.

John Lane 2023-12-16 5 min read
Big Data Analytics in the Cloud: Three Secrets the Vendors Don't Lead With

Cloud analytics platforms are one of the areas where the cloud is genuinely, unambiguously better than on-prem for most customers. Spinning up a Snowflake warehouse or a BigQuery dataset and getting productive in an afternoon is a superpower that on-prem Hadoop clusters never offered. We are not anti-cloud on analytics — we have built plenty of customer analytics stacks on it. But the pricing, architecture, and governance gotchas are real, and the vendor sales motion does not lead with them. Here are three things to know before you commit.

1. The Pricing Model Punishes Careless Queries More Than Careless Storage

The traditional on-prem data warehouse charged you for hardware up front and then let you run whatever queries you wanted against it. The cloud data warehouse inverts this. Storage is cheap (a few dollars per terabyte per month on Snowflake, BigQuery, or Redshift in some configurations). Compute is where the bill lives.

This has a specific consequence that nobody warns new customers about: a single badly written query by a single junior analyst can cost thousands of dollars in minutes. A SELECT * against a 50 TB fact table with no partition pruning will scan the whole table and charge you for every byte. BigQuery's on-demand pricing is $6.25 per TB scanned. Snowflake bills by warehouse size and wall-clock time, so a large warehouse running a bad query for an hour can cost hundreds of dollars.

The fix is not "train your analysts better," although that helps. The fix is technical: query cost controls, per-user query quotas, mandatory partition pruning policies, materialized views for common aggregations, and alerting when a single query exceeds a threshold. Every mature cloud analytics practice we have seen has these controls. Every immature one has a story about the month the bill was three times what they expected because of one developer testing a notebook on production data.

Do not let the first month of an analytics workload run without cost controls in place. The cloud providers will happily charge you for mistakes you did not know you were making.

2. The Lakehouse Is Mostly Marketing, Except When It Isn't

"Lakehouse" is the architecture where you store your data in open formats (Parquet, Delta, Iceberg) on object storage, and then use a query engine (Databricks, Snowflake, Trino, Athena) to read it. The pitch is that you get the cheap storage of a data lake with the query performance of a data warehouse.

The pitch is mostly true, but the gap between the pitch and the reality depends on three specific things that the vendor decks do not highlight.

First, the table format matters enormously. Plain Parquet files in folders work for a while, but they do not support ACID transactions, schema evolution, or efficient point updates. Delta, Iceberg, and Hudi solve this but add complexity — you are now running a metastore, managing file compaction jobs, and worrying about small file problems. If you are not going to actually use the transactional features, you do not need a lakehouse. You need a data lake with a query engine.

Second, the query engine matters. Trino and Presto are great for interactive queries. Spark is great for batch ETL and machine learning. Snowflake and BigQuery give you warehouse performance but cost warehouse prices. Picking one engine to do all three jobs is a compromise. Most mature architectures use two or three.

Third, the governance story is incomplete. Fine-grained access control across a lakehouse spanning multiple query engines is still hard. Unity Catalog, Lake Formation, and the various open-source attempts are all works in progress. If your compliance story requires row-level security and audit logging, budget real engineering time to make it work.

The lakehouse is a legitimate architecture. It is not a shortcut. Treat it as a real platform investment, not a configuration choice.

3. Data Gravity Is Real and Expensive to Reverse

Once you put petabytes of data into a cloud analytics platform, the cost and effort to move it back out — to another cloud, to on-prem, to a different analytics engine — is significant. Egress fees on multi-terabyte extracts from S3 are measurable. Re-ingest and re-transform on the other side takes weeks. The data pipelines that feed the warehouse are coupled to the warehouse's APIs and SQL dialects in a hundred small ways that you will only discover when you try to port them.

This is not a reason to avoid cloud analytics. It is a reason to think carefully about which cloud and which platform before you commit. The specific questions we ask customers before they pick a platform are:

  • How portable is the storage format? Parquet on S3 or ADLS is portable. Proprietary formats inside a managed warehouse are not.
  • How portable is the SQL? Standard SQL with minimal vendor extensions is portable. Every Snowflake-specific function you use is a chain.
  • How portable are the pipelines? dbt is portable across most warehouses. Vendor-specific ETL tools are not.
  • Where does the source data actually live? Pulling TBs of data across regions every day is expensive. The warehouse should be close to the source systems.

If you answer these questions up front and make deliberate choices, you can build an analytics stack that is cloud-native today and portable tomorrow. If you do not, you will be locked in within a year, and the lock-in will be invisible until you try to leave.

What We'd Actually Do

For a mid-market customer who is new to cloud analytics, we usually recommend starting with a managed warehouse (Snowflake or BigQuery) for the first eighteen months because the time-to-value is unbeatable and the team does not need to learn a dozen new tools at once. We put strict cost controls in place from day one, including per-user query quotas and daily spend alerts. We use dbt for transformations and keep the SQL as ANSI-compatible as we reasonably can.

Once the team is mature and the data volume justifies it (usually 10+ TB of active data and a clear ROI), we evaluate whether moving some of the workload onto Parquet/Iceberg on object storage with Trino or Databricks makes sense for cost and flexibility. For some customers it does. For others, the managed warehouse remains the right answer forever because the total cost of ownership including engineering time is lower.

Either way, the first principle is the same: cloud analytics is about compute cost, not storage cost. Control the queries and you control the bill.

Three Takeaways

  1. One bad query can cost thousands. Cost controls are day-one infrastructure, not a future phase.
  2. Lakehouse is a real architecture but not a shortcut. Budget real platform engineering, or stay with a managed warehouse until you can.
  3. Data gravity locks you in whether you notice or not. Pick portable formats, portable SQL, and portable pipelines unless you are certain you will never want to move.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →