Big Data Analytics on Cloud: Where the Cost Curve Actually Bends
A practical look at the six decisions that determine whether your analytics platform costs $5K a month or $50K — with the same data and the same questions.

"Big data" is a phrase that has mostly outlived its usefulness. Today, a single PostgreSQL instance can handle what we called "big data" in 2014. What's actually hard in 2023 is building an analytics platform that answers the questions the business cares about at a cost that doesn't double every year. The technology choices are mostly solved. The cost curve is where teams still get hurt.
Here's where the money actually goes — and the decisions that bend the curve.
1. The Storage vs. Query Cost Split
Every modern data platform separates storage from compute. This is good architecturally and misleading financially. You pay separately for storing data and for running queries on it. The ratio between those costs is usually 1:10 or worse — storage is cheap, compute is expensive.
What this means in practice:
- S3 / Azure Blob / GCS is approximately $20-25/TB/month for standard tier. A 50 TB data lake costs $1,000-1,250/month to store.
- BigQuery, Athena, Snowflake, Redshift Serverless charge by scanned bytes or compute-second. A single poorly-written query on that same 50 TB lake can cost $250.
- One bad dashboard refreshing every 5 minutes against an un-partitioned table can cost more per month than your entire storage bill.
The lesson: storage is not where you save money. Query patterns are.
2. Partition or Pay
Partitioning is the single biggest lever in analytics cost. An un-partitioned table scans the whole dataset for every query. A partitioned-by-date table scans only the relevant date range. The cost difference is often 50-100x for time-series workloads, which is most analytics workloads.
The rules we enforce:
- Partition every table by a time column. Day is usually right; hour if you have per-hour queries; month if you have years of data and rarely query recent data alone.
- Cluster / Z-order on the second-most-common filter column. Customer ID, region, product category — whatever your analysts filter on most.
- Compact small files aggressively. A table with 200,000 tiny Parquet files is slow and expensive. Run a compaction job weekly.
- Drop columns at write time that nobody ever queries. Parquet is columnar, but you still pay for the columns you don't use if your query is
SELECT *.
A customer we worked with was spending $18K/month on BigQuery before we looked at it. After partitioning three tables and rewriting five dashboards to hit the partition key, the bill was $4K. Same data, same questions, four months to pay for itself.
3. Choose the Right Engine for the Right Query
"Just use Snowflake" is a valid answer for a lot of use cases. It's also an expensive one if you use it for things that cheaper engines do better. The landscape is more nuanced than the vendor marketing suggests:
- Snowflake: the default for mixed analytics workloads where you want one tool, don't want to manage infrastructure, and can pay for convenience. Excellent for ad-hoc SQL across varied data.
- BigQuery: the cheapest per-TB-scanned for pure SQL analytics if your data is already in GCS. Serverless, no warehouse management. Can get expensive if scanned bytes aren't controlled.
- Redshift Serverless: AWS-native, good integration with the rest of the AWS data stack. Cost-effective if your ETL is in AWS already.
- Athena: pay-per-query on top of S3. Cheap for occasional queries, expensive for dashboards that refresh every five minutes.
- ClickHouse: the best choice for real-time analytics dashboards, event analytics, and anything with high query concurrency. Self-hosted or managed (ClickHouse Cloud, Altinity). Often 10x cheaper than Snowflake for the same workload if it fits ClickHouse's sweet spot.
- DuckDB: the right answer for single-user analytics on data up to a few hundred GB. Laptop-scale. Cheaper than any of the above for the right problem.
The mistake is using Snowflake for a problem DuckDB would solve, or using ClickHouse for a problem Snowflake handles better. Match the engine to the workload.
4. ETL Is Where the Money Hides
The compute cost of your data pipeline is often larger than the compute cost of your queries. Every team focuses on query optimization and ignores the nightly ETL running on a 64-core cluster for 4 hours.
Patterns that actually cut ETL cost:
- Incremental, not full refresh. Processing only yesterday's data instead of reprocessing the whole history. This should be the default, not an optimization.
- Push-down predicates. Filter before you join. Most query planners do this automatically if you write the SQL in the right shape.
- Spark only when you need Spark. A Python script with Pandas or DuckDB can process 50 GB in minutes on a 4-core box. Don't spin up an EMR cluster for it.
- Delete intermediate tables. Staging tables accumulate. Put TTL policies on them or they'll quietly grow your storage bill.
5. Pre-Aggregate for Dashboards, Raw for Ad-Hoc
Dashboards and ad-hoc analysis have opposite cost profiles.
- A dashboard runs the same queries hundreds or thousands of times per day. Optimize it by pre-computing the answers: materialized views, summary tables, cube builds. The query the dashboard runs should scan kilobytes, not terabytes.
- Ad-hoc analysis runs unique queries once. You can't pre-compute for it. The cost discipline here is column pruning, partition filters, and teaching analysts to check the query cost estimate before running.
The classic mistake is running both workloads against the same raw tables, which means your dashboards are paying full-scan costs every five minutes and your analytics bill is a mystery nobody wants to investigate.
6. Data Retention Is a Business Decision, Not a Technical One
Most organizations keep all their data forever because nobody has ever asked "what does it cost to keep this data?" The answer is usually more than they'd guess, and much more than the value of the 3-year-old rows nobody queries.
A reasonable retention policy:
- Raw event data: 90 days in hot storage, then tier to cheaper cold storage (S3 Glacier, Azure Archive) for compliance.
- Aggregated summaries: 3-5 years in hot storage. Small enough that cost is negligible.
- Customer-facing data (orders, accounts, audit logs): retention driven by legal / regulatory requirements, not storage cost.
This conversation is uncomfortable because it requires a business decision. Someone has to say "we don't need events from 2021." But it's the highest-leverage cost discussion in any data platform, and it's the one that rarely happens until the bill forces it.
What an Efficient Platform Actually Looks Like
For a mid-market customer processing 500 GB of new data per day, the stack we recommend most often:
- Raw storage: S3 or Azure Blob, Parquet files, partitioned by date.
- Transformation: dbt running on Snowflake or BigQuery for batch, incremental models where possible.
- Warehouse: Snowflake or BigQuery, with aggressive materialized views for dashboards.
- Real-time dashboards: ClickHouse if concurrency is high, otherwise the warehouse's own dashboarding.
- BI tool: Metabase, Looker, or PowerBI depending on existing stack.
- Cost monitoring: dashboards showing query cost per user per dashboard, reviewed weekly.
Total monthly cost for this profile is usually $3K-8K depending on query patterns. Teams that don't enforce the partition discipline end up at $15K-30K for the same workload.
Three Takeaways
- Query cost is 10x storage cost — focus your optimization there. Partitioning and pre-aggregation are the highest-leverage investments.
- Match the engine to the workload. Snowflake for mixed analytics, ClickHouse for real-time dashboards, DuckDB for laptop-scale. One-size-fits-all is expensive.
- Retention policies are a business decision and they're the biggest conversation nobody has. Have it before the bill forces it.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation