The False Economy of Cheap Storage

A Principal Data Engineer at Amazon recently posed a question that lit up LinkedIn: if data lakes are so much cheaper per terabyte than data warehouses, why pay for a warehouse at all?

His analogy was sharp: "That is like saying: if Google Drive is cheaper than a database, why not run your entire company's analytics from a folder of CSVs?"

It sounds clever for about two seconds. Then the whole thing falls apart the moment real usage begins.

But the ensuing debate — hundreds of comments from senior engineers across the industry — revealed something more interesting than the original question. The data engineering community is stuck in a three-way argument between lakes, warehouses, and lakehouses. And all three camps are optimizing for the wrong thing.

Three camps, same argument

Camp 1: "Just use the lake." Store everything cheaply, query it when you need to. The problem: as one Senior Data Engineer put it, "It's all fun and games until an analyst runs an unpartitioned SELECT on four years of nested JSON files and the compute bill arrives."*

Camp 2: "You need both." The lake is where data lands, the warehouse is where data becomes usable. This is the conventional wisdom — and it means maintaining two systems, ETL pipelines between them, and a FinOps team to manage the combined costs.

Camp 3: "Lakehouse solves everything." Apache Iceberg, Delta Lake, Hudi — open table formats that bring ACID transactions and schema enforcement to the lake. One commenter from Dremio argued: "The 'raw, untrustworthy lake' problem is largely solved when you layer the right engine on top."

Each camp has a point. But they're all debating the arrangement of furniture while the building is on fire.

The part nobody's talking about

The original post correctly identified that you're not just optimizing for storage cost. You're optimizing for query performance, reliability, governance, consistency, and trust.

But here's what the entire debate misses: the cost model of every option discussed — lake, warehouse, lakehouse — is per-query.

Whether you're running Snowflake, Databricks, BigQuery, Athena, or a lakehouse with Trino on top, you're paying for compute every time you ask a question. The meter is always spinning.

This was fine when humans ran 50 queries a day. It is not fine when AI agents run 50 queries a minute.

A single Snowflake Cortex AI query on 1.18 billion records cost one company $5,269. That's not a billing error. That's the per-query model working exactly as designed — at a scale it was never designed for.

The FinOps market is projected to reach $27 billion by 2030. That's a $27 billion industry whose sole purpose is managing the cost chaos created by per-query pricing. Something is deeply wrong when an entire industry exists just to explain your database bill.

You're asking the wrong question

One commenter — a fractional CTO — made the most honest observation in the entire thread:

"NO DATABASE VENDOR has a real solution. That's because the real solution involves time, organization, and commitment, not their product."

He's half right. Time, organization, and commitment matter. But the reason vendors don't have a real solution is simpler than that: they can't. Snowflake, Databricks, and BigQuery are structurally trapped. Their revenue models depend on per-query billing. Moving to capacity pricing would destroy their economics. It's the classic innovator's dilemma.

The right question isn't "lake vs. warehouse vs. lakehouse." It's: what does it actually cost to turn my data into decisions?

And that cost includes:

  • Per-query compute charges that scale with AI agent usage
  • ETL pipeline maintenance consuming a disproportionate share of engineering time
  • FinOps teams managing cost dashboards instead of building product
  • Throttled AI agents because every query has a price tag
  • Governance bolted on as an afterthought rather than built into the architecture

Storage and compute don't have to be separated

What if storage and compute weren't separated? What if you didn't need a lake AND a warehouse AND a lakehouse AND an ETL pipeline AND a FinOps team?

What if you could store massive datasets and query them in seconds, with governance built in, at a fixed monthly cost regardless of how many queries you run?

This is the architectural bet MinusOneDB made: move storage and compute back together, make distributed search the foundation rather than an add-on, and charge for capacity rather than per query.

The result: AI agents query freely. No per-query bill. No FinOps dashboard. No lake-to-warehouse ETL. SOC 2 certified with audit logging built in.

This argument isn't going away

The lake vs. warehouse vs. lakehouse argument will keep going. People will keep proposing new table formats, new query engines, new ways to arrange the same fundamental architecture.

But the companies that pull ahead won't be the ones who picked the right layer. They'll be the ones who stopped paying per question.

As one commenter noted: "Good Data Engineers don't optimize for the cheapest layer. They optimize for the full system."

Exactly. And the full system includes the bill.