Skip to main content

Command Palette

Search for a command to run...

How Cloud Data Warehouses Changed the Rules

Updated
3 min read

Modern cloud data warehouses like Snowflake, Firebolt, and Databricks have reimagined what databases can do — prioritizing massive scalability, low storage costs, and lightning-fast analytical queries. But in doing so, they’ve made trade-offs that make them very different from traditional RDBMS systems like PostgreSQL, Oracle, or MySQL.

Inserts Welcome. Updates and Deletes? Not So Much

Traditional RDBMS systems are designed to handle frequent row-level operations, including updates, deletes, and transactions with ACID guarantees. In contrast, cloud data warehouses discourage such usage — often subtly, sometimes explicitly.

For example:

  • In Snowflake, single-row UPDATE and DELETE operations are possible, but slow and inefficient. These commands trigger background copy-on-write and micro-partition recompaction, making them expensive.

  • Firebolt doesn’t support row-level MERGE or UPSERT natively and encourages append-only ingest models.

  • In Databricks Delta Lake, updates and deletes work, but they trigger costly file rewrites, which is why MERGE INTO is recommended for batch-style changes.

Separation of Storage and Compute: Elastic by Design

Traditional RDBMS systems tightly couple storage with compute — one machine (or a fixed cluster) owns the data and handles the queries. Cloud data warehouses took the opposite approach: decoupling storage and compute, allowing each to scale independently.

For example:

  • In Snowflake, you can spin up multiple "virtual warehouses" (compute clusters) on top of the same shared data. Want faster queries? Just use a larger warehouse. Want to reduce costs? Suspend the warehouse.

  • In Firebolt, each engine is isolated and can be independently scaled to support different workloads (ingestion, dashboards, experimentation).

  • Databricks allows you to bring up ephemeral Spark clusters that read from shared Delta tables stored on S3 or ADLS.

This decoupling allows for burstable compute — run a massive query, scale up temporarily, then scale back down — something a traditional RDBMS cannot do easily.

Object Storage and Metadata: Why Ingestion is Fast

At the heart of most cloud data warehouses lies object storage (like S3 or GCS), not traditional block storage. This shift changes how "loading data" works.

In many cases, the data never moves:

  • In Snowflake, external tables let you query Parquet or CSV files directly from S3, using schema-on-read.

  • Databricks can read Delta Lake tables directly from S3, which are essentially metadata pointers to parquet files.

  • Firebolt supports external tables that create a logical view over files in your data lake — the actual bits stay put.

Row-Level Operations Are a Last Resort

Cloud data warehouses are fundamentally columnar and immutable at the storage layer. That means row-level changes often require rewriting entire files or partitions.

Examples:

  • In Databricks Delta, a single row update may result in rewriting an entire file block.

  • Snowflake manages data in micro-partitions and any DELETE or UPDATE invalidates the old partition and creates a new one.

  • Firebolt doesn't support direct row-level updates at all, pushing users toward append + deduplicate or flip-flop tables.

Design patterns like slowly changing dimensions (SCD) are harder to implement when every update is a batch overwrite.

ETL and CDC Are Awkward Fits

Change Data Capture (CDC) is the backbone of many real-time ETL systems. But CDC is inherently row-based — which clashes with the batch-friendly nature of cloud data warehouses.

Challenges:

  • Tools like Airbyte or Debezium emit row-level changes that cloud data warehouses struggle to ingest efficiently.

  • In Snowflake, ingesting from CDC streams like Kafka requires landing data in a stage or external table, then merging it in batch.

  • Databricks users often have to write custom logic to coalesce CDC events into upserts.

  • Firebolt doesn’t support out-of-the-box CDC pipelines and typically expects batch inserts or full reloads.

The end result is that building reliable ETL pipelines requires extra engineering effort — often including custom buffering, data deduplication, and late-arrival handling.

8 views