<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Wirekite Blog]]></title><description><![CDATA[Thoughts and opinions about all things data. And Wirekite.]]></description><link>https://blog.wirekite.io</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1740292730164/4d52c367-02c5-46e4-948a-520affe9e528.png</url><title>Wirekite Blog</title><link>https://blog.wirekite.io</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 13 Apr 2026 22:56:50 GMT</lastBuildDate><atom:link href="https://blog.wirekite.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Zeroth Commandment: Thou Shall Not Copy]]></title><description><![CDATA[If you’ve written database backends, you learn a crucial design doctrine: Excess Data Copying is Bad.
Wirekite is designed around a related doctrine: “Thou Shall Not Copy (Unnecessarily)”.
If you’re moving data from source to target, you need to do t...]]></description><link>https://blog.wirekite.io/zeroth-commandment-thou-shall-not-copy</link><guid isPermaLink="true">https://blog.wirekite.io/zeroth-commandment-thou-shall-not-copy</guid><category><![CDATA[performance]]></category><category><![CDATA[Databases]]></category><category><![CDATA[database migrations]]></category><category><![CDATA[Database management system]]></category><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Mon, 21 Jul 2025 06:13:05 GMT</pubDate><content:encoded><![CDATA[<p>If you’ve written database backends, you learn a crucial design doctrine: Excess Data Copying is Bad.</p>
<p>Wirekite is designed around a related doctrine: “Thou Shall Not Copy (Unnecessarily)”.</p>
<p>If you’re moving data from source to target, you need to do the following at a minimum:</p>
<ol>
<li><p>Extract the data from the source database. Many data sources (MySQL/MariaDB and PostgreSQL in particular) permit this to be done by a direct dump out of the database itself to a file. Others (Oracle, Microsoft SQL Server, many others) require you to stream data from the source database instance into the address space of the client, where the client dumps the data to a file. The former is a single copy**, while the latter is three copies (network (even if localhost extract) → client address space → file).</p>
</li>
<li><p>Transfer data from the source environment to the target environment. In some cases, this is as simple as connecting to the target db and pointing it at the file dumped by the source application. In others - particularly many cloud data warehouse targets such as Firebolt or Snowflake - you need to upload the output from (1) above to something like Amazon S3. Assuming the usual case of the target db instance being on a different physical host than the source db instance, the optimal case involves a read of the file + a network byte transfer from the client to the target db instance. The less optimal case involves four copies: reading the file into the data transfer client’s address space + a network data transfer + reading the bytes off the network at the other end of the networking + writing the bytes to remote storage (either an S3 object or a file on a remote compute instance, whether in the cloud or not).</p>
</li>
<li><p>The final load to the target. This involves getting the data from transient storage into the address space of the target db instance. In some cases such as MySQL’s LOAD DATA INFILE, this is a direct access to the stored file by the db engine itself, which is a single copy**. In others, such as S3 storage or data loaded from an application (ie, using big INSERT statements or mechanisms such as PostgreSQL COPY FROM STDIN), the data is copied twice more: once into the address space of the client and another across the network.</p>
</li>
</ol>
<p>**We’ve ignored copies inside the source and target DB instances themselves - and a bunch of potential copying done by networking/firewalls/etc - as the number of copies done by these is difficult to determine at the application level.</p>
<p>Depending on the specifics and networking of the source and target, Wirekite’s Extractors, Movers (if needed), and Loaders will execute 3 to 9 copies of the data while doing the data movement.</p>
<p>Other tools - particularly those using something like Kafka or some other intermediate tooling as part of the data transfer process - will make many more copies of the data, particularly if they do row-by-row data conversions as part of using the data transfer tool. If they do parsing at the row and column level, there could be several memory copies as part of the parsing + copies involved in building an intermediate format such as JSON or <a target="_blank" href="https://en.wikipedia.org/wiki/Apache_Parquet">Parquet</a>, as well as at least some extra copying to persistent storage if they use a “database-backed” transfer system like Kafka.</p>
<p>On the target side, they have to convert from the intermediate format to something that can be ingested by the database instance. This will also involve several additional copies while parsing the JSON or Parquet into address space structures and reformatting these into output suitable for target ingestion.</p>
<p>The simplicity of our data flow - determined by source and target capabilities as well as network topology - is a big reason Wirekite benchmarks so much faster than many other data migration tools.</p>
<p>Another big reason is multithreaded extract and load, but that’s a topic for another blog post…</p>
]]></content:encoded></item><item><title><![CDATA[The Change Data Capture performance problem]]></title><description><![CDATA[Many Change Data Capture (CDC) tools use something like the below approach to propagate changes from sources to targets:

Gather changes from the source dataworld using vendor-specific change capture methods (binlog reader APIs, replication streams, ...]]></description><link>https://blog.wirekite.io/the-change-data-capture-performance-problem</link><guid isPermaLink="true">https://blog.wirekite.io/the-change-data-capture-performance-problem</guid><category><![CDATA[change data capture]]></category><category><![CDATA[performance]]></category><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Thu, 10 Jul 2025 07:00:44 GMT</pubDate><content:encoded><![CDATA[<p>Many Change Data Capture (CDC) tools use something like the below approach to propagate changes from sources to targets:</p>
<ul>
<li><p>Gather changes from the source dataworld using vendor-specific change capture methods (binlog reader APIs, replication streams, insert/update/delete table triggers, etc).</p>
</li>
<li><p>Stream changes using something like Kafka from the source platform to the target platform.</p>
</li>
<li><p>Use a client-side loader that reads the events out of Kafka, converts them to appropriate SQL DML - typically single-row INSERT/UPDATE/DELETE statements and BEGIN/COMMIT as necessary - and runs them on the target to execute the changes.</p>
</li>
</ul>
<p>This will “work correctly” in the sense that your changes on the source will make it to the target, and will correctly reflect changes to the source on the target.</p>
<p>But…</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751833172676/1ece9fbb-28b8-411a-bf1b-e24a4c0f7145.jpeg" alt class="image--center mx-auto" /></p>
<p>This method has a low performance ceiling for several reasons…</p>
<ol>
<li><p>CDC processing on the source is typically single-threaded, particularly if you intend to serialize transactions. To be fair, there isn’t too much the CDC processing software can do about this as it’s an artifact of the vendor CDC implementation.</p>
</li>
<li><p>Data must be formatted into stream-friendly events and written to the streaming system, such as Kafka.</p>
</li>
<li><p>On the client side, the events must be read out of the streaming system, converted to some sort of SQL DML, and written to the target. This is often done one query at a time in a single-threaded fashion.</p>
</li>
<li><p>One complicating factor is many cloud database engines don’t process single-row change DML efficiently because of their storage design.</p>
</li>
</ol>
<p>This may be fine for environments that average a few to a few dozen changes per second, but if you have a very active dataworld with hundreds or more changes per second from dozens of active client connections, CDC tools using such methods will quickly “fall behind”, and if you don’t have “idle periods” on the source, your CDC may fall far enough behind that it simply can’t recover.</p>
<p>Wirekite uses a faster method. We’re still serial on most extracts as most CDC APIs are fundamentally serialized (and to be transactionally correct in the target, we have to have some unit-of-work that is serial), but we use a much simpler and faster transfer mechanism than event-streaming, and we use multi-row loading operations to post changes to the target database.</p>
<p>This approach allows us to process over <strong>180,000 changes per second</strong> on some particularly fast data sources and targets, as shown by this benchmark: <a target="_blank" href="https://benchmarks.wirekite.io/mysql-to-firebolt-10-million-ops-inserts-updates-deletes-0-mins-53-secs">10 Million Changes: MySQL to Firebolt</a>.  </p>
<p>You won’t typically need this level of speed - few single-instance database engines can handle this much activity - but it’s good to know it’s there if needed…</p>
]]></content:encoded></item><item><title><![CDATA[Firebolt: Experience with the Fastest Data Warehouse]]></title><description><![CDATA[Working with Firebolt over the past few months has been nothing short of an adventure — one filled with optimism, head-scratching, performance highs, and some deep engineering sighs. It’s a platform with a lot of promise, backed by impressive speed b...]]></description><link>https://blog.wirekite.io/firebolt-experience-with-the-fastest-data-warehouse</link><guid isPermaLink="true">https://blog.wirekite.io/firebolt-experience-with-the-fastest-data-warehouse</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Tue, 01 Jul 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Working with Firebolt over the past few months has been nothing short of an adventure — one filled with optimism, head-scratching, performance highs, and some deep engineering sighs. It’s a platform with a lot of promise, backed by impressive speed benchmarks and a modern architecture. But like any early-stage rocket ship, it has its quirks — especially if you’re the kind of engineer who likes to dig deep and move fast. Here’s a breakdown of why Firebolt has been such an interesting ride.</p>
<p>Firebolt Benchmark - <a target="_blank" href="https://www.firebolt.io/blog/introducing-firescale">https://www.firebolt.io/blog/introducing-firescale</a></p>
<p>Wirekite Benchmark - <a target="_blank" href="https://benchmarks.wirekite.io/series/benchmarks-extract-and-load">https://benchmarks.wirekite.io/series/benchmarks-extract-and-load</a></p>
<h2 id="heading-no-command-line-no-party">No Command Line? No Party</h2>
<p>Let’s start with the obvious elephant in the room — no CLI. In an age where every serious tool offers a command line for real engineers to script, automate, test, and break things beautifully, Firebolt chose to go GUI-only. Sure, the web interface looks polished and is friendly to first-timers, but for engineers who like to tinker, build wrappers, or showcase product strengths and limitations through scripts, a lack of CLI is a red flag. It feels like you’re being asked to drive a sports car using only the touchscreen. There’s just no replacement for a well-documented, responsive, Unix-y CLI interface. And for many seasoned engineers, if it doesn’t have a CLI, it doesn’t feel real.</p>
<h2 id="heading-initial-setup">Initial Setup</h2>
<p>Then there’s the initial configuration — arguably the moment where a tool either wins or loses its future champions. Firebolt makes a poor first impression. The security model is… chaotic. You’re thrown into a maze of owners, users, roles, service accounts, policies, logins, keys, and organization scope. One misstep and you’re locked out. Want to "just play" with the database?. It's ironic that in an age of seamless SaaS onboarding, setting up Firebolt felt difficult. Contrast that with the early days of MySQL, which won hearts (and market share) simply by being easy to install, run, and explore. Firebolt, unfortunately, missed that memo.</p>
<h2 id="heading-blistering-ingestion-speeds">Blistering Ingestion Speeds</h2>
<p>But once you have it set up, ingesting data into Firebolt is snappy, really snappy. Compared to other cloud data warehouses, Firebolt’s performance here is eye-catching. They’ve made serious engineering choices to optimize ingestion pipelines, and it shows. For teams importing large datasets, this speed is addictive. You find yourself asking: “Why doesn’t every platform work this fast?” It’s a major win, and credit where it’s due — Firebolt shines bright in the lanes it was built for.</p>
<h2 id="heading-blistering-query-speeds">Blistering Query Speeds</h2>
<p>Firebolt's tagline isn’t marketing fluff — their decoupled compute and storage, combined with advanced indexing (like aggregating indexes and join indexes), really do make queries run much faster than traditional cloud data warehouses. This is especially noticeable with:</p>
<ul>
<li><p>Large fact tables</p>
</li>
<li><p>Star/snowflake schemas</p>
</li>
<li><p>High-concurrency workloads</p>
</li>
</ul>
<h2 id="heading-awesome-indexing">Awesome Indexing</h2>
<p>Firebolt brings real indexing to cloud data warehouses, which is pretty rare in the modern OLAP world. It supports:</p>
<ul>
<li><p>Primary indexes to organize how data is stored physically</p>
</li>
<li><p>Aggregating indexes for super-fast summary queries</p>
</li>
<li><p>Join indexes to optimize complex joins between large tables</p>
</li>
</ul>
<p>This indexing model allows users to pre-optimize for known access patterns, leading to big wins in performance and cost.</p>
<h2 id="heading-lower-storage-costs">Lower Storage Costs</h2>
<p>Firebolt uses F3 (Firebolt File Format), a columnar compressed format designed to optimize for:</p>
<ul>
<li><p>Fast sequential scans</p>
</li>
<li><p>Predictable performance</p>
</li>
<li><p>Storage cost reduction</p>
</li>
</ul>
<p>In our benchmarks, storage costs were often lower than Redshift or Snowflake for the same datasets, especially after compaction.</p>
<h2 id="heading-fast-in-slow-around">Fast In, Slow Around</h2>
<p>But speed has a funny way of revealing tradeoffs. Once you try to update or delete data, the ride slows to a crawl. Firebolt, like many OLAP systems, isn’t designed for transactional workloads — fair. But the degree of slowness in operations like <code>UPDATE</code> or <code>DELETE</code> feels surprising and sometimes unworkable. And more frustratingly, there’s no <code>MERGE</code>or <code>UPSERT</code>support in SQL — a staple in modern data warehousing for merging new records with existing data. That’s a tough pill to swallow for customers expecting to incrementally update their data warehouse without a full overwrite every time.</p>
<h2 id="heading-medium-is-not-the-new-big">Medium is Not the New Big</h2>
<p>Another limitation that caught us off guard — the size caps. Firebolt currently tops out at "M" size for compute engines. In a market where vendors offer XL, 2XL, even 4XL clusters for big data workloads, Firebolt’s constraint feels underwhelming. We’re dealing with warehouse-scale data here — hundreds of terabytes, sometimes petabytes. If "M" is your max, you’re signaling to customers that you’re not quite ready for enterprise-scale heavy lifting. It's a curious choice, and one that will need to be addressed if Firebolt is to compete in the big leagues.</p>
<h2 id="heading-outsourced-migrations">Outsourced Migrations</h2>
<p>Data migration, often the most painful part of any cloud transition, has been outsourced in Firebolt's case — mainly to tools like Airbyte and dbt. While it’s understandable that Firebolt didn’t want to reinvent the wheel, the performance of these tools, in practice, has been underwhelming. Our benchmarks showed slow throughput and unreliable sync behavior, especially with large datasets and complex schema transformations. In critical pipelines, these lags become show-stoppers. Offloading this responsibility without tight native integration leaves users juggling too many moving parts — and that slows adoption.</p>
]]></content:encoded></item><item><title><![CDATA[Why Data Movement Speed Matters]]></title><description><![CDATA[At Wirekite, we optimized our implementations around an ethos of speed over some other software engineering goals. We don’t use off-the-shelf data formatting libraries or other tools that are commonly used by others in the data movement space, and we...]]></description><link>https://blog.wirekite.io/why-speed-matters</link><guid isPermaLink="true">https://blog.wirekite.io/why-speed-matters</guid><category><![CDATA[performance]]></category><category><![CDATA[data migration tools]]></category><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Mon, 23 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>At Wirekite, we optimized our implementations around an ethos of speed over some other software engineering goals. We don’t use off-the-shelf data formatting libraries or other tools that are commonly used by others in the data movement space, and we aren’t afraid to “reinvent the wheel” if we feel that we need a better wheel.</p>
<h2 id="heading-speed-matters-in-data-migration-projects">Speed Matters in Data Migration Projects</h2>
<p>In our experience, the faster physical data movement is, the less complex the overall project ends up being, and the higher its chance for success.</p>
<p>For example, consider a very large data migration project from an existing on-premise data and application world to a new data world, whether it’s in “the cloud” or a new on-premise environment. In such a migration, you have a lot of moving parts to consider:</p>
<ul>
<li><p>You have to migrate the database schema from the old database technology to the new technology, and figure out how this will be done.</p>
</li>
<li><p>Application data layers will have to be recoded to work with the new dataworld.</p>
</li>
<li><p>You have to physically move existing data from the old dataworld to the new.</p>
</li>
<li><p>As bulk data movement isn’t instantaneous, you will need some scheme for “catching up” the new world with updates from the old world while the initial data was being extracted, transferred, and loaded to the new.</p>
</li>
</ul>
<p>Most organizations with skilled developers and operations people can handle the schema, application, and initial data movement parts, although often with manual processes that may be quite tedious and error-prone. But they usually can make these work - assuming project scope-creep is kept to a minimum.</p>
<p>Where things get icky is in the catch-up phase. If the extract, transfer, and load takes many hours or days as it often does, you have various choices:</p>
<ol>
<li><p>A lengthy downtime of your production environment to avoid lag while the base data is extracted, transferred, and loaded to the new environment. You switch your apps over after everything has been loaded (and validated, etc). It’s a “clean” cutover, but may involve hours or days of downtime.</p>
</li>
<li><p>App-coded “catch-up logic”, using app-specific mechanisms, such as copying over a user’s history as a user activity is encountered by the app (and keeping metadata somewhere that says what users have been migrated). This is extremely hard to get right, and the migration may take weeks or months so you’ll have to run two live environments until you finally decide that enough users have been migrated.</p>
</li>
<li><p>Coding your application data layer so it “mirrors” changes to the old and new environment. This is also very hard to get right.</p>
</li>
<li><p>The best scenario is if your data movement solution can do the initial data move as well as replicating both backlogged and new changes from your old to new environment quickly enough that the two environments will be fully “synched” within a reasonable timeframe. But you need a lot of “extra” performance - the change propagation can’t just be a bit faster than the change-rate in prod - to pull this off.</p>
</li>
</ol>
<p>If the last scenario can happen, you can avoid data-related downtime, and just need to “flip a switch” to go from your old app and dataworld to the new world. Other than possibly a blip while the switch is flipped, you won’t have user-facing downtime - and your app developers can focus on the already-nontrivial task of getting the data layer to work in the new world without the transitional app logic or lengthy downtime mentioned in the previous scenarios.</p>
<h2 id="heading-speed-matters-in-data-warehouse-projects">Speed Matters in Data Warehouse Projects</h2>
<p>Data warehouse projects are different from data redeployment projects in that the production-facing dataworld will not be shut down, and the data warehouse will be used for reporting and analytics, not customer-facing applications.</p>
<p>But the basic migration problem is similar: you have to get bulk data transferred to the data warehouse, and have some way to keep the data warehouse at least reasonably synchronized with prod.</p>
<p>Many organizations implement strategies such as once-a-week reloads of their entire dataset to their data warehouse. This is expensive, error-prone, and means you may have a several-day lag between current production and your reporting and analytics, which means you may miss emerging trends in your data that are business-interesting.</p>
<p>If you have fast change propagation, your analytics and reporting world may only lag production by a few seconds or minutes, and you can detect trends in your data as soon as they appear.</p>
<h2 id="heading-speed-and-wirekite">Speed and Wirekite</h2>
<p>Our focus on speed allows Wirekite customers to have simpler - and more successful - engineering processes for data rehosting efforts, and truly fresh and up-to-date data in data warehouses and other analytics environments.</p>
]]></content:encoded></item><item><title><![CDATA[Fast Targets, Slow Sources]]></title><description><![CDATA[Modern cloud data warehouses have revolutionized the analytics landscape with their impressive ingestion capabilities. Platforms like Snowflake, Firebolt, and Databricks proudly showcase their lightning-fast data loading tools—Snowpipe for continuous...]]></description><link>https://blog.wirekite.io/fast-targets-slow-sources</link><guid isPermaLink="true">https://blog.wirekite.io/fast-targets-slow-sources</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Thu, 19 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Modern cloud data warehouses have revolutionized the analytics landscape with their impressive ingestion capabilities. Platforms like Snowflake, Firebolt, and Databricks proudly showcase their lightning-fast data loading tools—Snowpipe for continuous data ingestion, Lakehouse architectures for unified processing, and optimized file formats that can process terabytes in minutes. These innovations have dramatically reduced the time it takes to analyze data once it arrives at the warehouse, leading many organizations to believe they've solved the data pipeline performance puzzle. However, this focus on destination-side optimization has created a dangerous blind spot in the data engineering community: source extraction.</p>
<p>The reality is that these powerful platforms, despite their sophisticated ingestion mechanisms, predominantly expect data to arrive as files—whether in Parquet, CSV, JSON, or other formats. While they excel at processing these files at unprecedented speeds, they offer little to address the fundamental challenge of extracting data from source systems. This creates an ironic situation where organizations invest millions in state-of-the-art cloud warehouses capable of processing petabytes per hour, only to find themselves bottlenecked by legacy extraction processes that trickle data out. The mismatch is akin to building a sixteen-lane superhighway that connects to a narrow country road.</p>
<p>This extraction bottleneck becomes particularly acute when dealing with operational databases that power critical business applications. Traditional databases like Oracle, SQL Server, or even modern systems like PostgreSQL and MySQL, weren't designed with bulk data extraction in mind. They're optimized for transactional workloads, not for efficiently scanning and extracting millions of rows while maintaining production performance.</p>
<p>The problem is compounded by the industry's current obsession with Change Data Capture (CDC). While CDC is undoubtedly crucial for maintaining synchronized systems, many engineers focus exclusively on optimizing incremental updates while completely overlooking the initial historical data load. This oversight can be catastrophic: if your initial extraction and load takes days or weeks to complete, your CDC pipeline starts with such a significant lag that it may never catch up to real-time. Consider a scenario where you're extracting 10 TB of historical data at 100 MB/second while the source system generates changes at 150 MB/second—mathematically, you'll never achieve synchronization.</p>
<p>The mathematics of data pipeline performance are unforgiving—every component must perform adequately for the system to function effectively. A data pipeline's throughput is determined by its slowest component. This means that having a warehouse capable of ingesting data at 10 GB/second is meaningless if your source extraction is limited to 100 MB/second. Furthermore, slow initial loads create cascading problems: they extend project timelines, increase costs due to prolonged dual-running systems, and create larger windows of data inconsistency.</p>
<p>The path forward requires a fundamental shift in how we approach data pipeline architecture. Instead of viewing extraction and loading as separate concerns, we need holistic solutions that optimize the entire data flow. This includes investing in parallel extraction techniques that can leverage source database read replicas, implementing intelligent partitioning strategies that allow for concurrent extracts, and developing extraction tools that understand source database internals to minimize production impact. Cloud warehouse vendors must also recognize that their responsibility doesn't end at the ingestion API—true end-to-end performance requires collaboration with source system vendors and investment in extraction technologies. Only when we achieve balance across all components—extraction, transformation, loading, and CDC—can we realize the full potential of modern cloud data warehouses and deliver on the promise of real-time, data-driven decision making.</p>
]]></content:encoded></item><item><title><![CDATA[Why Companies Don't Publish Benchmarks ?]]></title><description><![CDATA[I think we know :).
Although, some do.
Wirekite Benchmarks - https://benchmarks.wirekite.io/series/benchmarks-extract-and-load
Firebolt Benchmark - https://www.firebolt.io/blog/introducing-firescale]]></description><link>https://blog.wirekite.io/why-companies-do-not-publish-benchmarks</link><guid isPermaLink="true">https://blog.wirekite.io/why-companies-do-not-publish-benchmarks</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Sat, 14 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>I think we know :).</p>
<p>Although, some do.</p>
<p>Wirekite Benchmarks - <a target="_blank" href="https://benchmarks.wirekite.io/series/benchmarks-extract-and-load">https://benchmarks.wirekite.io/series/benchmarks-extract-and-load</a></p>
<p>Firebolt Benchmark - <a target="_blank" href="https://www.firebolt.io/blog/introducing-firescale">https://www.firebolt.io/blog/introducing-firescale</a></p>
]]></content:encoded></item><item><title><![CDATA[How Cloud Data Warehouses Changed the Rules]]></title><description><![CDATA[Modern cloud data warehouses like Snowflake, Firebolt, and Databricks have reimagined what databases can do — prioritizing massive scalability, low storage costs, and lightning-fast analytical queries. But in doing so, they’ve made trade-offs that ma...]]></description><link>https://blog.wirekite.io/how-cloud-data-warehouses-changed-the-rules</link><guid isPermaLink="true">https://blog.wirekite.io/how-cloud-data-warehouses-changed-the-rules</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Mon, 09 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Modern cloud data warehouses like Snowflake, Firebolt, and Databricks have reimagined what databases can do — prioritizing massive scalability, low storage costs, and lightning-fast analytical queries. But in doing so, they’ve made trade-offs that make them very different from traditional RDBMS systems like PostgreSQL, Oracle, or MySQL.</p>
<h2 id="heading-inserts-welcome-updates-and-deletes-not-so-much"><strong>Inserts Welcome. Updates and Deletes? Not So Much</strong></h2>
<p>Traditional RDBMS systems are designed to handle frequent row-level operations, including updates, deletes, and transactions with ACID guarantees. In contrast, cloud data warehouses discourage such usage — often subtly, sometimes explicitly.</p>
<p>For example:</p>
<ul>
<li><p>In Snowflake, single-row <code>UPDATE</code> and <code>DELETE</code> operations are possible, but slow and inefficient. These commands trigger background copy-on-write and micro-partition recompaction, making them expensive.</p>
</li>
<li><p>Firebolt doesn’t support row-level <code>MERGE</code> or <code>UPSERT</code> natively and encourages append-only ingest models.</p>
</li>
<li><p>In Databricks Delta Lake, updates and deletes work, but they trigger costly file rewrites, which is why MERGE INTO is recommended for batch-style changes.</p>
</li>
</ul>
<h2 id="heading-separation-of-storage-and-compute-elastic-by-design"><strong>Separation of Storage and Compute: Elastic by Design</strong></h2>
<p>Traditional RDBMS systems tightly couple storage with compute — one machine (or a fixed cluster) owns the data and handles the queries. Cloud data warehouses took the opposite approach: decoupling storage and compute, allowing each to scale independently.</p>
<p>For example:</p>
<ul>
<li><p>In Snowflake, you can spin up multiple "virtual warehouses" (compute clusters) on top of the same shared data. Want faster queries? Just use a larger warehouse. Want to reduce costs? Suspend the warehouse.</p>
</li>
<li><p>In Firebolt, each engine is isolated and can be independently scaled to support different workloads (ingestion, dashboards, experimentation).</p>
</li>
<li><p>Databricks allows you to bring up ephemeral Spark clusters that read from shared Delta tables stored on S3 or ADLS.</p>
</li>
</ul>
<p>This decoupling allows for burstable compute — run a massive query, scale up temporarily, then scale back down — something a traditional RDBMS cannot do easily.</p>
<h2 id="heading-object-storage-and-metadata-why-ingestion-is-fast"><strong>Object Storage and Metadata: Why Ingestion is Fast</strong></h2>
<p>At the heart of most cloud data warehouses lies object storage (like S3 or GCS), not traditional block storage. This shift changes how "loading data" works.</p>
<p>In many cases, the data never moves:</p>
<ul>
<li><p>In Snowflake, external tables let you query Parquet or CSV files directly from S3, using schema-on-read.</p>
</li>
<li><p>Databricks can read Delta Lake tables directly from S3, which are essentially metadata pointers to parquet files.</p>
</li>
<li><p>Firebolt supports external tables that create a logical view over files in your data lake — the actual bits stay put.</p>
</li>
</ul>
<h2 id="heading-row-level-operations-are-a-last-resort"><strong>Row-Level Operations Are a Last Resort</strong></h2>
<p>Cloud data warehouses are fundamentally columnar and immutable at the storage layer. That means row-level changes often require rewriting entire files or partitions.</p>
<p>Examples:</p>
<ul>
<li><p>In Databricks Delta, a single row update may result in rewriting an entire file block.</p>
</li>
<li><p>Snowflake manages data in micro-partitions and any <code>DELETE</code> or <code>UPDATE</code> invalidates the old partition and creates a new one.</p>
</li>
<li><p>Firebolt doesn't support direct row-level updates at all, pushing users toward append + deduplicate or flip-flop tables.</p>
</li>
</ul>
<p>Design patterns like slowly changing dimensions (SCD) are harder to implement when every update is a batch overwrite.</p>
<h2 id="heading-etl-and-cdc-are-awkward-fits"><strong>ETL and CDC Are Awkward Fits</strong></h2>
<p>Change Data Capture (CDC) is the backbone of many real-time ETL systems. But CDC is inherently row-based — which clashes with the batch-friendly nature of cloud data warehouses.</p>
<p>Challenges:</p>
<ul>
<li><p>Tools like Airbyte or Debezium emit row-level changes that cloud data warehouses struggle to ingest efficiently.</p>
</li>
<li><p>In Snowflake, ingesting from CDC streams like Kafka requires landing data in a stage or external table, then merging it in batch.</p>
</li>
<li><p>Databricks users often have to write custom logic to coalesce CDC events into upserts.</p>
</li>
<li><p>Firebolt doesn’t support out-of-the-box CDC pipelines and typically expects batch inserts or full reloads.</p>
</li>
</ul>
<p>The end result is that building reliable ETL pipelines requires extra engineering effort — often including custom buffering, data deduplication, and late-arrival handling.</p>
]]></content:encoded></item><item><title><![CDATA[Why 60% of Data Migration Projects Fail]]></title><description><![CDATA[Almost every other engineer I talk to is working on a Data Migration project. And they are frustrated. Here’s an example.
https://www.reddit.com/r/dataengineering/comments/1axtzgp/data_migration_projects_the_good_the_bad_and_the/?rdt=37557
Data migra...]]></description><link>https://blog.wirekite.io/why-60-of-data-migration-projects-fail-a-technical-perspective</link><guid isPermaLink="true">https://blog.wirekite.io/why-60-of-data-migration-projects-fail-a-technical-perspective</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Sat, 07 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Almost every other engineer I talk to is working on a Data Migration project. And they are frustrated. Here’s an example.</p>
<p><a target="_blank" href="https://www.reddit.com/r/dataengineering/comments/1axtzgp/data_migration_projects_the_good_the_bad_and_the/?rdt=37557">https://www.reddit.com/r/dataengineering/comments/1axtzgp/data_migration_projects_the_good_the_bad_and_the/?rdt=37557</a></p>
<p>Data migration is a critical process for organizations aiming to modernize their systems, yet it often leads to unexpected challenges. According to <a target="_blank" href="https://www.gartner.com/smarterwithgartner/6-ways-cloud-migration-costs-go-off-the-rails">Gartner</a>, approximately 60% of data migration projects fail or exceed their budgets and schedules . While managerial oversights contribute to these failures, technical issues are frequently at the core. This article explores the primary technical pitfalls that can derail data migration efforts.</p>
<h2 id="heading-lack-of-comprehensive-data-mapping"><strong>Lack of Comprehensive Data Mapping</strong></h2>
<p>Over time, as organizations evolve, the original architects of data systems may depart, leaving behind undocumented or poorly understood data structures. This knowledge gap can result in inadequate data mapping, where teams are unsure about the relevance or usage of certain data fields. Without a clear understanding of the data landscape, migrations can lead to incomplete or incorrect data transfers, compromising the integrity of the new system.​</p>
<h2 id="heading-over-reliance-on-migration-tools-without-deep-understanding"><strong>Over Reliance on Migration Tools Without Deep Understanding</strong></h2>
<p>While data migration tools can streamline the process, an over-dependence on them without a thorough grasp of their functionalities can be detrimental. Relying solely on vendor promises or feature lists without conducting extensive testing can result in overlooked nuances, leading to data mismatches or loss. It's essential to perform end-to-end testing with substantial data samples to ensure the tool's compatibility with specific organizational needs.​</p>
<h2 id="heading-insufficient-application-level-testing-post-migration"><strong>Insufficient Application-Level Testing Post-Migration</strong></h2>
<p>Migrating data is not just about transferring information; it's about ensuring that applications function seamlessly with the new data structures. Neglecting comprehensive application-level testing can lead to unexpected behaviors, from minor glitches to significant operational disruptions. Some features might underperform, while others could behave unpredictably due to subtle differences in data handling between systems.​</p>
<h2 id="heading-attempting-dual-system-operations-and-the-illusion-of-rollback"><strong>Attempting Dual-System Operations and the Illusion of Rollback</strong></h2>
<p>In an effort to minimize downtime, organizations might try to operate both old and new systems simultaneously or maintain the option to revert to the legacy system. This approach can introduce complexities like data drift, where inconsistencies arise between systems. Moreover, implementing reverse ETL processes to synchronize data can strain resources and complicate the migration, often leading to more issues than solutions.​</p>
<h2 id="heading-super-small-or-no-downtime-requirements"><strong>Super Small or No Downtime Requirements</strong></h2>
<p>Aiming for zero downtime during migrations is ambitious and, in many cases, unrealistic. Not allocating sufficient time for the migration process, including potential error corrections, can result in rushed implementations and overlooked issues. It's crucial to plan for adequate downtime, ensuring that there's a buffer to address unforeseen challenges without compromising the system's stability.​</p>
<h2 id="heading-excessive-check-summing-leading-to-resource-drain"><strong>Excessive Check-summing Leading to Resource Drain</strong></h2>
<p>While verifying data integrity is vital, overemphasis on check-summing every data point can consume significant computational resources and time. This exhaustive approach can delay the migration process and impact application performance. A balanced strategy that focuses on critical data validation, combined with efficient monitoring, can ensure data integrity without overburdening the system.​</p>
]]></content:encoded></item><item><title><![CDATA[The Ambiguity Shield]]></title><description><![CDATA[In the world of modern business, ambiguity has become a subtle but powerful weapon — not to challenge uncertainty, but to shield against accountability. Too often, companies and managers invoke the "intangibles," the "nuances," or the "complexities" ...]]></description><link>https://blog.wirekite.io/the-ambiguity-shield</link><guid isPermaLink="true">https://blog.wirekite.io/the-ambiguity-shield</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Wed, 04 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>In the world of modern business, ambiguity has become a subtle but powerful weapon — not to challenge uncertainty, but to shield against accountability. Too often, companies and managers invoke the "intangibles," the "nuances," or the "complexities" of a problem not to solve it — but to avoid the burden of measurement and merit. It’s a clever trick: by wrapping decisions in subjectivity, one can avoid the uncomfortable task of proving or disproving success.</p>
<p>Yes, not everything can be measured. But that’s a truth frequently twisted to serve those who fear objectivity. As Peter Drucker famously said, "What gets measured gets managed." And its corollary should be equally feared: "What is not measured is conveniently forgotten."</p>
<h2 id="heading-hiding-behind-the-intangibles">Hiding Behind the Intangibles</h2>
<p>There’s a fine line between recognizing complexity and weaponizing it. When managers avoid key questions like:</p>
<ul>
<li><p><em>How fast is our system compared to others?</em></p>
</li>
<li><p><em>Is this product genuinely better, or just marketed better?</em></p>
</li>
<li><p><em>Did this feature move the metric it was meant to?</em><br />  and instead respond with "It's hard to say…" or "There are too many variables," they're often not defending truth.</p>
</li>
</ul>
<p>In technical environments especially, ambiguity is frequently a political smokescreen. A system may be slow, a feature may be broken, an engineer may be outperforming peers — and yet, the organization will cling to fuzzy KPIs or vague language instead of calling things as they are. The reason? Precision brings consequences.</p>
<h2 id="heading-almost-everything-can-be-measured">Almost Everything <em>Can</em> Be Measured</h2>
<p>The irony is that most things are more measurable than they appear — particularly in technical contexts. If a system is too slow, benchmarking tools exist. If two algorithms compete, A/B testing can arbitrate. If engineers want to assess productivity, peer code reviews, commit velocity, bug rates, and design complexity metrics all offer insight. No system is perfect, but they can be fair.</p>
<p>And even if benchmarks have an error margin — say, ±5% — that shouldn’t stop us from benchmarking altogether. We accept uncertainty in almost every scientific and engineering field. So why do we let it become a blocker in corporate performance reviews, architecture decisions, or vendor evaluations?</p>
<p>The answer is cultural: a bias toward ambiguity protects status, avoids conflict, and delays change.</p>
<h2 id="heading-ambiguity-as-strategy">Ambiguity as Strategy</h2>
<p>When merit is hard to measure, influence and perception take over. This is where ambiguity becomes more than an oversight — it becomes a strategy.</p>
<ul>
<li><p>The product no one wants, but which survives because no one agrees on how to define “usage.”</p>
</li>
<li><p>The team delivering poorly, but defended by “intangible contributions.”</p>
</li>
<li><p>The leader avoiding decisions because “it’s not measurable” — when in reality, they fear being measured.</p>
</li>
</ul>
<p>In these cases, ambiguity isn't just passivity. It's an active rejection of objectivity. It's how the underperforming are protected, the overperforming are demoralized, and progress is slowed.</p>
<h2 id="heading-towards-measured-progress">Towards Measured Progress</h2>
<p>Instead of chasing perfect measurements, we should chase honest ones. A flawed benchmark is still better than no benchmark — and often, it’s enough to guide decisions, track improvement, or reward merit. Organizations that lean into measurement (even imperfect measurement) create a culture of learning, transparency, and trust. Those that hide behind ambiguity foster complacency, opacity, and power games.</p>
<p>The choice isn’t between false certainty and endless complexity. It’s between doing the hard work of objectivity — or continuing the easy slide into subjective noise.</p>
<h2 id="heading-final-thought">Final Thought</h2>
<p>Ambiguity may sound like humility, but in practice, it often serves as a defense mechanism against accountability. The most honest question any organization can ask itself is this:</p>
<blockquote>
<p>“Are we unsure because things are truly uncertain — or because we’re afraid of what certainty might reveal?”</p>
</blockquote>
<p>The answer might just be measurable.</p>
]]></content:encoded></item><item><title><![CDATA[The 4 Hidden Disconnects]]></title><description><![CDATA[In fast-paced product-driven organizations, success often hinges not just on the brilliance of the product itself, but on how well the people around it collaborate, understand, and communicate with each other. Yet time and again, companies fall into ...]]></description><link>https://blog.wirekite.io/the-4-hidden-disconnects-that-impact-product</link><guid isPermaLink="true">https://blog.wirekite.io/the-4-hidden-disconnects-that-impact-product</guid><dc:creator><![CDATA[Wirekite]]></dc:creator><pubDate>Sun, 01 Jun 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>In fast-paced product-driven organizations, success often hinges not just on the brilliance of the product itself, but on how well the people around it collaborate, understand, and communicate with each other. Yet time and again, companies fall into invisible traps—deep disconnects between key players that quietly undermine alignment, momentum, and strategic clarity.</p>
<p>While engineers are busy building, salespeople are selling, and executives are managing vision and investors, it's common for these groups to operate in parallel rather than in sync. At the heart of many dysfunctional product journeys lies a misalignment between the principal engineer—the one who knows the product inside-out—and everyone else. Here are four of the most critical disconnects companies often overlook:</p>
<h2 id="heading-ceo-vs-principal-engineer-the-vision-gap"><strong>CEO vs. Principal Engineer : The Vision Gap</strong></h2>
<p>In many organizations, the CEO is so focused on investor meetings, growth metrics, and hiring strategies that they lose touch with the core product. This results in a top executive who cannot clearly articulate what the product does, how it stands out from the competition, or what truly makes it valuable. While these other duties are important, not understanding your own product is a strategic risk. The principal engineer, who has built the product from scratch, often sees this as a failure of leadership—fueling frustration and widening the gap between engineering and the executive suite.</p>
<h2 id="heading-salesperson-vs-principal-engineer-the-feature-mirage">Salesperson vs. Principal Engineer : The Feature Mirage</h2>
<p>Sales teams frequently operate without a deep grasp of what the product can do, which leads to awkward scenarios where a salesperson sells the wrong features to the wrong people. More troubling is when these sales are made to other salespeople within the client organization, compounding the miscommunication. The principal engineer is left fielding support tickets, complaints, or feature requests that make no technical or business sense. This creates a culture of over-promising, under-delivering, and misaligned expectations between customers and engineers.</p>
<h2 id="heading-product-manager-vs-principal-engineer-the-feasibility-disconnect">Product Manager vs. Principal Engineer : The Feasibility Disconnect</h2>
<p>Product managers are expected to chart the roadmap, but without a realistic grasp of the underlying architecture, they may demand features that are either technically impossible or irrelevant to the customer. Sometimes, PMs chase feature ideas born from market trends or internal brainstorming without proper validation. Meanwhile, the principal engineer sees these asks as wasteful or absurd. Without a shared language or mutual understanding, the roadmap becomes a battleground rather than a collaborative effort.</p>
<h2 id="heading-principal-engineer-vs-everyone-the-founders-bias"><strong>Principal Engineer vs. Everyone : The Founder’s Bias</strong></h2>
<p>While the principal engineer holds the most intimate knowledge of the product, they can also become its greatest bottleneck. Having built the system from the ground up, they often develop a subconscious emotional bias—resisting any input that challenges their original design or philosophy. This rigidity makes them dismissive of feedback, skeptical of strategy, and immune to change. Ironically, the same person who once drove innovation may now be the one anchoring the product in the past.</p>
]]></content:encoded></item></channel></rss>