Skip to content
Data this week
Go back

Data This Week #15

Welcome back to Data This Week!

Whether you’re dialing in your data lakehouse architecture or untangling Kafka offsets, this week brings a fantastic lineup of reading for the data engineering space. We’ve been deep in the trenches recently managing dialect conversions and query optimizations across Spark, Iceberg, and Trino, and seeing the ecosystem evolve so rapidly right now is both exciting and a bit dizzying.

Here are the top reads, tools, and community discussions for senior data folks this week.

📚 Blogs to Read

Building a Spark Declarative Pipeline: A Modern Financial Data Lakehouse

Databricks recently open-sourced Spark Declarative Pipelines (SDP), and this piece walks through building a financial lakehouse with SDP, Apache Iceberg, and AWS Glue. It’s a great look at shifting toward a configuration-driven Medallion architecture, moving away from imperative boilerplate like .writeStream and manual checkpoints.

Why it matters: The push toward declarative data pipelines mirrors what Kubernetes did for infrastructure—abstracting away operational complexity so engineers can focus on data contracts and business logic rather than pipeline plumbing. SDP on top of Iceberg gives you the durability of open table formats without sacrificing the control your financial compliance requirements demand. Read more →


10 AWS Glue & Apache Iceberg Errors I Hit and Exactly How I Fixed Them

If you’ve spent any time working with Iceberg and AWS Glue, you know the integration isn’t always seamless. This highly practical guide breaks down ten real-world errors you’ll inevitably hit and provides the exact fixes to get your pipelines unblocked.

Why it matters: The Iceberg-Glue integration surface is notoriously finicky—from catalog version mismatches and partition evolution pitfalls to IAM permission gaps that only surface at runtime. A guide that documents the exact error messages and deterministic fixes saves hours of spelunking through sparse documentation and GitHub issues. Read more →


MOR Isn’t a Storage Optimization. It’s an Architectural Shift.

Merge-on-Read (MOR) tables are often viewed as just a neat storage trick to lower write latency. This article argues that adopting MOR in your lakehouse is actually a fundamental architectural shift in how modern data platforms balance massive streaming ingestion with continuous mutations and read latencies.

Why it matters: Treating MOR as a simple configuration toggle is how teams end up with compaction backlogs that silently degrade query performance. Understanding that MOR redefines the contract between your ingestion layer and your query engine—and that it demands a deliberate compaction and vacuuming strategy—is the difference between a well-architected lakehouse and one that runs fine in staging but degrades under real production load. Read more →


Quack: The DuckDB Client-Server Protocol

DuckDB instances can now talk to each other via “Quack,” a new HTTP-based remote protocol natively supporting bulk transfers and fast small writes. It brings client-server, multi-writer capabilities to DuckDB, pulling it out of the strictly “in-process” niche into a scalable backend for concurrent environments.

Why it matters: DuckDB’s in-process model has always been its biggest strength and its most limiting constraint for team-shared workloads. Quack is a deliberate architectural step toward making DuckDB a viable backend for concurrent analytical workloads without requiring a heavy server deployment—worth tracking closely for teams using DuckDB as a local query layer over object storage. Read more →


SQL Patterns I Use to Catch Transaction Fraud

A highly practical look at the specific SQL patterns and techniques used to identify and flag fraudulent transactions in your data warehouse. A great refresher for anyone handling financial, retail, or e-commerce pipelines.

Why it matters: Fraud detection logic often lives in scattered application code or ad-hoc notebooks that are difficult to audit and maintain. Codifying these patterns as reusable SQL gives your data warehouse a single, versioned source of truth for fraud signals—one that integrates cleanly with dbt models, data quality checks, and downstream alerting pipelines. Read more →


Reading the Last Written Offset in Kafka: A Producer Checkpoint Pattern

A deep dive into an interesting Kafka pattern where the producer reads its own last written offset to handle robust checkpointing, ensuring reliability and preventing data duplication during pipeline restarts.

Why it matters: Exactly-once semantics in Kafka are notoriously hard to achieve across heterogeneous producers. This producer-side checkpoint pattern is a pragmatic alternative that sidesteps the complexity of idempotent producers and transactional APIs—particularly useful when your producers are third-party systems or languages without mature Kafka client libraries for transactional guarantees. Read more →


🛠️ Tools

pbi_scanner

What it is: A new DuckDB extension that allows you to query Power BI Semantic Models directly using DAX. It supports multiple authentication paths (such as Azure CLI and service principals) and local metadata caching.

Why you should check it out: If you are bridging the gap between enterprise BI and local DuckDB instances, this is definitely worth pulling down and testing out. Being able to query a Power BI Semantic Model with DuckDB SQL—rather than context-switching into the Power BI service—dramatically lowers the friction for data engineers who need to audit DAX measure logic, validate data freshness, or cross-reference model outputs against upstream source tables without leaving their local environment.

GitHub → LinkedIn Announcement →


💬 Community Sentiments

The Exhaustion of Explaining Why We Can’t Use LLMs for Data Validation

A trending discussion on Reddit perfectly captured a shared industry frustration: non-technical leadership pushing to use LLMs for schema evolution and data quality checks. While leveraging AI tools to accelerate local development workflows has genuine merit, the core of our data infrastructure relies on strict determinism. Running critical ETL pipelines or financial row-count validation on probabilistic models that hallucinate is a recipe for disaster.

Key Takeaways from the thread:

The core tension: LLMs excel at fuzzy, contextual reasoning—exactly the opposite of what schema validation and row-count reconciliation require. The senior engineers in this thread aren’t anti-AI; they’re pushing back on applying the wrong tool to a problem where correctness is binary.

The organizational pattern: The pressure is almost always top-down, driven by leadership eager to justify AI investment rather than bottom-up from engineers who’ve evaluated the tradeoff. Recognizing this dynamic is the first step to having a productive conversation rather than a defensive one.

The consensus: It’s reassuring to see the senior engineering community collectively push back on forcing generative AI into places where strict logic is non-negotiable. The distinction isn’t “AI vs. no AI”—it’s about deploying AI where probabilistic output is acceptable and keeping deterministic systems deterministic where correctness guarantees are non-negotiable. Read more →


That’s all for this week! See you in the next edition.