Data This Week #14 | Data this week

Welcome to Data This Week. This issue covers the shift from batch ETL to real-time CDC, a comprehensive architectural map for LLM engineers, a fascinating new diskless take on Kafka, a deep dive into Iceberg write strategies, and how Instacart keeps search fast across billions of store-level SKUs.

Here are the top reads, tools, and community discussions for senior data folks this week.

📚 Blogs to Read

Building a Streaming ELT Pipeline from MySQL to Kafka with Flink CDC

A practical look at transitioning from traditional batch ingestion to real-time change data capture. The article breaks down the mechanics of using Flink CDC to stream row-level MySQL changes directly into Kafka. For senior engineers, it’s a solid blueprint for decoupling operational databases from analytical sinks without the latency overhead of heavy batch ETL tools.

Why it matters: Batch ETL introduces hours of lag between operational events and analytical availability. Flink CDC collapses that window to seconds, and this walkthrough gives you a production-ready implementation path without the usual hand-waving around connector configuration and schema evolution. Read more →

The Must-Know Topics for an LLM Engineer

If you are a data engineer expanding into the AI space, this piece serves as a comprehensive architectural map. It covers the end-to-end LLM engineering stack—from tokenization and attention mechanisms to optimization techniques like Flash-attention and RAG architectures. It effectively bridges the gap between traditional data pipelines and the operational nuances of managing LLM context, fine-tuning, and hallucination mitigation.

Why it matters: Data engineers are increasingly being asked to own the infrastructure that feeds, fine-tunes, and serves large language models. This article gives you the conceptual vocabulary and architectural overview needed to contribute meaningfully to those systems without getting lost in the ML theory weeds. Read more →

Ursa: A New Diskless Lakestream Engine for Kafka

Stanislav Kozlovski introduces Ursa, a minimally invasive fork of Kafka designed for “diskless topics.” By flushing mixed-partition data directly to S3 and asynchronously compacting it into open table formats like Apache Iceberg or Delta Lake, Ursa dramatically cuts cross-AZ network costs. It’s a fascinating look at the ongoing convergence of event streaming and data lakehouse architectures to solve infrastructure cost bloat.

Why it matters: Cross-AZ replication costs are one of the most underappreciated line items in a Kafka deployment. Ursa’s approach of treating S3 as the durable commit log—while keeping the Kafka API surface intact—is an elegant architectural bet on the continued cost trajectory of object storage versus EBS volumes. Read more →

ClusteredWriter vs FanoutWriter in Apache Iceberg: What I Learned During My DE Journey

A deep dive into the physical write mechanics of Apache Iceberg. The author contrasts ClusteredWriter (which requires presorted data by partition to maintain a low memory footprint) with FanoutWriter (which keeps multiple file handles open for incoming partitions, skipping the sort phase but risking out-of-memory errors). This is a crucial read for anyone fine-tuning Iceberg write performance and memory management at scale.

Why it matters: Choosing the wrong writer strategy is one of the most common causes of OOM errors and small-file proliferation in Iceberg workloads. Understanding when presorted data justifies the compute cost of sorting—versus when FanoutWriter’s memory risk is acceptable—is a decision that directly impacts both runtime cost and table health over time. Read more →

How Instacart Built a Search for Billions of Products

Search at grocery scale isn’t just about keyword matching or vector embeddings; it’s heavily constrained by real-time, store-level inventory. This engineering breakdown details how Instacart evolved their search infrastructure to handle complex user intent, dynamic ranking, and hyper-local availability, highlighting the operational realities of building AI-enhanced search when your underlying “truth” constantly fluctuates.

Why it matters: Most search literature assumes a relatively stable corpus. Instacart’s constraint—where relevance is fundamentally tied to what’s physically on a shelf at a specific store right now—forces architectural decisions that apply broadly to any system where ground truth is high-velocity. This is a rare, honest look at what AI-enhanced search actually costs at production scale. Read more →

Data Landscape

An interactive, opinionated map tracking the relevant open standards across the data ecosystem. Rather than just listing every vendor, it’s an excellent visual resource for making sense of the ever-expanding modern data stack and identifying which open-source tools are truly gaining standard-bearer status in their respective categories.

Why it matters: In a space where new tools launch weekly, having a curated, visual layer on top of the noise is genuinely useful. The Data Landscape cuts through vendor marketing by focusing specifically on open standards—a much more durable lens for evaluating tooling decisions that have multi-year architectural consequences. Explore →

🛠️ Tools

Jikkou 1.0: Declarative Kafka Now with Iceberg

What it is: Jikkou brings the Kubernetes kubectl experience to your data infrastructure. It is an open-source “Resource as Code” framework that lets you manage Kafka topics, ACLs, quotas, and schemas via declarative YAML files. With the 1.0 release, it extends its GitOps approach to Apache Iceberg, allowing platform teams to version-control and automate their Kafka and Iceberg provisioning natively within CI/CD pipelines.

Why you should check it out: Ad-hoc topic creation and manual schema updates are a major source of configuration drift and production incidents. Jikkou enforces the same declarative, auditable, and repeatable infrastructure model that Kubernetes made standard for compute—now applied to your streaming and lakehouse layers. The Iceberg support in 1.0 makes this a compelling choice for any team trying to unify their data platform provisioning under a single GitOps workflow.

💬 Community Sentiments

How Are You Centralizing Knowledge/Context from AI?

An emerging challenge is hitting teams using AI coding assistants: the fragmentation of context. Engineers on Reddit are noting that agents like Claude Code or Codex are generating brilliant architectural insights and debugging runbooks, but dumping them into siloed, local markdown files in individual repositories.

Key Takeaways from the thread:

The S3 + MCP approach: Several engineers are proposing dumping raw session files into an S3 data lake with an MCP (Model Context Protocol) server layered on top for retrieval—essentially treating the AI’s scratchpad as a first-class data asset with a queryable interface.

CI/CD-enforced promotion: Others advocate for strict CI/CD rules that promote high-signal agent scratchpads into a shared, indexed repository, ensuring that only vetted insights make it into the company-wide “brain” rather than creating a garbage-in, garbage-out knowledge dump.

The consensus: As AI agents become deeply integrated into data engineering workflows, we need scalable ways to feed localized insights back into a company-wide knowledge base. The tooling doesn’t exist yet in a clean, opinionated form—this is an open architectural problem actively looking for a standard. Read more →

That’s all for this week! See you in the next edition.