Data This Week #8 | Data this week

Welcome to this week’s data newsletter! We’ve curated the latest technical deep-dives, new infrastructure tools, and community discussions specifically tailored for data professionals, engineers, and architects looking to stay ahead of the curve.

Here is your weekly roundup.

📚 Blogs to Read

Why Data Engineers Should Care About Pydantic

If you are still using scattered if statements to validate untyped data or API responses, it’s time to upgrade. This piece highlights how Pydantic v2 serves as an explicit contract layer for Python data pipelines. By enforcing schema validation, type coercion, and environment configuration (via BaseSettings) at the very boundaries of your system, you can fail fast and avoid the expensive nightmare of debugging bad data deep inside a pipeline. Read more →

Something’s Off in Databricks Vector Search… Here’s What I Found

A practical, hands-on evaluation of Databricks Vector Search acting as a similarity search engine. The author dives into a recent use case, exposing some of the hidden complexities and nuances where the implementation isn’t quite as straightforward as the documentation implies. A must-read if you’re planning to adopt it for your own RAG architecture. Read more →

Tansu: Stateless Kafka-compatible Operations

Imagine Kafka without the operational headache of managing persistent broker state, replication, or leader elections. Tansu strips the Kafka architecture down to its API protocol and relies entirely on external storage (like S3, Postgres, or SQLite) for durability. It’s an incredibly lean approach that can scale to zero, run on a 256MB machine, and natively enforce schemas before sinking directly to open table formats like Apache Iceberg and Delta Lake. Read more →

Building a GenAI Cost Supervisor Agent in Databricks

Capital One’s engineering team shares how they escaped the “dashboard trap” by building an AI agent that answers highly specific ad-hoc questions about GenAI spend. By turning Databricks System Tables into a knowledge base mapped to 20 Unity Catalog SQL functions, their GenAI Cost Supervisor can dynamically reason through FinOps questions—giving insights on token efficiency and operational bottlenecks that a static dashboard simply cannot provide. Read more →

RAG Is a Data Engineering Problem — Here’s How to Build On It

A great reminder that Retrieval-Augmented Generation (RAG) is fundamentally an infrastructure and pipeline challenge, not just an ML trick. This step-by-step guide maps out the end-to-end orchestration required for a robust RAG system—from generating embeddings and managing vector databases to LangChain orchestration and setting up evaluation frameworks like RAGAS. Read more →

🛠️ Tools

Tansu (tansu-io/tansu)

What it is: An open-source, stateless Apache Kafka API-compatible broker written in asynchronous Rust.

Why you should check it out: Tansu is rethinking the message broker by separating compute from storage. As a single, statically linked binary, it drops into your environment and allows you to use PostgreSQL, S3, or SQLite as the storage backend. Notably, it has native schema registry capabilities (Avro, JSON Schema, Protobuf) and can act as a direct pipeline, automatically writing validated topics straight into Apache Iceberg or Delta Lake tables.

GitHub →

💭 Community Sentiments

Testing in DE Feels Decades Behind Traditional SWE. What Does Your Team Actually Do?

A recent discussion over on r/dataengineering perfectly captured the culture shock Software Engineers experience when moving to Data Engineering.

The Unit Test Dilemma: In traditional SWE, unit tests are non-negotiable. In DE, writing unit tests often equates to duplicating pipeline logic just to see if the outputs match, which many argue is pointless.

Data Validation vs. Code Testing: The consensus from senior engineers is that you cannot adequately test data the way you test code. Instead of focusing purely on unit tests, the real value comes from data contracts, boundary validation layers (like dbt tests and Great Expectations), pipeline observability, and ensuring idempotency so failures can easily be replayed.

The Streaming Nightmare: Several engineers noted that testing batch processing is child’s play compared to streaming pipelines, where statefulness, restart behaviors, and exactly-once guarantees make replicating source behavior in a test environment highly complex.

Ultimately, the thread highlights that while DE testing is different, the shift is moving away from code coverage towards robust data observability and strict boundary enforcement. Read more →

That’s all for this week! See you in the next edition.