Data This Week #6 | Data this week

Welcome to this week’s edition of Data This Week! We have an excellent lineup of deep dives into database internals, modern semantic layers, and community discussions around schema evolution.

📚 Blogs to Read

Unified Lakehouse with Apache Doris & Paimon: Xiaomi Achieves 6x Faster Performance

The Problem: Xiaomi’s previous OLAP stack was fragmented across four query engines (Presto, Druid, Doris, Spark) and four storage formats (Iceberg, Paimon, Doris, Druid). This created massive data redundancy, inconsistent governance, and high operational overhead.

The Solution: They consolidated down to just two compute engines (Doris for real-time interactive analytics, Spark for offline batch) and a single unified storage layer (Apache Paimon). Doris brings the distributed vectorized execution framework and millisecond-level query latency, while Paimon provides an open table format with petabyte-scale storage, native transactions, and schema evolution.

Why it matters: This case study is a textbook example of simplifying a sprawling data platform. By collapsing four engines and four formats into a clean two-layer architecture, Xiaomi cut aggregation query times from 40s to 8s (5x faster) and reduced average query latency by 6x. For teams drowning in multi-engine complexity, this piece demonstrates that a well-integrated lakehouse can deliver both real-time and batch analytics without the governance nightmare of maintaining separate data silos. Read more →

Optimizing Top-K in Postgres

Native Postgres full-text search (ts_rank) chokes at massive scale because it computes a score for every matching document before ranking them. To solve this, ParadeDB embeds dedicated search engine mechanics directly into Postgres. By integrating high-speed Rust search libraries and industry-standard relevance algorithms (like BM25), they bypass Postgres’s native scoring bottleneck. The real magic is Block-Max WAND (BMW) pruning—an execution method that intelligently skips scoring rows that mathematically cannot make the top results.

Why it matters: If you are building RAG or heavy search systems over 100M+ documents, this breakdown shows how pushing Top-K filters down the execution plan keeps latency flat, saving you from deploying a separate search database. Read more →

Monitoring dbt Runs: Logging Execution

Relying solely on Slack alerts for failed dbt jobs leaves you blind to silent performance degradation. This guide demonstrates how to parse and ingest dbt’s native artifacts (specifically run_results.json) back into your data warehouse as a robust, queryable metadata layer. By treating execution logs as a first-class data asset, you transition from reactive debugging to programmatic observability.

Why it matters: For platform teams managing complex DAGs, this approach unlocks the ability to track historical model execution times, pinpoint warehouse compute bottlenecks, and catch creeping SLA violations before they become actual incidents. Read more →

The Internals of PostgreSQL

When you hit a wall debugging complex lock contention or tuning aggressive vacuuming processes, surface-level documentation rarely cuts it. Authored by Hironobu Suzuki, this exhaustive, free online book is arguably the definitive architectural guide to Postgres short of reading the actual C source code. It meticulously breaks down the exact mechanics of Multi-Version Concurrency Control (MVCC), Write-Ahead Logging (WAL), Heap Only Tuples (HOT) updates, and query processing pipelines.

Why it matters: For senior engineers and DBAs, this isn’t a tutorial—it’s a mandatory reference manual to bookmark for understanding exactly how Postgres manages memory, buffers, and concurrency under the hood. Read more →

🛠️ Tools

DataJunction (Netflix’s Semantic Layer)

Graph-Based Metadata: Stores metrics and upstream abstractions as interconnected nodes, abstracting away the underlying physical tables.

API-First Approach: Delivers metric definitions as SQL or directly as data, ensuring consistency across BI dashboards, experimentation platforms, and ad-hoc analysis.

Why it matters: Recently highlighted by the Netflix Tech Blog, DataJunction (DJ) is an open-source metrics platform designed to solve the fragmented “semantic layer” problem. Instead of redefining “Total Streaming Hours” in Tableau, dbt, and custom Python scripts, DJ acts as a central repository. It infers dependency graphs to generate accurate join paths dynamically. For data teams struggling with metric discrepancies across the business, DataJunction offers a robust, highly scalable abstraction layer. Read more → | Website →

💭 Community Sentiments

Do any ETL tools handle automatic schema change detection?

The Consensus: While modern ELT tools (like Fivetran and Airbyte) handle additive schema evolution (e.g., new columns) reasonably well, destructive changes (drops, type casting) will almost always break downstream models.

The Shift to Contracts: Senior engineers in the thread largely agree that trying to solve this at the ETL tool layer is a losing battle. The conversation heavily pivots toward implementing strict Data Contracts at the ingestion layer, adding row-level anomaly checks, and quarantining broken payloads before they hit production modeled tables. Read more →

That’s all for this week! See you in the next edition.