Data This Week #4 | Data this week

Welcome to the fourth edition of Data This Week! Here’s a curated collection of the most interesting reads, tools, and community sentiments from the data engineering world.

📚 Blogs to Read

Apache Iceberg on Quanton: 3x Faster Apache Spark Workloads

Onehouse recently detailed their new Quanton engine, which aims to accelerate ETL workloads on Apache Iceberg by up to 3x without requiring you to change your existing Spark jobs or SQL. Instead of a simple framework bolt-on, the performance gains come from low-level execution layer rewrites.

Key Takeaways: Quanton utilizes SIMD vectorized execution to process columnar data in batches, entirely bypassing the bottlenecks of scalar loops. It also customizes the I/O path for high-throughput data scanning (parallel pre-fetching and decoding), making it a massive win for heavy read/write benchmarks on the Lakehouse. Read more →

Engineering VP Josh Clemm on how we use Knowledge Graphs, MCP, and DSPy in Dash

Building RAG in a notebook is easy; building it at enterprise scale is a different beast. Dropbox’s VP of Engineering opens up about the architecture behind Dash, their universal search tool.

Key Takeaways: Dropbox moved away from passing massive context windows directly to the LLM. Instead, they model documents and relationships into a Knowledge Graph to isolate the exact information needed. To combat ballooning token costs from MCP (Model Context Protocol) tool calling, they highly recommend building “super tools” rather than 10 different retrieval tools, and relying heavily on DSPy for at-scale prompt optimization. Read more →

Ten years late to the dbt party (DuckDB edition)

Veteran data engineer Robin Moffatt shares his “lightbulb moment” with dbt after ignoring it for nearly a decade. Using DuckDB to extract UK Environment Agency data, he explores how dbt perfectly handles the “T” in ELT.

Key Takeaways: For senior folks who grew up on legacy ETL tools, this is a great read on modern separation of concerns. Moffatt highlights how dbt’s modularity, version control, and built-in freshness checks beautifully solve the “Day 2” operational nightmares that custom pipelines inevitably face. Read more →

Scaling PostgreSQL to power 800M ChatGPT users

OpenAI shares a fascinating post-mortem on how they scaled their Postgres backend to handle the unprecedented traffic of ChatGPT. Relational databases still rule, but at this scale, multiversion concurrency control (MVCC) and connection storms become your biggest enemies.

Key Takeaways: To survive extreme read/write spikes, OpenAI relies on strict workload isolation (moving “noisy neighbor” analytical queries to dedicated instances), heavy usage of PgBouncer for connection pooling, and a custom cache-locking mechanism to prevent database-crushing read surges during cache-miss storms. Read more →

Local AWS Data Lakehouse

Cloud computing costs for sandbox testing can easily spiral out of control. This engineering blog walks through setting up a complete, mock AWS Lakehouse environment locally.

Key Takeaways: A great resource for Data Engineers looking to improve their CI/CD loops and local testing. Being able to mock S3, Glue, and Athena locally means faster iterations and zero risk of accidentally dropping a heavy query bill while debugging a pipeline. Read more →

🛠️ Tools

Metabase AI Data Generator

Generating realistic mock data for pipeline testing, unit tests, or BI dashboards is historically a tedious chore. Metabase just released an open-source, conversational prompt builder that generates custom datasets. You just define the schema and business context via natural language, and the local DataFactory engine spins up the rows. You can export the output directly to CSV or SQL inserts. Read more →

Alibaba ZVec

The embedded database renaissance continues. ZVec is Alibaba’s newly open-sourced, in-process vector database built on their battle-tested Proxima engine. Think of it as SQLite for vector search. It brings millisecond search latency for billions of vectors with zero server overhead. If you are building edge AI apps, local RAG pipelines, or simply want to avoid the infrastructure burden of a standalone vector database instance, ZVec is worth testing. Read more →

💭 Community Sentiments

Red flag! Red flag? White flag!

A fascinating discussion erupted on r/dataengineering this week regarding the current state of technical interviews. A hiring manager realized a candidate was using an LLM (like Claude/ChatGPT) to generate almost verbatim answers to the technical questions during the assessment.

The twist? The hiring manager also used AI to write the interview questions and expected answers.

The Community Takeaway: The industry is at a strange crossroads. Many senior engineers pointed out the hypocrisy of failing a candidate for using the exact same tools the engineering team uses daily to be productive. The consensus is that traditional Q&A trivia and take-home tests are officially dead. Interviews must shift toward conversational architecture design, real-time troubleshooting, and evaluating how candidates think rather than what they can recite. Read more →

That’s all for this week! See you in the next edition.