Skip to content
Data this week
Go back

Data This Week #5

Welcome to the fifth edition of Data This Week! Here’s a curated collection of the most interesting reads, tools, and community sentiments from the data engineering world.

📚 Blogs to Read

Spark Is Not Just Lazy. Spark Compiles Dataflow.

Re-evaluating the common oversimplification that “Spark is lazy.” This post dives deep into how Apache Spark doesn’t just defer execution, but actively compiles a directed acyclic graph (DAG) of the dataflow.

Why it matters: It highlights the internal optimization benefits of DAG execution over eager evaluation, explaining why labeling Spark simply as “lazy” undersells the sophisticated Cost-Based Optimizer (CBO) and physical planning happening under the hood. Read more →


Jack of all trades: query federation in modern OLAP databases

Fresha’s data engineering team explores query federation using StarRocks, tackling the challenge of querying across heterogeneous sources without building brittle, complex ETL pipelines.

Why it matters: The architecture utilizes StarRocks’ vectorized execution engine and deep integration with Apache Iceberg. It showcases how federation reduces data duplication, simplifies lakehouse architecture, and leverages caching and predicate pushdowns to keep query latency low. Read more →


Next Generation DB Ingestion at Pinterest

Pinterest outlines their migration from a legacy, batch-oriented database dump architecture to a unified, low-latency Change Data Capture (CDC) framework processing petabytes of data.

Why it matters: They replaced 24-hour batch jobs with a Debezium/TiCDC + Kafka + Flink + Spark + Iceberg stack, cutting latency to just minutes. The post also details a clever “Bucket Join” workaround in Spark to bypass full table shuffles during massive base-table upserts. Read more →


How CyberArk Achieved 4x Faster Support Using Apache Iceberg and Amazon Bedrock

A practical look at how CyberArk supercharged their customer support operations by blending open table formats with Generative AI.

Why it matters: They leveraged Apache Iceberg for the scalable, efficient ingestion of unstructured, messy customer logs. By plugging this into Amazon Bedrock, they enabled autonomous AI investigations that provide instant answers and reduce ticket resolution times by 4x. Read more →


🛠️ Tools

Databricks Announces General Availability of Zerobus Ingest (LakeFlow Connect)

Databricks has officially rolled out Zerobus Ingest as part of LakeFlow Connect, offering a fully managed, serverless approach to real-time data streaming.

Why it matters: It eliminates the need to manage external Kafka-style message buses by streaming data directly into Unity Catalog-governed Delta tables. It also supports schema-free JSON ingestion via the Databricks Variant type, cutting infrastructure costs and simplifying real-time streaming architectures. Read more →


Stoolap

Stoolap is a high-performance embedded SQL database written in pure Rust. Its architecture prioritizes:

  • Memory-first design with optional disk persistence
  • Full ACID transactions with MVCC
  • Cost-based query optimizer with adaptive execution
  • Multiple index types (B-tree, Hash, Bitmap, HNSW)
  • Parallel query execution via Rayon
  • Minimal unsafe code (only in performance-critical hot paths)

View on GitHub →


💭 Community Sentiments

Reddit: Which data quality tool do you use?

A popular thread sparked by a user mapping out 31 different specialized data quality and observability tools. The discussion reveals that while the market is flooded with SaaS platforms, many senior engineers still advocate for fundamental, code-driven approaches.

Keep it Simple: For raw ingestion issues, many still rely heavily on native dbt tests for basic row counts, uniqueness, and null checks rather than onboarding a new vendor.

Shift-Left with Data Contracts: There is high enthusiasm for implementing data contracts (defining schemas, constraints, and semantic versioning) so both publishers and consumers automatically enforce quality before data enters the pipeline.

Auditing over Logging: For cross-system reconciliation, practitioners highly recommend using data-diff tools and structured audit trails (tracking exact rows extracted, transported, and rejected) over dumping text into generic logging solutions. Read more →


💼 Jobs

dbt Labs

View openings →


Metabase

View openings →


That’s all for this week! See you in the next edition.