Data This Week #9 | Data this week

Welcome to this week’s roundup of the data ecosystem. This issue dives into the evolution of streaming lakehouse architectures, new unified data engines bypassing the JVM overhead, and some counter-intuitive truths about PostgreSQL performance.

Here are the top reads, tools, and community discussions for senior data folks this week.

📚 Blogs to Read

Data Inlining in DuckLake: Unlocking Streaming for Data Lakes

DuckLake is tackling the notorious “small files problem” in streaming lakehouses with a technique called data inlining. Instead of creating thousands of tiny Parquet/metadata files for every micro-batch, DuckLake stores small inserts and deletes directly inside an OLTP catalog database (like PostgreSQL or SQLite). A background CHECKPOINT eventually flushes them to Parquet.

Why it matters: Benchmark results showed a 926x speedup in aggregations compared to Iceberg for high-frequency inserts. This signals a shift toward using transactional databases as the active write-ahead log for object-storage data lakes, eliminating the need for heavy, continuous compaction jobs. Read more →

Operating Trino at Scale With Trino Gateway

Expedia Group details their architecture for managing a massive, multi-cluster Trino deployment. By deploying Trino Gateway, they dynamically route queries based on workload profiles—segregating resource-heavy ETL jobs from high-concurrency BI dashboards. Expedia recently contributed several UI features back to the open-source Gateway to eliminate manual config file editing for routing rules.

Why it matters: If you are running Presto or Trino across multiple environments, intelligent workload routing is mandatory to protect SLAs. This blog provides a blueprint for migrating from static clusters to a dynamic, workload-aware compute mesh. Read more →

Short Video Analytics: Closing the Gap Between Signals and Truth

A practical exploration of how to build a Lakehouse architecture that supports both real-time operational decisions (sub-minute latency) and deep offline analytical modeling. The author outlines a dual-path architecture using Kafka + Spark Structured Streaming for the “Hot Path” and daily batch jobs for the “Cold Path,” tied together by a unified semantic layer.

Why it matters: A great refresher on applying Lambda architecture principles to the modern Lakehouse. It emphasizes solving “Decision Usability”—ensuring real-time operational metrics don’t drift from offline analytical truth. Read more →

High memory usage in Postgres is good, actually

PlanetScale clears up a common misconception: seeing your PostgreSQL node sitting at 80%+ RAM usage is not a cause for alarm. Because reading from RAM is ~1,000x faster than NVMe drives, Postgres relies heavily on the OS page cache and shared_buffers to keep hot data close to the CPU.

Why it matters: It teaches you to stop alerting on total memory utilization and start looking at Resident Set Size (RSS). High Cache memory is healthy; high RSS (process memory) indicates memory pressure and OOM risks, often solved by connection poolers like PgBouncer. Read more →

Ontul: A unified data engine for batch, streaming, and interactive SQL

A deep dive into Ontul, a newly emerging pure-Java data engine that runs batch, streaming, and interactive SQL on a single cluster. It boasts an Arrow-native architecture with zero-serialization overhead (using Arrow Flight) and embedded RocksDB for state management, removing external dependencies on Hadoop, YARN, or Kubernetes.

Why it matters: The push for engine consolidation continues. Ontul’s native integration with Apache Iceberg via REST catalogs makes it an interesting lightweight alternative to maintaining separate Flink and Spark clusters. Read more →

Starburst Enterprise Performance Tuning

A practitioner’s guide to optimizing Starburst/Trino. The series focuses on the two-tier architecture (coordinator vs. worker nodes) and how to read the EXPLAIN plan to understand exactly what the Trino optimizer knows about your query before it scans a single row of data. Read more →

🛠️ Tools

TigerFS: PostgreSQL as a Filesystem

What it is: An experimental tool that mounts a PostgreSQL database as a local filesystem (via FUSE on Linux or NFS on Mac).

Why you should check it out: Every file corresponds to a database row, meaning you can interact with a live database using standard Unix tools like ls, grep, and cat, while retaining full ACID guarantees and concurrency. While developers will find it neat, the primary target is AI Agent workflows—LLM agents struggle with complex SDKs and SQL dialects but are incredibly proficient at navigating and manipulating traditional filesystem interfaces.

TigerFS → | InfoQ coverage →

💬 Community Sentiments

Lessons from building a 6-tier streaming lakehouse

Over on r/dataengineering, a user shared lessons from building an ambitious “franken-stack” to process live crypto websocket ticks. The stack chained together Iggy, Flink, Paimon, Iceberg, Fluss, and LanceDB.

Key Takeaways from the trenches:

Beware of distributed state edge-cases: The author got burned by Paimon’s aggregation engine, which treated every INSERT as a delta. When Flink High Availability (HA) resurrected a finished seed job upon restart, it effectively double-counted the initial financial balances.

Engine interoperability is still bumpy: Reading Paimon Primary Key tables with DuckDB resulted in double-counted data because DuckDB eagerly globs all underlying parquet files—including pre-compaction snapshots—bypassing the merge semantics.

The consensus: The community loved the experiment, jokingly dubbing it the “FFLIIP” stack. It serves as a great reminder to senior engineers: while vendors promise seamless streaming lakehouse interoperability, the reality of snapshot isolation, state checkpointing, and merge-on-read engines requires rigorous, idempotent pipeline design. Read more →

That’s all for this week! See you in the next edition.