Data This Week #2

Welcome to the second edition of Data This Week! Here’s a curated collection of the most interesting reads, tools, and community sentiments from the data engineering world.

📖 Blogs to Read

Real-Time Clickstream Analytics to Apache Iceberg with RisingWave Events API

RisingWave introduces an alternative to Kafka-heavy streaming architectures by offering direct HTTP ingestion via their Events API. The article demonstrates building a complete real-time clickstream pipeline that ingests events over HTTP (JSON/NDJSON), enriches them using materialized views with Postgres-style SQL, and persists results to Apache Iceberg using Lakekeeper catalog. The architecture eliminates operational overhead by auto-scaling compute resources, enables instant database branching for safe testing, and maintains open analytics storage compatible with Spark, Trino, and BI tools. For teams handling clickstream, session analytics, or event-driven workloads, this presents a compelling path to reduce infrastructure complexity while maintaining real-time capabilities and avoiding vendor lock-in through open formats.

Efficient String Compression for Modern Database Systems

CedarDB reveals how they achieved 2x storage reduction and faster query performance by implementing FSST (Fast Static Symbol Table) compression for text columns. The deep-dive explains why string compression matters—strings represent roughly 50% of stored data—and walks through their hybrid approach combining dictionary compression with FSST tokenization. CedarDB compresses dictionary values with FSST rather than raw strings, enabling efficient integer-based filtering while achieving better compression ratios than dictionaries alone. The article offers valuable implementation details for database engineers considering advanced compression schemes, including the 40% size penalty threshold they use to decide when FSST activation is worthwhile.

Databricks Lakebase is Generally Available

Databricks announces GA for Lakebase, their serverless Postgres implementation built on lakehouse architecture with compute-storage separation. The platform delivers instant database branching via zero-copy clones, automatic scaling (including scale-to-zero), point-in-time recovery with millisecond precision, and unified governance through Unity Catalog. The architecture eliminates the traditional tension between operational and analytical workloads—applications share the same data foundation as BI and ML without separate ETL pipelines or siloed databases.

AliSQL

Alibaba open-sources AliSQL, their production-hardened MySQL fork that powers Alibaba’s massive e-commerce infrastructure. The headline feature is native DuckDB integration as a storage engine, allowing teams to operate DuckDB with MySQL’s familiar interface for lightweight analytical capabilities. Built on MySQL 8.0.44 LTS, AliSQL offers a production-tested alternative for teams needing MySQL compatibility with advanced features like analytical queries (via DuckDB), future vector search for AI applications, and battle-tested stability from running Alibaba’s workloads.

Check it out →

Data Quality Monitoring at Scale with Agentic AI

Databricks launches Data Quality Monitoring in Public Preview, applying agentic AI to solve the scaling problem of manual, rule-based quality checks. The system learns expected data patterns rather than requiring threshold configuration, automatically prioritizes tables using Unity Catalog lineage and certification tags, and intelligently scans based on table importance and update frequency. The roadmap includes automated alerts with intelligent root cause analysis, proactive data filtering to quarantine bad data before it reaches consumers, and expanding quality rules beyond freshness and completeness to include percent null, uniqueness, and validity checks.

Check it out →

💭 Community Sentiments

Fivetran Service Disruptions Discussion

A long-term Fivetran customer experienced a total production blackout after the vendor terminated services without warning due to a billing error caused by a failure to update contact records as confirmed in writing. This administrative oversight paralyzed the user’s critical reverse-ELT architecture—including ERP and Salesforce pipelines—and the crisis was exacerbated by a rigid 24- to 48-hour reinstatement policy and a complete lack of human escalation. A proactive Customer Success (CS) team would have prevented this disaster by serving as an internal advocate to block automated service cuts for high-value accounts, ensuring the billing update was actually executed in the CRM, and providing a “human-in-the-loop” grace period to resolve the oversight without downtime.

Join the discussion →

That’s all for this week! See you in the next edition.