Data This Week #13 | Data this week

Welcome to Data This Week. This issue covers essential tuning knowledge for large-scale Spark workloads, a pragmatic rethink of where to enforce data quality constraints, two eye-opening deep dives into Postgres at scale, and some exciting news from the open-source and community fronts.

Here are the top reads, tools, and community discussions for senior data folks this week.

📚 Blogs to Read

Deep Dive into Spark Memory Management

If you have ever fought OOM errors during a massive sort-merge join, this piece by Kirill Bobrov is a must-read. It breaks down the critical balance between Execution and Storage memory within the JVM Heap, and how parameters like spark.memory.storageFraction dictate workload survival. Getting this right is the difference between a smooth petabyte-scale pipeline and burning money on failed clusters.

Why it matters: Spark’s unified memory model is one of the least understood levers available to data engineers. A solid grasp of how execution and storage memory compete—and how to tune the boundaries between them—is essential for anyone operating at scale. Read more →

Validate Smarter at the Row-Level: A Four-Layer Approach

Data validation is a delicate balancing act. Too strict, and you break pipelines over minor type mismatches; too lax, and bad data infects your downstream metrics. Jon Duran outlines a pragmatic four-layer framework (Schema, Value Formats, Business Rules, and Metrics & Entities) to help you decide exactly where to enforce constraints based on actual business impact.

Why it matters: Most data quality frameworks treat all validation rules as equal, which leads to either alarm fatigue or silent failures. Tiering your constraints by business impact is a much more durable and operationally sane approach—especially when you need to make the tradeoff conversation explicit with stakeholders. Read more →

RLS Sounds Great Until It Isn’t

Postgres Row Level Security (RLS) is often pitched as the perfect solution for multi-tenant access control. PlanetScale details the severe footguns that come with it at scale. Because Postgres struggles to cache expensive functions within policy definitions, RLS can become a massive CPU drain and introduce connection pooling incompatibilities. Sometimes, relying on application-level authentication is simply the safer architectural bet.

Why it matters: RLS is one of those features that looks elegant in a design review and painful in a post-mortem. For any team running multi-tenant workloads on Postgres, this is a critical read before you commit to a security model that may not survive production load. Read more →

Stripe’s DocDB: Zero-Downtime Data Movement

A fascinating look at Stripe’s database tier, scaling to 5 million QPS with 5.5 nines of reliability. Jimmy Morzaria walks through how they built a custom platform to handle horizontal sharding and multi-tenant migrations without any downtime, all while maintaining the strict consistency required for global payment processing.

Why it matters: Zero-downtime migrations at this scale are an engineering discipline, not a lucky coincidence. The architectural patterns Stripe documents here—particularly around live traffic cutover and consistency guarantees during shard splits—are directly applicable to any team approaching the limits of a single database instance. Read more →

Amazon Aurora DSQL vs. Single-Instance PostgreSQL

Aurora DSQL brings a distributed, shared-nothing architecture to Postgres. The physical layout differences are massive: data is stored in primary key order (no heap) and relies on Optimistic Concurrency Control (OCC) instead of lock-based concurrency. If you are migrating transactional workloads to distributed systems, understanding these dialect and structural constraints is essential.

Why it matters: Aurora DSQL is not a drop-in replacement for Postgres—it’s a fundamentally different engine that speaks a Postgres dialect. Teams planning a migration need to fully understand where OCC semantics diverge from Postgres’s lock-based model, or they will encounter correctness surprises in production. Read more →

🛠️ Tools

Velero Joins the CNCF Sandbox

What it is: Broadcom has officially donated Velero to the CNCF Sandbox. Kubernetes natively lacks cluster-level backup and disaster recovery, making Velero critical for any stateful workload running on Kubernetes.

Why you should check it out: Shifting to neutral community governance ensures Velero remains the open standard for Kubernetes backup and DR rather than a vendor-controlled tool. For teams standardizing disaster recovery protocols, this move makes Velero a safer long-term bet—its future is now driven by community consensus rather than a single company’s roadmap.

Rocky (rocky-data/rocky)

What it is: An open-source data engineering monorepo tool that bundles a fast Rust-based CLI engine, a Dagster integration, and a VS Code extension into a single, cohesive package.

Why you should check it out: By keeping parsing, type-checking, and editor tooling in a single repository, Rocky guarantees that DSL changes propagate atomically across the entire stack. No more version mismatches between your CLI, orchestrator plugin, and IDE extension—a genuinely elegant approach to toolchain coherence that is worth watching.

GitHub →

💬 Community Sentiments

SQLGlot Is Now 5x Faster While Still Being Written in Python

Over on the r/dataengineering subreddit, the creator of SQLGlot—an incredible SQL parser and transpiler framework—announced a massive 5x speed improvement. They managed this by using mypyc to compile typed Python code directly into fast C without losing the standard Python interface.

Key Takeaways from the thread:

C-level speed, Python-level ergonomics: mypyc compiles standard, typed Python into a C extension without requiring any rewrites or a separate codebase. The library still imports and behaves exactly like regular Python for downstream consumers.

Massive win for semantic query analysis: Teams doing heavy dialect translation, query rewriting, or AST-level analysis at scale now get a near-free performance multiplier simply by upgrading their SQLGlot version.

The consensus: For any team leaning heavily on semantic query analysis or dialect translations, seeing C-level execution speeds on a Python-native library is a huge operational win. Upgrade your SQLGlot version. Read more →

That’s all for this week! See you in the next edition.