Skip to content
Data this week
Go back

Data This Week #11

Welcome to another edition of Data This Week. For those of us spending our days deep in the trenches of data engineering architecture, this week brought some massive shifts in how we handle open table formats, LLM integrations, and pipeline orchestration. Let’s dive into the updates.

Here are the top reads, tools, and community discussions for senior data folks this week.

📚 Blogs to Read

Migrating Apache Iceberg Tables Between AWS Accounts: What Nobody Tells You

Migrating Iceberg tables across AWS accounts isn’t as simple as just copying Parquet files to a new bucket. This piece dives into the undocumented hurdles of moving Iceberg datasets—from resolving hardcoded S3 paths within metadata manifests to untangling cross-account Glue Catalog permissions. It’s a highly practical read for anyone actively designing multi-account data lakehouses.

Why it matters: Multi-account AWS architectures are the standard for regulated industries and large enterprises. Understanding the subtle, undocumented failure modes of Iceberg migrations at this boundary is essential operational knowledge that’s hard to find in official docs. Read more →


DuckLake 1.0

DuckLake 1.0 has officially hit production-ready status. Taking a radically different architectural bet than Iceberg or Delta Lake, DuckLake stores all metadata in a SQL database rather than scattering it across thousands of JSON files on object storage. It completely sidesteps the notorious “small files problem” via data inlining—writing small streaming inserts directly into the catalog database. For those looking to drastically simplify their analytics stack without the overhead of distributed clusters, this is a major milestone.

Why it matters: The “small files problem” and catalog scalability have long been pain points in lakehouse architectures. DuckLake’s approach of centralizing metadata in a SQL database is a bold architectural departure that could meaningfully reduce operational complexity for teams that don’t need the full distributed scale of Iceberg or Delta. Read more →


Infrastructure as Code (IaC) for Data Engineers

IaC is no longer just a DevOps concern; it’s a mandatory discipline for modern data teams. This deep dive from Data Engineer Things explores how tools like Terraform and Terragrunt provide modular, repeatable blueprints for your data infrastructure. Whether you are provisioning Redshift serverless clusters or managing Apache Superset deployments on Kubernetes, applying DRY principles and configuration inheritance ensures a scalable and disaster-proof environment.

Why it matters: Data infrastructure sprawl is a real and growing problem. Teams that treat their data stacks as manually provisioned snowflakes (no pun intended) accumulate enormous technical debt. This is a timely reminder that infrastructure reproducibility isn’t optional—it’s the foundation of any reliable data platform. Read more →


Getting started with Apache Iceberg write support in Amazon Redshift

Bridging the gap between the warehouse and the lake, Amazon Redshift now natively supports DELETE, UPDATE, and MERGE operations on Apache Iceberg tables. You can now maintain ACID compliance and execute complex upsert patterns directly against S3 using familiar SQL syntax. This allows you to process heavy transformations in Redshift while immediately making the results available to other analytical engines like Athena or Spark SQL.

Why it matters: This is a significant step toward true interoperability in the open lakehouse ecosystem. Redshift’s ability to write back to Iceberg tables eliminates a major architectural bottleneck, enabling hybrid query patterns that combine the power of a managed warehouse with the openness of a data lake. Read more →


Agents are only as good as the data they can join

We’re all building AI agents, but their reasoning capabilities are fundamentally bottlenecked by the “join problem.” An agent cannot function reliably if it struggles to assemble ad-hoc context spanning a CRM, a warehouse, and external SaaS APIs. This post breaks down 5 architectural patterns for agent-data assembly, arguing that federated query engines—acting as a real-time integration layer—are the scalable future of agent infrastructure.

Why it matters: The hype around AI agents often glosses over the data plumbing required to make them work. This article cuts through the noise with a practical, architecture-first perspective, making it essential reading for data engineers who are now being asked to build the data layer for their company’s AI initiatives. Read more →


🛠️ Tools

LARQL (chrishayuk/larql)

What it is: A new query language built in Rust that treats a transformer’s Feed-Forward Network as a graph database, enabling you to query and even write to an LLM’s internal weights using familiar database semantics.

Why you should check it out: LARQL is a fascinating project where inference essentially becomes a KNN walk through graph edges rather than a black-box matrix multiplication. Most impressively, this architecture allows you to INSERT new factual knowledge directly into the model without the heavy compute costs of retraining or fine-tuning—a genuinely novel approach to the knowledge update problem in production AI systems.

GitHub →


đź’¬ Community Sentiments

Dagster pricing update is beyond nuts

The data engineering community has been in an uproar this week over Dagster’s sudden pricing changes. Moving to a strict pay-per-credit model ($0.0035/credit) and reportedly stripping the 30,000 free credits from the starter plan with only two weeks’ notice has left many startup data teams scrambling.

Key Takeaways from the thread:

The cost shock is real: The magnitude of the price hike is forcing a serious industry-wide conversation about the hidden, unpredictable costs of managed orchestration platforms and whether the convenience premium is still justified.

Vendor lock-in fears are surging: The abruptness of the change has renewed fears about the long-term risks of building critical workflows on managed SaaS platforms with opaque pricing roadmaps, driving a wave of interest in self-hosted alternatives.

The consensus: The growing appeal of shifting back to self-hosted OSS deployments or alternatives like Airflow is palpable in this thread. A must-read if your team runs on Dagster Cloud. Read more →


That’s all for this week! See you in the next edition.