Skip to content
Data this week
Go back

Data This Week #12

Welcome to Data This Week. This week, we’re diving into practical architecture shifts—from migrating cold data to lakehouses, to Databricks’ new visual data prep tool—alongside a stark reminder of why data observability matters most when the pager goes off at 2 AM.

Here are the top reads, tools, and community discussions for senior data folks this week.

📚 Blogs to Read

How We Cut Our Database Costs by Moving Cold Data from PostgreSQL to Amazon S3

Every growing engineering team eventually faces the reality of a bloated transactional database. Arcesium details their journey of offloading cold reconciliation data (older than 90 days) from costly PostgreSQL to an Amazon S3-backed data lakehouse. By leveraging Apache Iceberg for table formats and DuckDB for fast analytical reads, they achieved a massive 40–60% reduction in database storage costs while keeping historical data fully queryable. A great read on balancing hot transactional needs with cold OLAP economics.

Why it matters: Transactional databases are expensive to scale vertically, and cold data rarely justifies the cost. This piece offers a concrete, battle-tested blueprint for tiering your storage strategy using open formats—without sacrificing queryability. Read more →


Announcing the Public Preview of Lakeflow Designer

Databricks just pushed Lakeflow Designer into public preview, aiming to bridge the gap between business teams and raw data. It’s a no-code, AI-native visual interface that lets analysts and domain experts build data prep workflows using drag-and-drop canvases and natural language prompts. Because it runs natively on Databricks, it respects Unity Catalog governance and avoids data duplication. Crucially for us on the engineering side, every visual transformation generates real, production-ready Python code under the hood—meaning no more rewriting analysts’ low-code pipelines from scratch when they need to be moved to production.

Why it matters: The perennial tension between self-service analytics and engineering standards has always been a governance nightmare. Lakeflow Designer’s approach of generating production-grade Python from visual workflows is a meaningful step toward closing that gap without creating a shadow data stack. Read more →


What is a Vector Database?

With the explosion of GenAI and RAG applications, vector databases have shifted from niche to mainstream infrastructure. This system design deep-dive breaks down how vector databases differ from traditional relational or NoSQL systems, explaining the mechanics of vector embeddings, indexing algorithms like HNSW (Hierarchical Navigable Small World), and how they enable lightning-fast similarity searches for unstructured data.

Why it matters: If your team is building any kind of RAG pipeline or semantic search layer, you need a first-principles understanding of how vector indexes actually work—not just which library to import. This is the clearest breakdown of HNSW mechanics you’ll find outside a research paper. Read more →


Salesforce Data Migration Guide: Steps, Tools, and Best Practices

Data migrations are rarely a simple “lift and shift.” This comprehensive guide outlines a battle-tested three-step framework (Plan, Execute, Validate) for migrating legacy CRM data into Salesforce. It covers essential strategies for handling cross-platform schema alignments, navigating API constraints for high-volume batches, and avoiding the cascading failures that happen when historical triggers and automations are left on during a migration load.

Why it matters: CRM migrations are notoriously underestimated in complexity. The failure modes here—runaway automations, silent API throttling, schema mismatches—are exactly the kind of undocumented gotchas that turn a “simple” migration into a multi-week incident. Essential reading before your team touches a production CRM. Read more →


🛠️ Tools

SwiftLake (via Arcesium Engineering)

What it is: A lightweight Java SQL engine built directly on top of Apache Iceberg and DuckDB that allows you to execute fast analytical reads and perform complex writes—including SCD Type 1 and 2 merges and schema evolution—directly against Parquet files on S3.

Why you should check it out: If you want the economics of a data lakehouse without the overhead of managing a distributed Spark cluster, SwiftLake is worth your attention. It is perfect for single-node workloads that require cloud storage scalability and Iceberg’s ACID compliance without heavy compute infrastructure. A natural companion to the cold data tiering strategy covered in the Arcesium blog above.

Blog post →  |  GitHub →


💬 Community Sentiments

”Data pipeline blew up at 2am and I have no clue where it started”

A painful but highly relatable thread over on r/dataengineering. A senior engineer got paged at 2 AM because a revenue dashboard was showing garbage numbers. The culprit wasn’t a broken transform, but an upstream source that stopped sending fresh data. Because the ingestion layer didn’t “fail,” the downstream dbt models happily processed the empty and stale data.

Key Takeaways from the thread:

Catch bad data at the source: The most highly recommended solutions involved implementing strict row count and freshness checks before your transform layers even kick off (e.g., using dbt source freshness blocks).

Enforce Data Contracts at the boundary: Explicitly define expected schemas and volume thresholds at the ingestion layer so silent failures become loud failures.

Adopt the Write-Audit-Publish (WAP) pattern: Validate data in a staging environment before swapping it into production views, ensuring downstream consumers never see stale or empty datasets.

The consensus: Prevention is always cheaper than a 2 AM firefighting session. A must-read if your team lacks observability at the ingestion layer. Read more →


That’s all for this week! See you in the next edition.