Data This Week #1

Welcome to the first edition of Data This Week! Here’s a curated collection of the most interesting reads, tools, community sentiments, and job opportunities from the data engineering world.

📖 Blogs to Read

2025 Databases Retrospective

In his annual comprehensive review, Andy Pavlo dissects the major database trends of the past year, highlighting PostgreSQL’s continued dominance through high-profile acquisitions in the ecosystem (e.g., Neon, Crunchy Data) and the rise of new distributed PostgreSQL projects like Multigres. The post also notes the industry-wide adoption of the Model Context Protocol (MCP) to standardize how Large Language Models (LLMs) interact with databases, effectively turning them into middleware.

ADBC Arrow Driver for Databricks

This article introduces the new Arrow Database Connectivity (ADBC) driver for Databricks, which is designed to supersede traditional ODBC/JDBC drivers for analytical workloads. By leveraging the Apache Arrow format, the driver enables direct, columnar, and zero-copy data transfer between Databricks and client tools like Power BI. This architecture eliminates the costly overhead of serializing data into row-oriented formats, significantly boosting performance for high-volume data retrieval and analytics.

Scaled Data Replication at Uber

Uber’s engineering team details how they optimized their HiveSync service to replicate petabytes of data daily across data lakes. Facing scaling bottlenecks with the standard Hadoop Distcp tool, they moved resource-intensive tasks like Copy Listing and Input Splitting to the Application Master to reduce client contention. They also introduced “Uber jobs” to handle small file transfers locally within the Application Master, eliminating the overhead of launching hundreds of thousands of separate containers daily.

The AI Evolution of Graph Search at Netflix

Netflix shares their journey of transitioning from a complex, structured Domain Specific Language (DSL) to natural language search for their internal Graph Search platform. The post explains their use of Retrieval Augmented Generation (RAG) to handle massive schemas and controlled vocabularies, ensuring the LLM has the right context to generate accurate queries. They also implemented rigorous validation steps, such as checking the generated query’s Abstract Syntax Tree (AST) for syntactic correctness and verifying fields against metadata to prevent semantic hallucinations.

OpenEverest

OpenEverest is a new open-source tool designed to provision and manage databases on any Kubernetes cluster, aiming to provide a private DBaaS experience without cloud vendor lock-in. It currently supports MySQL, PostgreSQL, and MongoDB, offering essential operational features such as automated backups, scaling, and monitoring. The project is modular, allowing teams to mix and match storage and database engines while managing them through a unified web UI and API.

Check it out →

Pandas 3.0

This major release of the popular Python data analysis library introduces significant performance and behavior changes. Key updates include making Copy-on-Write (CoW) the default behavior, which eliminates the SettingWithCopyWarning and disables chained assignment to improve predictability. Additionally, string columns now default to the PyArrow-backed string data type for better performance, and the library now supports varied datetime resolutions (including milliseconds and seconds) to better handle a wider range of historical and future dates.

Check it out →

💭 Community Sentiments

Getting off of Fabric

A highly engaged discussion on the r/dataengineering subreddit highlights growing frustration with Microsoft Fabric’s shared capacity model. The original poster details their decision to migrate away from the platform, citing critical stability issues where heavy ETL pipelines spike capacity usage, causing “noisy neighbor” problems that render Power BI visuals unusable for end users.

Beyond the pricing model, the thread aggregates complaints about the “painful and opaque” debugging experience, particularly with random pipeline hangs that offer no actionable error logs. Users also expressed disappointment with the reliance on slow “Copy Data” activities for SQL Server ingestion and the feeling that many features labeled as production-ready still behave like preview software. The sentiment reflects a broader hesitation to lock into an “all-in-one” platform that limits architectural flexibility.

Join the discussion →

Context Graphs: Capturing the Why in the Age of AI

In a viral LinkedIn Pulse article, HubSpot founder Dharmesh Shah argues that while Knowledge Graphs capture the “what” (entities and relationships), the age of AI agents requires a new layer: the Context Graph. This new architecture is designed to capture the “why”—recording the decision traces, policy constraints, and state of the world at the exact moment a decision was made. Shah posits that without this historical causality, AI agents cannot effectively reason about past actions or learn from organizational nuance.

Zapier

Zapier is hiring for Data Engineer roles across two key teams: Product Data Engineering and Data Platforms. The Product Data Engineering role focuses on building backend systems that power features like AI personalization and billing, while the Data Platforms role is dedicated to governance, security, and infrastructure scaling.

Key Requirements: 4+ years of experience with cloud data pipelines (AWS/GCP/Azure), proficiency in Python/TypeScript and SQL, and strong expertise in Databricks or Spark.

Locations: Americas (Product Data Engineering) and Americas/EMEA (Data Platforms).

View openings →

Coinbase

Coinbase has open positions for Senior Software Engineers and Engineering Managers within their Data Platform and Blockchain Platform teams. These roles involve building the systems that centralize internal and third-party data, enabling analytics and machine learning across the company. Specific opportunities include working on blockchain data processing and scaling the “Golden Record” data layers.

Key Requirements: Strong backend skills (Go, Java, or Python), experience with distributed systems (Kafka, Flink, Spark), and a background in building multi-tenant data infrastructure.

Locations: Remote (India, USA, Canada).

View openings →

Stripe

Stripe is actively recruiting for a variety of data roles, including Backend Engineer (Data Platform), Data Analyst, Data Scientist, and Machine Learning Engineer. The Data Platform team is looking for engineers to build scalable infrastructure that handles Stripe’s massive transaction volumes, while the Data Science roles focus on product analytics and payments optimization.

Key Requirements: Experience with distributed data systems (Trino, Spark, Flink), strong SQL skills, and a track record of rigorous data modeling. Engineering roles often require proficiency in Java or Go.

Locations: Remote (US/Canada) and hubs like Toronto, Seattle, New York, and Bengaluru.

View openings →

That’s all for this week! See you in the next edition.

📖 Blogs to Read

2025 Databases Retrospective

ADBC Arrow Driver for Databricks

Scaled Data Replication at Uber

The AI Evolution of Graph Search at Netflix

🛠️ Tools

OpenEverest

Pandas 3.0

💭 Community Sentiments

Getting off of Fabric

Context Graphs: Capturing the Why in the Age of AI

💼 Jobs

Zapier

Coinbase

Stripe