From CDC to Queryable Iceberg: Building a Production Ingestion Path with OLake + Trino

About 1 hour

About this event

Description

CDC pipelines into Apache Iceberg are straightforward to start and surprisingly hard to maintain at scale. The moment you move beyond full loads into continuous CDC, you accumulate equality delete files, small Parquet fragments, and schema drift that silently degrade query performance over time.

This hands-on workshop walks through the full ingestion path: configuring OLake to replicate a PostgreSQL source into S3-backed Iceberg tables, querying that data live with Trino via Starburst, and then confronting the real operational challenges that appear after the first few CDC cycles: small files, delete file overhead, snapshot history, and compaction.

Agenda

Configuring OLake for PostgreSQL replication — source config, catalog discovery, and sync modes
Running the sync — how OLake writes Parquet files into S3 as Iceberg tables after a full load
CDC in practice — running incremental syncs and inspecting the resulting small data files and equality delete files
Why file accumulation matters — MOR read overhead and what the layout looks like before any maintenance
Querying with Starburst — connecting to Iceberg, validating ingested data, and running live queries
Iceberg snapshot history and time travel — querying previous states using snapshot IDs
Compaction with Starburst — merging small files, reducing delete overhead, and measuring the performance difference
Best practices for Iceberg table maintenance in CDC pipelines

Hosted by

External speaker

E
Lester Martin Devrel @ Starburst

Lester is a data engineer and developer advocate with over a decade of hands-on work across data pipelines, lake analytics, and distributed query engines. He has worked extensively with Trino, Apache Iceberg, Hive, Spark, Flink, and Kafka, and spent eight years at Cloudera/Hortonworks helping enterprise teams build and operate Hadoop, Spark, and NoSQL workloads before moving into developer advocacy. At Starburst, he focuses on helping engineering teams adopt Trino and understand how query engines interact with open table formats like Iceberg at the storage layer. He regularly publishes technical deep-dives on Iceberg internals, including deletion vectors, snapshot management, and file layout behavior.
Team member

T
Harsha Kalbalia GTM @ Datazip | Founding Member @ Datazip

Harsha is a user-first GTM specialist at Datazip, transforming early-stage startups from zero to one. With a knack for technical market strategy and a startup enthusiast's mindset, she bridges the gap between innovative solutions and meaningful market adoption.
Team member

T
Nayan Joshi Data Engineer DevRel @ Datazip

OLake by Datazip

Fastest way to replicate your data to Apache Iceberg.

OLake is an open-source data ingestion tool available on GitHub, developed by Datazip, Inc. Its primary function is to replicate data from transactional databases and streaming platforms (like PostgreSQL, MySQL, MongoDB, Oracle, and Kafka) into open data lakehouse formats, like Apache Iceberg.

View all events

Share this event

Copy permalink