OLake by Datazip invites you to their event

From CDC to Queryable Iceberg: Building a Production Ingestion Path with OLake + Trino

About this event

Description

CDC pipelines into Apache Iceberg are straightforward to start and surprisingly hard to maintain at scale. The moment you move beyond full loads into continuous CDC, you accumulate equality delete files, small Parquet fragments, and schema drift that silently degrade query performance over time.

This hands-on workshop walks through the full ingestion path: configuring OLake to replicate a PostgreSQL source into S3-backed Iceberg tables, querying that data live with Trino via Starburst, and then confronting the real operational challenges that appear after the first few CDC cycles: small files, delete file overhead, snapshot history, and compaction.

Agenda

  1. Configuring OLake for PostgreSQL replication — source config, catalog discovery, and sync modes
  2. Running the sync — how OLake writes Parquet files into S3 as Iceberg tables after a full load
  3. CDC in practice — running incremental syncs and inspecting the resulting small data files and equality delete files
  4. Why file accumulation matters — MOR read overhead and what the layout looks like before any maintenance
  5. Querying with Starburst — connecting to Iceberg, validating ingested data, and running live queries
  6. Iceberg snapshot history and time travel — querying previous states using snapshot IDs
  7. Compaction with Starburst — merging small files, reducing delete overhead, and measuring the performance difference
  8. Best practices for Iceberg table maintenance in CDC pipelines

Hosted by

  • External speaker
    E
    Lester Martin Devrel @ Starburst

    Lester is a data engineer and developer advocate with over a decade of hands-on work across data pipelines, lake analytics, and distributed query engines. He has worked extensively with Trino, Apache Iceberg, Hive, Spark, Flink, and Kafka, and spent eight years at Cloudera/Hortonworks helping enterprise teams build and operate Hadoop, Spark, and NoSQL workloads before moving into developer advocacy. At Starburst, he focuses on helping engineering teams adopt Trino and understand how query engines interact with open table formats like Iceberg at the storage layer. He regularly publishes technical deep-dives on Iceberg internals, including deletion vectors, snapshot management, and file layout behavior.

  • Team member
    T
    Harsha Kalbalia GTM @ Datazip | Founding Member @ Datazip

    Harsha is a user-first GTM specialist at Datazip, transforming early-stage startups from zero to one. With a knack for technical market strategy and a startup enthusiast's mindset, she bridges the gap between innovative solutions and meaningful market adoption.

  • Team member
    T
    Nayan Joshi Data Engineer DevRel @ Datazip

OLake by Datazip

Fastest way to replicate your data to Apache Iceberg.

OLake is an open-source data ingestion tool available on GitHub, developed by Datazip, Inc. Its primary function is to replicate data from transactional databases and streaming platforms (like PostgreSQL, MySQL, MongoDB, Oracle, and Kafka) into open data lakehouse formats, like Apache Iceberg.