Apache Arrow: The In-Memory Layer Your Iceberg, Spark, and Parquet Stack Depends On

About 45 minutes

About this event

10 years of Arrow. 30 minutes to understand why it's everywhere.

If you work with modern data infrastructure, Arrow is almost certainly running somewhere in your stack. Most engineers never notice it.

Arrow solved a real problem: moving data between systems required serializing and deserializing at every boundary. CPU cycles, memory copies, latency. At scale, that cost compounds fast. Arrow's solution was a language-agnostic columnar memory format any system could share without copying. What started as a memory layout spec became the execution substrate of the modern data stack.

In this 30-minute session, Badal Singh, who has contributed to Apache Iceberg Go and built OLake's Arrow-based ingestion writer at 550,000+ rows/second, will cover:

From niche interoperability project to de-facto standard: Apache Arrow's 10-year journey
What Arrow actually is beyond "columnar in-memory format" and why that definition undersells it
How zero-copy data sharing eliminates serialization overhead and what that means for pipeline performance
Where Arrow runs today: Spark, Pandas, ClickHouse, Polars, and inside open table formats like Apache Iceberg Go
What's next: Arrow Flight, ADBC, nanoarrow, and the ecosystem reshaping how data systems talk to each other

Hosted by

External speaker

BS E
Badal Singh Software Engineer @ Datazip

Badal is a Software Engineer at Datazip working on distributed data systems and lakehouse infrastructure. His day-to-day involves building high-performance data writers and working deep in the internals of Apache Iceberg, Apache Arrow, and storage engines. He contributed to Apache Iceberg Go, implementing a partitioned table writer using Apache Arrow. At Datazip, he built V0 of OLake's Arrow-based Full Load and CDC writer for ingestion into Apache Iceberg tables — pushing full load throughput beyond 550,000 rows per second. He doesn't just use Arrow. He builds on top of it.
Team member

T
Sandeep Devarapalli Co-founder and CEO @ Datazip, Inc.
Team member

T
Harsha Kalbalia GTM @ Datazip | Founding Member @ Datazip

Harsha is a user-first GTM specialist at Datazip, transforming early-stage startups from zero to one. With a knack for technical market strategy and a startup enthusiast's mindset, she bridges the gap between innovative solutions and meaningful market adoption.

OLake by Datazip

Fastest way to replicate your data to Apache Iceberg.

OLake is an open-source data ingestion tool available on GitHub, developed by Datazip, Inc. Its primary function is to replicate data from transactional databases and streaming platforms (like PostgreSQL, MySQL, MongoDB, Oracle, and Kafka) into open data lakehouse formats, like Apache Iceberg.

View all events

Share this event

Copy permalink