OLake by Datazip invites you to their event

Scalable, Real-Time Data Pipelines: Distributed Stream Processing in Practice

About this event

About the Event

This technical session examines real-world challenges and patterns in building distributed stream processing systems. We focus on scalability, fault tolerance, and latency trade-offs through a concrete case study, using specific frameworks like Apache Storm as supporting tools to illustrate production concepts.

Why Should You Attend

Learn practical patterns for distributed stream processing at scale:

  • Master real-world challenges - Understand scalability, fault tolerance, and latency trade-offs in production
  • See architectural patterns - Stateless vs. stateful processing, event time vs. processing time decisions
  • Handle scale bottlenecks - Partitioning strategies, backpressure handling, and scheduling challenges
  • Learn from concrete examples - Real ML feature generation pipeline using Storm and Kafka

Perfect for: Data engineers building distributed streaming systems who need production-proven patterns.

----------------------------------------------------------------------------------------------------------------------

Agenda (30 minutes)

1. Stream Processing: Past and Now (4 minutes)

  • Rise of real-time data needs in ML, analytics, and user-facing apps
  • Shift from batch-first to event-first architectures

2. Distributed Stream Processing Fundamentals (5 minutes)

  • Definition and fundamentals
  • Processing types: at-most-once, at-least-once, exactly-once
  • Batch vs. micro-batch vs. true streaming

3. Architectural Patterns (6 minutes)

  • Stateless vs. stateful processing
  • Event time vs. processing time
  • Schedulers

Common architecture: Kafka → Stream Processor → Sink (DB, Lake, Dashboard)

4. Designing for Scale (6 minutes)

  • Partitioning strategies and operator parallelism
  • Handling backpressure and traffic spikes
  • Scheduling challenges and system bottlenecks
  • Fault tolerance and availability

5. Case Study: Real-Time ML Feature Generation (10 minutes)

  • Event Source (Kafka): Collects user events
  • Stream Engine (Apache Storm): Processes and transforms streams
  • Storage (S3): Stores aggregated feature datasets
  • Setup: 1 Nimbus + 3 Workers distributed topology
  • Model Training: Python jobs consume features

Hosted by

  • Guest speaker
    G
    Hasan Geren Data Engineer @ ProcurePro

    Hasan's career includes Data Engineering, where he has: • Designed and optimised 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝗱𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀 and cloud storage architectures. • Built 𝗹𝗼𝘄-𝗹𝗮𝘁𝗲𝗻𝗰𝘆 𝗱𝗮𝘁𝗮 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 to support real-time applications and analytics dashboards. • Developed AI/ML-based solutions, including 𝗟𝗦𝗧𝗠 𝗺𝗼𝗱𝗲𝗹𝘀 and 𝗿𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 to enhance user engagement. • Collaborated across teams to drive actionable insights, ensuring data solutions align with business goals.

  • Guest speaker
    G
    Shubham Satish Baldava

  • Team member
    T
    Harsha Kalbalia GTM @ Datazip | Founding Member @ Datazip

    Harsha is a user-first GTM specialist at Datazip, transforming early-stage startups from zero to one. With a knack for technical market strategy and a startup enthusiast's mindset, she bridges the gap between innovative solutions and meaningful market adoption.

OLake by Datazip

Fastest way to replicate your data to Apache Iceberg.

OLake is an open-source data ingestion tool available on GitHub, developed by Datazip, Inc. Its primary function is to replicate data from transactional databases and streaming platforms (like PostgreSQL, MySQL, MongoDB, Oracle, and Kafka) into open data lakehouse formats, like Apache Iceberg.