Mastering Large-Scale Shuffle: The Dream11 Playbook with Remote Shuffle Service

Amit Shinde, Mohit Jain

English Session 2025-07-25 14:30 GMT+8 (ROOM : Mtn WanShou Hall) #datastorage

At Dream11, Apache Celeborn is used for petabyte-scale shuffle, leveraging rack awareness to minimize cross-rack data transfer and ensure high availability by replicating shuffle data across different racks. It decouples shuffle storage from compute nodes, enabling elastic scaling of storage independent of compute demands. Parallel partition writes and adaptive shuffle reads optimize throughput and reduce latency. Elastic scaling allows dynamic resource allocation, maintaining cost efficiency under varying workloads. This design enhances fault tolerance and reduces job completion time by over 50%.

Speakers:

Amit Shinde: Technical Lead

Engineering Director at Dream11. 14+ professional experience in software engineering

Mohit Jain: Technical Lead

I work at the intersection of big data and ML, making Spark and Flink faster while scaling real-time, batch, and ML workloads. Whether it’s optimizing distributed computing, streamlining data pipelines, or building intuitive tools, my goal is to turn complex data challenges into seamless, high-performance solutions.