Mastering Large-Scale Shuffle: The Dream11 Playbook with Remote Shuffle Service

Ravi Rana

English Session #datastorage

At Dream11, Apache Celeborn is used for petabyte-scale shuffle, leveraging rack awareness to minimize cross-rack data transfer and ensure high availability by replicating shuffle data across different racks. It decouples shuffle storage from compute nodes, enabling elastic scaling of storage independent of compute demands. Parallel partition writes and adaptive shuffle reads optimize throughput and reduce latency. Elastic scaling allows dynamic resource allocation, maintaining cost efficiency under varying workloads. This design enhances fault tolerance and reduces job completion time by over 50%.

Speakers:


Technical Lead at Dream11. 10+ professional experience in software engineering