Practice of Flink Memory Governance at ByteDance

Yiheng Tang

Chinese Session #streaming

As the demand for streaming tasks continues to grow within ByteDance, Flink has been widely adopted across various business domains at scale. Among the resource costs of these large-scale tasks, memory stands out as a significant contributor—especially heap memory. The total allocated memory for all tasks has reached tens of thousands of terabytes, yet the JVM heap utilization remains below 50%, and container-level memory usage is under 70%. Against the backdrop of company-wide cost reduction and efficiency improvement, we have carried out a series of memory optimizations focused on heap memory usage prediction, off-heap memory usage tracking, and simplifying Flink’s memory model. These efforts have been successfully rolled out across ByteDance, resulting in memory savings of over a thousand terabytes. In this talk, we will present the key joint optimizations led by the Flink and JVM teams at ByteDance, and share the results we’ve achieved.

Agenda:

  1. Background
  • Current memory usage status of Flink at ByteDance
  • Key challenges we are facing
  1. Heap memory usage prediction
  2. Off-heap memory usage tracking
  3. Simplification of the Flink memory model: Unified memory pool
  4. Implementation and benefits
  5. Future plans
  • Exploration of further optimization directions

Speakers:


Flink Runtime Developer at Bytedance