History based optimizations for SparkSQL

Zhengjie He, Hongnan Gan

Chinese Session 2025-07-25 16:45 GMT+8 (ROOM : Mtn BaiWang Hall) #olap

In this presentation, we will conduct an in-depth exploration of the application of History-Based Optimizations (HBO) within the context of SparkSQL, a critical component in modern big data processing ecosystems. Our discussion will not only cover the fundamental principles of HBO but also delve into the compelling reasons for its adoption, especially in scenarios where Adaptive Query Execution (AQE) is already in place. We will dissect how HBO addresses limitations inherent to AQE, thereby providing enhanced query performance and resource utilization.

The presentation will also showcase our practical implementation of HBO in diverse large-scale scenarios, highlighting its effectiveness in optimizing complex data workflows. In addition, we will detail some optimization rules designed to achieve significant cost savings, covering aspects such as reducing unnecessary I/O operations and optimizing memory usage.

One of the key challenges we will discuss is the implementation of HBO across different computational engines, a common pain point in heterogeneous big data environments.

Key takeaways:

A set of optimization rules
Cross engine
Practical insights into implementing HBO

Speakers:

Zhengjie He: ByteDance Data Engineer

I am a Big Data Engineer with several years of expertise in Query engine optimization, specializing in performance tuning and cost-efficient query execution for large-scale data systems.

Hongnan Gan: ByteDance Senior R&D Engineer, 字节跳动高级研发工程师

BS and MEng Degree of Fudan University
SparkSQL kernel developer in ByteDance
Interested in SQL optimization