History based optimizations for SparkSQL

He Zhengjie

Chinese Session #olap

In this presentation, we will conduct an in-depth exploration of the application of History-Based Optimizations (HBO) within the context of SparkSQL, a critical component in modern big data processing ecosystems. Our discussion will not only cover the fundamental principles of HBO but also delve into the compelling reasons for its adoption, especially in scenarios where Adaptive Query Execution (AQE) is already in place. We will dissect how HBO addresses limitations inherent to AQE, thereby providing enhanced query performance and resource utilization.

The presentation will also showcase our practical implementation of HBO in diverse large-scale scenarios, highlighting its effectiveness in optimizing complex data workflows. In addition, we will detail some optimization rules designed to achieve significant cost savings, covering aspects such as reducing unnecessary I/O operations and optimizing memory usage.

One of the key challenges we will discuss is the implementation of HBO across different computational engines, a common pain point in heterogeneous big data environments.

Key takeaways:

  • A set of optimization rules
  • Cross engine
  • Practical insights into implementing HBO

Speakers:


I am a Big Data Engineer with several years of expertise in Query engine optimization, specializing in performance tuning and cost-efficient query execution for large-scale data systems.