Optimizing Parquet Storage: Metadata Management, Performance Tuning & Seamless Migration
Hongnan Gan
Chinese Session #datalakeThis session delves into advanced techniques for optimizing Parquet storage ecosystems. We’ll explore a comprehensive approach to managing metadata—from efficient collection and storage to building a scalable metadata warehouse. Attendees will learn practical strategies for storage optimization, including ZSTD compression upgrades, local/global sorting, and column-level tuning for enhanced performance. Additionally, we’ll cover seamless migration of legacy formats through in-place backtracking and heterogeneous format interoperability. Whether you’re handling large-scale datasets or fine-tuning storage efficiency, this session offers actionable insights to elevate your Parquet workflows.
Key takeaways:
- Best practices for metadata lifecycle management (capture, storage, and utilization).
- Proven methods to boost query/compression performance with ZSTD and data organization.
- Strategies for transparent format migration without disrupting pipelines.
Speakers:
- BS and MEng Degree of Fudan University
- SparkSQL kernel developer in ByteDance
- Interested in SQL optimization