Optimizing Parquet Storage: Metadata Management, Performance Tuning & Seamless Migration

Hongnan Gan

Chinese Session #datalake

This session delves into advanced techniques for optimizing Parquet storage ecosystems. We’ll explore a comprehensive approach to managing metadata—from efficient collection and storage to building a scalable metadata warehouse. Attendees will learn practical strategies for storage optimization, including ZSTD compression upgrades, local/global sorting, and column-level tuning for enhanced performance. Additionally, we’ll cover seamless migration of legacy formats through in-place backtracking and heterogeneous format interoperability. Whether you’re handling large-scale datasets or fine-tuning storage efficiency, this session offers actionable insights to elevate your Parquet workflows.

Key takeaways:

  1. Best practices for metadata lifecycle management (capture, storage, and utilization).
  2. Proven methods to boost query/compression performance with ZSTD and data organization.
  3. Strategies for transparent format migration without disrupting pipelines.

Speakers:


  • BS and MEng Degree of Fudan University
  • SparkSQL kernel developer in ByteDance
  • Interested in SQL optimization