Resolving Data Silos: Apache Gravitino's Production Implementation Practices at Bilibili

Tianhang Li

Chinese Session #datalake

Apache Gravitino is a unified metadata management platform adopted by Bilibili to address data silos, enabling metadata view integration across heterogeneous data sources such as Hive, Iceberg, Kafka and so on. Leveraging ​end-to-end lineage tracking, it traces data workflows from ingestion, processing, to service delivery, optimizing resource utilization and impact analysis of schema changes. By integrating Iceberg’s partitioning strategies (e.g., Truncate/Bucket) and Branch features, Gravitino supports flexible data versioning, multi-stream data stitching, and isolated testing environments. For AI-driven scenarios, the platform provides systematic ​training dataset partitioning and model version management, ensuring reproducibility and streamlined AI asset governance. This solution reduces cross-source maintenance overhead, enhances data consistency, and establishes a scalable technical foundation for Bilibili’s AI applications and enterprise-wide data governance

Speakers:


Li Tianhang is a Big Data Development Engineer at Bilibili, where he specializes in metadata management and Spark computing engine optimization for large-scale data scenarios