Lance, the data format for the frontiers of multimodal AI

Lei Xu

Chinese Session #ai

The frontier of multimodal models is trained on petabytes of multimodal AI data, encompassing videos, images, and long text. The complexity and scale of new AI data pose challenges to the existing data infrastructure.

The Apache-licensed Lance format is built on Apache Arrow and Apache Datafusion, written in Rust as core, by a team of Apache Hadoop, Apache HBase, Apache Iceberg, Apache Arrow and Delta lake PMC members. Lance format is a new AI-focused columnar format and table format, inspired heavily from Apache Parquet, Apache Iceberg and Apache Hudi projects. The distinguish features of Lance format are random access and zero-cost schema evolution: two of the most beloved features from AI engineers. Those features set Lance apart from Apache Parquet, Apache ORC or Apache Iceberg, making it more suitable for multimodal AI feature engineering and training.

Lance format has been used by many leading AI companies, such as MidJourney, WorldLabs, Runway ML, Character AI, and many more.

In this session, LanceDB CTO, Lei Xu (Apache Hadoop PMC member) and ByteDance Volcano Engine Expert, Vino Yang (Apache Hudi PMC member), will present together:

  • The infra challenges to support AI workfloads from the frontier multimodal AI companies
  • The core design principles behind the Lance format
  • How ByteDance Volcano Engine builds Lance Datalake on top of Lance format and support some of the best AI companies in the world

Speakers:


CTO of LanceDB. Apache Hadoop / HDFS PMC. Led Machine Learning Platform and Data Infra teams at Cruise automation.