Hybrid Search in Apache Doris
Lee Happen
Chinese Session #aiDoris’s hybrid search capability combines traditional full-text search (keyword-based lexical search) with vector search (semantics-based search) to deliver an advanced search function aimed at improving the accuracy and relevance of search results. This capability is particularly well-suited for complex search scenarios that require both keyword matching and semantic understanding, such as e-commerce, content recommendation, and knowledge base search. Core Concepts of Hybrid Search Hybrid search leverages the strengths of both search methods to provide more precise search results: Full-Text Search (BM25): Based on inverted indexes and keyword matching, it excels at accurately matching user-input query terms. Doris uses the BM25 algorithm (default) to calculate relevance scores between documents and queries, suitable for structured text search.
Vector Search (Semantic Search): By converting text into vectors (embeddings), machine learning models are used to compute semantic similarity between queries and documents, excelling at understanding query intent and context.
Fusion Mechanism: Hybrid search integrates the results of both methods using specific scoring and ranking techniques, such as Reciprocal Rank Fusion (RRF) or Convex Combination (CC), to balance lexical and semantic relevance.
How Hybrid Search Works Doris’s hybrid search relies on the following technical components and workflows: Text Fields: Inverted indexes are generated via tokenizers for full-text search.
Vector Fields: Models are used to convert text into vector types, stored in vector fields.
Composite Indexing: Supports simultaneous storage of text and vector fields, enabling hybrid queries.
Query Execution: Lexical Query: Uses match queries or similar DSL queries to retrieve documents matching keywords based on the BM25 algorithm.
Vector Query: Utilizes knn (K-Nearest Neighbors) queries or ANN (Approximate Nearest Neighbors) indexes to retrieve semantically relevant documents based on vector similarity (e.g., cosine similarity).
Hybrid Query: Combines both query types, such as running match and knn simultaneously, then integrates results using a fusion algorithm.
Result Fusion: Reciprocal Rank Fusion (RRF): Calculates a composite score based on document rankings in different query results, emphasizing documents that rank high across multiple search methods. RRF relies on rank order rather than specific scores, making it suitable for heterogeneous result fusion.
Convex Combination (CC): Combines BM25 and vector query scores through weighted summation, requiring manual adjustment of weights to balance the contributions of both search methods.
Scoring and Re-ranking: Hybrid search results can be further optimized using scripted scoring (script_score) or re-ranking models (Rerank) to ensure the final results better align with user intent.
Speakers:
Apache Doris PMC Member