Jo Kristian Bergum
Jo Kristian Bergum
Vespa Solutions Architect

Redefining Hybrid Search Possibilities with Vespa - part one

Redefining Hybrid Search Possibilities with Vespa - part one

Photo by Trent Erwin on Unsplash

This first blog post in a series of two on hybrid search discusses first retrievers in a ranking pipeline, where the first retriever phase aims to find reasonably good candidates for a query under a limited computing budget to allow scaling the retrieval phase to large collections of documents without incurring high costs.

Introduction

Representational approaches in information retrieval involve converting queries and documents into structured forms, like vectors, to make retrieval more efficient. Representational approaches allow querying collections of documents without having to score all documents for a query.

Query and document representations for information retrieval come in two main types: sparse and dense. For both methods, we can distinguish between unsupervised and supervised representations. In other words, the representations of queries and documents could be learned using supervision or unsupervised by using corpus statistics.

Let’s pause for a moment to underscore the distinction between the retrieval and ranking phases. Many developers, accustomed to working with technologies that offer minimal support for phased ranking, might instinctively view this through the lens of systems where retrieval and ranking are implemented by distinct services. In such systems, there’s often a separation of technologies, with retrieval handled by a subsystem—like Apache Solr or Elasticsearch—and ranking primarily executed within a dedicated ‘ranker’ system that re-ranks a shallow pool of candidates retrieved by the retriever subsystem.

Unlike systems characterized by discrete retrieval and ranking technologies, Vespa has rich support for both retrieval and phased ranking. In Vespa, the retrieval phase is expressed with the Vespa query language, using top-k scoring query operators (e.g nearestNeighbor, wand). These operators, with the help of index structures, accelerate the search for the top-k scoring documents without having to score all documents in the collection.

Let us illustrate top-k scoring operators with an example from the database world. Imagine that we have a table products and we want to find the top-100 most popular products using the following SQL query:

SELECT * FROM products ORDER BY popularity DESC LIMIT 100;

If we haven’t specified any index on the popularity column, the above query would likely entail a linear table scan over all the rows in the products table, followed by a sort operation, and finally, returning the top-100 products sorted by descending popularity.

A simplistic index implementation to speed up such top-k queries, avoiding reading the popularity value of all the rows, would be to maintain a sorted data structure with all the unique popularity values with a pointer to the row ID. Then, at query time, we could traverse this sorted data structure, starting at the largest value and traversing it until we have found 100 products. This illustrates a basic form of efficient top-k retrieval. Things get more complex as you introduce additional filters to the query and begin ranking documents based on user-specified information, typically expressed as a free-text query.

In the following sections, we discuss representational approaches in information retrieval that involve converting queries and documents into structured forms, like vectors, to make retrieval (top-k search) more efficient, allowing querying collections of billions of documents without scoring all documents for a query.

Sparse representations and retrieval

With sparse representations, documents and queries are represented as sparse vectors with up to |V| dimensions, where V is the vocabulary (unique terms in the corpus). Only dimensions (terms) that have a non-zero weight impact the score.

A popular unsupervised scoring function using sparse representations is BM25. BM25 can be expressed as a dot product between the query terms weights and the document term weights. The term weights are assigned by using document collection statistics such as inverse document frequency and term frequency. The BM25 scoring function does not take into account the order of the terms in the text. Put simply; a document or a query is represented as a “bag of terms”. BM25 offers the advantage of easily applying it to your document collection, providing a decent baseline for text search without any labeled data. During index building, it analyzes the corpus to determine term occurrences and calculates term statistics, such as term frequency and document frequency. The drawback, however, is its vulnerability to vocabulary mismatch—if the query uses different terms than those in the document, retrieval may fail.

Another less dated example of a sparse representational model is SPLADE (Sparse Lexical and Expansion Model for Information Retrieval). SPLADE can, given labeled data of the form (query, relevant document, irrelevant document) learn sparse representations of documents and queries that optimizes in-domain search ranking performance.

Compared to BM25, SPLADE representations of both queries and documents are denser, as queries and documents are expanded with terms. In addition to more non-zero dimensions, the term weights are contextualized by other terms in the text within the context length limitation of the Transformer model that SPLADE uses. As with BM25, the scoring function of SPLADE is expressed as a dot product between the non-zero query and document terms.

Top-k retrieval for both mentioned sparse representational methods can be done in (potential) sub-linear time using inverted indices. The search for the top-k ranked documents can be accelerated by dynamic pruning algorithms like WAND. Dynamic pruning retrieval algorithms attempt to avoid exhaustively scoring all documents (linear complexity) that match at least one of the terms in the query.

The effectiveness of dynamic pruning algorithms for top-k scoring, as opposed to exhaustive scoring, depends on factors such as k, sparseness (number of non-zero terms), and the distributions of term weights. In the case of learned sparse expansion models like SPLADE, the efficiency of result pruning is influenced by both the density and wacky weights. This negative serving performance observation might be unexpected, given that many have advocated for sparse expansion models based on their perceived simplicity in utilizing inverted index structures.

Acceleration of sparse top-k retrieval, using dynamic pruning algorithms like WAND, is the magic sauce of search engine implementations. Implementing a BM25 scoring function for a query and document pair is trivial, while implementing top-k pruning algorithms to avoid exhaustively scoring all documents is not. This means that technologies that advertise support for sparse vector representations might not dislodge mature search engines.

Sparse retrieval in Vespa

There are two WAND algorithm implementations for sublinear top-k retrieval in Vespa that are useful for retrieval models using sparse representations:

weakAnd

The weakAnd() query operator which fully integrates with language-dependent linguistic processing of text. This operator does not require much thinking, define a string field in the Vespa schema, feed the documents, and issue queries:

{
  "query": "what was the manhattan project?",
  "yql": "select * from sources * where userQuery()"
}

This is a handy beginning for developers. They can easily set up a solid text ranking baseline for their data without delving into Vespa’s inner workings. Vespa speeds up the query with the WAND algorithm, skipping the need to score every matching document in the corpus. It’s worth noting that for certain cases with smaller document collections (a few million documents), scoring all documents is doable within 100 milliseconds on modern hardware, using multithreaded query execution, without attempting to prune the result.

wand

The wand() Vespa query operator is useful for retrieval using sparse learned representations of documents and queries, beyond text search use cases. The wand implementation in Vespa was primarily developed for efficient retrieval for recommendation use cases, using interaction data to learn sparse representations of users and items for large-scale recommendation use cases and where the wand operator is used for efficient candidate retrieval.

The developer controls the vocabulary (number of unique dimensions or “terms”), and where the query and document weights of each dimension is provided using integer precision. That means, that if the sparse representational model uses float, the float weights must be scaled to integer representations to be used with the wand query operator.

Take SPLADE, for instance, it uses a vocabulary of around 30K words using the English BERT language model wordpiece tokenizer. Instead of using the actual words or subword string representation, you can use subword vocabulary identifiers that fit into the signed int type in Vespa. For instance, the phrase “what was the manhattan project” can be represented by the corresponding wordpiece token identifiers such as 2054, 2001, 1996, 7128, 2622.

Their term weights are assigned by the sparse representational model. In Vespa, sparse learned representations for top-k retrieval with the wand query operator, are best expressed using a schema field of the type weightedset

schema doc {
   
   field sparse_rep type weightedset<int> {
	indexing: summary | attribute
       attribute: fast-search
   }

}

In ranking phases, one can also use Vespa’s tensors support, using a mapped tensor to represent the sparse representation, but this post focuses on using sparse representations for efficient top-k retrieval, where we have to currently use integer precision weights.

Then using the model (and converted to integer weights), feed documents to Vespa:

{
    "id": "id:namespace..::doc",
    "fields": {
        "sparse_rep": { 
 		    "2054": 12,
            "2001": 9,
            "1996": 7,
            "7128":34,
            "2622": 53
        }
    }
}

One can then query the sparse representation using the wand query operator:

{
    "yql": "select * from sources * where ({targetHits:100}wand(sparse_rep,@sparse_query_rep))",
    "sparse_query_rep": "{2622:45, 7128:23}"
}

Dense representations and retrieval

With dense representations, queries and documents are embedded into a latent low-dimensional dense vector space where most dimensions have a non-zero weight.

Scoring a document for a query can be done by a vector similarity function such as cosine or a dot product. Dense vector text representations are commonly referred to as text embeddings and similar to SPLADE, require inference with a Transformer-based language model for both queries and documents.

Dense representations using language models require in-domain training data (supervised in the same way as other supervised retrieval models) or hoping that an off-the-shelf embedding model transfers to your data in a zero-shot setting.

Retrieval over dense representations is accelerated by approximate nearest neighbor search algorithms, for example indexing the document vector(s) representations using HNSW indexing.

In addition to single-vector representations, there are also multi-vector representations of text, which represents a text using multiple term vectors. Take ColBERT for instance, it learns contextual term vector embeddings, and where scoring uses a multi-vector similarity expression.

Dense retrieval in Vespa

An important difference between needing to perform efficient retrieval over dense representations and ranking is that the latter does not require building data structures for fast but approximate nearest neighbor search.

It’s possible in Vespa to retrieve using a sparse representation using inverted indices, accelerated by any of the mentioned sparse retrieval methods, and use dense representational models during ranking phases.

Such approaches are attractive as they eliminate the need for HNSW data structures to facilitate fast dense retrieval, which can be a significant cost driver for real-time indexing in large-scale production deployments. This is also true for retrieval and ranking models that use several dense representational models, adding HNSW indexes to dense representations that are only used in ranking phases is a fruitless waste of computing and storage.

The following example defines two document embedding fields where only the first has an HNSW index enabled (controlled by the index switch).

schema doc {
	document doc {
          field body type array<string> {..}
          field llm_summary type string {..}
    }
    field embedding1 type tensor<bfloat16>(p{}, x[384]) {
          indexing: input body | embed model-1 | attribute | index	    
    }
	field embedding2 type tensor<bfloat16>(x[768]) {
          indexing: input llm_summary | embed model-2 | attribute	    
    }
}

For this type of schema, we can effectively retrieve over the embedding1 field with an HNSW index while using embedding2 in ranking phases without incurring the overhead of the HNSW indexing embedding2 field. Details on representing text embedding models in Vespa are found in embedding documentation.

{
  "yql": "select * from sources * where {targetHits:100}nearestNeighbor(embedding1, q1)",
  "input.query(q1)": "embed(model-1, \"the query\")",
  "input.query(q2)": "embed(model-2, \"the query\")"
}

In the above Vespa query example, we retrieve efficiently over the embedding1 field with HNSW enabled using the nearestNeighbor query operator. In addition, we run inference with model-2 to produce another query vector representation that can be used during Vespa ranking phases.

Enabling the combination of sparse and dense representations in the same query enhances the flexibility of the way we can implement hybrid search. This flexibility helps manage trade-offs involving indexing speed, resource usage, and query performance. For instance, the HNSW graph algorithm relies on accessing document vectors during both query and ingest times.

This implies that, for optimal performance, the vectors must be stored in memory. However, if vector similarity is not utilized for top-k retrieval, the vector data can be paged from disk during ranking phases, because during the ranking phases, we have significantly less random accesses to vector data.

With Vespa, developers can use the Vespa paged attribute option that allows the system to page the tensor data from disk as needed during ranking.

field embedding2 type tensor<bfloat16>(x[768]) {
  indexing: input text | embed model-2 | attribute
  attribute: paged 
 }

Vectors or tensors in general, can easily have a large footprint per document (above uses 1536 bytes per document). Moving large amounts of vector data across the network for re-ranking will quickly saturate network capacity for smaller instance types. With Vespa, developers can express that the similarity calculations should happen where the data is stored, eliminating the network throughput scaling bottlenecks.

There is also increasing interest in using sparse representation models for retrieval, and use semantic (or dense) models for re-ranking. In Lexically Accelerated Dense Retrieval (LADR), the authors propose an interesting approach that combines lexical (sparse) retrieval with re-ranking using a dense model. LADR uses a two-phase retrieval process, first retrieving a set of candidate documents using lexical (sparse) retrieval, for example, BM25. The retrieved documents seed an exploration of a proximity graph formed by documents to document similarities. The candidates retrieved by the initial phase and those discovered through the graph traversal are scored (query, document) with the semantic (dense representational) ranking model. Interestingly enough, this approach lifts precision ranking metrics like nDCG@10 compared to best-case exhaustive dense similarity search because the initial sparse phase targets exact keyword matches.

Hybrid Retrieval in Vespa

Research indicates that combining dense and sparse methods could improve the overall ranking effectiveness. The classic hybrid retrieval approach combines dense and sparse retrieval but requires technology that supports both retrieval types. Vespa supports hybrid retrieval, expressed in the same query by combining the sparse wand and dense nearestNeighbor query operators in the same query.

There are multiple useful ways to combine these two dynamic pruning algorithms in the Vespa Query Language for efficient top-k retrieval:

Disjunction (OR)

With disjunction, we can express retrieval using sparse and dense top-k retrieval algorithms in the same query. The retrieved hits (potential disjoint set of candidates) are then exposed into the Vespa ranking phases.

{
  "yql": "select * from sources * where ({targetHits:100}nearestNeighbor(embedding1, q1)) or userQuery()",
  "query": "the query",
  "input.query(q1)": "embed(model-1, \"the query\")"
}

In this example, we ask Vespa to expose 100 (per node involved in the query) to ranking phases using the nearestNeighbor query operator or the best lexical matches using the weakAnd query operator.

How these candiates are scored and ranking is determined by the Vespa ranking phases. One can also use multiple nearestNeighbor search operators in the same query request.

RANK ()

The Vespa rank() query operator allows retrieval using the first operand and allows matching features calculated for those documents retrieved by the first operand using the other operands.

{
  "yql": "select * from sources * where rank(({targetHits:100}nearestNeighbor(embedding1, q1)), userQuery())",
  "query": "the query",
  "input.query(q1)": "embed(model-1, \"the query\")",
}

In the above example, Vespa uses dense retrieval to retrieve top-100 hits into ranking phases, but at the same time, allows the calculation of sparse matching features for the best semantic matching documents

The following inverts the above logic. Instead of dense first, it performs sparse retrieval using weakAnd, then uses the dense model for ranking. This way of querying Vespa does not benefit performance-wise with an HNSW index because the top-k retrieval uses the lexical weakAnd operator.

{
  "yql": "select * from sources * where rank(userQuery(), ({targetHits:100}nearestNeighbor(embedding1, q1)))",
  "query": "the query",
  "input.query(q1)": "embed(model-1, \"the query\")",
}

Both query operators can be combined with the flexibility of the Vespa query language, for example, constraining the search by filters. Similarly, one can combine weakAnd with wand in the same query, or combine all three mentioned query operators.

The Vespa rank() query operator accepts an arbitrary number of operands, but only the first one is used for top-k retrieval.

Summary

In this post, we explored representational approaches in information retrieval, focusing on sparse and dense representations for queries and documents. We made a clear distinction between efficient retrieval and ranking, without going into details on how to combine different dense and sparse signals in the ranking phases. That is the topic for the upcoming blog post in this series, where we will look at how to combine sparse and dense ranking signals for hybrid ranking practically.