Improving Zero-Shot Ranking with Vespa Hybrid Search - part two
Photo by Tamarcus Brown on Unsplash
Where should you begin if you plan to implement search functionality but have not yet collected data from user interactions to train ranking models?
In the first post in the series, we introduced the difference between in-domain and out-of-domain (zero-shot) ranking. We also presented the BEIR benchmark and highlighted cases where in-domain effectiveness does not transfer to another domain in a zero-shot setting.
In this second post in this series, we introduce and evaluate three different Vespa ranking methods on the BEIR benchmark in a zero-shot setting. We establish a new and strong BM25 baseline for the BEIR dataset, which outperforms previously reported BM25 results. We then show how a unique hybrid approach, combining a neural ranking method with BM25, outperforms other evaluated methods on 12 out of 13 datasets on the BEIR benchmark. We also compare the effectiveness of the hybrid ranking method with emerging few-shot methods that generate in-domain synthetic training data via prompting large language models (LLMs).
Establishing a strong baseline
In the BEIR paper, the authors find that BM25 is a strong generalizable baseline text ranking model. Many, if not most, of the dense single vector embedding models trained on MS MARCO labels are outperformed by BM25 in an out-of-domain setting. Quote from BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models:
In-domain performance is not a good indicator for out-of-domain generalization. We observe that BM25 heavily underperforms neural approaches by 7-18 points on in-domain MS MARCO. However, BEIR reveals it to be a strong baseline for generalization and generally outperforming many other, more complex approaches. This stresses the point, that retrieval methods must be evaluated on a broad range of datasets.
What is interesting about reporting BM25 baselines is that there are multiple implementations, variants, and performance tweaks, as demonstrated in Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. Unfortunately, various papers have reported conflicting results for BM25 on the same BEIR benchmark datasets. The BM25 effectiveness can vary due to different hyperparameters and different linguistic processing methods used in different system implementations, such as removing stop words, stemming, and tokenization. Furthermore, researchers want to contrast their proposed ranking approach with a baseline ranking method. It could be tempting to report a weak BM25 baseline, which makes the proposed ranking method stand out better.
Several serving systems implement BM25 scoring, including Vespa.
Vespa’s lexical or sparse retrieval is also accelerated using the
weakAnd Vespa query
operator.
This is important because implementing a BM25 scoring function in
a system is trivial, but scoring all documents that contains at
least one of the query terms approaches linear complexity. Dynamic
pruning algorithms like weakAnd
improve the
retrieval efficiency significantly compared to naive
brute-force implementations that scores all documents matching any of the query terms.
BM25 has two
hyperparameters, k1
and b
, which impact ranking effectiveness.
Additionally, most (14 out of 18) of the BEIR datasets have both
title and text document fields, which in a real-production environment
would be the first thing that a seasoned search practitioner would
tune the relative importance of. In our BM25 baseline,
we configure Vespa to independently calculate the
BM25 score of both
title and text, and we combine the two BM25 scores
linearly. The complete Vespa rank
profile is given below.
rank-profile bm25 inherits default { first-phase { expression: bm25(title) + bm25(text) } rank-properties { bm25(title).k1: 0.9 bm25(title).b: 0.4 bm25(text).k1: 0.9 bm25(text).b: 0.4 } }
We modify the BM25 k1
and b
parameters but use the same parameters for
both fields. The values align with Anserini
defaults
(k1=0.9, b=0.4).
The following table reports nDCG@10
scores on a subset (13) of the
BEIR benchmark datasets. We exclude the four datasets that are not
publicly available. We also exclude the BEIR CQADupStack dataset
because it consists of 12 sub-datasets where the overall nDCG@10
score is found by averaging each
sub-dataset’s
nDCG@10
score. Adding these sub-datasets would significantly increase
the evaluation effort.
BEIR Dataset | BM25 from BEIR Paper | Vespa BM25 |
MS MARCO | 0.228 | 0.228 |
TREC-COVID | 0.656 | 0.690 |
NFCorpus | 0.325 | 0.313 |
Natural Questions (NQ) | 0.329 | 0.327 |
HotpotQA | 0.603 | 0.623 |
FiQA-2018 | 0.236 | 0.244 |
ArguAna | 0.315 | 0.393 |
Touché-2020 (V2) | 0.367 | 0.413 |
Quora | 0.789 | 0.761 |
DBPedia | 0.313 | 0.327 |
SCIDOCS | 0.158 | 0.160 |
FEVER | 0.753 | 0.751 |
CLIMATE-FEVER | 0.213 | 0.207 |
SciFact | 0.665 | 0.673 |
Average (excluding MS MARCO) | 0.440 | 0.453 |
The table summarizes the BM25 nDCG@10 results. Vespa BM25 versus BM25 from BEIR paper.
The table above demonstrates that the Vespa implementation has set a new high standard, outperforming reported BM25 baselines on the BEIR benchmark.
Evaluating Vespa ranking models in a zero-shot setting
With the new strong BM25 baseline established in the above section, we will now introduce two neural ranking models and compare their performance with the baseline.
Vespa ColBERT
We have previously described the Vespa ColBERT implementation in
this blog
post,
and we use the same model
weights in this
work. The Vespa ColBERT model is based on a distilled 6-layer MiniLM
model with 22M parameters, using quantized int8
weights (post-training
quantization). The model uses only 32 vector dimensions per query
and document wordpiece),
in contrast to the original ColBERT model, which
uses 128 dimensions. Furthermore, we use Vespa’s support for
bfloat16
to reduce the per-dimension storage usage from 4 bytes per dimension
with float
to 2 bytes with bfloat16
. We configure the maximum query
length to 32 wordpieces, and maximum document length to 180
wordpieces. Both maximum length parameters align with the training and experiments
on MS MARCO.
The ColBERT MaxSim scoring is implemented as a re-ranking model using Vespa phased ranking, re-ranking the top 2K hits ranked by BM25. We also compute and store the title term embeddings for datasets with titles, meaning we have two MaxSim scores for datasets with titles. We use a linear combination to combine the title and text MaxSim scores.
The complete Vespa rank profile is given below.
rank-profile colbert inherits bm25 { inputs { query(qt) tensor<float>(qt{}, x[32]) query(title_weight): 0.5 } second-phase { rerank-count: 2000 expression { (1 - query(title_weight))* sum( reduce( sum( query(qt) * cell_cast(attribute(dt), float), x ), max, dt ), qt ) + query(title_weight) * sum( reduce( sum( query(qt) * cell_cast(attribute(title_dt), float), x ), max, dt ), qt ) } }
The per wordpiece ColBERT vectors are stored in Vespa using Vespa’s support for storing and computing over tensors.
Note: Users can also trade efficiency versus cost by storing the tensors on disk, or in-memory using paging options. Paging is highly efficient in a re-ranking pipeline, as just a few K tensors values are potentially paged on-demand.
Vespa Hybrid ColBERT + BM25
There are several ways to combine the ColBERT MaxSim with BM25, including reciprocal rank fusion(RRF) which does not consider model scores, just the ordering (ranking) the scores produce. Quote from Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods:
RRF is simpler and more effective than Condorcet Fuse, while sharing the valuable property that it combines ranks without regard to the arbitrary scores returned by particular ranking methods
Another approach is to combine the model scores into a new score to produce a new ranking. We use a linear combination in this work to compute the hybrid score. Like the ColBERT-only model, we use BM25 as the first-phase ranking model and only calculate the hybrid score for the global top-ranking K documents from the BM25 model.
Before combining the scores, we want to normalize both the unbound BM25 and the bound ColBERT score. Normalization is accomplished by simple max-min scaling of the scores. With max-min scaling, scores from any ranking model are scaled from 0 to 1. This makes it easier to combine the two using relative weighting.
Since scoring in a production serving system might be spread across multiple nodes, each node involved in the query will not know the global max scores. We solve this problem by letting Vespa content nodes involved in the query return both scores using Vespa match-features.
A custom searcher is injected in the query dispatching stateless Vespa service. This searcher calculates the max and min for both model scores using match features for hits within the window of global top-k hits ranked by BM25. As with the ColBERT rank profile, we use a re-ranking window of 2000 hits, but we perform feature-score scaling and re-ranking in a stateless custom searcher instead of on the content nodes.
The complete Vespa rank profile is given below. Notice the
match-features
, which are returned with each hit to the stateless
searcher
(implementation),
which performs the normalization and re-scoring. The first-phase
scoring function is inherited from the previously described bm25
rank profile.
rank-profile hybrid-colbert inherits bm25 { function bm25() { expression: bm25(title) + bm25(text) } function colbert_maxsim() { expression { 2*sum( reduce( sum( query(qt) * cell_cast(attribute(dt), float) , x ), max, dt ), qt ) + sum( reduce( sum( query(qt) * cell_cast(attribute(title_dt), float), x ), max, dt ), qt ) } } match-features { bm25 colbert_maxsim } }
Results and analysis
As with the BM25 baseline model, we index one of the BEIR datasets
at a time on a Vespa instance and evaluate the models. The following
table summarizes the results. All numbers are nDCG@10
. The
best-performing model score per dataset is in bold.
BEIR Dataset | Vespa BM25 | Vespa ColBERT | Vespa Hybrid |
MS MARCO (in-domain) | 0.228 | 0.401 | 0.344 |
TREC-COVID | 0.690 | 0.658 | 0.750 |
NFCorpus | 0.313 | 0.304 | 0.350 |
Natural Questions (NQ) | 0.327 | 0.403 | 0.404 |
HotpotQA | 0.623 | 0.298 | 0.632 |
FiQA-2018 | 0.244 | 0.252 | 0.292 |
ArguAna | 0.393 | 0.286 | 0.404 |
Touché-2020 (V2) | 0.413 | 0.315 | 0.415 |
Quora | 0.761 | 0.817 | 0.826 |
DBPedia | 0.327 | 0.281 | 0.365 |
SCIDOCS | 0.160 | 0.107 | 0.161 |
FEVER | 0.751 | 0.534 | 0.779 |
CLIMATE-FEVER | 0.207 | 0.067 | 0.191 |
SciFact | 0.673 | 0.403 | 0.679 |
Average nDCG@10 (excluding MS MARCO) | 0.453 | 0.363 | 0.481 |
The table summarizes the nDCG@10 results per dataset. Note that MS MARCO is in-domain for ColBERT and Hybrid.
Average nDCG@10
is only computed for zero-shot and out-of-domain datasets.
As shown in the table above, in a in-domain setting on MS MARCO,
the Vespa ColBERT model outperforms the BM25
baseline significantly. The resulting nDCG@10
score aligns with reported MRR@10
results from previous work using
ColBERT
in-domain on MS MARCO. However, mixing the baseline BM25 using the hybrid model on MS MARCO evaluation
hurts the nDCG@10
score, as we combine two models where the unsupervised BM25
model is significantly weaker than the ColBERT model.
The Vespa ColBERT model underperforms BM25 on out-of-domain datasets, especially CLIMATE-FEVER. The CLIMATE-FEVER dataset has very long queries (avg 20.2 words). The long questions challenge the ColBERT model, configured with a max query length of 32 wordpieces in the experimental setup. Additionally, the Vespa ColBERT model underperforms reported results for the full-sized ColBERT V2 model using 110M parameters and 128 dimensions. This result could indicate that the compressed (in the number of dimensions) and model distillation have a more significant negative impact when applied in a zero-shot setting compared to in-domain.
These exceptions aside, the data shows that the unique hybrid Vespa ColBERT and BM25 combination is highly
effective, performing the best on 12 of 13 datasets. Its average
nDCG@10
score improves from 0.453 to 0.481 compared to the strong
Vespa BM25 baseline.
To reproduce the results of this benchmark, follow the open-sourced instructions.
Comparing hybrid zero-shot with few-shot methods
To compare the hybrid Vespa ranking performance with other models, we include the results reported in Promptagator: Few-shot Dense Retrieval From 8 Examples from Google Research.
Generating synthetic training data in-domain via prompting LLMs is a recent emerging Information Retrieval(IR) trend also described in InPars: Data Augmentation for Information Retrieval using Large Language Models.
The basic idea is to “prompt” a large language model (LLM) to generate synthetic queries for use in training of in-domain ranking models. A typical prompt include a few examples of queries and relevant documents, then the LLM is “asked” to generate syntetic queries for many of the documents in the corpus. The generated syntetic query, document pairs can be used to train neural ranking models. We include a quote describing the approach from the Promptagator paper:
Running the prompt on all documents from DT, we can create a large set of synthetic (q, d) examples, amplifying the information from few examples into a large synthetic dataset whose query distribution is similar to true task distribution QT and query-document pairs convey the true search intent IT. We use FLAN (Wei et al., 2022a) as the LLM for query generation in this work. FLAN is trained on a collection of tasks described via instructions and was shown to have good zero/few-shot performance on unseen tasks. We use the 137B FLAN checkpoint provided by the authors.
The Promptagator authors report results on a different subset of the BEIR datasets (excluding Quora and Natural Questions). In the following table we compare their reported results on the same BEIR datasets used in this work. We also include the most effective single-vector representation model (TAS-B) from the BEIR benchmark (zero-shot).
BEIR Dataset | Vespa BM25 | Vespa Hybrid | TAS-B (dense) | PROMPTAGATOR few-shot (dense) | PROMPTAGATOR few-shot (cross-encoder) |
TREC-COVID | 0.690 | 0.750 | 0.481 | 0.756 | 0.762 |
NFCorpus | 0.313 | 0.350 | 0.319 | 0.334 | 0.37 |
HotpotQA | 0.623 | 0.632 | 0.584 | 0.614 | 0.736 |
FiQA-2018 | 0.244 | 0.292 | 0.300 | 0.462 | 0.494 |
ArguAna | 0.393 | 0.404 | 0.429 | 0.594 | 0.63 |
Touché-2020 (V2) | 0.413 | 0.415 | 0.173 | 0.345 | 0.381 |
DBPedia | 0.327 | 0.365 | 0.384 | 0.38 | 0.434 |
SCIDOCS | 0.160 | 0.161 | 0.149 | 0.184 | 0.201 |
FEVER | 0.751 | 0.779 | 0.700 | 0.77 | 0.868 |
CLIMATE-FEVER | 0.207 | 0.191 | 0.228 | 0.168 | 0.203 |
SciFact | 0.673 | 0.679 | 0.643 | 0.65 | 0.731 |
Average nDCG@10 | 0.436 | 0.456 | 0.399 | 0.478 | 0.528 |
Vespa ranking model comparison with few-shot models and singe-vector TAS-B (zero-shot). The PROMPTAGATOR results are from table 2 in the paper.
The dense TAS-B model underperforms both the BM25 baseline and the hybrid model. This result is in line with other dense models trained on MS MARCO; dense single-vector representation models struggle with generalization in new domains.
The PROMPTAGATOR
single-vector representation model (110M parameter)
performs better than the zero-shot Vespa hybrid model.
Still, given that it has performed in-domain adoption, we don’t
think the difference is that significant (0.456 versus 0.478).
Furthermore, we could also adapt the hybrid model on a per-dataset
basis, for example, by adjusting the relative importance of the
title and text fields. Interestingly, PROMPTGATOR
reports a BM25 baseline
nDCG@10
score of 0.418 across these datasets, which is considerably
weaker than the strong Vespa BM25 baseline of 0.436.
We also include the PROMPTGATOR
re-ranking model, a cross-encoder
model with another 110M parameters, to re-rank the top-200 results
from the retriever model. This model outperforms all other methods
described in this blog post series.
There is also exciting work (InPars v2) using LLMs to generate synthetic training data that report strong cross-encoder model results on BEIR, but with models of up to 3B parameters, which makes it impractical and costly in production use cases.
Cross-encoder models can only be deployed as a re-ranking phase as they input both the query and the document and is more computationally intensive than other methods presented in this blog post. Nevertheless, the computationally inexpensive Vespa hybrid model could be used as a first-phase retriever for cross-encoder models. We described cross-encoder models in Vespa in part four in our Pretrained Transformer Language Models for Search blog post series.
Deploying hybrid ranking models to production
We’ve made everything you need to deploy this solution available.
This research, evaluating Vespa zero-shot models on BEIR began with COVID-19 Open Research Dataset (CORD-19). We have indexed the complete, final version of the CORD-19 dataset on https://cord19.vespa.ai/. You can select between all three ranking models described in this blog post. The demo search also includes result facets, result pagination, result snippets, and highlighting of matched query terms. Essentially, everything you expect from a search engine implementation.
The Vespa app is open-source and deployed on Vespa Cloud, and the app can also be run locally using the open-source Vespa container image. The Vespa ColBERT model is CPU-friendly and does not require expensive GPU/TPU acceleration to meet user-serving latency requirements. The end-to-end retrieval and ranking pipeline, including query encoding, retrieval, and re-ranking, takes less than 60 ms.
Summary
In this blog post in a series on zero-shot ranking, we established a strong BM25 baseline on multiple BEIR datasets, improving over previously reported results. We believe that without a strong BM25 baseline model, we can overestimate the neural ranking progress, especially in a zero-shot setting where neural single vector representations struggle with generalization.
We then introduced a unique hybrid ranking model, combining ColBERT with BM25 and setting a new high bar for efficient and effective zero-shot ranking. We also compared this unique model’s effectiveness with much larger models that use few-shot in-domain adoption techniques involving billion-sized LLMs.
Importantly, all the results presented in this blog post are easily reproduced using the open-sourced Vespa app, which is deployed to production and available at https://cord19.vespa.ai/.
For those interested in learning more about hybrid search in a zero-shot setting, we highly recommend two Vespa related talks presented at Berlin Buzzwords 2022.