Representing BGE embedding models in Vespa using bfloat16
Photo by Rafael Drück on Unsplash
This post demonstrates how to use recently announced BGE (BAAI General Embedding) models in Vespa. The open-sourced (MIT licensed) BGE models from the Beijing Academy of Artificial Intelligence (BAAI) perform strongly on the Massive Text Embedding Benchmark (MTEB leaderboard). We evaluate the effectiveness of two BGE variants on the BEIR trec-covid dataset. Finally, we demonstrate how Vespa’s support for storing and indexing vectors using bfloat16 precision saves 50% of memory and storage fooprint with close to zero loss in retrieval quality.
Choose your BGE Fighter
When deciding on an embedding model, developers must strike a balance between quality and serving costs.
These serving-related costs are all roughly linear with model parameters and embedding dimensionality (for a given sequence length). For example, using an embedding model with 768 dimensions instead of 384 increases embedding storage by 2x and nearest neighbor search compute by 2x.
Quality, however, is not nearly linear, as demonstrated on the MTEB leaderboard.
Model | Dimensionality | Model params (M) | Accuracy Average (56 datasets) | Accuracy Retrieval (15 datasets) |
bge-small-en | 384 | 33 | 62.11 | 51.82 |
bge-base-en | 768 | 110 | 63.36 | 53 |
bge-base-large | 1024 | 335 | 63.98 | 53.9 |
In the following sections, we experiment with the small and base BGE variants, which gives us reasonable accuracy for a much lower cost than the large variant. The small model inference complexity also makes it servable on CPU architecture, allowing iterations and development locally without managing GPU-related infrastructure complexity.
Exporting BGE to ONNX format for accelerated model inference
To use the embedding model from the Huggingface model hub in Vespa, we need to export it to ONNX format. We can use the Transformers Optimum library for this:
$ optimum-cli export onnx --library transformers --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en
This exports the small model with the highest optimization level usable for serving on the CPU.
Quantization (post-training) converts the float model weights (4 bytes per weight) to byte (int8), enabling faster inference on the CPU. As demonstrated in this blog post, quantization accelerates embedding model inference by 2x on CPU with negligible impact on retrieval quality.
Using BGE in Vespa
Using the Optimum generated ONNX model and tokenizer files, we configure the Vespa Huggingface embedder with the following in the Vespa application package services.xml file.
<component id="bge" type="hugging-face-embedder">
<transformer-model path="model/model.onnx"/>
<tokenizer-model path="model/tokenizer.json"/>
<pooling-strategy>cls</pooling-strategy>
<normalize>true</normalize>
</component>
BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular
distance
metric
for nearest neighbor search. See configuration
reference
for details.
With this, we are ready to use the BGE model to embed queries and documents with Vespa.
Using BGE in Vespa schema
The BGE model family does not use instructions for documents like
the E5
family,
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
Huggingface
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search
distance-metric.
field embedding type tensor<float>(x[384]) {
indexing: input title . " " . input text | embed | attribute
attribute {
distance-metric: prenormalized-angular
}
}
Note that the above does not enable HNSW indexing, see this blog post on the tradeoffs related to introducing approximative nearest neighbor search. The small model embedding is configured with 384 dimensions, while the base model uses 768 dimensions.
field embedding type tensor<float>(x[768]) {
indexing: input title . " " . input text | embed | attribute
attribute {
distance-metric: prenormalized-angular
}
}
Using BGE in queries
The BGE model uses query instructions like the E5 family that are prepended to the input query text. We prepend the instruction text to the user query as demonstrated in the snippet below:
query = 'is remdesivir an effective treatment for COVID-19'
body = {
'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query + ')',
'ranking': 'semantic',
'hits' : '10'
}
response = session.post('http://localhost:8080/search/', json=body)
The BGE query instruction is Represent this sentence for searching relevant passages:. We are unsure why they choose a longer query instruction as it does hurt efficiency as compute complexity is quadratic with sequence length.
Experiments
We evaluate the small and base model on the trec-covid test split from the BEIR benchmark. We concat the title and the abstract as input to the BEG embedding models as demonstrated in the Vespa schema snippets in the previous section.
Dataset | Documents | Avg document tokens | Queries | Avg query tokens | Relevance Judgments |
BEIR trec_covid | 171,332 | 245 | 50 | 18 | 66,336 |
All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs and 32GB of memory, using the open-source Vespa container image. No GPU acceleration and no need to manage CUDA driver compatibility, huge container images due to CUDA dependencies, or forwarding host GPU devices to the container.
- We use the multilingual-search Vespa sample application as the starting point for these experiments. This sample app was introduced in Simply search with multilingual embedding models.
- The retrieval quality evaluation uses NDCG@10
- Both small and base are quantized to improve efficiency on CPU.
Sample Vespa JSON formatted feed document (prettified) from the BEIR trec-covid dataset:
{
"put": "id:miracl-trec:doc::wnnsmx60",
"fields": {
"title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
"text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
"doc_id": "wnnsmx60",
"language": "en"
}
}
Evalution results
Model | Model size (MB) | NDCG@10 BGE | NDCG@10 BM25 |
bge-small-en | 33 | 0.7395 | 0.6823 |
bge-base-en | 104 | 0.7662 | 0.6823 |
We contrast both BGE models with the unsupervised BM25 baseline from this blog post. Both models perform better than the BM25 baseline on this dataset. We also note that our NDCG@10 numbers represented in Vespa are slightly better than reported on the MTEB leaderboard for the same dataset. We can also observe that the base model performs better on this dataset, but is also 2x more costly due to size of embedding model and the embedding dimensionality. The bge-base model inference could benefit from GPU acceleration (without quantization).
Using bfloat16 precision
We evaluate using
bfloat16
instead of float for the tensor representation in Vespa. Using
bfloat16
instead of float
reduces memory and storage requirements
by 2x since bfloat16
uses 2 bytes per embedding dimension instead
of 4 bytes for float
. See Vespa tensor values
types.
We do not change the type of the query tensor. Vespa casts the bfloat16
field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.
field embedding type tensor<bfloat16>(x[384]) {
indexing: input title . " " . input text | embed | attribute
attribute {
distance-metric: prenormalized-angular
}
}
Model | NDCG@10 bfloat16 | NDCG@10 float |
bge-small-en | 0.7346 | 0.7395 |
bge-base-en | 0.7656 | 0.7662 |
By using bfloat16
instead of float
to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:
Summary
Using the open-source Vespa container image, we’ve explored the recently announced strong BGE text embedding models with embedding inference and retrieval on our laptops. The local experimentation eliminates prolonged feedback loops.
Moreover, the same Vespa configuration files suffice for many deployment scenarios, whether in on-premise setups, on Vespa Cloud, or locally on a laptop. The beauty lies in that specific infrastructure for managing embedding inference and nearest neighbor search as separate infra systems become obsolete with Vespa’s native embedding support.
If you are interested to learn more about Vespa; See Vespa Cloud - getting started, or self-serve Vespa - getting started. Got questions? Join the Vespa community in Vespa Slack.