Jo Kristian Bergum
Jo Kristian Bergum
Vespa Solutions Architect

Representing BGE embedding models in Vespa using bfloat16

Decorative image

Photo by Rafael Drück on Unsplash

This post demonstrates how to use recently announced BGE (BAAI General Embedding) models in Vespa. The open-sourced (MIT licensed) BGE models from the Beijing Academy of Artificial Intelligence (BAAI) perform strongly on the Massive Text Embedding Benchmark (MTEB leaderboard). We evaluate the effectiveness of two BGE variants on the BEIR trec-covid dataset. Finally, we demonstrate how Vespa’s support for storing and indexing vectors using bfloat16 precision saves 50% of memory and storage fooprint with close to zero loss in retrieval quality.

Choose your BGE Fighter

When deciding on an embedding model, developers must strike a balance between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model parameters and embedding dimensionality (for a given sequence length). For example, using an embedding model with 768 dimensions instead of 384 increases embedding storage by 2x and nearest neighbor search compute by 2x.

Quality, however, is not nearly linear, as demonstrated on the MTEB leaderboard.

Model Dimensionality Model params (M) Accuracy Average (56 datasets) Accuracy Retrieval (15 datasets)
bge-small-en 384 33 62.11 51.82
bge-base-en 768 110 63.36 53
bge-base-large 1024 335 63.98 53.9
A comparison of the English BGE embedding models — accuracy numbers MTEB leaderboard. All three BGE models outperform OpenAI ada embeddings with 1536 dimensions and unknown model parameters on MTEB

In the following sections, we experiment with the small and base BGE variants, which gives us reasonable accuracy for a much lower cost than the large variant. The small model inference complexity also makes it servable on CPU architecture, allowing iterations and development locally without managing GPU-related infrastructure complexity.

Exporting BGE to ONNX format for accelerated model inference

To use the embedding model from the Huggingface model hub in Vespa, we need to export it to ONNX format. We can use the Transformers Optimum library for this:

$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en

This exports the small model with the highest optimization level usable for serving on the CPU.

Quantization (post-training) converts the float model weights (4 bytes per weight) to byte (int8), enabling faster inference on the CPU. As demonstrated in this blog post, quantization accelerates embedding model inference by 2x on CPU with negligible impact on retrieval quality.

Using BGE in Vespa

Using the Optimum generated ONNX model and tokenizer files, we configure the Vespa Huggingface embedder with the following in the Vespa application package services.xml file.

<component id="bge" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
  <pooling-strategy>cls</pooling-strategy>
  <normalize>true</normalize>
</component>

BGE uses the CLS special token as the text representation vector (instead of average pooling). We also specify normalization so that we can use the prenormalized-angular distance metric for nearest neighbor search. See configuration reference for details.

With this, we are ready to use the BGE model to embed queries and documents with Vespa.

Using BGE in Vespa schema

The BGE model family does not use instructions for documents like the E5 family, so we don’t need to prepend the input to the document model with “passage: “ like with the E5 models. Since we configure the Vespa Huggingface embedder to normalize the vectors, we use the optimized prenormalized-angular distance-metric for the nearest neighbor search distance-metric.

field embedding type tensor<float>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Note that the above does not enable HNSW indexing, see this blog post on the tradeoffs related to introducing approximative nearest neighbor search. The small model embedding is configured with 384 dimensions, while the base model uses 768 dimensions.

field embedding type tensor<float>(x[768]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using BGE in queries

The BGE model uses query instructions like the E5 family that are prepended to the input query text. We prepend the instruction text to the user query as demonstrated in the snippet below:

query = 'is remdesivir an effective treatment for COVID-19'
body = {
        'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
        'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query +  ')', 
        'ranking': 'semantic',
        'hits' : '10' 
 }
response = session.post('http://localhost:8080/search/', json=body)

The BGE query instruction is Represent this sentence for searching relevant passages:. We are unsure why they choose a longer query instruction as it does hurt efficiency as compute complexity is quadratic with sequence length.

Experiments

We evaluate the small and base model on the trec-covid test split from the BEIR benchmark. We concat the title and the abstract as input to the BEG embedding models as demonstrated in the Vespa schema snippets in the previous section.

Dataset Documents Avg document tokens Queries Avg query tokens Relevance Judgments
BEIR trec_covid 171,332 245 50 18 66,336
Dataset characteristics; tokens are the number of language model token identifiers (wordpieces)

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs and 32GB of memory, using the open-source Vespa container image. No GPU acceleration and no need to manage CUDA driver compatibility, huge container images due to CUDA dependencies, or forwarding host GPU devices to the container.

Sample Vespa JSON formatted feed document (prettified) from the BEIR trec-covid dataset:

{
  "put": "id:miracl-trec:doc::wnnsmx60",
  "fields": {
    "title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
    "text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
    "doc_id": "wnnsmx60",
    "language": "en"
  }
}

Evalution results

Model Model size (MB) NDCG@10 BGE NDCG@10 BM25
bge-small-en 33 0.7395 0.6823
bge-base-en 104 0.7662 0.6823
Evaluation results for quantized BGE models.

We contrast both BGE models with the unsupervised BM25 baseline from this blog post. Both models perform better than the BM25 baseline on this dataset. We also note that our NDCG@10 numbers represented in Vespa are slightly better than reported on the MTEB leaderboard for the same dataset. We can also observe that the base model performs better on this dataset, but is also 2x more costly due to size of embedding model and the embedding dimensionality. The bge-base model inference could benefit from GPU acceleration (without quantization).

Using bfloat16 precision

We evaluate using bfloat16 instead of float for the tensor representation in Vespa. Using bfloat16 instead of float reduces memory and storage requirements by 2x since bfloat16 uses 2 bytes per embedding dimension instead of 4 bytes for float. See Vespa tensor values types.

We do not change the type of the query tensor. Vespa casts the bfloat16 field representation to float at search time, allowing CPU acceleration of floating point operations. The cast operation does come with a small cost (20-30%) compared with using float, but the saving in memory and storage resource footprint is well worth it for most use cases.

field embedding type tensor<bfloat16>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}
Using bfloat16 instead of float for the embedding tensor.
Model NDCG@10 bfloat16 NDCG@10 float
bge-small-en 0.7346 0.7395
bge-base-en 0.7656 0.7662
Evaluation results for BGE models - float versus bfloat16 document representation.

By using bfloat16 instead of float to store the vectors, we save 50% of memory cost and we can store 2x more embeddings per instance type with almost zero impact on retrieval quality:

Summary

Using the open-source Vespa container image, we’ve explored the recently announced strong BGE text embedding models with embedding inference and retrieval on our laptops. The local experimentation eliminates prolonged feedback loops.

Moreover, the same Vespa configuration files suffice for many deployment scenarios, whether in on-premise setups, on Vespa Cloud, or locally on a laptop. The beauty lies in that specific infrastructure for managing embedding inference and nearest neighbor search as separate infra systems become obsolete with Vespa’s native embedding support.

If you are interested to learn more about Vespa; See Vespa Cloud - getting started, or self-serve Vespa - getting started. Got questions? Join the Vespa community in Vespa Slack.