Thomas H. Thoresen
Senior Software Engineer

Arne H Juul
Senior Principal Vespa Engineer

17 Mar 2025

Modernize your retrieval pipeline with ModernBERT and Vespa

Introduction

In december 2024, Answer.AI and LightOn.ai announced ModernBERT, a new and modernized version of BERT. In this blog post, we will take a look at how ModernBERT-models can be used in Vespa, improving the efficiency of your retrieval pipelines. BERT was released in 2018, and is in 2025 still one of the most downloaded models from Huggingface Hub. There have been many learnings and improvements since then, and ModernBERT’s goal was to incorporate all the learnings from the past 7 years when training a new base model.

ModernBERT

Image credit: answer.ai

And as you can see in the image above, they succeeded. ModernBERT is a pareto improvement (both speed and performance) over BERT and its descendants.

Like BERT, ModernBERT is a base model, which is trained on the masked language model (MLM) objective. (Predict a masked token). This means that it is a general-purpose model that can (and should) be fine-tuned on a wide range of tasks, such as feature extraction (embedding), text classification, named entity recognition, and retrieval.

A couple of months after its release, we are already starting to see many fine-tuned models based on ModernBERT, and in this blog post, we will demonstrate how you can use these models in Vespa.

To view all the models that are fine-tuned on ModernBERT, you can check out its model tree at Huggingface Hub.

What’s modern about ModernBERT?

ModernBERT’s premise was to take all the recent learnings and advancements in both architecture, training techniques and dataset curation, and incorporate them into a new bi-encoder base model with a focus on:

A modernized Transformer architecture
Efficiency
Modern data scale and sources

This results in a base model with features such as:

Longer context length: Where BERT has a maximum context length of 512 tokens, ModernBERT has a maximum context length of 8192 tokens. This means that you don’t necessarily need to chunk documents that are smaller than 8192 tokens when using ModernBERT. But be aware that just because you can stuff longer documents into a single vector representation doesn’t mean you should. Please evaluate for your specific use case.
Supports Matryoshka: Enables choosing between accuracy/efficiency by using only a sliced dimension of the vector representation. See also Shrinking embeddings for speed and accuracy and Matryoshka embeddings in Vespa.
Uses the tokenizer from AI2’s OLMo, which was explicitly trained to model code.
It is the first bi-encoder model to have a lot of code in the training data.

Use cases

There are two cases in particular where we think ModernBERT will shine:

1. Code retrieval

… out of all the existing open source encoders, ModernBERT is in a class of its own on programming-related tasks. We’re particularly interested in what downstream uses this will lead to, in terms of improving programming assistants.

-ModernBERT intro blog post

Thanks to a code-friendly tokenizer, and the inclusion of code data in its training mixture, ModernBERT excels at code retrieval tasks. For example, gte-modernbert-base displays an incredible 79.31 average ndcg@10 on the CoIR-benchmark.

code-friendly tokenizer

Image credit: jina.ai

As you can see in the image above, the ModernBERT tokenizer tokenizes code in a more logical way than the other two tokenizers. 4 and 8 whitespaces are different tokens.

2. Long context retrieval

ModernBERT shows superior performance on long context retrieval tasks, such as the MLDR and LoCo benchmarks.

results table

Image credit: huggingface

We were particularly impressed by the ColBERT (late interaction) version of ModernBERT’s great performance on the MLDR-benchmark. We hope (and expect) to see some “semi-official” “ColModernBERT” models soon, and will be sure to add them to our Model Hub when we do.

Using ModernBERT-models in Vespa

In general, Vespa supports most embedding models that are available on Huggingface Hub out of the box, through our generic Huggingface-embedder.

There are some configurations specific to each model that is important to get right.

Below are three ModernBERT models that we have added to our Model Hub, and with their configurations:

1. Nomic AI ModernBERT

Description: Trained from ModernBERT-base on the Nomic Embed datasets
Tensor Definition: tensor<float>(x[768]) (supports Matryoshka: x[256] is also possible)
Distance Metric: angular
License: apache-2.0
Source: nomic-ai/modernbert-embed-base @ 92168cb
Language: English

services.xml component definition:

<component id="my-embedder-id" type="hugging-face-embedder">
        <transformer-model model-id="nomic-ai-modernbert"/>
        <transformer-output>token_embeddings</transformer-output>
        <max-tokens>8192</max-tokens>
        <prepend>
            <query>search_query:</query>
            <document>search_document:</document>
        </prepend>
</component>

2. LightOn AI ModernBERT Large

Description: Trained from ModernBERT-large on the Nomic Embed datasets
Tensor Definition: tensor<float>(x[1024])
Distance Metric: angular
License: apache-2.0
Source: lightonai/modernbert-embed-large @ b3a781f
Language: English

services.xml component definition:

<component id="my-embedder-id" type="hugging-face-embedder">
        <transformer-model model-id="lightonai-modernbert-large"/>
        <max-tokens>8192</max-tokens>
        <prepend>
            <query>search_query:</query>
            <document>search_document:</document>
        </prepend>
</component>

3. Alibaba GTE ModernBERT

Description: GTE model trained from ModernBERT-base
Tensor Definition: tensor<float>(x[768]) (supports Matryoshka: x[256] is also possible)
Distance Metric: angular
License: apache-2.0
Source: Alibaba-NLP/gte-modernbert-base @ 3ab3f8c
Language: English

services.xml component definition:

<component id="my-embedder-id" type="hugging-face-embedder">
        <transformer-model model-id="alibaba-gte-modernbert"/>
        <max-tokens>8192</max-tokens>
        <pooling-strategy>cls</pooling-strategy>
</component>

Conclusion

We are excited to see what you will build with ModernBERT in Vespa. Get started with a free trial with Vespa Cloud today.

Photo by Simon Hurry on Unsplash

Exploring the potential of OpenAI Matryoshka 🪆 embeddings with Vespa

Using the "shortening" properties of OpenAI v3 embedding models to greatly reduce latency/cost while retaining near-exact quality

Getting Started

Welcome to Vespa, the open big data serving engine! Here you'll find resources for getting started.

Free Trial

Deploy your application for free. Get started now to get $300 in free credits. No credit card required!

#embeddings

« Advanced Video Retrieval at Scale: A Quick Start Using Vespa and TwelveLabs Tripling the query performance of lexical search »

Vespa Blog

Modernize your retrieval pipeline with ModernBERT and Vespa

Introduction

What’s modern about ModernBERT?

Use cases

1. Code retrieval

2. Long context retrieval

Using ModernBERT-models in Vespa

Conclusion

Read more

Exploring the potential of OpenAI Matryoshka 🪆 embeddings with Vespa

Using the "shortening" properties of OpenAI v3 embedding models to greatly reduce latency/cost while retaining near-exact quality

Getting Started

Welcome to Vespa, the open big data serving engine! Here you'll find resources for getting started.

Free Trial

Deploy your application for free. Get started now to get $300 in free credits. No credit card required!