Vespa Blog

Re-autoresearching MSMARCO BM25, on Vespa

Fri, 29 May 2026 00:00:00 +0000

BM25 is having a moment

Google search interest in “BM25” jumped about 5× in early August 2025. Around the same time, OpenAI’s models started volunteering BM25 noticeably more — gpt-4o named it on 12% of neutral retrieval prompts; gpt-4.1 and gpt-5-chat on 30–35%.¹ The LIMIT paper showing dense embedding models flubbing trivial retrieval landed three weeks later.²

Whatever the reasons for this spike, the renewed interest in lexical search is likely a good thing. Lexical scoring is still a very robust baseline, especially in out-of-domain or zero-shot settings. And BM25 is a great baseline, but can we do better?

Building on BM25

Earlier this month, Doug Turnbull published a really neat autoresearch experiment: let an LLM iterate on a Python BM25 reranker for 8 rounds and see how much better it gets on the MSMARCO³ passage-ranking benchmark. He randomly selects a 650k-passage slice he calls “minimarco”, and his agent is able to improve MRR@10 from 0.4913 → 0.5350 (+0.044). Very cool!

Of course, we wanted to try it for ourselves too. Our Vespa twist on it: instead of letting an LLM write arbitrary code in reranker.py, can we get a similar lift while limiting our search space to existing Vespa rank features?

Spoiler: It turns out we can indeed get a significant improvement - and one that generalises better to the full dataset too. Let’s show you exactly how we did it!

Reproducing Doug’s setup

His recipe to create the “minimarco” subset is two lines of pandas:

collection = pd.read_csv("collection.tsv", sep="\t", names=["doc_id", "description"])
minimarco = collection.sample(n=650_000, random_state=42).reset_index(drop=True)

That trims the corpus from 8.84M MSMARCO passages down to 650k. Of the 6,980 dev queries, 543 still have a labeled relevant passage that landed in our random subset (the rest are unscoreable here, since their answer isn’t in the sample).

We loaded the 650k passages into Vespa with a minimal schema, setting the two BM25 hyperparameters to the Anserini-tuned MSMARCO defaults commonly used on this benchmark.

rank-profile bm25 inherits default {
    first-phase {
        expression: bm25(description)
    }
    rank-properties {
        bm25(description).k1: 0.6      # Anserini-tuned MSMARCO defaults
        bm25(description).b: 0.62
    }
}

Running the 543 scoreable queries against this schema gives MRR@10 = 0.4907. Doug’s number is 0.4913 - close enough that the small gap could be explained by e.g. differences in stemming - the way his SearchArray’s Snowball vs. Vespa’s OpenNLP English⁴ normalize words. So we’re starting from the same baseline - now let’s see what we can add!

The three tweaks that moved the needle

We had time to try about 20 things in an afternoon. Three of them robustly survived 10 paired rotations against the running baseline — meaning each round we drew 10 different random subsets of the dev queries, evaluated both the candidate and the current best config on the same queries, and only kept changes that beat the previous best across all 10 subsets.

#1: Stopword-limit on weakAnd

Vespa’s text() operator searches its tokens with a weakAnd by default, which has a built-in document-frequency (DF) based stopword filter. We set it extremely aggressively here - ranking.matching.weakand.stopwordLimit = 0.05, which means Vespa automatically drops query terms that appear in more than 5% of docs. No need to create a hand-curated list. This makes our queries faster too - the high-DF terms have the longest posting lists, so skipping them more than halved our wall-clock latency.⁵

stopword-limit	Δ MRR@10 paired
off	-
0.05	+0.0136
0.02	-0.0088

At 0.02 the filter has become way too aggressive and starts dropping important content words.

#2: nativeProximity

The nativeProximity feature is a continuous proximity score that rewards docs where the matched query terms are close together. Doug’s agent did something loosely similar by writing code to score based on adjacent-bigram phrase term frequencies. We just pulled it from Vespa’s rank-feature catalog:

rank-profile lexical inherits default {
    inputs {
        query(w_prox) double: 0.0      # query-time tunable weight
    }
    first-phase {
        expression: bm25(description) + query(w_prox) * nativeProximity
    }
    ...
}

Our sweep results: w_prox = 10 is a good value here, with a wide plateau (8-14 are all within noise). Wide plateaus like this are a positive sign that the gain is real, not a noise-fitted spike.

#3: fieldMatch.earliness

fieldMatch(description).earliness is a feature that rewards matches near the start of the field. A match is often a stronger signal if it appears early — writing tends to introduce its main topic up front (headline, abstract, summary). w_fm_early = 8 peaks at +0.0189 paired over the proximity-only anchor.

Iterating fast — and validating more thoroughly

A nice property of doing this in Vespa: every “did this weight help?” question is just a query parameter against the index which is already built when we fed the data. ~33 ms per query, ~0.4 s for a 109-query training eval, ~22 s to test one weight value across 10 paired rotations. It stayed fast because we weren’t paying for re-indexing or re-scoring of every query/document pair; we were just passing numbers in an HTTP body.

That speed is also what lets us validate each candidate more thoroughly. The eval set is tiny - 109 training queries per rotation - so any single split is noisy. Try 20 weight combos against one split, and one or two will look like wins by pure luck (the paired noise is ±0.014, bigger than the gains we’re chasing). Our fix is to test each candidate across many different random splits and keep only what holds up everywhere. That’s painful if every eval is an LLM call or a re-index, but nearly free when it’s just another query - so we ran 10 rotations per weight. That’s the difference between a real +0.005 and a lucky one, and possibly a big reason the single-split agent loop overfits: not enough independent looks at the data before committing to a change.

Our best recipe from this run

bm25 + 10·nativeProximity + 8·fieldMatch.earliness + sw=0.05. On the full 543-query minimarco dev set: MRR@10 = 0.5163 (+0.0256 over BM25 = 0.4907). Doug’s agent gets 0.5350 (+0.044), so it’s ahead of us by about 0.019 in absolute terms.

What happens on full MSMARCO?

First, what these numbers mean: minimarco is also where we tuned, so the minimarco column is in-sample. The 10-rotation gate guards against getting lucky on a single split — but not against fitting the 543-query subset as a whole. The full 8.84M-doc corpus is the real out-of-sample test, so that’s the column to trust for generalization.

Doug honestly flags this in his post: his agent’s gains don’t generalize well. On the full 8.84M-doc benchmark, his round-8 reranker scores 0.1991 vs BM25’s 0.1897 - only +0.0094 of the original +0.044 improvement survives. We tried our configurations on the full corpus too:

How well do our results transfer to the full dataset?

	minimarco	full MSMARCO	retention
BM25	0.4907	0.1901	N/A
Free-form-Python agent	0.5350 (+0.044)	0.1991 (+0.009)	21%
Our agent (Vespa rank features)	0.5060 (+0.015)	0.2053 (+0.015)	99%
Our manual sweep (Vespa rank features)	0.5163 (+0.026)	0.2106 (+0.021)	80%

Most of our minimarco lift from both attempts survives the jump to full MSMARCO - these are all features with generalizable signal, re-tuning the weights on the full dataset might squeeze out more.

Why this gap?

An LLM with free-form Python has more rope. By round 8 the agent hard-codes a stopword list containing vacat and medicin and a conditional like toks[1] not in ("can", "invent"). Those help on the minimarco subset; they don’t help anywhere else - it’s overfitting.
We’re using rank features that IR researchers and Vespa engineers have already validated as carrying generalizable signal. We didn’t have to invent nativeProximity from scratch here. The LLM has to rediscover something like it from termfreq primitives within a single round, which is harder.
The LLM writing arbitrary python does have the freedom to invent completely novel techniques, but on this extremely well-studied dataset we can perhaps consider it somewhat unlikely.

Our own “autoresearch” loop

In our results, we refer to our first experiment as “manual”, but it’s 2026 - that sweep wasn’t a human hand-typing weights either. We used a coding assistant (Claude Code) to do the exploration, with us steering at a high level and holding it to the 10-rotation rule. So really, it’s LLM-vs-LLM - what changes between the runs is the search space and how rigorously each one accepts a change, not human vs machine.

After the quick sweep we wondered: would a fully autonomous LLM agent like Doug’s, inside our constrained search space, generalize any better? We built a small loop (same eval_margin = 0.002, same rotating seeds, same gpt-5.5 model with xhigh reasoning) that edits the Vespa first-phase rank expression instead of reranker.py. The key difference from Doug’s setup: a change is accepted only if it clears the same 10-rotation paired-robustness check our manual sweep uses — so the agent and the sweep differ only in search space and autonomous-vs-steered, not in how rigorously a change is accepted. ~700 lines of Python, $6 of OpenAI spend, 30 minutes.

In our run the agent found two valuable features — nativeProximity and fieldMatch.earliness, the same pair the manual sweep landed on — and reached MRR@10 = 0.2053 on full MSMARCO (+0.0152): about 99% retention of its minimarco lift, even higher than the manual sweep’s 80% and far above the free-form-Python run’s 21%. Its absolute score sits a touch below the manual sweep’s (0.2053 vs 0.2106) because it didn’t add the stopwordLimit matching-side lever in this run — but almost all of the lift it did find carried over. That’s the constrained search space doing its job: the agent can’t encode token-specific tactics like a hard-coded stopword list, so its ceiling on overfitting is lower.

Our final config

body = {
    "yql": "select doc_id from passage where description contains ({language:'en'}text(@q))",
    "q": user_query_text,
    "ranking.profile": "lexical",
    "hits": 10,
    "language": "en",
    "input.query(w_prox)": 10.0,
    "input.query(w_fm_early)": 8.0,
    "ranking.matching.weakand.stopwordLimit": 0.05,
}

This is all that’s needed to use these features. Just a simple linear combination of well-known signals, and a demonstration that there is much more tuning potential to be had from lexical search too. And your coding agent already knows how to do it!

BM25 has served the IR community well for 30+ years, and we don’t think it’s going anywhere. But Vespa has all the pieces in place to go beyond the baseline - with lexical signals, advanced multi-vector ranking and more, and we keep working to raise the bar. Subscribe to the newsletter if you’d like to hear about it!

Going further

The full code for this experiment — the Vespa app, the paired-rotation sweep harness, and the LLM agent loop, with a step-by-step reproduction guide — is at vespaai-playground/msmarco-bm25-autoresearch.

To actually give your coding agent the Vespa knowledge to quickly succeed, there’s an official skills pack for Claude Code / Codex / Cursor / Gemini CLI: github.com/vespaai-playground/skills — which includes schema authoring, rank features, query building, etc.

Want to go even further? A weighted sum of three features is still on the simple end of the spectrum. The next step is to stop hand-weighting and let a model learn the best combination on your dataset: the RAG Blueprint collects ~190 lexical and semantic match/rank features and trains a GBDT (LightGBM) model to combine them — a more advanced version of what we did by hand here, and it mixes BM25-style lexical signals with vector-semantic ones.

Want to play with Vespa? Start with the free trial or pull the vespaengine/vespa container.

Notes

Come hang out in Vespa Slack or Discord if you want to chat ranking features or compare notes on retrieval evals. Thanks to Doug for publishing the experiment!

We asked the three models the same six neutral retrieval prompts, 10 reps each. BM25 mention rates: gpt-4o 12%, gpt-4.1 30%, gpt-5-chat 35%. The shift happens at gpt-4.1, not GPT-5. ↩
On the Theoretical Limitations of Embedding-Based Retrieval, Google DeepMind & Johns Hopkins, arXiv:2508.21038 (Aug 2025). LIMIT is a benchmark of deliberately simple retrieval queries — top embedding models scored under 20% recall@100 on queries as simple as “who likes apples?”. ↩
MSMARCO is a widely-used passage-ranking benchmark — 8.84M short passages and a dev set of 7,437 qrels (query + relevant-passage-ID pairs collected by human annotators). MRR@10 = mean reciprocal rank of the first relevant passage in each query’s top-10 results, averaged across queries (0 if no relevant passage in top-10). ↩
Vespa also ships Lucene Linguistics as an alternative for stemming, tokenization, and language detection — we stuck with the default OpenNLP here and did no tuning of linguistics. ↩
Tripling the query performance of lexical search takes a complementary latency-and-cost-focused approach — stopwordLimit=0.6 combined with adjust-target and filter-threshold for 3–11× speedup with negligible quality change. We pushed stopwordLimit to the aggressive end of the range for quality, since on MSMARCO the benefit comes from setting the parameter low enough to exclude common question words — about 72% of dev queries are question-style (“what is X”, “how do Y”), and at 0.6 those terms are not excluded. On keyword-heavy workloads the optimum would be higher — re-measure on your own query distribution before adopting. Query rewriting on lexical queries is likely stronger, but we didn’t do it here. ↩

Vespa Newsletter, May 2026

Wed, 27 May 2026 00:00:00 +0000

Welcome to the latest edition of the Vespa newsletter. In the previous update, we introduced the Vespa.ai Playground, the Vespa Kubernetes Operator, Pyvespa 1.0, and more.

This month, we’re announcing several updates focused on retrieval quality, ranking flexibility, and developer productivity. Each feature is designed to help engineering teams build faster, more accurate, and more maintainable retrieval and ranking systems, while giving businesses better relevance, lower operational overhead, and more predictable performance at scale.

Let’s dive into what’s new.

Calling All Vespians: Vespa.ai Live Has Landed

As more teams build with Vespa, bringing the community closer together has become a major focus for us. Earlier this year, we ran our first virtual meetups, extending our Slack community and were joined by more than 100 Vespians from around the world — from the US and Ukraine to Singapore, Australia, Kazakhstan, Egypt, and beyond. But while virtual events are great, nothing quite compares to meeting in person — learning from peers, exchanging ideas, and continuing the conversation over coffee, beer, or wine. That’s why we’re excited to announce our first in-person community meetup: Vespa.ai Live!

The event includes technical sessions, real-world user experiences, expert panels, interactive unconference discussions, and plenty of opportunities to connect with others building in this space. Hosted by The Search Juggler, Charlie Hull, the day will bring together external experts and leading authors Trey Grainger and Doug Turnbull, Vespa engineers, community voices with speakers from Walmart, Etsy and RavenPack, and a keynote from Vespa co-founder and CEO Jon Bratseth.

On September 9, join the pre-event training, with Vespa 101: Getting started with Vespa, and Ranking 202: A deeper dive into improving retrieval quality - details and registration.

Most of all, Vespa.ai Live is intended to be community-driven — where Vespians share lessons learned and boldly go beyond the frontier of modern AI retrieval.

Learn more about Vespa Live!

Product updates

Vespa Cloud: Detailed metric dashboards
Vespa Cloud: Index backup
Vespa Cloud: Fine-grained maintenance controls
Vespa Cloud: Voyage AI, OpenAI, and Mistral AI embedding integration
Vespa Cloud: Custom resource tags
Vespa skills for agents
A new query operator for text matching
Cluster-size independent configuration of relevance effort
Boolean array fields
Match specific array elements
In-memory document ids
Search group pinning
Near matching aware ranking
Detect ignored write operations
Accessing the max first phase score in re-ranking
Geo distance in grouping

Vespa Cloud: Detailed metric dashboards

As companies deploy their large-scale latency-sensitive applications on Vespa Cloud there is a need for more detailed insights into how the application is performing. While the Vespa Cloud Console has always provided an overview metrics dashboard, many finer details have only been available to Vespa’s operations engineers.

What’s new: We have added all the metrics dashboard used by Vespa engineers to the console so that customers who want to are empowered to dig as deeply as they like. We’ve also added explanations to the dashboards to make them easier to understand.

Vespa Cloud: Index backup

Reliability at scale means being prepared for the unexpected. Vespa Cloud now provides automated snapshot backups of indexes from content nodes, enabling catastrophe recovery without a full re-index.

What’s New: Vespa Cloud now supports backing up indexes from content nodes. The snapshot backups can be used for catastrophe restore of nodes. Read more.

Vespa Cloud: Fine-Grained maintenance controls

Vespa Cloud has always provided control over when and how application changes and Vespa upgrades are rolled out in production by the CD pipeline. In addition, Vespa Cloud does occasional OS upgrades as a background host level operation which is orchestrated but not rolled out by the CD pipeline.

We already see a need to run these processes more aggressively and anticipate that this trend will accelerate as capabilities similar to Claude Mythos become widely available. While this is necessary to maintain a strong security posture in the times ahead it has the potential to increase the impact from OS level maintenance on observed metrics, such as when these operations happen during peak traffic.

What’s new: Vespa Cloud now lets you control when these maintenance operations are allowed to happen in the same way as deployments are controlled. See the maintenance attribute in deployment.xml’s block-change tag.

Vespa Cloud: Voyage AI, OpenAI, and Mistral AI embedding integration

Embedding models are central to AI search, and it’s now simpler to use the most popular ones in Vespa Cloud.

What’s new: You can now save your API key in Vespa Cloud and invoke the embedding APIs of Voyage AI, OpenAI, and Mistral AI. These APIs can be invoked at document processing time (indexing), as well as query time. With Voyage AI and the voyage-4 model family, you can use the API for documents and a smaller, local model for queries, eliminating the need for an API call in the query path.

Read more in the Vespa embedding documentation for: Voyage AI, OpenAI, and Mistral AI.

Vespa Cloud: Custom resource tags

With Vespa Cloud Enclave, Vespa will provision resources in accounts owned by the customer. Many companies want to track these resources, e.g. for financial monitoring.

What’s New: Vespa now lets you declare custom resource tags in deployment.xml that will be applied on provisioned resources. The tag declarations can contain template variables such that resources can be tagged with e.g. the application they belong to. Read more.

Vespa skills for agents

Coding agents are good at working with Vespa applications. Giving them the relevant skills makes this even more efficient.

What’s new: We have released a collection of skills for agents working with Vespa applications, available here. This includes skills for working with application packages, feeding and queries, as well as migrating from ElasticSearch to Vespa. We run evaluations over these skills to ensure that they actually improve outcomes with current models.

A new query operator for text matching

To do lexical matching with an arbitrary input string in Vespa, you can use userInput(“my text”) in YQL. This assumes that the text can control simple syntax for controlling the matching, such as “some-field:” to specify the field to match. Sometimes, the text should just be interpreted as raw text with no such query syntax.

What’s new: Vespa now supports a new text() operator which interprets the argument text simply as raw text with no syntax. When there’s no syntax, the text can only end up searching one field or fieldSet and so the regular syntax can be used to specify the field: where my-field contains text(“my text”). Read more.

Cluster-size independent configuration of relevance effort

Vespa has various parameters to set how much effort (CPU) should be spent on providing good results in a query. These parameters are specified as a value per content node, so if you want the total expenditure to stay constant when you change the number of content nodes, you must remember to update these parameters.

This is easy to forget, and of course impossible with autoscaling activated. What’s more, when new nodes are added to clusters, they will initially have less data than other nodes, but will get the same setting as nodes with a full share of data.

What’s new: Vespa now supports alternatives of these configuration parameters prefixed by “total”, which allows you to specify values across all the content nodes. Vespa will automatically calculate the right share for each node in the cluster group, including when nodes temporarily have less data than normally.

The new “total-” parameters are:

NearestNeighbor and WeakAnd totalTargetHits: The minimum number of hits the query operator should produce
Match-phase total-max-hits / ranking.matchphase.totalmaxhits
First-phase total-keep-rank-count / ranking.totalKeepRankCount
Second-phase total-rerank-count / ranking.secondPhase.totalRerankCount

Boolean array fields

Once developers move beyond the basics to really put the power of Vespa to work they often want to pack large amounts of dynamic metadata into documents, such as for example storing information about each document’s relationship to each zip code in the US. At scale the memory efficiency with such usage really matters.

What’s new: Vespa now lets you create arrays of bits fields: field my_bits type array. Booleans can both be standalone or part of a struct type which is wrapped in an array. Read more.

Match specific array elements

Arrays in documents can be searched both as attributes and text indexes. You can also match multiple struct values or text tokens of the same array element by using the sameElement operator. In some use cases, you also want to match a specific index in the array.

What’s new: Vespa now lets you specify the array index you want to match in queries: select … where my_bits[94085] = true. You can also search for multiple indexes in the same query by using slightly more complicated syntax:

select … where my_array contains ({elementFilter:[33, 34]}sameElement(first_name contains "John", last_name contains "Doe")

See the elementFilter documentation. This is also supported in JSON queries by using an index attribute.

Search group pinning

When a content cluster has multiple groups, they will all have the same data, but their indexes will be slightly different since each node in each group has a different subset of the data written in a different order. This can lead to some inconsistency when a user is paging over a result set and hitting different groups.

What’s new: Vespa now lets you pin queries to a specific group to make pagination queries consistent. Read more.

In-memory document ids

Document ids in Vespa are only stored on disk only. This saves memory, but makes it impossible to retrieve the full ids of many documents really fast in queries and visiting.

What’s new: From Vespa 8.691 you can declare in the schema that document ids should reside in memory for fast access similar to attributes. Read more.

Near matching aware ranking

When using the near and onear query operators, the most intuitive ranking is using only the terms matching in the operator itself for rank scoring. Example:

Suppose near(term1, term2) matches document1 because of a single window where term1 and term2 appear close enough. If document1 contains term1 many additional times outside the valid proximity window, this is less relevant with respect to ranking.

What’s new: From Vespa 8.672, terms outside the match window are not considered in relevance calculations. Read more.

Detect ignored write operations

Content clusters can specify what documents they should receive in a document selection. Sending a document operation which is ignored by every cluster is not an error, but you may want to know.

What’s New: From Vespa 8.680, the document/v1 API includes a dedicated X-Vespa-Ignored-Operation HTTP response header. When an operation is ignored during routing (for example, because the target document no longer exists), this header is present and set to “true”. Read more in this issue.

Accessing the max first phase score in re-ranking

What’s new: firstPhaseMax is a new rank feature which exposes the rank score of the top scoring document locally on the node in second-phase ranking.

One usage of this is to enable dropping documents that score too low relative to the best-scoring document, by combining it with a rank-score-drop-limit.

Geo distance in grouping

What’s new: Grouping now lets you group by geo_distance:

all( group(fixedwidth(geo_distance(attribute(location), 63.4, 10.4).km, 10)) each(output(count())) )

Image Search demo

Try the image search demo to test how ranking profiles with different tensor precision in ranking affect results. Hint: binarized embeddings perform really well, read the report!

Vespa Learn

learn.vespa.ai is a self-paced course that teaches you how to build search, recommendation, and RAG applications with Vespa. You will go from zero to a working e-commerce search engine with hybrid retrieval and machine learning ranking, building it up one piece at a time across six modules.

What’s New on YouTube

Multimodal Intelligence for Life Sciences on AWS (Webinar)
The Personalization Problem in eCommerce AM (Webinar)
The Personalization Problem in eCommerce EM (Webinar)
The Relevance Problem in eCommerce (Webinar)
Vespa Now: Q1 Product Update (Webinar)
Zero Results Webinar

Find more videos in the @vespaai channel.

Blogs and ebooks

Learn how Kleinanzeigen built a single system with user behavioral profiles alongside ads, WAND for fast inner-product retrieval over sparse attribute vectors, embedding-based ANN search, and click and search events processed as document updates: From Elasticsearch to Vespa: Rebuilding the Kleinanzeigen Homepage Feed.
Scaling a Vespa Application: Feeding Fast and Furiously
The Vespa Cloud Metrics Dashboard
Using Large ONNX Models with External Data in Vespa Embedders
Asymmetric Retrieval: Spend on Docs, Embed your Queries for Free
How Metal AI Built an Agent-Driven Intelligence Platform on Vespa Cloud
Build a High-Quality RAG App on Vespa Cloud in 15 Minutes

Upcoming events

Commerce AI Summit London: June 3, London, UK. An executive-style event connecting retailers, brands, and AI solution providers.
Berlin Buzzwords: June 7-9, Berlin, Germany. Europe’s leading conference for data infrastructure, search, and machine learning.
Shoptalk Europe: June 9-11, Fira Gran Via, Barcelona. Europe’s home for retail innovation, bringing together 4,500+ trailblazers and 180+ speakers focused on AI and the future of commerce.
Etail UK: June 16-17, Manchester, England. A leading eCommerce and retail conference focused on digital commerce strategy, customer experience, and AI-driven personalization for modern retail teams.

👉 Follow us on LinkedIn to stay in the loop on upcoming events, blog posts, and announcements.

Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? Deploy your application for free on Vespa Cloud today.

Scaling a Vespa Application: Feeding Fast and Furiously

Tue, 28 Apr 2026 00:00:00 +0000

This is a blog/series on how to scale and evaluate a Vespa Application for serving enterprise-scale workloads and customer facing applications with potentially millions of users. Vespa is the AI search platform and all-in-one solution for all your retrieval and large scale computation needs.

In this blog I will show you how to feed a large dataset to a Vespa Application. We will be using the full MS_marco passages dataset, which is perhaps the most comprehensive open dataset for information retrieval. It is around 4GB and contains more than 8 million passages on a wide range of topics. The goal in this blog is to show how scaling works in Vespa through feeding the entire dataset as fast as we can.

Creating the Vespa Application

We will be using a pre-made sample application as our basis for scaling but the concepts are the same for any other application.

Setup:

Create a tenant on Vespa Cloud:

Go to console.vespa-cloud.com and create your tenant (unless you already have one).
Install the Vespa CLI using Homebrew:
```
$ brew install vespa-cli
```
Windows/No Homebrew? See the Vespa CLI page to download directly.

Configure the Vespa client:

$ vespa config set target cloud
$ vespa config set application your-tenant-name-here.scalingtutorial

Get Vespa Cloud control plane access:
```
$ vespa auth login
```
Follow the instructions from the command to authenticate.
Clone the sample application:
```
$ vespa clone scaling-tutorial scaling-app && cd scaling-app
```
This sample app is perfect for demonstrating scaling and performance as it is quite intensive to run both for feeding and querying. You can also check out sample-apps for other sample apps you can clone.
Add a certificate for data plane access to the application:
```
$ vespa auth cert
```
It is a good idea to take note of the path to the .pem files written here.

Add the cross-encoder and Colbert model

Export the cross-encoder ranker model to onnx format using the Optimum library from HF or download an exported ONNX version of the model (like in this example)

$ mkdir -p models
$ curl -L https://huggingface.co/Xenova/ms-marco-MiniLM-L-6-v2/resolve/main/onnx/model.onnx -o models/model.onnx
$ curl -L https://huggingface.co/Xenova/ms-marco-MiniLM-L-6-v2/raw/main/tokenizer.json -o models/tokenizer.json

Download the dataset

The msmarco passages dataset can be found here. Download, unzip it and place it in the ext/ folder in our application.

NOTE: You will need around 8GB of free disk space for the dataset and the subsets we will be creating.
Prepare the dataset for Vespa

Then run the script to convert it into the vespa feed format:
```
$ python3 ext/transform_ms_marco.py
```
which gives us the dataset and a few subsets of various sizes to feed to our application.

Deploying and Feeding

We now have everything we need for deployment, feeding and scaling! Scaling a vespa application is largely managed through the services.xml file. This is what the file currently looks like:

XML — services.xml

The important parts to take note of in this tutorial are the two resource specifiers in the and tags:

Container

 deploy:environment="dev" count="1">
     vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/>

Content

 deploy:environment="dev" count="1">
     vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/>

This is where we configure the machine resources that our Vespa application runs on in Vespa Cloud.

NOTE: when deploying to dev we need to add the specifier to ensure we actually get the resources we ask for, otherwise we default to what is quickly available.

Adding more resources or more nodes are the main parameters that need to be tweaked in order to scale your application. Right now we have provisioned the smallest amount of resources to our application.

Deploy the application to Vespa Cloud:

$ vespa deploy --wait 900

(It might take a little bit of time for all services and nodes to go up and start running.)

You can follow the progress of the deployment from the terminal or in your tenant in your cloud console. When it is finished you should get the message:

Application up!

If you go to your cloud console you should be able to see your application. Note that we haven’t fed it any documents yet, so it should look something like this:

Let’s feed some documents. Feed the smallest dataset to Vespa using:

vespa feed ext/corpus_transformed_1000.jsonl

or, on Unix systems:

time vespa feed ext/corpus_transformed_1000.jsonl

to see how long it takes.

It will take a few minutes as we are doing heavy computations on very modest resources.

If you want to see a live count of how many documents that are in Vespa you can run:

vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'

to see how many documents have been processed so far (under documents).

On this lowest resource configuration we get this result.

vespa feed ext/corpus_transformed_1000.jsonl  4.96s user 7.11s system 3% cpu 5:56.05 total 

If we were to try and feed the whole 8.8 million passage msmarco dataset on this instance it would take more than a month to finish feeding!

We can do better!

Scaling

Before scaling the application we’ll delete the documents from our instance so that we have a fresh start.

Now lets assign more resources to our Vespa instance. From our schema we see that we are doing extensive computations during feeding (notice the configuration in the indexing parameters)

Schema

  field e5_embedding type tensor(x[384]) {
    # Using the e5 embedding model defined in services.xml
    indexing: input text | embed e5_embedding_model | attribute | index
    attribute {
      distance-metric: angular
    }
    index { # override default hnsw settings 
      hnsw {
        max-links-per-node: 32
        neighbors-to-explore-at-insert: 400
      } 
    }
  }

  field colbert_embeddings type tensor(dt{}, x[16]) {
    # No index - used for ranking, not retrieval 
    indexing: input text | embed colbert_embedding_model | attribute
    attribute: paged
  }

Embedding in Vespa happens in the container cluster, so it is a very reasonable guess that if we can make the embeddings go faster, our whole system will be faster (bellow in this blog we will show how to more thouroughly deduce scaling parameters). So lets start by scaling up the resources for the container node. To see what resource configurations we have available we must look at the instance type page in the documentation. Embedding-computations are best suited to run on GPUs, so we will select an instance type with a GPU:

Container

 deploy:environment="dev" count="1">
     vcpu="4.0" memory="16Gb" architecture="x86_64" storage-type="local" disk="125Gb">
         count="1" memory="16.0Gb"/>
    

Replace the resources in the container node in services.xml with the new instance type (see above). Leave the content node resources as is for now.

Run the command for checking document count to make sure that it is zero:

vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'

and redeploy:

$ vespa deploy --wait 900

When the deployment is finished we’ll time the feeding process again.

time vespa feed ext/corpus_transformed_1000.jsonl

vespa feed ext/corpus_transformed_1000.jsonl  0.40s user 0.53s system 8% cpu 11.371 total 

11.4 seconds, that’s more like it! Instead of a month, this new instance would be able to crunch through the full dataset in just around a day!

We have now significantly upgraded a part of the hardware Vespa is running on. But before we scale up further we shall take a look at the metrics tab for our application. Go to Metrics and then resources.

This is where you see the resource usage history in your vespa instance, but most importantly it gives you a clear image of where your application is bottlenecked. The bottleneck for your application will be different depending on how your application is configured and the kind of computations you do. The previous 1000-line dataset was no match for the upgraded instance, so lets give it a bigger one to get some proper bottleneck data:

Delete the documents from the instance again, wait a bit, and run the command to ensure that we have no documents in our application

vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'

Now we’ll feed the 50 000 line dataset to properly test and time the upgraded instance.

time vespa feed ext/corpus_transformed_50000.jsonl

Result

vespa feed ext/corpus_transformed_50000.jsonl  17.44s user 22.68s system 11% cpu 5:56.41 total (~17.6 hours for full dataset)

This is a more accurate reading of the instance’s performance, and at 5min 56s to feed 50 000 documents, the full dataset would take around 17 and a half hours.

Look at the resources in the metrics and set it to show only the last 30 minutes so that we can see more clearly what went on. Notice the CPU-utilisation and the GPU-utilisation graphs. Notice that the GPU usage on the container node hit 100% and stayed there for the entire feeding process. The CPU usage on the container node peaked at 80% but leveled at around 60% and the content node’s CPU barely went over 50%.

It is clear that on this Vespa Instance, the bottleneck for better feeding performance lies in the GPU processing. If we want to improve the feeding performance of the system, then we must increase the amount of GPUs in the container node.

Now that we know where the problem lies: Lets make it go faster! We’ll increase the amount of GPU nodes to 5 with the count="5" parameter in the container node in services.xml.

Container

 deploy:environment="dev" count="5">
     vcpu="4.0" memory="16Gb" architecture="x86_64" storage-type="local" disk="125Gb">
         count="1" memory="16.0Gb"/>
    

save the services.xml file and redeploy:

$ vespa deploy --wait 900

Now lets feed the larger dataset:

time vespa feed ext/corpus_transformed_500000.jsonl

Result

vespa feed ext/corpus_transformed_500000.jsonl  97.59s user 95.22s system 10% cpu 29:38.18 total (~8.8 hours for full dataset)

If we extrapolate the results we see we got around twice the speed of the single-container node instance. But why not 5 times the speed? Let’s look at the metrics.

We see that the container GPU utilization now sits comfortably at around 50% and the container CPU at around 20-30%. But the content node CPU utilization sits near 100%. The 5 content nodes with GPUs saturate the single content node’s ability to take in data. We have found the new bottleneck of the system.

We’ll add some more content nodes:

Content

 deploy:environment="dev" count="2">
     vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/>
 

Delete the documents, redeploy, and refeed:

$ vespa deploy --wait 900

time vespa feed ext/corpus_transformed_500000.jsonl

Result

vespa feed ext/corpus_transformed_500000.jsonl  92.22s user 88.21s system 17% cpu 16:51.51 total (~5.0 hours for full dataset)

Adding the second content node almost doubles the performance again. Look at the metrics to see what is going on.

We see now that the container GPU (70-80%) and the content node CPU (80-90%) are both highly utilised, whilst the container node CPU is around 40%. Since we are already on the smallest instance type with a GPU we can’t scale down the cpu to match the others, so we have actually found a near optimal balance of container and content node resources for feeding this application.

Now that we have found a good balance, lets really scale up!

Feeding Fast: 20 GPUs

If we want serious feed throughput, we need serious hardware. Let’s scale the container and content nodes proportionately and jump to 20 GPU container nodes and 8 content nodes at the same time:

Container

 deploy:environment="dev" count="20">
     vcpu="4.0" memory="16Gb" architecture="x86_64" storage-type="local" disk="125Gb">
         count="1" memory="16.0Gb"/>
    

Content

 deploy:environment="dev" count="8">
     vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/>

NOTE: At this point you will most likely hit the quotaExceeded error when you try to deploy. Vespa Cloud tenants have a default quota that prevents you from accidentally spending a lot of money. If you want to go past it, reach out to Vespa support. With the limit raised, redeploy:

$ vespa deploy --wait 900

Delete any existing documents, wait for the count to hit zero, and feed the 500 000 line dataset again:

time vespa feed ext/corpus_transformed_500000.jsonl

Result

vespa feed ext/corpus_transformed_500000.jsonl  59.66s user 48.08s system 42% cpu 4:13.19 total (~1 hour 15 min for full dataset)

At an estimated 1 hour and 15 minutes for the full dataset we see that we got pretty much exactly 4x feeding speed with 4x the resources. We also see that the utilisation metrics are essentially the same as the last run (feeding at 11:30), just faster.

Feeding Furiously: 100 GPUs

Finally, because we can: 100 GPU container nodes and 40 content nodes, and this time we will feed the full 8.8 million passage dataset in one go.

Container

 deploy:environment="dev" count="100">
     vcpu="4.0" memory="16Gb" architecture="x86_64" storage-type="local" disk="125Gb">
         count="1" memory="16.0Gb"/>
    

Content

 deploy:environment="dev" count="40">
     vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/>

Delete the documents, Deploy, then feed the full dataset:

time vespa feed ext/corpus_transformed_full.jsonl

Result

vespa feed ext/corpus_transformed_full.jsonl  695.31s user 605.48s system 108% cpu 20:04.23 total

The Vespa instance managed to process more than 8.8 million passages, with embeddings and ColBERT vectors computed for every single one, in just over 20 minutes (over a fast internet connection).

The feed client also gives us a nice summary at the end:

{
  "feeder.operation.count": 8841823,
  "feeder.seconds": 1201.608,
  "feeder.ok.count": 8841823,
  "feeder.ok.rate": 7358.324,
  "feeder.error.count": 399,
  "http.request.count": 8844266,
  "http.response.latency.millis.avg": 167,
  "http.response.code.counts": {
    "200": 8841823,
    "429": 2044
  }
}

The feeding process had an average feed rate of around 7358 documents per second. Now that is fast and furious!

Conclusion

The best way to scale your Vespa instance is to use the metrics dashboards to see where the bottlenecks lie. There is no singular best instance of Vespa as the computational requirements are highly dependent on how you define your application. Feed the instance a sizable corpus to see how it performs under sustained load, and adjust its resources accordingly.

The Vespa Cloud Metrics Dashboard

Fri, 24 Apr 2026 00:00:00 +0000

When something goes wrong in production, the hard part is rarely finding a metric. The hard part is figuring out which metric tells you where to look next.

The Vespa Cloud metrics dashboard is designed for exactly that. Instead of treating monitoring as a wall of graphs, it helps you move from symptom → bottleneck → action.

Start with three questions

Most production issues can be reduced to three questions:

Is the system healthy?
Where is latency added?
Are we running out of resources?

The dashboard mirrors that flow.

1. Is the system healthy?

Start on the Overview tab. This is the fastest place to answer “is anything obviously broken?”. A healthy system keeps read and write QoS close to 100%. If it drops, look at whether 4xx or 5xx responses are rising — 5xx responses usually mean the problem is on the server side. A rise in degraded or failed queries means it is time to continue into the Query tab.

See the docs for the full reference: Metrics Overview tab.

2. Where is latency added?

Latency in Vespa is layered — a slow request is not just “slow”, it can be slow in different parts of the path:

HTTP → container → content nodes → ranking

That is why the dashboard shows several latency metrics for what feels like the same request. If HTTP latency is much higher than query latency, payload size or network overhead may be the issue. If search-protocol latency on the content nodes is high, the bottleneck is deeper in the system.

See the docs for a layer-by-layer walkthrough: Query tab and Feed tab.

3. Are we running out of resources?

Once you know where the slowdown is, switch to the Resources tab. As a rule of thumb, sustained utilization above roughly 80% is a sign the cluster may need more headroom. If one host is much hotter than the others, enable per-host metrics and look for uneven load distribution.

See the docs for healthy-value tables and scaling guidance: Resources tab.

What’s new in the latest revision

The dashboard has picked up a few improvements worth calling out.

Health Indicators on the Overview tab

The Overview tab now opens with a dedicated Health Indicators row — five stat panels that surface stability issues in a single glance: Core Dumps (1h), Restarts (1h), Feed Blocked, Content Cluster with Groups/Nodes Down, and Container Nodes with Services Down.

Details and healthy values: Health Indicators.

Annotations for Service restart and Core dump

Annotations are the vertical lines drawn across every chart when an operational event happens — Vespa upgrades, feed blocked, data migration, reindexing, autoscaling changes. Two annotations were added recently and they are worth flagging:

Service restart — fires when a Vespa service process restarts. Outside of planned upgrades, restarts usually mean a crash, OOM, or forced stop.
Core dump — fires when a process core-dumps. Should be extremely rare.

When a latency anomaly lines up with one of these annotations, you get the context for the change without having to infer it from the graph alone. Both signals also feed the Overview’s Health Indicators row, so the same event shows up in three places: the counter, the annotation line, and the Health tab’s historical time series.

Full annotation reference: Annotations.

Container thread pool rows, one per configuration case

The Resources tab used to have a single thread-pool row that was mostly empty — a container only has the thread pools that match its services.xml configuration (

, , or both). The row has been split into three case-specific rows:

Thread Pools (search + document-api) for full-feature containers
Thread Pools (search only) for query-only containers
Thread Pools (document-api only) for feed-only containers

Classification is automatic — hidden variables derive the cluster list per case from Prometheus set operations, so only relevant rows render for a given deployment. Each thread pool now gets its own panel with avg (green) and max (yellow dashed) on the same chart.

Details: Container Thread Pools.

JVM memory breakdown (heap / direct / native)

The Resources tab separates the three layers of container memory: heap, direct, and native. This matters on container nodes that run embedders or local LLM components — model weights are memory-mapped and partially resident, but KV cache and compute buffers are allocated upfront as native memory. When node memory is high but heap and direct look normal, the native layer is usually where to look.

Details: JVM memory breakdown.

A simple workflow

A practical way to use the dashboard during an incident:

Open Overview and scan the Health Indicators row.
Confirm the symptom (QoS drop, latency spike, error-rate increase).
Use Query or Feed to find the slow layer.
Use Resources to confirm whether the cluster is saturated.
Cross-reference annotations for restarts, upgrades, reindexing, or migration.

That flow gets from “latency is up” to “this is the actual bottleneck” much faster than scanning every chart. The common workflows section of the docs has recipes for the most frequent scenarios.

Summary

The Vespa Cloud metrics dashboard works best as a troubleshooting tool — not a metrics catalog. Start with health, follow the latency path, confirm with resources, and use annotations to connect spikes to real events. The tab reference, healthy-value tables, and step-by-step workflows live in the Monitoring documentation.

Using Large ONNX Models with External Data in Vespa Embedders

Fri, 27 Mar 2026 00:00:00 +0000

Many popular ONNX models exceed the 2 GB protobuf format limit and store their weights in separate external data files. Until recently, these models could not be used directly in Vespa’s built-in embedders.

This was a long requested feature on our tracker (see GitHub issue #28761).

The 2 GB limitation

ONNX uses Google’s Protocol Buffers as its serialization format. Protobuf has a hard limit of 2 GB on message size. For smaller models, this is not a problem — all tensor data (the model weights) is embedded directly in the .onnx file, making it self-contained.

As models grow larger, they inevitably hit this limitation. For a model exceeding 2 GB, ONNX tooling splits it into two parts:

A small .onnx file containing the model graph structure (typically a few hundred KB to a few MB).
One or more external data files (commonly named .onnx_data) containing the actual tensor weights.

Note that reduced-precision variants of these models (INT8, FP16, etc.) are often small enough to fit in a single self-contained .onnx file. The external data split primarily affects the full-precision versions.

Previously, if you pointed a Vespa embedder at a model with external data files, ONNX Runtime would fail to load it because the data files were not available alongside the model file.

What changed

Vespa embedders now automatically handle ONNX models with external data files. When you configure an embedder with a URL pointing to an .onnx file, Vespa inspects the model to check whether it references any external data files. If it does, Vespa downloads those files automatically before loading the model.

This feature is available starting from Vespa 8.544.

How to use it

Here is an example using EmbeddingGemma 300M, which uses external data:

 id="default" version="1.0">
   id="gemma" type="hugging-face-embedder">
    
      url="https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/onnx/model.onnx"/>
    
      url="https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/tokenizer.json"/>
    2048
    
      task: search result | query: 
      title: none | text:

If you are deploying to Vespa Cloud, you can also use models from the Vespa Model Hub that use external data. For example, the Multilingual-E5-large model (will be available on Vespa Cloud 8.668+):

 id="default" version="1.0">
   id="e5" type="hugging-face-embedder">
     model-id="multilingual-e5-large"/>
    512
    
      query: 
      passage: 
    
  

This works with our ONNX-based embedders:

It’s also possible to use private models — authentication tokens are propagated when downloading external data files.

Current limitations

There are a few constraints to be aware of:

Embedders only. Models used directly in ranking expressions must still be self-contained and under 2 GB.
URL-referenced or Model Hub models only. Models bundled in the application package using the path attribute do not support external data. Models referenced via url or model-id (Vespa Cloud) are supported.
External data files must be co-located with the model. The external data files are resolved relative to the model URL. They must be in the same directory (or a subdirectory) as the .onnx file.

See the ONNX model documentation for the full list of requirements.

If you need more extensive support for ONNX models with external data — for example in ranking expressions — feel free to file an issue.

Asymmetric Retrieval: Spend on Docs, Embed your Queries for Free

Tue, 10 Mar 2026 00:00:00 +0000

At 10,000 queries per second with ~30-token queries, you’re pushing ~18 million tokens per minute through your embedding API. At $0.02 per million tokens, that’s over $15,000/month — just for query embeddings. Documents are embedded once. Queries are embedded forever.

What if you could drop that to $0?

That’s the promise of asymmetric retrieval: embed your documents with the best model money can buy, then embed queries with a tiny model running locally — for free. Voyage AI’s new voyage-4 family is the first to make this practical, and Vespa now has native support for it.

The asymmetric insight

The conventional approach is to use the same embedding model for documents and queries. Same model, same vector space. But it ignores a fundamental asymmetry.

Document embedding is a one-time cost. You embed each document once at indexing time, and it’s not latency-sensitive — whether it takes 10ms or 500ms doesn’t matter because no user is waiting. You can throw the biggest, most accurate model at it and take your time.

Query embedding is the opposite. It’s on the critical path of every single request, continuously, at scale. It needs to be fast, and at 10K QPS the cost dwarfs everything else.

Why use the same model for both?

Asymmetric retrieval splits these two concerns:

Documents — Embed once with voyage-4-large. Best accuracy, API-based, no rush.
Queries — Embed continuously with voyage-4-nano. Tiny, local, free.

This works because all four models in the Voyage 4 family — voyage-4-large, voyage-4, voyage-4-lite, and voyage-4-nano — produce compatible embeddings in a shared vector space.

It also means you can upgrade your query model independently. Start with voyage-4-nano for cost, move to voyage-4-lite for quality — without re-embedding a single document.

The shared embedding space opens up document-side flexibility too. In a multi-tenant system, you could use different models for different tiers — voyage-4-large for premium customers who need the best retrieval quality, voyage-4-lite for cost-sensitive tenants — all searchable with the same query model. Same index, same query path, different quality/cost tradeoffs per tenant.

The numbers

Cost

Let’s be concrete about the 10K QPS scenario:

10,000 QPS × 30 tokens = 300,000 tokens/sec
300,000 × 60 × 60 × 24 × 30 = ~777 billion tokens/month
At $0.02/1M tokens ≈ $15,500/month for query embeddings via API

With voyage-4-nano running locally on the Vespa container: $0/month. The model runs as part of the serving infrastructure you’re already paying for.

Latency

API calls add network round-trips. Local inference on voyage-4-nano runs in single-digit milliseconds on CPU.

Quality

Voyage 4 is state-of-the-art. On the RTEB benchmark (29 retrieval datasets, NDCG@10), voyage-4-large beats the competition:

Comparison	Improvement
vs. Gemini Embedding 001	+3.87%
vs. Cohere Embed v4	+8.20%
vs. OpenAI v3 Large	+14.05%

And asymmetric retrieval — querying with a smaller model against voyage-4-large document embeddings — preserves retrieval quality across medical, code, web, finance, and legal domains.

Storage

Binary quantization gives you a 16x memory reduction over bfloat16 — 2048-dim vectors go from 4,096 bytes to 256 bytes. The full-precision floats are still used for second-phase reranking, paged from disk only when needed. For a deeper dive on quantization tradeoffs, see Embedding Tradeoffs, Quantified.

Why this matters at scale

Cost and quality are table stakes. The real question for large-scale systems is: does this work in production?

Independent scaling

Vespa separates stateless containers (where embedding runs) from content clusters (where data lives). This means you can scale query embedding capacity independently from storage. Need more QPS? Add container nodes. More documents? Add content nodes. They don’t interfere.

No external API on the query path

This is the underrated benefit. With asymmetric retrieval, the query embedding model runs locally inside Vespa — your critical search path has zero dependency on an external API.

That matters when:

The API goes down. Every embedding API has outages. If your query path depends on one, your search goes down with it.
You get rate-limited. Traffic spikes don’t care about your API quota. A sudden 3x in query volume means dropped requests — or queued requests that blow your latency budget.
You need to scale fast. Adding Vespa container nodes takes minutes. Negotiating higher API rate limit may take days. On Vespa Cloud, autoscaling handles traffic spikes automatically — container clusters are stateless and scale up quickly.

Keeping the query path self-contained turns your search system from “works when everything is up” into “works, period.”

Two-phase ranking

Binary vectors are fast — Vespa can do ~1 billion hamming distance calculations per second. But binary quantization loses precision. Vespa’s phased ranking recovers it:

First phase: Hamming distance on binary embeddings. Fast, cheap, scans the full index.
Second phase: Float dot-product on the top 2,000 candidates. Accurate, but only touches a bounded set of vectors paged from disk.

This gives you the speed of binary search with the accuracy of full-precision reranking.

Enterprise-proven

This isn’t theoretical. Vespa runs search and recommendation at Spotify, Yahoo, and Perplexity — billions of documents, thousands of QPS, sub-100ms latency. The architecture handles it.

How to set this up

Here’s the complete Vespa configuration for asymmetric retrieval with Voyage AI.

Schema

Two embedding fields — binary for fast retrieval, float for accurate reranking:

schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
    }
  }

  field embedding_float type tensor(x[2048]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: prenormalized-angular
      paged
    }
  }

  field embedding_binary type tensor(x[256]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: hamming
    }
  }
}

The paged attribute on embedding_float tells Vespa to keep these vectors on disk, paging them into memory only during second-phase reranking. The binary embeddings stay in memory for fast first-phase retrieval.

Embedders (services.xml)

Two embedders — one API-based for documents, one local for queries:

 id="default" version="1.0">
   id="voyage-4-large" type="voyage-ai-embedder">
    voyage-4-large
    apiKey
    2048
     max-size="20" max-delay="20ms"/>
  

   id="voyage-4-nano" type="hugging-face-embedder">
     model-id="voyage-4-nano-int8"/>
     model-id="voyage-4-nano-vocab"/>
    32768
    mean
    true
    
      Represent the query for retrieving supporting documents: 
    
  

The voyage-ai-embedder handles vector quantization automatically — it infers the target precision from the destination tensor type. bfloat16 fields get full-precision embeddings; int8 fields get binary representations.

The hugging-face-embedder runs voyage-4-nano locally. No API calls, no rate limits, no cost. Both model references (voyage-4-nano-int8, voyage-4-nano-vocab) resolve via the Vespa Model Hub.

A note on “quantization” — two different things. The voyage-4-nano-int8 in the model-id refers to model weight quantization: the ONNX model file uses INT8 weights instead of FP32, which makes inference 2-3x faster on CPU with negligible quality loss. This is about how the model itself is stored and executed. The embedder still produces full-precision float vectors as output. Vector quantization is a separate concern — it’s about the precision of the output embeddings you store and search over (bfloat16, int8/binary, etc.). That’s controlled by the tensor type in your schema field, not the model format. These are independent knobs: you can run an INT8-quantized model that outputs float vectors, then store them as binary. For a deeper dive with benchmarks on both, see Embedding Tradeoffs, Quantified.

Rank profile

Two-phase ranking: hamming distance first, float reranking second:

rank-profile binary-with-rerank {
  inputs {
    query(q_float) tensor(x[2048])
    query(q_bin) tensor(x[256])
  }

  function binary_closeness() {
    expression: 1 - (distance(field, embedding_binary) / 2048)
  }

  function float_closeness() {
    expression: reduce(query(q_float) * attribute(embedding_float), sum, x)
  }

  first-phase {
    expression: binary_closeness
  }

  second-phase {
    expression: float_closeness
    rerank-count: 2000
  }
}

Querying

Both query tensors are produced by the local voyage-4-nano embedder:

yql=select * from doc where {targetHits: 100}nearestNeighbor(embedding_binary, q_bin)
&ranking=binary-with-rerank
&input.query(q_bin)=embed(voyage-4-nano, "your query here")
&input.query(q_float)=embed(voyage-4-nano, "your query here")
&hits=10

The nearest neighbor search runs on the binary field for speed, while the rank profile handles two-phase scoring.

For a complete runnable example with pyvespa, see the Voyage AI embeddings notebook.

Wrapping up

Asymmetric retrieval makes the most sense when:

High QPS — The cost savings scale linearly. At 10K QPS, you’re saving $15.5K/month. At 100K QPS, it’s $155K.
Large corpus — Documents are embedded once, so the large model cost is amortized. The bigger the corpus, the more you benefit from cheap queries.
Latency-sensitive — Local inference eliminates network round-trips.

When a single model is the better choice:

Low volume and latency-tolerant — At 10 QPS, the API cost is ~$15/month and the network round-trip doesn’t matter. One model is simpler to operate.
Quality above all else — Using voyage-4-large for both documents and queries gives you the best possible retrieval quality. If you can afford the API cost and latency, symmetric with the top model is hard to beat.

The Voyage 4 family and Vespa’s native integration make asymmetric retrieval practical for the first time. Embed documents with the best model available, query with a tiny local model, and let phased ranking close the quality gap.

Resources:

Voyage AI embeddings notebook — Full runnable example
Embedding documentation — Configuring embedders in Vespa
Binary quantization guide — Deep dive on binarization
Phased ranking — Multi-phase ranking architecture
Voyage 4 announcement — Model family details and benchmarks

For those interested in learning more about Vespa, join the Vespa community on Slack to exchange ideas, seek assistance from the community, or stay in the loop on the latest Vespa developments.

How Metal AI Built an Agent-Driven Intelligence Platform on Vespa Cloud

Tue, 10 Mar 2026 00:00:00 +0000

“95% of our retrieval is done by AI agents.” - Sergio Prada, Co-Founder & CTO, Metal

Metal needed a retrieval foundation that could evolve as fast as their product, without hitting a wall.

Introduction

Private equity firms manage vast amounts of unstructured data, including deal documents, expert call transcripts, financial statements, CRM records, and more. The challenge isn’t simply accessing this information. It’s connecting and understanding it, in context, across the investment lifecycle.

Metal AI was built to address this challenge. Its purpose-built institutional intelligence platform, used by established private equity firms transforms fragmented historical and live deal data into a living system of record that drives conviction at every stage of the investment lifecycle.

To deliver this vision at scale, Metal leverages Vespa.ai as its core retrieval layer, powering entity relationships, advanced ranking, and real-time context-aware retrieval across complex investment data.

The Need for Relationship-Driven Retrieval

As Metal’s product evolved, the limitations of traditional retrieval systems became clear.

Early architecture supported basic document search, but private equity workflows aren’t document-centric. They are entity- and relationship-driven. The enduring edge in private equity lies in drawing on decades of deal history, portfolio outcomes, and institutional knowledge. When that depth of experience surfaces reasoning and connections across time, every investment decision carries greater conviction.

Most traditional vector stores and search engines are fundamentally document-first. They index text, return similar passages, and rely primarily on semantic similarity or keyword matching. But for Metal’s use case, relevance requires more:

Understanding which answer is the most recent and legally approved
Identifying which company a metric belongs to
Connecting meetings to prior diligence activity
Applying business logic alongside semantic similarity

As Metal introduced more advanced workflows, like DDQ automation and agent-driven retrieval, the gap widened. Traditional systems struggle to:

Combine semantic similarity with recency and compliance rules within ranking
Support evolving data models without significant rework
Query across multiple object types in a unified way
Serve as a foundation for structured, iterative queries issued by AI agents

Layering custom logic on top of limited retrieval infrastructure would have created increasing technical debt, and each new entity type or ranking rule risked architectural compromise.

Metal needed a retrieval foundation that could evolve with the product, not constrain it.

Choosing a Retrieval Layer without Limits

Metal wasn’t simply selecting a search engine. They were selecting a long-term retrieval architecture.

Several capabilities distinguished Vespa:

Multi-entity modeling: Vespa supports multiple object types, like documents, people, activities, and financial data, as well as the relationships between them. This aligned with how Metal structures institutional knowledge.
Advanced ranking and filtering: Vespa can combine semantic similarity with structured filters like recency and business rules, enabling Metal to tailor retrieval to specific workflows.
Flexibility without re-architecture: New object types can be introduced without migrating existing data or rebuilding the system.
Operational simplicity: Moving to Vespa Cloud enabled the team to focus engineering capacity on product innovation instead of infrastructure.

These capabilities give Metal the ability to shape retrieval around business logic, rather than forcing business logic to adapt to infrastructure limitations.

“Our competitors focus on documents. With Vespa, we can focus on the full picture: companies, people, activities, and how they relate.” - Sergio Prada, Co-Founder & CTO, Metal

Architecture in Action

Metal treats retrieval as part of an AI agent orchestration layer, not just a standard search box.

When a user or agent asks a question like, “What’s this company’s EBITDA?”, the query is first interpreted by an AI agent. Rather than issuing a single plain-text search, the agent:

Determines which entity types to query (documents, companies, metrics, activities)
Applies structured parameters such as recency or workflow-specific filters
Executes retrieval against Vespa
Iterates as needed (paginating, refining, or querying related entities)
Assembles sufficient context before generating a response

Vespa powers this retrieval layer, enabling fast, structured queries across different object types and supporting the iterative retrieval process required by Metal’s agent-driven architecture.

Turning DDQ Chaos into Structured, Approved Intelligence

One clear example is Metal’s Due Diligence Questionnaire (DDQ) workflow. Private equity firms must respond to thousands of LP questionnaires using pre-approved answers. These responses cannot be freely generated by an LLM. They must come from content that has already been reviewed and approved by legal teams.

Answer banks change over time and are stored in unstructured formats like documents and spreadsheets. Metal indexes this data into Vespa, making the system aware of which documents are most recent. When answering a questionnaire, retrieval is prioritized not only by semantic similarity to the question but also by freshness.

This allows Metal to surface the most relevant and up-to-date approved answers, efficiently and reliably within its platform.

Scaling without Infrastructure Headaches

By building on Vespa Cloud, Metal achieved:

Improved feature velocity: The team can introduce new entity types and workflows quickly without architectural rework
Greater engineering focus: The team spends less time managing infrastructure and more time building differentiating product features
Scalable retrieval architecture: Metal can onboard new clients and data volumes without redesigning retrieval.
Confidence in long-term flexibility: Vespa is not a limiting factor as Metal expands into more advanced agent-driven workflows.

“Managing infrastructure can be a distraction. Vespa Cloud lets us focus on product.” - Sergio Prada, Co-Founder & CTO, Metal

Looking Forward: Build for an Agentic Future

Metal’s roadmap is deeply agentic. AI agents drive most interactions, deciding how best to query the platform and construct the context needed to answer sophisticated questions.

Because Vespa supports flexible, multi-entity retrieval with advanced ranking and real-time performance, Metal can:

Expand into more advanced analysis workflows
Build deeper relational structures between entities
Adapt retrieval strategies dynamically as business logic evolves

The result is an institutional intelligence platform that scales in both data volume and intelligence, evolving alongside the firm it serves.

“When you’re building something ambitious, you don’t want to hit a capability wall. Vespa gives us confidence that we won’t.” - Sergio Prada, Co-Founder & CTO, Metal

Build a High-Quality RAG App on Vespa Cloud in 15 Minutes

Mon, 02 Mar 2026 00:00:00 +0000

Retrieval-Augmented Generation (RAG) allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.

RAG bridges that gap by retrieving relevant information from your data and supplying it to the model as context, so responses are grounded in real, trusted sources rather than guesswork.

The Challenge: The Quality of the Context Window

In Retrieval-Augmented Generation (RAG), the real bottleneck is the LLM’s context window. You can’t simply pass your entire dataset into a prompt—there’s a strict token budget.

Because of this, the problem isn’t just retrieving information, but retrieving the right information. When the context window is filled with loosely matched or low-quality results, the LLM has little to work with and the quality of its answers drops accordingly.

High-quality RAG depends on semantic understanding, precise retrieval, and strong ranking across diverse data types so that every token in the context window earns its place.

The Solution: Out-of-the-Box RAG on Vespa Cloud

Vespa Cloud provides an out-of-the-box Vespa RAG Blueprint designed to maximize the quality of the context sent to the LLM. Instead of relying solely on nearest-neighbor vector search, Vespa combines semantic vector retrieval with lexical BM25 scoring and applies advanced ranking, using models such as BERT, LightGBM, or custom logic—to ensure that only the strongest candidates are selected.

This hybrid retrieval and ranking approach consistently surfaces the most relevant document chunks, which significantly improves the quality of the final generated answer.

In this blog post, we’ll build a complete Retrieval-Augmented Generation (RAG) application from end to end by leveraging the OOTB Vespa RAG app on Vespa cloud. The following diagram shows the architecture we’ll be working with:

The architecture consists of two main flows: data ingestion and query processing.

Data Ingestion (one-time setup)

First, we ingest our data sources, such as documents, PDFs, or web pages by using a Python-based pipeline. The pipeline processes the data, splits it into manageable chunks, generates embeddings, and feeds everything into a Vespa Cloud RAG application that is preconfigured with a schema and ranking profiles. This step populates the search index.

Query Flow (live interaction)

A user enters a question in the Vespa RAG UI.
The UI sends the query to a Python backend, which issues a hybrid search request (combining keyword and vector retrieval) to Vespa Cloud.
Vespa Cloud returns the most relevant document chunks.
The backend sends those chunks, along with the original query, to an LLM as context.
The model generates an answer grounded in that context and returns it to the backend.
The backend streams the answer back to the UI.

This architecture ensures that generated responses are grounded in your own data, combining Vespa’s retrieval and ranking strengths with the generative capabilities of large language models.

The end-to-end setup takes about 15 minutes, plus additional time to process your documents.

Deploy Vespa RAG Blueprint to Vespa Cloud

We’ll start by deploying a preconfigured RAG Blueprint to Vespa Cloud. This gives you a high-quality retrieval stack in minutes, and it’s free to get started. All of this is done directly from the Vespa Cloud console.

Sign up for Vespa Cloud

Go to the Vespa Cloud Console and create an account. If this is your first time using Vespa Cloud, the free trial is the fastest way to get going.

Deploy RAG Blueprint

In the console, select “Deploy your first application”.

Choose “Select a sample application to deploy directly from the browser”.

Select “RAG Blueprint”.

Click “Deploy” and wait for the deployment to complete.

Save your credentials

Once deployment finishes, the console will generate an access token. Save this immediately.

That token is how Python backend authenticates with Vespa Cloud. Treat it like a password.

Continue through the remaining setup screens, then open the application view. Note your endpoint URL

In the application view you will also find the endpoint URL. Save both the endpoint URL and the token; you will need them to configure Python backend in the next section.

You can download the Vespa application package by clicking the download icon if you’d like. From there, you can start building your data feeding pipeline, frontend service UI, and more. However, this blog provides a sample end-to-end RAG application, and the same Vespa application package is included, so there’s no need to download it separately.

Behind the Scenes: What You Just Deployed

When you clicked Deploy, Vespa Cloud automatically provisioned infrastructure and deployed a complete Vespa application package. This package includes everything needed for a high-quality RAG system: schemas, indexing logic, ranking profiles, and service configuration.

In other words, you didn’t just spin up a demo, you launched a ready-to-use, high-quality retrieval engine.

Let’s take a closer look at what’s inside.

The Schema

The RAG Blueprint uses a carefully designed schema that controls how documents are stored, chunked, embedded, and retrieved:

vespa_cloud/schemas/doc.sd:

schema doc {
    document doc {
        field id type string {
            indexing: summary | attribute
        }
        field title type string {
            indexing: index | summary
            index: enable-bm25
        }
        field text type string {
        }

        # Optional metadata fields for tracking document usage
        field created_timestamp type long {
            indexing: attribute | summary
        }
        field modified_timestamp type long {
            indexing: attribute | summary
        }
        field last_opened_timestamp type long {
            indexing: attribute | summary
        }
        field open_count type int {
            indexing: attribute | summary
        }
        field favorite type bool {
            indexing: attribute | summary
        }
    }

    # Binary quantized embeddings for the title (768 floats → 96 int8)
    field title_embedding type tensor<int8>(x[96]) {
        indexing: input title | embed | pack_bits | attribute | index
        attribute {
            distance-metric: hamming
        }
    }

    # Automatically chunks text into 1024-character segments
    field chunks type array<string> {
        indexing: input text | chunk fixed-length 1024 | summary | index
        index: enable-bm25
    }

    # Binary quantized embeddings for each chunk
    field chunk_embeddings type tensor<int8>(chunk{}, x[96]) {
        indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index
        attribute {
            distance-metric: hamming
        }
    }

    fieldset default {
        fields: title, chunks
    }

    document-summary top_3_chunks {
        from-disk
        summary chunks_top3 {
            source: chunks
            select-elements-by: top_3_chunk_sim_scores
        }
    }
}

What’s happening here: Your documents store their raw content in title and text. During indexing, the text field automatically split into 1024-character chunks. Embeddings are generated for both titles and chunks, then binary-quantized using pack_bits, shrinking 768 floating-point values down to just 96 int8s. This dramatically reduces storage and improves performance while still supporting efficient vector similarity search.

At the same time, BM25 is enabled for lexical matching. This combination is what enables Vespa’s hybrid retrieval: semantic matching plus exact term relevance.

Out-of-the-Box Query Profiles:

The RAG Blueprint ships with four query profiles optimized for NyRAG’s client-side RAG architecture:

NyRAG Architecture:

User Query → NyRAG (generates search queries)
          → Vespa (retrieval + ranking)
          → NyRAG (generates final answer)

Query profiles control only the Vespa retrieval/ranking step. NyRAG handles all LLM interactions.

The 4 Profiles:

hybrid (default, fast)
- Retrieval: BM25 + Vector search with targetHits:100
- Ranking: Learned linear model (logistic regression)
- Best for: Everyday queries where you want fast, solid results
hybrid-with-gbdt (highest quality)
- Retrieval: Same as hybrid (BM25 + Vector, 100 targets)
- Ranking: Two-phase with LightGBM (GBDT) second-phase
- Best for: Complex queries where relevance matters most (~2-3x slower)
deepresearch (exhaustive search)
- Retrieval: BM25 + Vector with targetHits:10000 (100x more!)
- Ranking: Learned linear model
- Best for: Research scenarios needing maximum recall
deepresearch-with-gbdt (exhaustive + best quality)
- Retrieval: Deep search (10k targets)
- Ranking: Two-phase with GBDT
- Best for: When you need both maximum recall and best ranking

For Advanced Users: Query profiles bundle complete search configurations including YQL structure (with nearestNeighbor operators), ranking profiles, and all required parameters (like learned coefficients). The Vespa application also includes rag and rag-with-gbdt profiles with searchChain=openai for server-side RAG (direct API usage), but these conflict with NyRAG’s client-side architecture and aren’t included. Learn more in the technical guide.

Which profile should you use?

Start with hybrid for everyday use - fast and accurate
Switch to hybrid-with-gbdt when quality matters most (harder queries)
Use deepresearch when you need to find everything relevant (research mode)
Try deepresearch-with-gbdt for maximum recall + quality (slowest but most thorough)

Now that your RAG Blueprint Vespa Cloud application is up and running, it’s time to add the missing pieces: a simple frontend UI and a data ingestion pipeline. For this, we’ll use NyRAG, a tool included in the RAG-app-in-15min-ragblueprint repository.

NyRAG acts as the glue for the entire RAG workflow. It reads documents from local files or websites, splits text into manageable chunks, generates embeddings, feeds everything into Vespa, and finally exposes a lightweight chat UI where you can ask questions over your data. Instead of wiring all of this together yourself, NyRAG gives you a working end-to-end system out of the box.

Install NyRAG

# Clone the repository
git clone https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint.git
cd RAG-app-in-15min-ragblueprint

# Install uv (Fast, modern Python package manager)
# macOS
brew install uv

# Linux & macOS
# curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
# powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Verify uv installation
uv --version

# Install dependencies using uv
uv sync
source .venv/bin/activate

# Windows (PowerShell)
# powershell -ExecutionPolicy Bypass
# . .\.venv\Scripts\activate

# Install nyrag locally
uv pip install -e .

# Verify nyrag installation
nyrag --help

Get an LLM API key

To generate final answers, NyRAG needs an OpenAI-compatible API key. The simplest way to get started is OpenRouter, which provides access to multiple LLMs through a single API.

In this walkthrough, we’ll use OpenRouter for convenience. In a real application, you’re free to swap in any compatible LLM provider. To continue, sign up for OpenRouter and generate an API key. You’ll use it in the next step when configuring NyRAG.

Start the NyRAG UI

# This script handles all configuration automatically
./run_nyrag.sh

# Windows (PowerShell)
# powershell -ExecutionPolicy Bypass
# .\run_nyrag.ps1

The run_nyrag.sh script starts the UI and wires up the configuration so NyRAG can talk to Vespa Cloud. In practice, it loads your project config, uses the token you provide for authentication, and starts the web UI on port 8000.

Open http://localhost:8000 in your browser.

Configure your project: Now you’ll configure your project using the web UI to connect to your Vespa Cloud deployment and set up document processing.

Step 1: Select and edit the example project

In the top header, the project dropdown shows “doc_example”. If you are starting from the example config, it is usually pre-selected. The configuration editor typically opens automatically; if it does not (for example you land directly in chat), open the three-dot menu (⋮) and choose “Edit Config”.

Description: Shows the project dropdown menu in the header with “doc_example” option

Note: If the configuration editor doesn’t appear (shows chat interface instead), click the three-dot menu (⋮) in the top right corner and select “Edit Config” to open it manually.

Step 2: Update your credentials

In the configuration editor, paste in the information you saved from Vespa Cloud and your LLM provider. You only need three things to get going: your Vespa tenant name, your Vespa endpoint + token, and your LLM API key.

Required fields to update:

# Your Vespa Cloud credentials (from Vespa Cloud Console)
cloud_tenant: your-tenant          # Your Vespa Cloud tenant name
vespa_cloud:
  endpoint: https://your-app.vespa-cloud.com  # Your Vespa token endpoint (not mtls)
  token: vespa_cloud_YOUR_TOKEN_HERE          # Your Vespa data plane token

# Your LLM configuration (default: OpenRouter)
llm_config:
  api_key: sk-or-v1-YOUR_KEY_HERE   # Your OpenRouter API key (or other provider)

Notes:

The default LLM provider is OpenRouter. If you switch providers, also update base_url and model to match. For the included example documents, start_loc defaults to ./dataset, so you can run the pipeline without changing anything else.

Step 3: Save and start processing

After updating the configuration, you can close the editor (changes are saved automatically) and start indexing. If you are using the example dataset, keep ./dataset as-is; otherwise, point start_loc at the folder (or site) you want to ingest. When you click “Start Indexing”, NyRAG reads your input, chunks it into 1024-character segments, generates embeddings, feeds everything to Vespa Cloud, and shows progress in the terminal panel so you can see exactly what is happening.

Description: Shows documents being processed with terminal logs displaying progress

Chat with Your Data

You can now start asking questions in the chat UI.

When you submit a query, NyRAG expands it into focused retrieval queries and sends them to Vespa. Vespa runs hybrid retrieval, combining BM25 keyword matching with vector similarity, and returns the most relevant chunks. Those chunks are packed into a compact context window and sent to the LLM, which generates an answer grounded entirely in your data.

A good way to sanity-check the setup is to start with a broad question like “What are the main topics in these documents?” and then follow up with something more specific to confirm the retrieved context makes sense.

At this point, you have a fully functional RAG application running on Vespa Cloud.

Improving Search Quality with Query Profiles

Want better search results? You can fine-tune how Vespa retrieves and ranks your documents using the Settings modal (⚙️ icon in the top right).

Change query profiles: Open the ⚙️ Settings panel, choose a Query Profile from the dropdown, and click “Save”. The very next query you run will use the new profile.

Description: Settings modal showing query profile selection dropdown with 4 available options

What each profile does:

hybrid: Fast hybrid search (BM25 + vector) with linear ranking
hybrid-with-gbdt: Same retrieval + advanced GBDT ranking (slower but best quality)
deepresearch: Exhaustive search with 10,000 retrieval targets (maximum recall)
deepresearch-with-gbdt: Exhaustive search + GBDT ranking (slowest, most thorough)

Pro tip: The quality difference between hybrid and hybrid-with-gbdt can be dramatic for complex queries. The GBDT model offers significantly better relevance at the cost of 2-3x higher latency. For research tasks where you need to find everything relevant, try deepresearch variants which cast a much wider net!

Manage Your Data

NyRAG also gives you simple tools for cleanup. Open the advanced menu (three-dot icon ⋮ in the top right) and you will find two cleanup actions. Clear Local Cache removes cached files for all projects on your machine, which is useful when you want to re-process from scratch locally. Clear Vespa Data deletes the indexed documents in Vespa for the project, which is useful when you want a clean index before re-feeding. Both actions ask for confirmation so you do not delete data by accident.

Bonus: Try Web Crawling Mode

In addition to local documents, NyRAG supports web crawling. By switching to the web_example project, you can point NyRAG at a website and have it crawl, extract, and index content automatically.

Switch to web crawling mode: Select web_example (web) from the dropdown at the top and open the configuration editor. If you are currently on the chat screen, open the three-dot menu (⋮) and choose “Edit Config” to bring the editor back. From there, update the same credential fields as you did for doc_example, then click “Start Indexing” to crawl and feed the site.

Description: Shows web crawling in progress with terminal logs displaying discovered URLs and processed pages

Web Mode Features: Web mode discovers and follows links automatically, while still respecting robots.txt and crawl delays so you do not hammer a site. It also does smart content extraction to drop navigation and boilerplate, deduplicates very similar pages, and supports resume so you can continue a crawl after interruption.

Example Use Cases: Web mode is a good fit for product documentation, knowledge bases, blog archives, help-center content, and technical wikis. In general, it works best on sites with consistent HTML structure and clean, text-heavy pages.

Tips: Start small. Crawl a limited part of a site first so you can sanity-check what gets extracted and indexed, then expand. Use exclude patterns to skip sections you do not want (for example /pricing or /sales/*), and keep an eye on the terminal output panel so you can spot loops, unexpected URLs, or pages that fail to parse.

Troubleshooting

Running into issues? We’ve got you covered! For detailed troubleshooting guides covering Vespa connection errors, LLM configuration, document processing, and more, see the Troubleshooting section in the main README.

Quick help: If you get stuck, the fastest path is usually to ask in the Vespa Slack community, where people can help you interpret logs and query behavior. If you think you found a bug or want to request an improvement, open an issue in GitHub Issues. And when you want deeper background on schema, ranking, and deployment, the Vespa Docs are your go-to reference.

Conclusion

Congratulations! You now have a working RAG app: a Vespa Cloud deployment that can retrieve high-quality context, and a small UI that lets you ingest data and chat with it.

Building a high-quality RAG system is never trivial. There are multiple moving parts to get right: the quality of the LLM, the size and management of its context window, and how effectively your retrieval system surfaces the most relevant information.

Thanks to the out-of-the-box Vespa RAG blueprint on Vespa Cloud, much of this complexity is handled for you. It comes with multiple ranking profiles, and its default hybrid retrieval setup combines vector similarity with BM25 text matching, ensuring your LLM sees the best possible context for every query.

Vespa Cloud doesn’t just make building RAG easier, it makes it scalable, fast, and reliable, giving you production-ready infrastructure, auto-scaling and observability without the headaches of self-hosting. Whether you’re experimenting with small datasets or scaling to millions of documents, Vespa Cloud provides the tools and flexibility to make your RAG project shine.

Want to dive deeper? Start with the RAG Blueprint Tutorial for a thorough conceptual walkthrough. And remember the Vespa Slack community is always there to help. Ask questions, share what you’ve built, or get advice on retrieval, ranking, and deployment strategies.

Ready to experience the power of Vespa Cloud for yourself? Sign up today and start building high-quality RAG applications with ease!

Vespa Newsletter, February 2026

Mon, 16 Feb 2026 00:00:00 +0000

Welcome to the latest edition of the Vespa newsletter. In the previous update, we introduced several new features and improvements, including Automated ANN Tuning, Accelerated Exact Vector Distance with Google Highway, Precise Chunk-Level Matching for Higher Retrieval Quality, Quantile Computation in Grouping for Instant Distribution Insights, and more.

Let’s dive into what’s new.

Product updates

Announcing the Vespa.ai Playground
The Vespa Kubernetes Operator
Faster result rendering with CBOR
Pyvespa 1.0 with improved HTTP performance
Hybrid search relevance evaluation tool
Configurable linguistics per field
“switch” operator in ranking expressions
Vespa is now available on GCP Marketplace
Feed data and run queries in the Vespa Console

Announcing the Vespa.ai Playground

The Vespa Playground is a new GitHub space where we share projects, tools, and demos built on the Vespa platform. It’s a practical place to explore real examples for embeddings, model training, and feed connectors that you can clone, run, and build on your own.

These repos are ideal for experimentation, learning, and inspiration, though they aren’t officially supported product releases.

Explore the Playground

The Vespa Kubernetes Operator

The safest, most robust and cost effective way to run Vespa is to deploy on Vespa Cloud, but for various reasons that’s not an option for everybody. For those who want to run Vespa securely at scale but can’t use Vespa Cloud we have now released the Vespa Kubernetes Operator. This brings many of the Vespa Cloud features such as security out of the box, dynamic provisioning, autoscaling and automated upgrades to your own Kubernetes environments.

Read more in the Kubernetes Operator documentation.

Faster result rendering with CBOR

Query result sets can be large, and increasingly so when the client is an LLM retrieving many chunks for model context. Layered ranking is designed to address this by extracting the most relevant content. Still, in some cases, the total latency is dominated by the time it takes to send the query response. Compressing with gzip can help, but is also CPU-intensive and slow.. From Vespa 8.623.5, json response generation is over twice as fast as before.

Another new option in this release is to use the CBOR format for query results. CBOR is a binary format so it can be serialized faster and produces smaller payloads, especially when the result contains lots of numeric data. Read more in the Query API reference and query performance guide.

Pyvespa 1.0 with improved HTTP performance

We have released the first major version of Pyvespa! This release switches the HTTP-client used by Pyvespa, from httpx to httpr, which gives big performance gains, especially for serializing and deserializing tensors, largely by taking advantage of the new CBOR serialization support in Vespa.

On preliminary benchmarks, we compared end-to-end latency for:

Vespa 8.591.16 + Pyvespa v0.63.0 (using JSON)
Vespa 8.634.24 + Pyvespa v1.0.0 (using CBOR)

The latter was ~4.9x faster when returning 400 hits with a 768-dim vector each. Performance gains will be smaller when not returning large result sets with tensors, but still significant. You may encounter different exceptions than before, but we strived to not change any user-facing API’s even if we bumped the major version.

Go to Pyvespa

Hybrid search relevance evaluation tool

Hybrid search combines lexical and embedding based search to get the best from both. One of the tasks you need to solve is to pick an embedding model that provides a good quality vs. cost tradeoff for your use case. We have done a systematic evaluation of modern alternatives in this blog.

The code used to run these experiments is now merged into Pyvespa. You can use the VespaMTEBApp to evaluate embedding model performance on any task/benchmark compatible with the mteb-library. See example usage from the tests.

Configurable linguistics per field

Vespa now lets you specify linguistics profiles on fields to select some specific linguistics processing in your Linguistics module. In Lucene Linguistics, linguistics profiles map to analyzer configuration, optionally in combination with a specific language.

For example, you can define a Lucene analyzer like this in services.xml:

  
      whitespace

        lowercase

And use it in the schema, under any field’s definition, like this:

field title type string {

  indexing: summary | index

  linguistics {

      profile: whitespaceLowercase

  }

}

By default the linguistics profile will be applied both when processing the text of the field and the text searching it, but you can also specify a different linguistics profile on the query side, which is useful for e.g. doing synonym query expansion.

We’ve added a sample application demonstrating how to use multiple Lucene linguistics profiles across multiple fields and updated the Vespa linguistics documentation with usage examples.

New “switch” operator in ranking expressions

We have added a “switch” function in ranking expressions as a clearer, more maintainable alternative to deeply nested if() clauses, making complex ranking easier to read, debug, and evolve.

switch (attribute(category)) {

    case "restaurant": myRestaurantFunction(),

    case "hotel": myHotelFunction(),

    default: myDefaultFunction()

}

Learn more

Vespa is now available on GCP Marketplace

Vespa Cloud is now listed on the GCP Marketplace, making it easier to deploy and manage Vespa using native Google Cloud billing and procurement. Vespa Cloud is already available on AWS Marketplace.

See details

Feed data and run queries in the Vespa Console

The onboarding experience is now even smoother for new Vespa Cloud users. When you follow the getting started guide and deploy a sample app from the browser, you can immediately feed data and run queries directly in the browser. This makes it easy to try your own data and see how it behaves in Vespa.

We also provide examples showing how to do the same using pyvespa, the Vespa CLI, or curl.

Try it Free

New content and learning resources

We published several new articles and resources since our last newsletter to help teams get more out of Vespa and stay ahead of new developments in search, RAG, and large-scale AI.

Examples and notebooks:

playground.vespa.ai

Videos, webinars, and podcasts

Blogs and ebooks

Upcoming Events

Lightning Lesson: Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET

Intro to sparse vectors and tensors for efficient data handling
Using Vision-Language Models (VLMs) to extract high quality and nuanced features from images
Leveraging these features in sparse representations for hyper-personalized search & recommendations

February 18: The Zero Results Problem in eCommerce

🔗 10am CET (EMEA): Save your spot
🔗 1pm ET (Americas): Save your spot

March 11: The Relevance Problem in eCommerce

🔗 10am CET (EMEA): Save your spot
🔗 1pm ET (Americas): Save your spot

March 10: Vespa Q1 Product Update

🔗 10am CET (EMEA): Save your spot
🔗 1pm ET (Americas): Save your spot

👉 Follow us on LinkedIn to stay in the loop on upcoming events, blog posts, and announcements.

Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? Deploy your application for free on Vespa Cloud today.

Nexla + Vespa, The Power Duo for AI-Ready Data Pipelines

Mon, 16 Feb 2026 00:00:00 +0000

Partner Spotlight: Nexla

AI is transforming quickly. What started with Q&A chatbots has already evolved into deep research applications and, now, autonomous AI agents. Vespa is proud to be at the center of this shift, enabling some of the most proficient adopters of AI, such as Perplexity. To help organizations maximize the benefits of Vespa, we’re building a robust partner ecosystem. These partners help bring Vespa’s AI-native capabilities into real-world deployments across industries.

Meet the innovators shaping the future of AI. Today’s spotlight: Nexla

Nexla + Vespa.ai: The Power Duo for AI-Ready Data Pipelines

When AI systems fall short, it’s rarely the model’s fault. It’s the messy reality of data spread across systems and never quite staying in sync. That’s why Nexla and Vespa partnered together.

Nexla makes data usable.

Vespa makes data intelligent at scale.

Together, they turn messy, distributed enterprise data into real-time AI search, recommendation, and RAG systems, without months of custom code gluing things together.

Nexla: Making Enterprise Data Usable

Nexla is an enterprise-grade, AI-powered data integration platform that turns raw data from any source into production-ready data products. It provides a declarative, no-code way to move, transform, and validate data across ETL/ELT, reverse ETL, streaming, APIs, and RAG pipelines.

Think of Nexla as the layer that answers: “How do we reliably get the right data, in the right shape, to the systems that need it?

Core capabilities:

500+ Bidirectional Connectors: Pull data from databases, APIs, cloud storage, SaaS apps, and data warehouses, including systems like Salesforce, Snowflake, and Amazon S3.
Metadata Intelligence: Nexla automatically scans sources and generates Nexsets, virtual, ready-to-use data products with schemas, samples, and validation rules. Example: If a price field suddenly switches from numeric to string, Nexla detects it before bad data reaches production search.
Express (conversational pipelines): A conversational AI interface where you can simply describe what you need. Example: You can say, “Pull customer data from Salesforce and merge with Google Analytics,” and it builds the pipeline for you.
Universal integration styles: Supports ELT, ETL, CDC, R-ETL, streaming, API integration, and FTP in a single platform.

Nexla processes over 1 trillion records monthly for companies like DoorDash, LinkedIn, Carrier, and LiveRamp.

Vespa: Where Retrieval Becomes Reasoning

Vespa is a production-grade AI search platform that combines a distributed text search, vector search, structured filtering, and machine-learned ranking in a single system.

Think of Vespa as the engine that answers: “Given all this data, how do we retrieve, rank, and reason over it in real time?”

It powers demanding applications like Perplexity and supports search, recommendations, personalization, and RAG at massive scale.

Core capabilities:

Unified AI Search and Retrieval: Vespa natively combines vector and tensor search for semantic retrieval, full-text search for precise keyword matching, and structured filtering on attributes like categories, prices, and dates to enable richer, contextual search without stitching multiple systems together.
Real-time Retrieval and Inference at Scale: Rather than separating indexing, ranking, and inference across multiple systems, Vespa performs real-time machine-learned ranking and model inference where the data lives. This means you can serve fresh, personalized results with predictable sub-100 ms latency even for large datasets.
Multi-Phase Ranking and Custom Logic: Vespa lets you embed custom ranking logic, including ML models like XGBoost, directly into your search pipeline using ONNX. You can combine relevance signals, business rules, and semantic vectors in multi-stage ranking to fine-tune which results surface first.
Massive Scalability with High Throughput: Designed for real-world, high-traffic applications, Vespa can scale horizontally across clusters, handling billions of documents with sub-100ms query latency and up to 100k writes per second per node.
Multi-Vector and Multi-Modal Retrieval: Vespa natively handles multiple vectors per document, with support for token-level embeddings, ColPali-based visual document retrieval, and tensor-based computations for precise, cross-modal relevance and ranking.

GigaOm recognized Vespa as a leader in vector databases for two consecutive years, noting its performance advantages over alternatives like Elasticsearch, up to 12.9X higher throughput per CPU core for vector searches.

How Nexla and Vespa Work Together

The Nexla-Vespa partnership removes one of the hardest parts of AI systems: getting clean, well-modeled data into a high-performance retrieval engine, continuously.

Nexla recently launched a Vespa connector that makes data integration with Vespa seamless. The integration includes:

Vespa Connector in Nexla: Handles all data piping from sources like Amazon S3, PostgreSQL, Pinecone, Snowflake, and others directly into Vespa:

Vespa Nexla Plugin CLI: Automatically generates draft Vespa application packages (including schema files) directly from a Nexset, eliminating manual configuration:

This means you can move data from S3 to Vespa, migrate from Pinecone to Vespa, or sync PostgreSQL to Vespa, all without writing a single line of code.

When Nexla Clients Should Use Vespa

You’re a Nexla client. Use Vespa when you need:

Advanced AI search and RAG applications: If you’re building intelligent search, recommendation systems, or RAG applications that require hybrid search (combining semantic vector search with keyword matching and metadata filtering), Vespa is purpose-built for this. Nexla gets your data into Vespa, while Vespa delivers production-grade AI search with machine-learned ranking.

Real-time, high-scale query performance: When you need to serve thousands of queries per second across billions of documents with sub-100ms latency, Vespa’s distributed architecture scales horizontally without compromising quality. Nexla ensures your data flows continuously into Vespa with incremental updates and CDC support.

Complex ranking and inference: If your use case requires multi-phase ranking, custom ML models, or LLM integration at query time, Vespa executes these operations locally where data lives, avoiding costly data movement. Nexla prepares and transforms your data into the exact schema Vespa needs.

Cost efficiency at scale: Vespa delivers 5X infrastructure cost savings compared to alternatives like Elasticsearch while handling vector, lexical, and hybrid queries. Nexla minimizes integration costs by automating pipeline creation and schema management.

When Vespa Clients Should Use Nexla

You’re a Vespa client. Use Nexla when you need:

Multi-source data consolidation: Vespa is your search and inference engine, but data lives everywhere, S3 buckets, PostgreSQL databases, Snowflake warehouses, Salesforce CRMs, APIs, and files. Nexla connects to 500+ sources with bidirectional connectors and consolidates data into Vespa without custom ETL scripts.

Automated schema generation and management: Instead of manually writing Vespa schema files and managing schema evolution, Nexla’s Plugin CLI auto-generates schemas from your Nexsets. As source schemas change, Nexla’s metadata intelligence detects changes and propagates them downstream automatically.

Data transformation and enrichment: Before data hits Vespa, it often needs cleaning, filtering, enrichment, or format conversion. Nexla provides a no-code transformation library and supports custom SQL, Python, or JavaScript, all without maintaining separate ETL infrastructure.

Vector database migration: Moving from Pinecone, Weaviate, or another vector database to Vespa? Nexla handles the migration with zero code, extracting records, transforming data to match Vespa’s schema, and syncing documents continuously.

Data quality and monitoring: Nexla continuously monitors data flows with built-in validation rules, error handling, and automated alerts. When data quality issues arise, Nexla quarantines bad records and provides audit trails, ensuring Vespa always receives clean, trustworthy data.

Real-time and streaming pipelines: Vespa supports real-time updates, but getting real-time data from streaming sources (Kafka, APIs, databases with CDC) requires integration logic. Nexla handles streaming, batch, and hybrid integration styles, optimizing throughput and latency for each source type.

Conclusion

Nexla solves data readiness.

Vespa solves intelligence and precision at scale.

Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications. Vespa gives you production-grade vector search, hybrid retrieval, and RAG capabilities at any scale. Nexla eliminates months of pipeline development and makes multi-source data flows conversational.

Ready to explore?

Start at express.dev for conversational pipeline building, or explore the Vespa connector in Nexla’s platform to see how quickly your data can power real AI applications.