<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Vespa Blog</title>
    <description>We Make AI Work</description>
    <link>https://blog.vespa.ai/</link>
    <atom:link href="https://blog.vespa.ai/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 09 Jun 2026 13:45:34 +0000</pubDate>
    <lastBuildDate>Tue, 09 Jun 2026 13:45:34 +0000</lastBuildDate>
    <generator>Jekyll v4.4.1</generator>
    
      <item>
        <title>Re-autoresearching MSMARCO BM25, on Vespa</title>
        <description>BM25 is having a moment. We reproduce Doug Turnbull&apos;s MSMARCO autoresearch experiment in Vespa and get a comparable MRR@10 lift from existing rank features — with twice the generalization to full MSMARCO.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-05-29-re-autoresearching-msmarco-bm25-on-vespa/lead-card-bg.png" />
        
        <content:encoded><![CDATA[<h2 id="bm25-is-having-a-moment">BM25 is having a moment</h2>

<p>Google search interest in “BM25” jumped about 5× in early August 2025. Around
the same time, OpenAI’s models started volunteering BM25 noticeably more —
gpt-4o named it on 12% of neutral retrieval prompts; gpt-4.1 and
gpt-5-chat on 30–35%.<sup id="fnref:bm25-probe"><a href="#fn:bm25-probe" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> The <a href="https://arxiv.org/abs/2508.21038">LIMIT paper</a>
showing dense embedding models flubbing trivial retrieval landed three
weeks later.<sup id="fnref:limit"><a href="#fn:limit" class="footnote" rel="footnote" role="doc-noteref">2</a></sup></p>

<p><img src="/assets/2026-05-29-re-autoresearching-msmarco-bm25-on-vespa/bm25-google-trends.png" alt="Google Trends: worldwide search interest in &quot;bm25&quot; over the past five years, roughly 5x in late 2025" /></p>

<p>Whatever the reasons for this spike, the renewed interest in lexical search is likely a good thing.
Lexical scoring is still a very robust baseline, especially in
out-of-domain or zero-shot settings. And BM25 is a great baseline, but
can we do better?</p>

<h2 id="building-on-bm25">Building on BM25</h2>

<p>Earlier this month, <a href="https://softwaredoug.com/">Doug Turnbull</a>
published a really neat
<a href="https://softwaredoug.com/blog/2026/05/17/autoresearching-a-better-msmarco-bm25">autoresearch experiment</a>:
let an LLM iterate on a Python BM25 reranker for 8 rounds and see how much
better it gets on the MSMARCO<sup id="fnref:msmarco"><a href="#fn:msmarco" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> passage-ranking benchmark. He randomly selects a 650k-passage slice he
calls “minimarco”, and his agent is able to improve MRR@10 from <strong>0.4913 → 0.5350</strong>
(+0.044). Very cool!</p>

<p>Of course, we wanted to try it for ourselves too. Our Vespa twist on it: instead of letting an LLM write arbitrary code in <code class="language-plaintext highlighter-rouge">reranker.py</code>, can
we get a similar lift while limiting our search space to existing Vespa rank features?</p>

<p><img src="/assets/2026-05-29-re-autoresearching-msmarco-bm25-on-vespa/lead.png" alt="BM25 vs BM25 + three Vespa rank features on the full MSMARCO benchmark: MRR@10 lifts from 0.1901 to 0.2106, a 10.8% improvement" /></p>

<p>Spoiler: It turns out we can indeed get a significant improvement - and one that generalises better to the full dataset too. Let’s show you exactly how we did it!</p>

<h2 id="reproducing-dougs-setup">Reproducing Doug’s setup</h2>

<p>His recipe to create the “minimarco” subset is two lines of pandas:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">collection</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="nf">read_csv</span><span class="p">(</span><span class="sh">"</span><span class="s">collection.tsv</span><span class="sh">"</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="sh">"</span><span class="se">\t</span><span class="sh">"</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="sh">"</span><span class="s">doc_id</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">])</span>
<span class="n">minimarco</span> <span class="o">=</span> <span class="n">collection</span><span class="p">.</span><span class="nf">sample</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">650_000</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">).</span><span class="nf">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<p>That trims the corpus from 8.84M MSMARCO passages down to 650k. Of the
6,980 dev queries, 543 still have a labeled relevant passage that landed in our
random subset (the rest are unscoreable here, since their answer isn’t in
the sample).</p>

<p>We loaded the 650k passages into Vespa with a minimal schema, setting
the two BM25 hyperparameters to the Anserini-tuned MSMARCO defaults
commonly used on this benchmark.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile bm25 inherits default {
    first-phase {
        expression: bm25(description)
    }
    rank-properties {
        bm25(description).k1: 0.6      # Anserini-tuned MSMARCO defaults
        bm25(description).b: 0.62
    }
}
</code></pre></div></div>

<p>Running the 543 scoreable queries against this schema gives
<strong>MRR@10 = 0.4907</strong>. Doug’s number is <strong>0.4913</strong> - close enough that
the small gap could be explained by e.g. differences in stemming - the way his <a href="https://github.com/softwaredoug/searcharray">SearchArray</a>’s Snowball
vs. Vespa’s <a href="https://docs.vespa.ai/en/linguistics.html">OpenNLP English</a><sup id="fnref:stemming"><a href="#fn:stemming" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>
normalize words. So we’re starting from the same baseline - now let’s see what we can add!</p>

<h2 id="the-three-tweaks-that-moved-the-needle">The three tweaks that moved the needle</h2>

<p>We had time to try about 20 things in an afternoon. Three of them robustly survived 10 paired rotations
against the running baseline — meaning each round we drew 10 different
random subsets of the dev queries, evaluated both the candidate and the
current best config on the same queries, and only kept changes that beat
the previous best across all 10 subsets.</p>

<h3 id="1-stopword-limit-on-weakand">#1: Stopword-limit on weakAnd</h3>
<p>Vespa’s <a href="https://docs.vespa.ai/en/reference/query-language-reference.html#text"><code class="language-plaintext highlighter-rouge">text()</code></a>
operator searches its tokens with a
<a href="https://docs.vespa.ai/en/reference/query-language-reference.html#weakand"><code class="language-plaintext highlighter-rouge">weakAnd</code></a> by default,
which has a built-in document-frequency (DF) based stopword filter. We set it extremely aggressively here - 
<a href="https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.matching.weakand.stopwordLimit"><code class="language-plaintext highlighter-rouge">ranking.matching.weakand.stopwordLimit = 0.05</code></a>, which means Vespa automatically drops query terms that appear in more than 5% of docs.  No need to create a hand-curated
list. This makes our queries faster too - the high-DF terms have the longest posting lists,
so skipping them more than halved our wall-clock latency.<sup id="fnref:tripling"><a href="#fn:tripling" class="footnote" rel="footnote" role="doc-noteref">5</a></sup></p>

<table>
  <thead>
    <tr>
      <th>stopword-limit</th>
      <th>Δ MRR@10 paired</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>off</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>0.05</strong></td>
      <td><strong>+0.0136</strong></td>
    </tr>
    <tr>
      <td>0.02</td>
      <td>-0.0088</td>
    </tr>
  </tbody>
</table>

<p>At 0.02 the filter has become way too aggressive and starts dropping important content words.</p>

<h3 id="2-nativeproximity">#2: nativeProximity</h3>
<p>The <strong><code class="language-plaintext highlighter-rouge">nativeProximity</code></strong> feature is a continuous proximity score that rewards docs
where the matched query terms are close together. Doug’s agent did something
loosely similar by writing code to score based on adjacent-bigram phrase term frequencies. We just pulled it from
<a href="https://docs.vespa.ai/en/reference/ranking/rank-features.html">Vespa’s rank-feature catalog</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile lexical inherits default {
    inputs {
        query(w_prox) double: 0.0      # query-time tunable weight
    }
    first-phase {
        expression: bm25(description) + query(w_prox) * nativeProximity
    }
    ...
}
</code></pre></div></div>

<p>Our sweep results: <code class="language-plaintext highlighter-rouge">w_prox = 10</code> is a good value here, with a wide plateau (8-14 are all within
noise). Wide plateaus like this are a positive sign that the gain is real, not a noise-fitted spike.</p>

<h3 id="3-fieldmatchearliness">#3: fieldMatch.earliness</h3>
<p><strong><a href="https://docs.vespa.ai/en/reference/rank-features.html#fieldMatch(name).earliness"><code class="language-plaintext highlighter-rouge">fieldMatch(description).earliness</code></a></strong> is a feature that rewards matches near the start
of the field. A match is often a stronger signal if it appears early —
writing tends to introduce its main topic up front (headline, abstract,
summary). <code class="language-plaintext highlighter-rouge">w_fm_early = 8</code> peaks at +0.0189 paired over the proximity-only anchor.</p>

<h3 id="iterating-fast--and-validating-more-thoroughly">Iterating fast — and validating more thoroughly</h3>
<p>A nice property of doing this in Vespa: every “did this weight help?”
question is just a query parameter against the index which is already built when we fed the data.
~33 ms per query, ~0.4 s for a 109-query training eval, ~22 s to test
one weight value across 10 paired rotations. It stayed fast because we
weren’t paying for re-indexing or re-scoring of every query/document pair;
we were just passing numbers in an HTTP body.</p>

<p>That speed is also what lets us validate each candidate more thoroughly.
The eval set is tiny - 109 training queries per rotation - so any single split is noisy. Try 20 weight
combos against one split, and one or two will look like wins by pure luck
(the paired noise is ±0.014, bigger than the gains we’re chasing). Our fix
is to test each candidate across many <em>different</em> random splits and keep
only what holds up everywhere. That’s painful if every eval is an LLM call
or a re-index, but nearly free when it’s just another query - so we ran 10
rotations per weight. That’s the difference between a real +0.005 and a
lucky one, and possibly a big reason the single-split agent loop overfits:
not enough independent looks at the data before committing to a change.</p>

<h3 id="our-best-recipe-from-this-run">Our best recipe from this run</h3>

<p><strong><code class="language-plaintext highlighter-rouge">bm25 + 10·nativeProximity + 8·fieldMatch.earliness + sw=0.05</code></strong>.
On the full 543-query minimarco dev set: MRR@10 = <strong>0.5163</strong> (+0.0256
over BM25 = 0.4907). Doug’s agent gets <strong>0.5350</strong> (+0.044), so it’s ahead of us by
about 0.019 in absolute terms.</p>

<h2 id="what-happens-on-full-msmarco">What happens on full MSMARCO?</h2>

<p>First, what these numbers mean: minimarco is also where we <em>tuned</em>, so the
minimarco column is in-sample. The 10-rotation gate guards against getting lucky
on a single split — but not against fitting the 543-query subset as a whole. The
full 8.84M-doc corpus is the real out-of-sample test, so that’s the column to
trust for generalization.</p>

<p>Doug honestly flags this in his post: his agent’s gains don’t generalize
well. On the full 8.84M-doc benchmark, his round-8 reranker scores 0.1991
vs BM25’s 0.1897 - only +0.0094 of the original +0.044 improvement survives.
We tried our configurations on the full corpus too:</p>

<p><img src="/assets/2026-05-29-re-autoresearching-msmarco-bm25-on-vespa/hero.png" alt="MRR@10 by approach on the minimarco tuning subset vs the full MSMARCO benchmark" /></p>

<p>How well do our results transfer to the full dataset?</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>minimarco</th>
      <th>full MSMARCO</th>
      <th>retention</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>BM25</td>
      <td>0.4907</td>
      <td>0.1901</td>
      <td>N/A</td>
    </tr>
    <tr>
      <td>Free-form-Python agent</td>
      <td>0.5350 (+0.044)</td>
      <td>0.1991 (+0.009)</td>
      <td><strong>21%</strong></td>
    </tr>
    <tr>
      <td>Our agent (Vespa rank features)</td>
      <td>0.5060 (+0.015)</td>
      <td>0.2053 (+0.015)</td>
      <td><strong>99%</strong></td>
    </tr>
    <tr>
      <td>Our manual sweep (Vespa rank features)</td>
      <td>0.5163 (+0.026)</td>
      <td>0.2106 (+0.021)</td>
      <td><strong>80%</strong></td>
    </tr>
  </tbody>
</table>

<p>Most of our minimarco lift from both attempts survives the jump to full MSMARCO - these
are all features with generalizable signal, re-tuning the weights on the full dataset might squeeze out more.</p>

<p>Why this gap?</p>

<ul>
  <li>An LLM with free-form Python has more rope. By round 8 the agent
hard-codes a stopword list containing <code class="language-plaintext highlighter-rouge">vacat</code> and <code class="language-plaintext highlighter-rouge">medicin</code> and a
conditional like <code class="language-plaintext highlighter-rouge">toks[1] not in ("can", "invent")</code>. Those help on the
minimarco subset; they don’t help anywhere else - it’s overfitting.</li>
  <li>We’re using rank features that IR researchers and Vespa engineers have
already validated as carrying generalizable signal. We didn’t have to invent
<code class="language-plaintext highlighter-rouge">nativeProximity</code> from scratch here. The LLM has to rediscover something like it
from termfreq primitives within a single round, which is harder.</li>
  <li>The LLM writing arbitrary python does have the freedom to invent completely
novel techniques, but on this extremely well-studied dataset we can
perhaps consider it somewhat unlikely.</li>
</ul>

<h2 id="our-own-autoresearch-loop">Our own “autoresearch” loop</h2>

<p>In our results, we refer to our first experiment as “manual”, but it’s 2026 - that sweep wasn’t a human hand-typing
weights either. We used a coding assistant (Claude Code) to do the
exploration, with us steering at a high level and holding it to the
10-rotation rule. So really, it’s LLM-vs-LLM - what changes
between the runs is the search space and how rigorously each one accepts a
change, not human vs machine.</p>

<p>After the quick sweep we wondered: would a fully <em>autonomous</em> LLM agent like Doug’s,
inside our constrained search space, generalize any better? We built a small loop
(same <code class="language-plaintext highlighter-rouge">eval_margin = 0.002</code>, same rotating seeds, same gpt-5.5 model with xhigh
reasoning) that edits the Vespa first-phase rank <em>expression</em> instead of
<code class="language-plaintext highlighter-rouge">reranker.py</code>. The key difference from Doug’s setup: a change is accepted only if it
clears the <strong>same 10-rotation paired-robustness check our manual sweep uses</strong> — so the
agent and the sweep differ only in search space and autonomous-vs-steered, not in how
rigorously a change is accepted. ~700 lines of Python, $6 of OpenAI spend, 30
minutes.</p>

<p>In our run the agent found two valuable features — <code class="language-plaintext highlighter-rouge">nativeProximity</code> and
<code class="language-plaintext highlighter-rouge">fieldMatch.earliness</code>, the same pair the manual sweep landed on — and reached
<strong>MRR@10 = 0.2053 on full MSMARCO (+0.0152)</strong>: about <strong>99% retention</strong> of its
minimarco lift, even higher than the manual sweep’s 80% and far above the
free-form-Python run’s 21%. Its absolute score sits a touch below the manual sweep’s
(0.2053 vs 0.2106) because it didn’t add the <code class="language-plaintext highlighter-rouge">stopwordLimit</code> matching-side lever in
this run — but almost all of the lift it <em>did</em> find carried over. That’s the
constrained search space doing its job: the agent can’t encode token-specific tactics
like a hard-coded stopword list, so its ceiling on overfitting is lower.</p>

<h2 id="our-final-config">Our final config</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">body</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">"</span><span class="s">yql</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">select doc_id from passage where description contains ({language:</span><span class="sh">'</span><span class="s">en</span><span class="sh">'</span><span class="s">}text(@q))</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">q</span><span class="sh">"</span><span class="p">:</span> <span class="n">user_query_text</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">ranking.profile</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">lexical</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">hits</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">language</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">en</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">input.query(w_prox)</span><span class="sh">"</span><span class="p">:</span> <span class="mf">10.0</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">input.query(w_fm_early)</span><span class="sh">"</span><span class="p">:</span> <span class="mf">8.0</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">ranking.matching.weakand.stopwordLimit</span><span class="sh">"</span><span class="p">:</span> <span class="mf">0.05</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is all that’s needed to use these features. Just a simple linear combination of well-known signals,
and a demonstration that there <em>is</em> much more tuning potential to be had from lexical search too.
And your coding agent already knows how to do it!</p>

<p>BM25 has served the IR community well for 30+ years, and we don’t think it’s
going anywhere. But Vespa has all the pieces in place to go beyond the baseline -
with lexical signals, advanced multi-vector ranking and more, and we keep working to raise the bar.
<a href="https://vespa.ai/subscribe/">Subscribe</a> to the newsletter if you’d like to hear about it!</p>

<h2 id="going-further">Going further</h2>

<p>The full code for this experiment — the Vespa app, the
paired-rotation sweep harness, and the LLM agent loop, with a step-by-step
reproduction guide — is at <a href="https://github.com/vespaai-playground/msmarco-bm25-autoresearch"><code class="language-plaintext highlighter-rouge">vespaai-playground/msmarco-bm25-autoresearch</code></a>.</p>

<p>To actually give <em>your</em> coding agent the Vespa knowledge to quickly succeed, there’s an
official skills pack for Claude Code / Codex / Cursor / Gemini CLI:
<a href="https://github.com/vespaai-playground/skills">github.com/vespaai-playground/skills</a> —
which includes schema authoring, rank features, query building, etc.</p>

<p>Want to go even further? A weighted sum of three features is still on the simple end of
the spectrum. The next step is to stop hand-weighting and let a model learn
the best combination on your dataset: the <a href="https://blog.vespa.ai/the-rag-blueprint/">RAG Blueprint</a>
collects ~190 lexical <em>and</em> semantic match/rank features and trains a GBDT
(<a href="https://docs.vespa.ai/en/lightgbm.html">LightGBM</a>) model to combine them — a more advanced version of
what we did by hand here, and it mixes BM25-style lexical signals with
vector-semantic ones.</p>

<p>Want to play with Vespa? Start with the <a href="https://vespa.ai/free-trial/">free trial</a>
or pull the <a href="https://hub.docker.com/r/vespaengine/vespa/">vespaengine/vespa</a>
container.</p>

<h2 id="notes">Notes</h2>

<p>Come hang out in <a href="https://vespatalk.slack.com/">Vespa Slack</a> or
<a href="https://discord.vespa.ai/">Discord</a> if you want to chat ranking features
or compare notes on retrieval evals. Thanks to Doug for publishing the experiment!</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:bm25-probe">
      <p>We asked the three models the same six neutral retrieval prompts, 10 reps each. BM25 mention rates: gpt-4o 12%, gpt-4.1 30%, gpt-5-chat 35%. The shift happens at gpt-4.1, not GPT-5. <a href="#fnref:bm25-probe" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:limit">
      <p><em>On the Theoretical Limitations of Embedding-Based Retrieval</em>, Google DeepMind &amp; Johns Hopkins, arXiv:2508.21038 (Aug 2025). LIMIT is a benchmark of deliberately simple retrieval queries — top embedding models scored under 20% recall@100 on queries as simple as “who likes apples?”. <a href="#fnref:limit" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:msmarco">
      <p><a href="https://microsoft.github.io/msmarco/">MSMARCO</a> is a widely-used passage-ranking benchmark — 8.84M short passages and a dev set of 7,437 <em>qrels</em> (query + relevant-passage-ID pairs collected by human annotators). <strong>MRR@10</strong> = mean reciprocal rank of the first relevant passage in each query’s top-10 results, averaged across queries (0 if no relevant passage in top-10). <a href="#fnref:msmarco" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:stemming">
      <p>Vespa also ships <a href="https://docs.vespa.ai/en/linguistics/lucene-linguistics.html">Lucene Linguistics</a> as an alternative for stemming, tokenization, and language detection — we stuck with the default OpenNLP here and did no tuning of linguistics. <a href="#fnref:stemming" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:tripling">
      <p><a href="https://blog.vespa.ai/tripling-the-query-performance-of-lexical-search/">Tripling the query performance of lexical search</a> takes a complementary latency-and-cost-focused approach — <code class="language-plaintext highlighter-rouge">stopwordLimit=0.6</code> combined with <code class="language-plaintext highlighter-rouge">adjust-target</code> and <code class="language-plaintext highlighter-rouge">filter-threshold</code> for 3–11× speedup with negligible quality change. We pushed <code class="language-plaintext highlighter-rouge">stopwordLimit</code> to the aggressive end of the range for quality, since on MSMARCO the benefit comes from setting the parameter low enough to exclude common question words — about 72% of dev queries are question-style (“what is X”, “how do Y”), and at <code class="language-plaintext highlighter-rouge">0.6</code> those terms are not excluded. On keyword-heavy workloads the optimum would be higher — re-measure on your own query distribution before adopting. Query rewriting on lexical queries is likely stronger, but we didn’t do it here. <a href="#fnref:tripling" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>
]]></content:encoded>
        <pubDate>Fri, 29 May 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/re-autoresearching-msmarco-bm25-on-vespa/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/re-autoresearching-msmarco-bm25-on-vespa/</guid>
        
        <category>ranking</category>
        
        <category>bm25</category>
        
        <category>information retrieval</category>
        
        <category>evaluation</category>
        
        
      </item>
    
      <item>
        <title>Vespa Newsletter, May 2026</title>
        <description>Advances in Vespa include finer control over deployments, smarter ranking, richer embedding integrations, and more scalable vector search.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/logo/logo-pi.jpg" />
        
        <content:encoded><![CDATA[<p>Welcome to the latest edition of the Vespa newsletter.
In the <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">previous update</a>, we introduced the Vespa.ai Playground, the Vespa Kubernetes Operator, Pyvespa 1.0,
and <a href="https://blog.vespa.ai/vespa-newsletter-february-2026/">more</a>.</p>

<p>This month, we’re announcing several updates focused on retrieval quality, ranking flexibility, and developer productivity. Each feature is designed to help engineering teams build faster, more accurate, and more maintainable retrieval and ranking systems, while giving businesses better relevance, lower operational overhead, and more predictable performance at scale.</p>

<p>Let’s dive into what’s new.</p>

<h3 id="calling-all-vespians-vespaai-live-has-landed">Calling All Vespians: Vespa.ai Live Has Landed</h3>

<p><img src="/assets/2026-05-26-newsletter/Vespa.ai-live.png" alt="Vespa.ai Live" /></p>

<p>As more teams build with Vespa, bringing the community closer together has become a major focus for us.
Earlier this year, we ran our first virtual meetups,
extending our Slack community and were joined by more than 100 Vespians from around the world —
from the US and Ukraine to Singapore, Australia, Kazakhstan, Egypt, and beyond.
But while virtual events are great, nothing quite compares to meeting in person —
learning from peers, exchanging ideas, and continuing the conversation over coffee, beer, or wine.
That’s why we’re excited to announce our first in-person community meetup: Vespa.ai Live!</p>

<p>The event includes technical sessions, real-world user experiences, expert panels,
interactive unconference discussions,
and plenty of opportunities to connect with others building in this space.
Hosted by The Search Juggler, Charlie Hull,
the day will bring together external experts and leading authors Trey Grainger and Doug Turnbull,
Vespa engineers,
community voices with speakers from Walmart, Etsy and RavenPack,
and a keynote from Vespa co-founder and CEO Jon Bratseth.</p>

<p>On September 9, join the pre-event training, with <em>Vespa 101: Getting started with Vespa</em>,
and <em>Ranking 202: A deeper dive into improving retrieval quality</em> - <a href="https://www.tickettailor.com/events/vespaaias/2158210">details and registration</a>.</p>

<p>Most of all, Vespa.ai Live is intended to be community-driven —
where Vespians share lessons learned and boldly go beyond the frontier of modern AI retrieval.</p>

<p><a href="https://content.vespa.ai/vespa-live">Learn more about Vespa Live!</a></p>

<h3 id="product-updates">Product updates</h3>

<ul>
  <li>Vespa Cloud: Detailed metric dashboards</li>
  <li>Vespa Cloud: Index backup</li>
  <li>Vespa Cloud: Fine-grained maintenance controls</li>
  <li>Vespa Cloud: Voyage AI, OpenAI, and Mistral AI embedding integration</li>
  <li>Vespa Cloud: Custom resource tags</li>
  <li>Vespa skills for agents</li>
  <li>A new query operator for text matching</li>
  <li>Cluster-size independent configuration of relevance effort</li>
  <li>Boolean array fields</li>
  <li>Match specific array elements</li>
  <li>In-memory document ids</li>
  <li>Search group pinning</li>
  <li><em>Near</em> matching aware ranking</li>
  <li>Detect ignored write operations</li>
  <li>Accessing the max first phase score in re-ranking</li>
  <li>Geo distance in grouping</li>
</ul>

<h3 id="vespa-cloud-detailed-metric-dashboards">Vespa Cloud: Detailed metric dashboards</h3>
<p>As companies deploy their large-scale latency-sensitive applications on Vespa Cloud
there is a need for more detailed insights into how the application is performing.
While the Vespa Cloud Console has always provided an overview metrics dashboard,
many finer details have only been available to Vespa’s operations engineers.</p>

<p><strong>What’s new:</strong> We have added all the metrics dashboard used by Vespa engineers to the console
so that customers who want to are empowered to dig as deeply as they like.
We’ve also added explanations to the dashboards to make them easier to understand.</p>

<p><img src="/assets/2026-05-26-newsletter/dashboard.png" alt="Dashboard" /></p>

<p><a href="https://blog.vespa.ai/the-vespa-cloud-metrics-dashboard/">Read more</a>.</p>

<h3 id="vespa-cloud-index-backup">Vespa Cloud: Index backup</h3>
<p>Reliability at scale means being prepared for the unexpected.
Vespa Cloud now provides automated snapshot backups of indexes from content nodes,
enabling catastrophe recovery without a full re-index.</p>

<p><strong>What’s New:</strong> Vespa Cloud now supports backing up indexes from content nodes.
The snapshot backups can be used for catastrophe restore of nodes.
<a href="https://docs.vespa.ai/en/operations/data-management.html#backup">Read more</a>.</p>

<h3 id="vespa-cloud-fine-grained-maintenance-controls">Vespa Cloud: Fine-Grained maintenance controls</h3>
<p>Vespa Cloud has always provided control over when and how application changes and Vespa upgrades
are rolled out in production by the <a href="https://docs.vespa.ai/en/operations/automated-deployments.html">CD pipeline</a>.
In addition, Vespa Cloud does occasional OS upgrades as a background host level operation which is orchestrated
but not rolled out by the CD pipeline.</p>

<p>We already see a need to run these processes more aggressively
and anticipate that this trend will accelerate as capabilities similar to Claude Mythos become widely available.
While this is necessary to maintain a strong security posture in the times ahead
it has the potential to increase the impact from OS level maintenance on observed metrics,
such as when these operations happen during peak traffic.</p>

<p><strong>What’s new:</strong> Vespa Cloud now lets you control when these maintenance operations are allowed to happen
in the same way as deployments are controlled.
See the maintenance attribute in <a href="https://docs.vespa.ai/en/reference/applications/deployment.html#block-change">deployment.xml’s block-change tag</a>.</p>

<h3 id="vespa-cloud-voyage-ai-openai-and-mistral-ai-embedding-integration">Vespa Cloud: Voyage AI, OpenAI, and Mistral AI embedding integration</h3>
<p>Embedding models are central to AI search, and it’s now simpler to use the most popular ones in Vespa Cloud.</p>

<p><strong>What’s new:</strong> You can now save your API key in Vespa Cloud and invoke the embedding APIs of
<a href="https://docs.voyageai.com/docs/introduction">Voyage AI</a>,
<a href="https://developers.openai.com/api/docs/guides/embeddings">OpenAI</a>,
and <a href="https://docs.mistral.ai/studio-api/knowledge-rag/embeddings">Mistral AI</a>.
These APIs can be invoked at document processing time (indexing), as well as query time. With Voyage AI and the voyage-4 model family, you can use the API for documents and a smaller, local model for queries, eliminating the need for an API call in the query path.</p>

<p>Read more in the Vespa embedding documentation for:
<a href="https://docs.vespa.ai/en/rag/embedding.html#voyageai-embedder">Voyage AI</a>,
<a href="https://docs.vespa.ai/en/rag/embedding.html#openai-embedder">OpenAI</a>,
and <a href="https://docs.vespa.ai/en/rag/embedding.html#mistral-embedder">Mistral AI</a>.</p>

<h3 id="vespa-cloud-custom-resource-tags">Vespa Cloud: Custom resource tags</h3>
<p>With Vespa Cloud Enclave, Vespa will provision resources in accounts owned by the customer.
Many companies want to track these resources, e.g. for financial monitoring.</p>

<p><strong>What’s New</strong>: Vespa now lets you declare custom resource tags in deployment.xml that will be applied on provisioned resources.
The tag declarations can contain template variables such that resources can be tagged with e.g. the application they belong to.
<a href="https://docs.vespa.ai/en/reference/applications/deployment.html#resource-tags">Read more</a>.</p>

<h3 id="vespa-skills-for-agents">Vespa skills for agents</h3>
<p>Coding agents are good at working with Vespa applications.
Giving them the relevant skills makes this even more efficient.</p>

<p><strong>What’s new:</strong> We have released a collection of skills for agents working with Vespa applications,
available <a href="https://github.com/vespaai-playground/skills">here</a>.
This includes skills for working with application packages, feeding and queries, as well as migrating from ElasticSearch to Vespa.
We run evaluations over these skills to ensure that they actually improve outcomes with current models.</p>

<h3 id="a-new-query-operator-for-text-matching">A new query operator for text matching</h3>
<p>To do lexical matching with an arbitrary input string in Vespa, you can use <code class="language-plaintext highlighter-rouge">userInput(“my text”)</code> in YQL.
This assumes that the text can control simple syntax for controlling the matching, such as “some-field:” to specify the field to match.
Sometimes, the text should just be interpreted as raw text with no such query syntax.</p>

<p><strong>What’s new:</strong> Vespa now supports a new <code class="language-plaintext highlighter-rouge">text()</code> operator which interprets the argument text simply as raw text with no syntax.
When there’s no syntax, the text can only end up searching one field or fieldSet
and so the regular syntax can be used to specify the field: <code class="language-plaintext highlighter-rouge">where my-field contains text(“my text”)</code>.
<a href="https://docs.vespa.ai/en/reference/querying/yql.html#text">Read more</a>.</p>

<h3 id="cluster-size-independent-configuration-of-relevance-effort">Cluster-size independent configuration of relevance effort</h3>
<p>Vespa has various parameters to set how much effort (CPU) should be spent on providing good results in a query.
These parameters are specified as a value <em>per content node</em>,
so if you want the total expenditure to stay constant when you change the number of content nodes,
you must remember to update these parameters.</p>

<p>This is easy to forget, and of course impossible with autoscaling activated.
What’s more, when new nodes are added to clusters, they will initially have less data than other nodes,
but will get the same setting as nodes with a full share of data.</p>

<p><strong>What’s new:</strong> Vespa now supports alternatives of these configuration parameters prefixed by “total”,
which allows you to specify values across all the content nodes.
Vespa will automatically calculate the right share for each node in the cluster group,
including when nodes temporarily have less data than normally.</p>

<p>The new “total-” parameters are:</p>

<ul>
  <li>NearestNeighbor and WeakAnd <a href="https://docs.vespa.ai/en/reference/querying/yql.html#totaltargethits">totalTargetHits</a>:
The minimum number of hits the query operator should produce</li>
  <li>Match-phase <a href="https://docs.vespa.ai/en/reference/schemas/schemas.html#match-phase-total-max-hits">total-max-hits</a> /
<a href="https://docs.vespa.ai/en/reference/api/query.html#ranking.matchphase.totalmaxhits">ranking.matchphase.totalmaxhits</a></li>
  <li>First-phase <a href="https://docs.vespa.ai/en/reference/schemas/schemas.html#total-keep-rank-count">total-keep-rank-count</a> /
<a href="https://docs.vespa.ai/en/reference/api/query.html#ranking.totalkeeprankcount">ranking.totalKeepRankCount</a></li>
  <li>Second-phase <a href="https://docs.vespa.ai/en/reference/schemas/schemas.html#secondphase-total-rerank-count">total-rerank-count</a> /
<a href="https://docs.vespa.ai/en/reference/api/query.html#ranking.secondphase.totalrerankcount">ranking.secondPhase.totalRerankCount</a></li>
</ul>

<h3 id="boolean-array-fields">Boolean array fields</h3>
<p>Once developers move beyond the basics to really put the power of Vespa to work
they often want to pack large amounts of dynamic metadata into documents,
such as for example storing information about each document’s relationship to each zip code in the US.
At scale the memory efficiency with such usage really matters.</p>

<p><strong>What’s new:</strong> Vespa now lets you create arrays of bits fields: <code class="language-plaintext highlighter-rouge">field my_bits type array&lt;bool&gt;</code>.
Booleans can both be standalone or part of a struct type which is wrapped in an array.
<a href="https://docs.vespa.ai/en/reference/schemas/schemas.html#array">Read more</a>.</p>

<h3 id="match-specific-array-elements">Match specific array elements</h3>
<p>Arrays in documents can be searched both as attributes and text indexes.
You can also match multiple struct values or text tokens of the same array element by using the sameElement operator.
In some use cases, you also want to match a specific index in the array.</p>

<p><strong>What’s new:</strong> Vespa now lets you specify the array index you want to match in queries: <code class="language-plaintext highlighter-rouge">select … where my_bits[94085] = true</code>.
You can also search for multiple indexes in the same query by using slightly more complicated syntax:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>select … where my_array contains ({elementFilter:[33, 34]}sameElement(first_name contains "John", last_name contains "Doe")
</code></pre></div></div>

<p>See the <a href="https://docs.vespa.ai/en/reference/querying/yql.html#elementfilter">elementFilter</a> documentation.
This is also supported in <a href="https://docs.vespa.ai/en/reference/querying/json-query-language.html#search-within-same-struct-element">JSON queries</a>
by using an index attribute.</p>

<h3 id="search-group-pinning">Search group pinning</h3>
<p>When a content cluster has <a href="https://docs.vespa.ai/en/performance/topology-and-resizing.html">multiple groups</a>,
they will all have the same data,
but their indexes will be slightly different since each node in each group has a different subset of the data written in a different order.
This can lead to some inconsistency when a user is paging over a result set and hitting different groups.</p>

<p><strong>What’s new:</strong> Vespa now lets you pin queries to a specific group to make pagination queries consistent.
<a href="https://docs.vespa.ai/en/content/elasticity.html#pinning-groups">Read more</a>.</p>

<h3 id="in-memory-document-ids">In-memory document ids</h3>
<p>Document ids in Vespa are only stored on disk only.
This saves memory, but makes it impossible to retrieve the full ids of many documents really fast in queries and visiting.</p>

<p><strong>What’s new:</strong> From Vespa 8.691 you can declare in the schema that document ids should reside in memory for fast access similar to attributes.
<a href="https://docs.vespa.ai/en/schemas/documents.html#docid-in-results">Read more</a>.</p>

<h3 id="near-matching-aware-ranking"><em>Near</em> matching aware ranking</h3>
<p>When using the <em>near</em> and <em>onear</em> query operators,
the most intuitive ranking is using only the terms matching in the operator itself for rank scoring. Example:</p>

<p>Suppose near(term1, term2) matches document1 because of a single window where term1 and term2 appear close enough.
If document1 contains term1 many additional times outside the valid proximity window,
this is less relevant with respect to ranking.</p>

<p><strong>What’s new:</strong> From Vespa 8.672, terms outside the match window are not considered in relevance calculations.
<a href="https://docs.vespa.ai/en/reference/querying/yql.html#near">Read more</a>.</p>

<h3 id="detect-ignored-write-operations">Detect ignored write operations</h3>
<p>Content clusters can specify what documents they should receive in a document selection.
Sending a document operation which is ignored by every cluster is not an error, but you may want to know.</p>

<p><strong>What’s New:</strong> From Vespa 8.680, the document/v1 API includes a dedicated
<a href="https://docs.vespa.ai/en/reference/api/document-v1.html#x-vespa-ignored-operation">X-Vespa-Ignored-Operation</a> HTTP response header.
When an operation is ignored during routing (for example, because the target document no longer exists),
this header is present and set to “true”.
Read more in <a href="https://github.com/vespa-engine/vespa/issues/36397">this issue</a>.</p>

<h3 id="accessing-the-max-first-phase-score-in-re-ranking">Accessing the max first phase score in re-ranking</h3>
<p><strong>What’s new:</strong> <em>firstPhaseMax</em> is a new <a href="https://docs.vespa.ai/en/reference/ranking/rank-features.html">rank feature</a>
which exposes the rank score of the top scoring document locally on the node in second-phase ranking.</p>

<p>One usage of this is to enable dropping documents that score too low <em>relative</em> to the best-scoring document,
by combining it with a <a href="https://docs.vespa.ai/en/reference/schemas/schemas.html#rank-score-drop-limit">rank-score-drop-limit</a>.</p>

<h3 id="geo-distance-in-grouping">Geo distance in grouping</h3>
<p><strong>What’s new:</strong> Grouping now lets you group by <a href="https://docs.vespa.ai/en/reference/querying/grouping-language.html#geo_distance">geo_distance</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>all( group(fixedwidth(geo_distance(attribute(location), 63.4, 10.4).km, 10)) each(output(count())) )
</code></pre></div></div>

<h3 id="image-search-demo">Image Search demo</h3>
<p><img src="/assets/2026-05-26-newsletter/image-search.png" alt="Image search demo" /></p>

<p>Try the <a href="https://reverse-image-search.vespa-demos.ai/">image search demo</a> to test how ranking profiles with different tensor precision in ranking affect results. Hint: binarized embeddings perform really well,
<a href="https://reverse-image-search.vespa-demos.ai/article">read the report</a>!</p>

<h3 id="vespa-learn">Vespa Learn</h3>
<p><a href="http://learn.vespa.ai">learn.vespa.ai</a> is a self-paced course that teaches you how to build search, recommendation, and RAG applications with Vespa.
You will go from zero to a working e-commerce search engine with hybrid retrieval and machine learning ranking,
building it up one piece at a time across six modules.</p>

<h3 id="whats-new-on-youtube">What’s New on YouTube</h3>
<ul>
  <li><a href="https://vespa.ai/resource/multimodal-intelligence-for-life-sciences-on-aws/">Multimodal Intelligence for Life Sciences on AWS</a> <em>(Webinar)</em></li>
  <li><a href="https://vespa.ai/resource/the-personalization-problem-in-ecommerce-am/">The Personalization Problem in eCommerce AM</a> <em>(Webinar)</em></li>
  <li><a href="https://vespa.ai/resource/the-personalization-problem-in-ecommerce-em/">The Personalization Problem in eCommerce EM</a> <em>(Webinar)</em></li>
  <li><a href="https://vespa.ai/resource/the-relevance-problem-in-ecommerce/">The Relevance Problem in eCommerce</a> <em>(Webinar)</em></li>
  <li><a href="https://vespa.ai/resource/vespa-now-q1-product-update/">Vespa Now: Q1 Product Update</a> <em>(Webinar)</em></li>
  <li><a href="https://vespa.ai/resource/webinar-zero-results/">Zero Results Webinar</a></li>
</ul>

<p>Find more videos in the <a href="https://www.youtube.com/@vespaai">@vespaai</a> channel.</p>

<h3 id="blogs-and-ebooks">Blogs and ebooks</h3>
<ul>
  <li>Learn how <a href="https://www.kleinanzeigen.de/">Kleinanzeigen</a> built a single system with user behavioral profiles alongside ads,
WAND for fast inner-product retrieval over sparse attribute vectors,
embedding-based ANN search,
and click and search events processed as document updates:
<a href="https://medium.com/berlin-tech-blog/from-elasticsearch-to-vespa-rebuilding-the-kleinanzeigen-homepage-feed-part-1-b6164e366ab8">From Elasticsearch to Vespa: Rebuilding the Kleinanzeigen Homepage Feed</a>.</li>
  <li><a href="https://blog.vespa.ai/scaling-a-vespa-application-feeding-fast-and-furiously/">Scaling a Vespa Application: Feeding Fast and Furiously</a></li>
  <li><a href="https://blog.vespa.ai/the-vespa-cloud-metrics-dashboard/">The Vespa Cloud Metrics Dashboard</a></li>
  <li><a href="https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/">Using Large ONNX Models with External Data in Vespa Embedders</a></li>
  <li><a href="https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/">Asymmetric Retrieval: Spend on Docs, Embed your Queries for Free</a></li>
  <li><a href="https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/">How Metal AI Built an Agent-Driven Intelligence Platform on Vespa Cloud</a></li>
  <li><a href="https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/">Build a High-Quality RAG App on Vespa Cloud in 15 Minutes</a></li>
</ul>

<h3 id="upcoming-events">Upcoming events</h3>
<ul>
  <li><a href="https://retailx.events/commerceai-summit-2026">Commerce AI Summit London</a>: June 3, London, UK. An executive-style event connecting retailers, brands, and AI solution providers.</li>
  <li><a href="https://2026.berlinbuzzwords.de/">Berlin Buzzwords</a>:  June 7-9, Berlin, Germany. Europe’s leading conference for data infrastructure, search, and machine learning.</li>
  <li><a href="https://europe.shoptalk.com/home">Shoptalk Europe</a>: June 9-11, Fira Gran Via, Barcelona. Europe’s home for retail innovation, bringing together 4,500+ trailblazers and 180+ speakers focused on AI and the future of commerce.</li>
  <li><a href="https://etailuk.wbresearch.com/">Etail UK</a>: June 16-17, Manchester, England. A leading eCommerce and retail conference focused on digital commerce strategy, customer experience, and AI-driven personalization for modern retail teams.</li>
</ul>

<hr />
<p>👉 <a href="https://www.linkedin.com/company/vespa-ai/">Follow us on LinkedIn</a> to stay in the loop on upcoming events, blog posts, and announcements.</p>

<hr />

<p>Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? <a href="https://vespa.ai/free-trial/">Deploy your application for free</a> on Vespa Cloud today.</p>

]]></content:encoded>
        <pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-newsletter-may-2026/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-newsletter-may-2026/</guid>
        
        
        <category>newsletter</category>
        
      </item>
    
      <item>
        <title>Scaling a Vespa Application: Feeding Fast and Furiously</title>
        <description>A tutorial on how to scale the resources in a Vespa application to increase feed throughput. Using the metrics dashboard for informed and optimised scaling.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/shaun-sullivan-4Ia69jX7rq4-unsplash.jpg" />
        
        <content:encoded><![CDATA[<p><em>This is a blog/series on how to scale and evaluate a Vespa Application for serving enterprise-scale workloads and customer facing applications with potentially millions of users. Vespa is the AI search platform and all-in-one solution for all your retrieval and large scale computation needs.</em></p>

<p>In this blog I will show you how to feed a large dataset to a Vespa Application. We will be using the full MS_marco passages dataset, which is perhaps the most comprehensive open dataset for information retrieval. It is around 4GB and contains more than 8 million passages on a wide range of topics. The goal in this blog is to show how scaling works in Vespa through feeding the entire dataset as fast as we can.</p>

<h1 id="creating-the-vespa-application">Creating the Vespa Application</h1>

<p>We will be using a pre-made sample application as our basis for scaling but the concepts are the same for any other application.</p>

<p>Setup:</p>

<ol>
  <li>
    <p><strong>Create a <a href="https://docs.vespa.ai/en/learn/tenant-apps-instances.html">tenant</a> on Vespa Cloud:</strong></p>

    <p>Go to <a href="https://console.vespa-cloud.com/">console.vespa-cloud.com</a> and create your tenant (unless you already have one).</p>
  </li>
  <li><strong>Install the <a href="https://docs.vespa.ai/en/clients/vespa-cli.html">Vespa CLI</a></strong> using <a href="https://brew.sh/">Homebrew</a>:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ brew install vespa-cli
</code></pre></div>    </div>
    <p>Windows/No Homebrew? See the <a href="https://docs.vespa.ai/en/clients/vespa-cli.html">Vespa CLI page</a> to download directly.</p>
  </li>
  <li><strong>Configure the Vespa client:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa config set target cloud
$ vespa config set application your-tenant-name-here.scalingtutorial
</code></pre></div>    </div>
  </li>
  <li><strong>Get Vespa Cloud control plane access:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa auth login
</code></pre></div>    </div>
    <p>Follow the instructions from the command to authenticate.</p>
  </li>
  <li><strong>Clone the sample application:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa clone scaling-tutorial scaling-app &amp;&amp; cd scaling-app
</code></pre></div>    </div>
    <p>This sample app is perfect for demonstrating scaling and performance as it is quite intensive to run both for feeding and querying.
You can also check out <a href="https://github.com/vespa-engine/sample-apps">sample-apps</a> for other sample apps you can clone.</p>
  </li>
  <li><strong>Add a certificate for <a href="https://docs.vespa.ai/en/security/guide#data-plane">data plane access</a> to the application:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa auth cert
</code></pre></div>    </div>
    <p>It is a good idea to take note of the path to the .pem files written here.</p>
  </li>
  <li>
    <p><strong>Add the cross-encoder and Colbert model</strong></p>

    <p>Export the cross-encoder ranker model to onnx format using the <a href="https://huggingface.co/docs/optimum/index">Optimum</a> library from HF or download an exported ONNX version of the model (like in this example)</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p models
$ curl -L https://huggingface.co/Xenova/ms-marco-MiniLM-L-6-v2/resolve/main/onnx/model.onnx -o models/model.onnx
$ curl -L https://huggingface.co/Xenova/ms-marco-MiniLM-L-6-v2/raw/main/tokenizer.json -o models/tokenizer.json
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>Download the dataset</strong></p>

    <p>The msmarco passages dataset can be found <a href="https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus">here</a>. Download, unzip it and place it in the <code class="language-plaintext highlighter-rouge">ext/</code> folder in our application.</p>

    <p><strong>NOTE: You will need around 8GB of free disk space for the dataset and the subsets we will be creating.</strong></p>
  </li>
  <li>
    <p><strong>Prepare the dataset for Vespa</strong></p>

    <p>Then run the script to convert it into the vespa feed format:</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python3 ext/transform_ms_marco.py
</code></pre></div>    </div>
    <p>which gives us the dataset and a few subsets of various sizes to feed to our application.</p>
  </li>
</ol>

<h1 id="deploying-and-feeding">Deploying and Feeding</h1>

<p>We now have everything we need for deployment, feeding and scaling! Scaling a vespa application is largely managed through the services.xml file. This is what the file currently looks like:</p>

<style>
.code-block{overflow:hidden;border:1px solid #e0e0e0;background:#fafafa;font-family:monospace}
.code-header{display:flex;align-items:center;justify-content:space-between;padding:8px 14px;background:#f4f4f4;border-bottom:1px solid #e0e0e0}
.code-lang{font-size:11px;color:#999;letter-spacing:.06em;text-transform:uppercase;font-family:sans-serif}
.copy-btn{font-size:11px;color:#999;background:none;border:1px solid #ddd;padding:2px 9px;cursor:pointer;font-family:sans-serif}
.copy-btn:hover{color:#333;border-color:#aaa}
.code-body{position:relative}
.code-scroll{overflow:hidden;max-height:132px;transition:max-height 0.4s cubic-bezier(0.4,0,0.2,1)}
.code-scroll.expanded{max-height:4000px}
.code-block pre{padding:14px 16px;font-size:12.5px;line-height:22px;color:#222;overflow-x:auto;white-space:pre;tab-size:2;margin:0;background:none}
.fade-overlay{position:absolute;bottom:0;left:0;right:0;height:72px;background:linear-gradient(to bottom,transparent,#fafafa);pointer-events:none;transition:opacity 0.3s ease}
.code-scroll.expanded~.fade-overlay{opacity:0}
.show-btn-wrap{display:flex;justify-content:center;padding:8px 0 12px;background:#fafafa;border-top:1px solid #e0e0e0}
.show-btn{font-size:12px;color:#555;background:none;border:none;padding:4px 12px;cursor:pointer;display:flex;align-items:center;gap:5px;font-family:sans-serif}
.show-btn:hover{color:#000}
.show-btn svg{transition:transform 0.3s ease}
.show-btn.open svg{transform:rotate(180deg)}
</style>

<div class="code-block">
  <div class="code-header">
    <span class="code-lang">XML — services.xml</span>
    <button class="copy-btn" onclick="(function(){var t=document.getElementById('vespa-raw').innerText;navigator.clipboard.writeText(t).then(function(){var b=document.querySelector('.copy-btn');b.textContent='Copied!';setTimeout(function(){b.textContent='Copy'},1500)})})()">Copy</button>
  </div>
  <div class="code-body">
    <div class="code-scroll" id="vespa-scroll">
      <pre id="vespa-raw">&lt;?xml version="1.0" encoding="utf-8" ?&gt;
&lt;!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --&gt;
&lt;services version="1.0" xmlns:deploy="vespa" xmlns:preprocess="properties" minimum-required-vespa-version="8.311.28"&gt;

  &lt;container id="default" version="1.0"&gt;

    &lt;nodes deploy:environment="dev" count="1"&gt;
      &lt;resources vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/&gt;
    &lt;/nodes&gt;
   
    &lt;search/&gt;
    &lt;document-api/&gt;

     &lt;!-- See https://docs.vespa.ai/en/embedding.html#huggingface-embedder --&gt;
    &lt;component id="e5_embedding_model" type="hugging-face-embedder"&gt;
            &lt;transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/&gt;
            &lt;tokenizer-model url="https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"/&gt;
            &lt;prepend&gt;
                &lt;query&gt;query:&lt;/query&gt;
                &lt;document&gt;passage:&lt;/document&gt;
            &lt;/prepend&gt;
    &lt;/component&gt;

    &lt;!-- See https://docs.vespa.ai/en/embedding.html#colbert-embedder --&gt;
    &lt;component id="colbert_embedding_model" type="colbert-embedder"&gt;
      &lt;transformer-model url="https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"/&gt;
      &lt;tokenizer-model url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/&gt;
    &lt;/component&gt;

     &lt;!-- See https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder--&gt;
    &lt;component id="tokenizer" type="hugging-face-tokenizer"&gt;
      &lt;model path="models/tokenizer.json"/&gt;
    &lt;/component&gt;

  &lt;/container&gt;

  &lt;content id="msmarco" version="1.0"&gt;
    &lt;min-redundancy&gt;1&lt;/min-redundancy&gt;
    &lt;documents&gt;
      &lt;document mode="index" type="passage"/&gt;
    &lt;/documents&gt;
    &lt;nodes deploy:environment="dev" count="1"&gt;
      &lt;resources vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/&gt;
    &lt;/nodes&gt; 
    &lt;engine&gt;
      &lt;proton&gt;
        &lt;tuning&gt;
          &lt;searchnode&gt;
            &lt;requestthreads&gt;
              &lt;persearch&gt;4&lt;/persearch&gt;
            &lt;/requestthreads&gt;
            &lt;feeding&gt;
              &lt;concurrency&gt;1.0&lt;/concurrency&gt;
            &lt;/feeding&gt;
          &lt;/searchnode&gt;
        &lt;/tuning&gt;
      &lt;/proton&gt;
    &lt;/engine&gt;
  &lt;/content&gt;

&lt;/services&gt;
</pre>
    </div>
    <div class="fade-overlay"></div>
  </div>
  <div class="show-btn-wrap">
    <button class="show-btn" id="vespa-btn" onclick="(function(){var s=document.getElementById('vespa-scroll');var b=document.getElementById('vespa-btn');var open=s.classList.toggle('expanded');b.classList.toggle('open',open);b.innerHTML=open?'&lt;svg width=\'14\' height=\'14\' viewBox=\'0 0 14 14\' fill=\'none\'&gt;&lt;path d=\'M2 4.5L7 9.5L12 4.5\' stroke=\'#89b4fa\' stroke-width=\'1.8\' stroke-linecap=\'round\' stroke-linejoin=\'round\'/&gt;&lt;/svg&gt; Show less':'&lt;svg width=\'14\' height=\'14\' viewBox=\'0 0 14 14\' fill=\'none\'&gt;&lt;path d=\'M2 4.5L7 9.5L12 4.5\' stroke=\'#89b4fa\' stroke-width=\'1.8\' stroke-linecap=\'round\' stroke-linejoin=\'round\'/&gt;&lt;/svg&gt; Show all'})()">
      <svg width="14" height="14" viewBox="0 0 14 14" fill="none"><path d="M2 4.5L7 9.5L12 4.5" stroke="#888" stroke-width="1.8" stroke-linecap="round" stroke-linejoin="round" /></svg>
      Show all
    </button>
  </div>
</div>

<p><br /></p>

<p>The important parts to take note of in this tutorial are the two resource specifiers in the &lt;container&gt; and &lt;content&gt; tags:</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"1"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"1"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p>This is where we configure the machine resources that our Vespa application runs on in Vespa Cloud.</p>

<p><strong>NOTE: when deploying to dev we need to add the <code class="language-plaintext highlighter-rouge">&lt;nodes deploy:environment="dev"&gt;</code> specifier to ensure we actually get the resources we ask for,
otherwise we default to what is quickly available</strong>.</p>

<p>Adding more resources or more nodes are the main parameters that need to be tweaked in order to scale your application. Right now we have provisioned the smallest amount of resources to our application.</p>

<p>Deploy the application to Vespa Cloud:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>(It might take a little bit of time for all services and nodes to go up and start running.)</p>

<p>You can follow the progress of the deployment from the terminal or in your tenant in your cloud console. When it is finished you should get the message:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Application up!
</code></pre></div></div>
<p>If you go to your cloud console you should be able to see your application. Note that we haven’t fed it any documents yet, so it should look something like this:</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1howitshouldlook.png" alt="application view in console" /></p>

<p>Let’s feed some documents. Feed the smallest dataset to Vespa using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_1000.jsonl
</code></pre></div></div>
<p>or, on Unix systems:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_1000.jsonl
</code></pre></div></div>
<p>to see how long it takes.</p>

<p>It will take a few minutes as we are doing heavy computations on very modest resources.</p>

<p>If you want to see a live count of how many documents that are in Vespa you can run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'
</code></pre></div></div>
<p>to see how many documents have been processed so far (under <code class="language-plaintext highlighter-rouge">documents</code>).</p>

<p>On this lowest resource configuration we get this result.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_1000.jsonl  4.96s user 7.11s system 3% cpu 5:56.05 total 
</code></pre></div></div>

<p>If we were to try and feed the whole 8.8 million passage msmarco dataset on this instance it would take more than a month to finish feeding!</p>

<p>We can do better!</p>

<h1 id="scaling">Scaling</h1>

<p>Before scaling the application we’ll delete the documents from our instance so that we have a fresh start.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/deleteDocs.png" alt="Deleting documents" /></p>

<p>Now lets assign more resources to our Vespa instance. From our schema we see that  we are doing extensive computations during feeding (notice the configuration in the <code class="language-plaintext highlighter-rouge">indexing</code> parameters)</p>

<p><strong>Schema</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  field e5_embedding type tensor&lt;bfloat16&gt;(x[384]) {
    # Using the e5 embedding model defined in services.xml
    indexing: input text | embed e5_embedding_model | attribute | index
    attribute {
      distance-metric: angular
    }
    index { # override default hnsw settings 
      hnsw {
        max-links-per-node: 32
        neighbors-to-explore-at-insert: 400
      } 
    }
  }

  field colbert_embeddings type tensor&lt;int8&gt;(dt{}, x[16]) {
    # No index - used for ranking, not retrieval 
    indexing: input text | embed colbert_embedding_model | attribute
    attribute: paged
  }
</code></pre></div></div>

<p>Embedding in Vespa happens in the container cluster, so it is a very reasonable guess that if we can make the embeddings go faster, our whole system will be faster (bellow in this blog we will show how to more thouroughly deduce scaling parameters). So lets start by scaling up the resources for the <strong>container</strong> node. To see what resource configurations we have available we must look at the <a href="https://docs.vespa.ai/en/performance/instance-types/aws-instance-types.html">instance type</a> page in the documentation.
Embedding-computations are best suited to run on GPUs, so we will select an instance type with a GPU:</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"1"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p>Replace the resources in the container node in services.xml with the new instance type (see above). Leave the content node resources as is for now.</p>

<p>Run the command for checking document count to make sure that it is zero:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'
</code></pre></div></div>
<p>and redeploy:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>When the deployment is finished we’ll time the feeding process again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_1000.jsonl
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_1000.jsonl  0.40s user 0.53s system 8% cpu 11.371 total 
</code></pre></div></div>
<p>11.4 seconds, that’s more like it! Instead of a month, this new instance would be able to crunch through the full dataset in just around a day!</p>

<p>We have now significantly upgraded a part of the hardware Vespa is running on. But before we scale up further we shall take a look at the <strong>metrics</strong> tab for our application. Go to <strong>Metrics</strong> and then <strong>resources</strong>.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/metrics_resources.png" alt="Metrics and Resources" /></p>

<p>This is where you see the resource usage history in your vespa instance, but most importantly it gives you a clear image of where your application is bottlenecked. The bottleneck for your application will be different depending on how your application is configured and the kind of computations you do. The previous 1000-line dataset was no match for the upgraded instance, so lets give it a bigger one to get some proper bottleneck data:</p>

<p>Delete the documents from the instance again, wait a bit, and run the command to ensure that we have no documents in our application</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'
</code></pre></div></div>
<p>Now we’ll feed the 50 000 line dataset to properly test and time the upgraded instance.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_50000.jsonl
</code></pre></div></div>
<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_50000.jsonl  17.44s user 22.68s system 11% cpu 5:56.41 total (~17.6 hours for full dataset)
</code></pre></div></div>

<p>This is a more accurate reading of the instance’s performance, and at 5min 56s to feed 50 000 documents, the full dataset would take around 17 and a half hours.</p>

<p>Look at the resources in the metrics and set it to show only the last 30 minutes so that we can see more clearly what went on. Notice the CPU-utilisation and the GPU-utilisation graphs. Notice that the GPU usage on the container node hit 100% and stayed there for the entire feeding process. The CPU usage on the container node peaked at 80% but leveled at around 60% and the content node’s CPU barely went over 50%.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1g1c_container.png" alt="5 GPU 1 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1g1c_gpu.png" alt="1 GPU 1 content - GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1g1c_content.png" alt="1 GPU 1 content - content node resources" /></p>

<p>It is clear that on this Vespa Instance, the bottleneck for better feeding performance lies in the GPU processing. If we want to improve the feeding performance of the system, then we must increase the amount of GPUs in the container node.</p>

<p>Now that we know where the problem lies: Lets make it go faster! We’ll increase the amount of GPU nodes to 5 with the <code class="language-plaintext highlighter-rouge">count="5"</code> parameter in the container node in <code class="language-plaintext highlighter-rouge">services.xml</code>.</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"5"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>
<p>save the <code class="language-plaintext highlighter-rouge">services.xml</code> file and redeploy:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>Now lets feed the larger dataset:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_500000.jsonl
</code></pre></div></div>
<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_500000.jsonl  97.59s user 95.22s system 10% cpu 29:38.18 total (~8.8 hours for full dataset)
</code></pre></div></div>
<p>If we extrapolate the results we see we got around twice the speed of the single-container node instance. But why not 5 times the speed? Let’s look at the metrics.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g1c_container.png" alt="5 GPU 1 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g1c_gpu.png" alt="5 GPU 1 content - container GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g1c_content.png" alt="5 GPU 1 content - content node resources" /></p>

<p>We see that the container GPU utilization now sits comfortably at around 50% and the container CPU at around 20-30%. But the content node CPU utilization sits near 100%. The 5 content nodes with GPUs saturate the single content node’s ability to take in data. We have found the new bottleneck of the system.</p>

<p>We’ll add some more content nodes:</p>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"2"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span> 
</code></pre></div></div>
<p>Delete the documents, redeploy, and refeed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_500000.jsonl
</code></pre></div></div>
<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_500000.jsonl  92.22s user 88.21s system 17% cpu 16:51.51 total (~5.0 hours for full dataset)
</code></pre></div></div>
<p>Adding the second content node almost doubles the performance again. Look at the metrics to see what is going on.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g2c_container.png" alt="5 GPUs 2 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g2c_gpu.png" alt="5 GPUs 2 content - container GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g2c_content.png" alt="5 GPUs 2 content - content node resources" /></p>

<p>We see now that the container GPU (70-80%) and the content node CPU (80-90%) are both highly utilised, whilst the container node CPU is around 40%. Since we are already on the smallest instance type with a GPU we can’t scale down the cpu to match the others, so we have actually found a near optimal balance of container and content node resources for feeding this application.</p>

<p>Now that we have found a good balance, lets really scale up!</p>

<h1 id="feeding-fast-20-gpus">Feeding Fast: 20 GPUs</h1>

<p>If we want serious feed throughput, we need serious hardware. Let’s scale the container and content nodes proportionately and jump to 20 GPU container nodes and 8 content nodes at the same time:</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"20"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"8"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>NOTE:</strong> At this point you will most likely hit the <code class="language-plaintext highlighter-rouge">quotaExceeded</code> error when you try to deploy. Vespa Cloud tenants have a default quota that prevents you from accidentally spending a lot of money. If you want to go past it, reach out to Vespa support. With the limit raised, redeploy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>Delete any existing documents, wait for the count to hit zero, and feed the 500 000 line dataset again:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_500000.jsonl
</code></pre></div></div>

<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_500000.jsonl  59.66s user 48.08s system 42% cpu 4:13.19 total (~1 hour 15 min for full dataset)
</code></pre></div></div>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/20g8c_container.png" alt="20 GPUs 8 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/20g8c_gpu.png" alt="20 GPUs 8 content - container GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/20g8c_content.png" alt="20 GPUs 8 content - content node resources" /></p>

<p>At an estimated 1 hour and 15 minutes for the full dataset we see that we got pretty much exactly 4x feeding speed with 4x the resources. We also see that the utilisation metrics are essentially the same as the last run (feeding at 11:30), just faster.</p>

<h1 id="feeding-furiously-100-gpus">Feeding Furiously: 100 GPUs</h1>

<p>Finally, because we can: 100 GPU container nodes and 40 content nodes, and this time we will feed the full 8.8 million passage dataset in one go.</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"100"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"40"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p>Delete the documents, Deploy, then feed the full dataset:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_full.jsonl
</code></pre></div></div>

<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_full.jsonl  695.31s user 605.48s system 108% cpu 20:04.23 total
</code></pre></div></div>

<p>The Vespa instance managed to process more than 8.8 million passages, with embeddings and ColBERT vectors computed for every single one, in just over 20 minutes (over a fast internet connection).</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/100g40c_console_3.png" alt="100 GPUs 40 content - feed complete at 8.84M documents" /></p>

<p>The feed client also gives us a nice summary at the end:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"feeder.operation.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">8841823</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1201.608</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.ok.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">8841823</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.ok.rate"</span><span class="p">:</span><span class="w"> </span><span class="mf">7358.324</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.error.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">399</span><span class="p">,</span><span class="w">
  </span><span class="nl">"http.request.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">8844266</span><span class="p">,</span><span class="w">
  </span><span class="nl">"http.response.latency.millis.avg"</span><span class="p">:</span><span class="w"> </span><span class="mi">167</span><span class="p">,</span><span class="w">
  </span><span class="nl">"http.response.code.counts"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"200"</span><span class="p">:</span><span class="w"> </span><span class="mi">8841823</span><span class="p">,</span><span class="w">
    </span><span class="nl">"429"</span><span class="p">:</span><span class="w"> </span><span class="mi">2044</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The feeding process had an average feed rate of around <strong>7358 documents per second</strong>. Now that is fast and furious!</p>

<h1 id="conclusion">Conclusion</h1>

<p>The best way to scale your Vespa instance is to use the metrics dashboards to see where the bottlenecks lie. There is no singular best instance of Vespa as the computational requirements are highly dependent on how you define your application. Feed the instance a sizable corpus to see how it performs under sustained load, and adjust its resources accordingly.</p>
]]></content:encoded>
        <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/scaling-a-vespa-application-feeding-fast-and-furiously/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/scaling-a-vespa-application-feeding-fast-and-furiously/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>The Vespa Cloud Metrics Dashboard</title>
        <description>A guide to the Vespa Cloud metrics dashboard — how to move from symptom to bottleneck to action, and what&apos;s new in the latest revision.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-overview.png" />
        
        <content:encoded><![CDATA[<p>When something goes wrong in production, the hard part is rarely finding a metric.
The hard part is figuring out <strong>which metric tells you where to look next</strong>.</p>

<p>The Vespa Cloud metrics dashboard is designed for exactly that.
Instead of treating monitoring as a wall of graphs, it helps you move from
<strong>symptom → bottleneck → action</strong>.</p>

<h2 id="start-with-three-questions">Start with three questions</h2>

<p>Most production issues can be reduced to three questions:</p>

<ol>
  <li><strong>Is the system healthy?</strong></li>
  <li><strong>Where is latency added?</strong></li>
  <li><strong>Are we running out of resources?</strong></li>
</ol>

<p>The dashboard mirrors that flow.</p>

<h3 id="1-is-the-system-healthy">1. Is the system healthy?</h3>

<p>Start on the <strong>Overview</strong> tab. This is the fastest place to answer “is anything
obviously broken?”. A healthy system keeps read and write QoS close to 100%.
If it drops, look at whether 4xx or 5xx responses are rising — 5xx responses
usually mean the problem is on the server side. A rise in degraded or failed queries
means it is time to continue into the Query tab.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-overview.png" alt="Metrics Overview" /></p>

<p>See the docs for the full reference:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#overview-tab">Metrics Overview tab</a>.</p>

<h3 id="2-where-is-latency-added">2. Where is latency added?</h3>

<p>Latency in Vespa is layered — a slow request is not just “slow”, it can be
slow in different parts of the path:</p>

<p><strong>HTTP → container → content nodes → ranking</strong></p>

<p>That is why the dashboard shows several latency metrics for what feels like
the same request. If HTTP latency is much higher than query latency,
payload size or network overhead may be the issue. If search-protocol latency
on the content nodes is high, the bottleneck is deeper in the system.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-query-rate-latency.png" alt="Query rate / Latency" /></p>

<p>See the docs for a layer-by-layer walkthrough:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#query-tab">Query tab</a>
and <a href="https://docs.vespa.ai/en/operations/monitoring.html#feed-tab">Feed tab</a>.</p>

<h3 id="3-are-we-running-out-of-resources">3. Are we running out of resources?</h3>

<p>Once you know where the slowdown is, switch to the <strong>Resources</strong> tab.
As a rule of thumb, sustained utilization above roughly 80% is a sign the
cluster may need more headroom. If one host is much hotter than the others,
enable per-host metrics and look for uneven load distribution.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-resources.png" alt="Node Resources" /></p>

<p>See the docs for healthy-value tables and scaling guidance:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#resources-tab">Resources tab</a>.</p>

<h2 id="whats-new-in-the-latest-revision">What’s new in the latest revision</h2>

<p>The dashboard has picked up a few improvements worth calling out.</p>

<h3 id="health-indicators-on-the-overview-tab">Health Indicators on the Overview tab</h3>

<p>The Overview tab now opens with a dedicated <strong>Health Indicators</strong> row —
five stat panels that surface stability issues in a single glance:
Core Dumps (1h), Restarts (1h), Feed Blocked, Content Cluster with Groups/Nodes Down,
and Container Nodes with Services Down.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-health.png" alt="Health Indicators" /></p>

<p>Details and healthy values:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#health-indicators">Health Indicators</a>.</p>

<h3 id="annotations-for-service-restart-and-core-dump">Annotations for Service restart and Core dump</h3>

<p><strong>Annotations</strong> are the vertical lines drawn across every chart when an
operational event happens — Vespa upgrades, feed blocked, data migration, reindexing,
autoscaling changes. Two annotations were added recently and they are worth flagging:</p>

<ul>
  <li><strong>Service restart</strong> — fires when a Vespa service process restarts.
Outside of planned upgrades, restarts usually mean a crash, OOM, or forced stop.</li>
  <li><strong>Core dump</strong> — fires when a process core-dumps. Should be extremely rare.</li>
</ul>

<p>When a latency anomaly lines up with one of these annotations,
you get the context for the change without having to infer it from the graph alone.
Both signals also feed the Overview’s Health Indicators row, so the same event
shows up in three places: the counter, the annotation line, and the Health tab’s
historical time series.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-annotation.png" alt="Dashboard Annotation" /></p>

<p>Full annotation reference:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#dashboard-annotations">Annotations</a>.</p>

<h3 id="container-thread-pool-rows-one-per-configuration-case">Container thread pool rows, one per configuration case</h3>

<p>The Resources tab used to have a single thread-pool row that was mostly empty —
a container only has the thread pools that match its <code class="language-plaintext highlighter-rouge">services.xml</code> configuration
(<code class="language-plaintext highlighter-rouge">&lt;search&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;document-api&gt;</code>, or both). The row has been split into three
case-specific rows:</p>

<ul>
  <li><strong>Thread Pools (search + document-api)</strong> for full-feature containers</li>
  <li><strong>Thread Pools (search only)</strong> for query-only containers</li>
  <li><strong>Thread Pools (document-api only)</strong> for feed-only containers</li>
</ul>

<p>Classification is automatic — hidden variables derive the cluster list per case
from Prometheus set operations, so only relevant rows render for a given deployment.
Each thread pool now gets its own panel with avg (green) and max (yellow dashed)
on the same chart.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-thread-pools.png" alt="Dashboard Thread Pools" /></p>

<p>Details:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#container-thread-pools">Container Thread Pools</a>.</p>

<h3 id="jvm-memory-breakdown-heap--direct--native">JVM memory breakdown (heap / direct / native)</h3>

<p>The Resources tab separates the three layers of container memory: <strong>heap</strong>,
<strong>direct</strong>, and <strong>native</strong>. This matters on container nodes that run embedders
or local LLM components — model weights are memory-mapped and partially resident,
but KV cache and compute buffers are allocated upfront as <strong>native</strong> memory.
When node memory is high but heap and direct look normal, the native layer
is usually where to look.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-jvm.png" alt="Dashboard JVM" /></p>

<p>Details:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#jvm-memory">JVM memory breakdown</a>.</p>

<h2 id="a-simple-workflow">A simple workflow</h2>

<p>A practical way to use the dashboard during an incident:</p>

<ol>
  <li>Open <strong>Overview</strong> and scan the Health Indicators row.</li>
  <li>Confirm the symptom (QoS drop, latency spike, error-rate increase).</li>
  <li>Use <strong>Query</strong> or <strong>Feed</strong> to find the slow layer.</li>
  <li>Use <strong>Resources</strong> to confirm whether the cluster is saturated.</li>
  <li>Cross-reference <strong>annotations</strong> for restarts, upgrades, reindexing, or migration.</li>
</ol>

<p>That flow gets from “latency is up” to “this is the actual bottleneck” much faster
than scanning every chart. The
<a href="https://docs.vespa.ai/en/operations/monitoring.html#dashboard-workflows">common workflows</a>
section of the docs has recipes for the most frequent scenarios.</p>

<h2 id="summary">Summary</h2>

<p>The Vespa Cloud metrics dashboard works best as a troubleshooting tool —
not a metrics catalog. Start with health, follow the latency path, confirm with
resources, and use annotations to connect spikes to real events. The tab reference,
healthy-value tables, and step-by-step workflows live in the
<a href="https://docs.vespa.ai/en/operations/monitoring.html#vespa-cloud-dashboard">Monitoring documentation</a>.</p>
]]></content:encoded>
        <pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/the-vespa-cloud-metrics-dashboard/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/the-vespa-cloud-metrics-dashboard/</guid>
        
        <category>monitoring</category>
        
        <category>metrics</category>
        
        <category>performance</category>
        
        
      </item>
    
      <item>
        <title>Using Large ONNX Models with External Data in Vespa Embedders</title>
        <description>Many ONNX models exceed the 2GB protobuf limit and store weights in external data files. Vespa now supports these models for embedders.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-27-onnx-external-data-in-vespa-embedders/onnx-external-data-splash.png" />
        
        <content:encoded><![CDATA[<p>Many popular ONNX models exceed the 2 GB <a href="https://protobuf.dev/">protobuf</a> format limit and store their weights in separate external data files.
Until recently, these models could not be used directly in Vespa’s built-in embedders.</p>

<p>This was a long requested feature on our tracker (see <a href="https://github.com/vespa-engine/vespa/issues/28761">GitHub issue #28761</a>).</p>

<h2 id="the-2-gb-limitation">The 2 GB limitation</h2>

<p><a href="https://onnx.ai/">ONNX</a> uses Google’s Protocol Buffers as its serialization format.
Protobuf has a hard limit of 2 GB on message size.
For smaller models, this is not a problem — all tensor data (the model weights) is embedded directly in the <code class="language-plaintext highlighter-rouge">.onnx</code> file,
making it self-contained.</p>

<p>As models grow larger, they inevitably hit this limitation.
For a model exceeding 2 GB, ONNX tooling splits it into two parts:</p>

<ul>
  <li>A small <strong><code class="language-plaintext highlighter-rouge">.onnx</code> file</strong> containing the model graph structure (typically a few hundred KB to a few MB).</li>
  <li>One or more <strong>external data files</strong> (commonly named <code class="language-plaintext highlighter-rouge">.onnx_data</code>) containing the actual tensor weights.</li>
</ul>

<p>Note that reduced-precision variants of these models (INT8, FP16, etc.) are often small enough to fit in a single self-contained <code class="language-plaintext highlighter-rouge">.onnx</code> file.
The external data split primarily affects the full-precision versions.</p>

<p>Previously, if you pointed a Vespa embedder at a model with external data files, ONNX Runtime would fail to load it
because the data files were not available alongside the model file.</p>

<h2 id="what-changed">What changed</h2>

<p>Vespa embedders now automatically handle ONNX models with external data files.
When you configure an embedder with a URL pointing to an <code class="language-plaintext highlighter-rouge">.onnx</code> file,
Vespa inspects the model to check whether it references any external data files.
If it does, Vespa downloads those files automatically before loading the model.</p>

<p>This feature is available starting from Vespa 8.544.</p>

<h2 id="how-to-use-it">How to use it</h2>

<p>Here is an example using EmbeddingGemma 300M, which uses external data:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"gemma"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span>
      <span class="na">url=</span><span class="s">"https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/onnx/model.onnx"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;tokenizer-model</span>
      <span class="na">url=</span><span class="s">"https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/tokenizer.json"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>2048<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>task: search result | query: <span class="nt">&lt;/query&gt;</span>
      <span class="nt">&lt;document&gt;</span>title: none | text: <span class="nt">&lt;/document&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>If you are deploying to <a href="https://cloud.vespa.ai/">Vespa Cloud</a>, you can also use models from the
<a href="https://docs.vespa.ai/en/rag/model-hub.html">Vespa Model Hub</a> that use external data.
For example, the Multilingual-E5-large model (will be available on Vespa Cloud 8.668+):</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"e5"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"multilingual-e5-large"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>512<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>query: <span class="nt">&lt;/query&gt;</span>
      <span class="nt">&lt;document&gt;</span>passage: <span class="nt">&lt;/document&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>This works with our ONNX-based embedders:</p>

<ul>
  <li><a href="https://docs.vespa.ai/en/embedding.html#huggingface-embedder"><code class="language-plaintext highlighter-rouge">hugging-face-embedder</code></a></li>
  <li><a href="https://docs.vespa.ai/en/embedding.html#colbert-embedder"><code class="language-plaintext highlighter-rouge">colbert-embedder</code></a></li>
  <li><a href="https://docs.vespa.ai/en/embedding.html#splade-embedder"><code class="language-plaintext highlighter-rouge">splade-embedder</code></a></li>
</ul>

<p>It’s also possible to use <a href="https://docs.vespa.ai/en/reference/rag/embedding.html#private-model-hub">private models</a> — authentication tokens are propagated when downloading external data files.</p>

<h2 id="current-limitations">Current limitations</h2>

<p>There are a few constraints to be aware of:</p>

<ul>
  <li>
    <p><strong>Embedders only.</strong> Models used directly in <a href="https://docs.vespa.ai/en/ranking/onnx.html">ranking expressions</a>
must still be self-contained and under 2 GB.</p>
  </li>
  <li>
    <p><strong>URL-referenced or Model Hub models only.</strong> Models bundled in the
<a href="https://docs.vespa.ai/en/application-packages.html">application package</a>
using the <code class="language-plaintext highlighter-rouge">path</code> attribute do not support external data.
Models referenced via <code class="language-plaintext highlighter-rouge">url</code> or <code class="language-plaintext highlighter-rouge">model-id</code> (Vespa Cloud) are supported.</p>
  </li>
  <li>
    <p><strong>External data files must be co-located with the model.</strong>
The external data files are resolved relative to the model URL.
They must be in the same directory (or a subdirectory) as the <code class="language-plaintext highlighter-rouge">.onnx</code> file.</p>
  </li>
</ul>

<p>See the <a href="https://docs.vespa.ai/en/ranking/onnx.html#limitations-on-model-size-and-complexity">ONNX model documentation</a>
for the full list of requirements.</p>

<p>If you need more extensive support for ONNX models with external data — for example in ranking expressions —
feel free to <a href="https://github.com/vespa-engine/vespa/issues">file an issue</a>.</p>
]]></content:encoded>
        <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/</guid>
        
        <category>embedding</category>
        
        <category>onnx</category>
        
        
      </item>
    
      <item>
        <title>Asymmetric Retrieval: Spend on Docs, Embed your Queries for Free</title>
        <description>Documents are embedded once — worth the spend for maximum quality. Queries hit you on every request. This is what drives your cost at scale. Asymmetric retrieval with Voyage AI and Vespa. Real numbers, real config.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-10-asymmetric-retrieval-spend-on-docs-queries-for-free/hero.png" />
        
        <content:encoded><![CDATA[<p>At 10,000 queries per second with ~30-token queries, you’re pushing ~18 million tokens per minute through your embedding API. At $0.02 per million tokens, that’s <strong>over $15,000/month</strong> — just for query embeddings. Documents are embedded once. Queries are embedded forever.</p>

<p>What if you could drop that to $0?</p>

<p>That’s the promise of <strong>asymmetric retrieval</strong>: embed your documents with the best model money can buy, then embed queries with a tiny model running locally — for free. Voyage AI’s new <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">voyage-4 family</a> is the first to make this practical, and Vespa now has native support for it.</p>

<h2 id="the-asymmetric-insight">The asymmetric insight</h2>

<p>The conventional approach is to use the same embedding model for documents and queries. Same model, same vector space. But it ignores a fundamental asymmetry.</p>

<p>Document embedding is a <strong>one-time cost</strong>. You embed each document once at indexing time, and it’s not latency-sensitive — whether it takes 10ms or 500ms doesn’t matter because no user is waiting. You can throw the biggest, most accurate model at it and take your time.</p>

<p>Query embedding is the opposite. It’s on the <strong>critical path of every single request</strong>, continuously, at scale. It needs to be fast, and at 10K QPS the cost dwarfs everything else.</p>

<p>Why use the same model for both?</p>

<p>Asymmetric retrieval splits these two concerns:</p>

<ol>
  <li><strong>Documents</strong> — Embed once with <code class="language-plaintext highlighter-rouge">voyage-4-large</code>. Best accuracy, API-based, no rush.</li>
  <li><strong>Queries</strong> — Embed continuously with <code class="language-plaintext highlighter-rouge">voyage-4-nano</code>. Tiny, local, free.</li>
</ol>

<p>This works because all four models in the Voyage 4 family — <code class="language-plaintext highlighter-rouge">voyage-4-large</code>, <code class="language-plaintext highlighter-rouge">voyage-4</code>, <code class="language-plaintext highlighter-rouge">voyage-4-lite</code>, and <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> — produce <strong>compatible embeddings in a shared vector space</strong>.</p>

<p><img src="/assets/2026-03-10-asymmetric-retrieval-spend-on-docs-queries-for-free/asymmetric-embeddings.png" alt="Asymmetric retrieval: documents embedded with voyage-4-large via API, queries embedded with voyage-4-nano locally" /></p>

<p>It also means you can upgrade your query model independently. Start with <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> for cost, move to <code class="language-plaintext highlighter-rouge">voyage-4-lite</code> for quality — without re-embedding a single document.</p>

<p>The shared embedding space opens up document-side flexibility too. In a multi-tenant system, you could use different models for different tiers — <code class="language-plaintext highlighter-rouge">voyage-4-large</code> for premium customers who need the best retrieval quality, <code class="language-plaintext highlighter-rouge">voyage-4-lite</code> for cost-sensitive tenants — all searchable with the same query model. Same index, same query path, different quality/cost tradeoffs per tenant.</p>

<h2 id="the-numbers">The numbers</h2>

<h3 id="cost">Cost</h3>

<p>Let’s be concrete about the 10K QPS scenario:</p>

<ul>
  <li>10,000 QPS × 30 tokens = 300,000 tokens/sec</li>
  <li>300,000 × 60 × 60 × 24 × 30 = ~777 billion tokens/month</li>
  <li>At $0.02/1M tokens ≈ <strong>$15,500/month</strong> for query embeddings via API</li>
</ul>

<p>With <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> running locally on the Vespa container: <strong>$0/month</strong>. The model runs as part of the serving infrastructure you’re already paying for.</p>

<h3 id="latency">Latency</h3>

<p>API calls add network round-trips. Local inference on <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> runs in single-digit milliseconds on CPU.</p>

<h3 id="quality">Quality</h3>

<p>Voyage 4 is state-of-the-art. On the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">RTEB benchmark</a> (29 retrieval datasets, NDCG@10), <code class="language-plaintext highlighter-rouge">voyage-4-large</code> beats the competition:</p>

<style>
  table, th, td {
    border: 1px solid black;
  }
  th, td {
    padding: 5px;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Comparison</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>vs. Gemini Embedding 001</td>
      <td>+3.87%</td>
    </tr>
    <tr>
      <td>vs. Cohere Embed v4</td>
      <td>+8.20%</td>
    </tr>
    <tr>
      <td>vs. OpenAI v3 Large</td>
      <td>+14.05%</td>
    </tr>
  </tbody>
</table>

<p><br />
And asymmetric retrieval — querying with a smaller model against <code class="language-plaintext highlighter-rouge">voyage-4-large</code> document embeddings — preserves retrieval quality across medical, code, web, finance, and legal domains.</p>

<h3 id="storage">Storage</h3>

<p>Binary quantization gives you a <strong>16x memory reduction</strong> over bfloat16 — 2048-dim vectors go from 4,096 bytes to 256 bytes. The full-precision floats are still used for second-phase reranking, <a href="https://docs.vespa.ai/en/content/attributes.html#paged-attributes-disadvantages">paged from disk</a> only when needed. For a deeper dive on quantization tradeoffs, see <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a>.</p>

<h2 id="why-this-matters-at-scale">Why this matters at scale</h2>

<p>Cost and quality are table stakes. The real question for large-scale systems is: does this work in production?</p>

<h3 id="independent-scaling">Independent scaling</h3>

<p>Vespa separates stateless containers (where embedding runs) from content clusters (where data lives). This means you can scale query embedding capacity independently from storage. Need more QPS? Add container nodes. More documents? Add content nodes. They don’t interfere.</p>

<h3 id="no-external-api-on-the-query-path">No external API on the query path</h3>

<p>This is the underrated benefit. With asymmetric retrieval, the query embedding model runs locally inside Vespa — your critical search path has zero dependency on an external API.</p>

<p>That matters when:</p>

<ul>
  <li><strong>The API goes down.</strong> Every embedding API has outages. If your query path depends on one, your search goes down with it.</li>
  <li><strong>You get rate-limited.</strong> Traffic spikes don’t care about your API quota. A sudden 3x in query volume means dropped requests — or queued requests that blow your latency budget.</li>
  <li><strong>You need to scale fast.</strong> Adding Vespa container nodes takes minutes. Negotiating higher API rate limit may take days. On <a href="https://docs.vespa.ai/en/cloud/autoscaling.html">Vespa Cloud</a>, autoscaling handles traffic spikes automatically — container clusters are stateless and scale up quickly.</li>
</ul>

<p>Keeping the query path self-contained turns your search system from “works when everything is up” into “works, period.”</p>

<h3 id="two-phase-ranking">Two-phase ranking</h3>

<p>Binary vectors are fast — Vespa can do ~1 billion hamming distance calculations per second. But binary quantization loses precision. Vespa’s <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">phased ranking</a> recovers it:</p>

<ol>
  <li><strong>First phase</strong>: Hamming distance on binary embeddings. Fast, cheap, scans the full index.</li>
  <li><strong>Second phase</strong>: Float dot-product on the top 2,000 candidates. Accurate, but only touches a bounded set of vectors paged from disk.</li>
</ol>

<p>This gives you the speed of binary search with the accuracy of full-precision reranking.</p>

<h3 id="enterprise-proven">Enterprise-proven</h3>

<p>This isn’t theoretical. Vespa runs search and recommendation at Spotify, Yahoo, and Perplexity — billions of documents, thousands of QPS, sub-100ms latency. The architecture handles it.</p>

<h2 id="how-to-set-this-up">How to set this up</h2>

<p>Here’s the complete Vespa configuration for asymmetric retrieval with Voyage AI.</p>

<h3 id="schema">Schema</h3>

<p>Two embedding fields — binary for fast retrieval, float for accurate reranking:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
    }
  }

  field embedding_float type tensor&lt;bfloat16&gt;(x[2048]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: prenormalized-angular
      paged
    }
  }

  field embedding_binary type tensor&lt;int8&gt;(x[256]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: hamming
    }
  }
}
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">paged</code> attribute on <code class="language-plaintext highlighter-rouge">embedding_float</code> tells Vespa to keep these vectors on disk, paging them into memory only during second-phase reranking. The binary embeddings stay in memory for fast first-phase retrieval.</p>

<h3 id="embedders-servicesxml">Embedders (services.xml)</h3>

<p>Two embedders — one API-based for documents, one local for queries:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"voyage-4-large"</span> <span class="na">type=</span><span class="s">"voyage-ai-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;model&gt;</span>voyage-4-large<span class="nt">&lt;/model&gt;</span>
    <span class="nt">&lt;api-key-secret-ref&gt;</span>apiKey<span class="nt">&lt;/api-key-secret-ref&gt;</span>
    <span class="nt">&lt;dimensions&gt;</span>2048<span class="nt">&lt;/dimensions&gt;</span>
    <span class="nt">&lt;batching</span> <span class="na">max-size=</span><span class="s">"20"</span> <span class="na">max-delay=</span><span class="s">"20ms"</span><span class="nt">/&gt;</span>
  <span class="nt">&lt;/component&gt;</span>

  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"voyage-4-nano"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"voyage-4-nano-int8"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;tokenizer-model</span> <span class="na">model-id=</span><span class="s">"voyage-4-nano-vocab"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>32768<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;pooling-strategy&gt;</span>mean<span class="nt">&lt;/pooling-strategy&gt;</span>
    <span class="nt">&lt;normalize&gt;</span>true<span class="nt">&lt;/normalize&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>Represent the query for retrieving supporting documents: <span class="nt">&lt;/query&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>The <a href="https://docs.vespa.ai/en/rag/embedding.html#voyageai-embedder"><code class="language-plaintext highlighter-rouge">voyage-ai-embedder</code></a> handles vector quantization automatically — it infers the target precision from the destination tensor type. bfloat16 fields get full-precision embeddings; int8 fields get binary representations.</p>

<p>The <code class="language-plaintext highlighter-rouge">hugging-face-embedder</code> runs <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> locally. No API calls, no rate limits, no cost. Both model references (<code class="language-plaintext highlighter-rouge">voyage-4-nano-int8</code>, <code class="language-plaintext highlighter-rouge">voyage-4-nano-vocab</code>) resolve via the <a href="https://docs.vespa.ai/en/rag/model-hub.html">Vespa Model Hub</a>.</p>

<p><strong>A note on “quantization” — two different things.</strong> The <code class="language-plaintext highlighter-rouge">voyage-4-nano-int8</code> in the <code class="language-plaintext highlighter-rouge">model-id</code> refers to <strong>model weight quantization</strong>: the ONNX model file uses INT8 weights instead of FP32, which makes inference 2-3x faster on CPU with negligible quality loss. This is about how the <em>model itself</em> is stored and executed. The embedder still produces full-precision float vectors as output. <strong>Vector quantization</strong> is a separate concern — it’s about the precision of the <em>output embeddings</em> you store and search over (bfloat16, int8/binary, etc.). That’s controlled by the tensor type in your schema field, not the model format. These are independent knobs: you can run an INT8-quantized model that outputs float vectors, then store them as binary. For a deeper dive with benchmarks on both, see <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a>.</p>

<h3 id="rank-profile">Rank profile</h3>

<p>Two-phase ranking: hamming distance first, float reranking second:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile binary-with-rerank {
  inputs {
    query(q_float) tensor&lt;float&gt;(x[2048])
    query(q_bin) tensor&lt;int8&gt;(x[256])
  }

  function binary_closeness() {
    expression: 1 - (distance(field, embedding_binary) / 2048)
  }

  function float_closeness() {
    expression: reduce(query(q_float) * attribute(embedding_float), sum, x)
  }

  first-phase {
    expression: binary_closeness
  }

  second-phase {
    expression: float_closeness
    rerank-count: 2000
  }
}
</code></pre></div></div>

<h3 id="querying">Querying</h3>

<p>Both query tensors are produced by the local <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> embedder:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yql=select * from doc where {targetHits: 100}nearestNeighbor(embedding_binary, q_bin)
&amp;ranking=binary-with-rerank
&amp;input.query(q_bin)=embed(voyage-4-nano, "your query here")
&amp;input.query(q_float)=embed(voyage-4-nano, "your query here")
&amp;hits=10
</code></pre></div></div>

<p>The <a href="https://docs.vespa.ai/en/nearest-neighbor-search.html">nearest neighbor search</a> runs on the binary field for speed, while the rank profile handles two-phase scoring.</p>

<p>For a complete runnable example with pyvespa, see the <a href="https://vespa-engine.github.io/pyvespa/examples/voyage-ai-embeddings-cloud.html">Voyage AI embeddings notebook</a>.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>Asymmetric retrieval makes the most sense when:</p>

<ul>
  <li><strong>High QPS</strong> — The cost savings scale linearly. At 10K QPS, you’re saving $15.5K/month. At 100K QPS, it’s $155K.</li>
  <li><strong>Large corpus</strong> — Documents are embedded once, so the large model cost is amortized. The bigger the corpus, the more you benefit from cheap queries.</li>
  <li><strong>Latency-sensitive</strong> — Local inference eliminates network round-trips.</li>
</ul>

<p>When a single model is the better choice:</p>

<ul>
  <li><strong>Low volume and latency-tolerant</strong> — At 10 QPS, the API cost is ~$15/month and the network round-trip doesn’t matter. One model is simpler to operate.</li>
  <li><strong>Quality above all else</strong> — Using <code class="language-plaintext highlighter-rouge">voyage-4-large</code> for both documents and queries gives you the best possible retrieval quality. If you can afford the API cost and latency, symmetric with the top model is hard to beat.</li>
</ul>

<p>The Voyage 4 family and Vespa’s native integration make asymmetric retrieval practical for the first time. Embed documents with the best model available, query with a tiny local model, and let phased ranking close the quality gap.</p>

<p><strong>Resources:</strong></p>

<ul>
  <li><a href="https://vespa-engine.github.io/pyvespa/examples/voyage-ai-embeddings-cloud.html">Voyage AI embeddings notebook</a> — Full runnable example</li>
  <li><a href="https://docs.vespa.ai/en/embedding.html">Embedding documentation</a> — Configuring embedders in Vespa</li>
  <li><a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html">Binary quantization guide</a> — Deep dive on binarization</li>
  <li><a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">Phased ranking</a> — Multi-phase ranking architecture</li>
  <li><a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 announcement</a> — Model family details and benchmarks</li>
</ul>

<p>For those interested in learning more about Vespa, join the <a href="https://vespatalk.slack.com/">Vespa community on Slack</a> to exchange ideas,
seek assistance from the community, or stay in the loop on the latest Vespa developments.</p>
]]></content:encoded>
        <pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/</guid>
        
        <category>embedding</category>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>voyage-ai</category>
        
        
      </item>
    
      <item>
        <title>How Metal AI Built an Agent-Driven Intelligence Platform on Vespa Cloud</title>
        <description>How Metal built an AI-Native Intelligence Platform on Vespa.ai, where 95% of retrieval is handled by AI agents.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-10-metal-case-study-agent-driven-intelligence-on-vespa-cloud/MetalxVespa.png" />
        
        <content:encoded><![CDATA[<blockquote>
  <p>“95% of our retrieval is done by AI agents.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<p>Metal needed a retrieval foundation that could evolve as fast as their product, without hitting a wall.</p>

<h2 id="introduction">Introduction</h2>

<p>Private equity firms manage vast amounts of unstructured data, including deal documents, expert call transcripts, financial statements, CRM records, and more. The challenge isn’t simply accessing this information. It’s connecting and understanding it, in context, across the investment lifecycle.</p>

<p><a href="https://www.metal.ai/?utm_source=chatgpt.com">Metal AI</a> was built to address this challenge. Its purpose-built institutional intelligence platform, used by established private equity firms transforms fragmented historical and live deal data into a living system of record that drives conviction at every stage of the investment lifecycle.</p>

<p>To deliver this vision at scale, Metal leverages <a href="http://vespa.ai">Vespa.ai</a> as its core retrieval layer, powering entity relationships, advanced ranking, and real-time context-aware retrieval across complex investment data.</p>

<h2 id="the-need-for-relationship-driven-retrieval">The Need for Relationship-Driven Retrieval</h2>

<p>As Metal’s product evolved, the limitations of traditional retrieval systems became clear.</p>

<p>Early architecture supported basic document search, but private equity workflows aren’t document-centric. They are entity- and relationship-driven. The enduring edge in private equity lies in drawing on decades of deal history, portfolio outcomes, and institutional knowledge. When that depth of experience surfaces reasoning and connections across time, every investment decision carries greater conviction.</p>

<p>Most traditional vector stores and search engines are fundamentally document-first. They index text, return similar passages, and rely primarily on semantic similarity or keyword matching. But for Metal’s use case, relevance requires more:</p>

<ul>
  <li>
    <p>Understanding which answer is the most recent and legally approved</p>
  </li>
  <li>
    <p>Identifying which company a metric belongs to</p>
  </li>
  <li>
    <p>Connecting meetings to prior diligence activity</p>
  </li>
  <li>
    <p>Applying business logic alongside semantic similarity</p>
  </li>
</ul>

<p>As Metal introduced more advanced workflows, like DDQ automation and agent-driven retrieval, the gap widened. Traditional systems struggle to:</p>

<ul>
  <li>
    <p>Combine semantic similarity with recency and compliance rules within ranking</p>
  </li>
  <li>
    <p>Support evolving data models without significant rework</p>
  </li>
  <li>
    <p>Query across multiple object types in a unified way</p>
  </li>
  <li>
    <p>Serve as a foundation for structured, iterative queries issued by AI agents</p>
  </li>
</ul>

<p>Layering custom logic on top of limited retrieval infrastructure would have created increasing technical debt, and each new entity type or ranking rule risked architectural compromise.</p>

<p>Metal needed a retrieval foundation that could evolve with the product, not constrain it.</p>

<h2 id="choosing-a-retrieval-layer-without-limits">Choosing a Retrieval Layer without Limits</h2>

<p>Metal wasn’t simply selecting a search engine. They were selecting a long-term retrieval architecture.</p>

<p>Several capabilities distinguished Vespa:</p>

<ul>
  <li>
    <p><strong>Multi-entity modeling:</strong> Vespa supports multiple object types, like documents, people, activities, and financial data, as well as the relationships between them. This aligned with how Metal structures institutional knowledge.</p>
  </li>
  <li>
    <p><strong>Advanced ranking and filtering:</strong> Vespa can combine semantic similarity with structured filters like recency and business rules, enabling Metal to tailor retrieval to specific workflows.</p>
  </li>
  <li>
    <p><strong>Flexibility without re-architecture:</strong> New object types can be introduced without migrating existing data or rebuilding the system.</p>
  </li>
  <li>
    <p><strong>Operational simplicity:</strong> Moving to Vespa Cloud enabled the team to focus engineering capacity on product innovation instead of infrastructure.</p>
  </li>
</ul>

<p>These capabilities give Metal the ability to shape retrieval around business logic, rather than forcing business logic to adapt to infrastructure limitations.</p>

<blockquote>
  <p>“Our competitors focus on documents. With Vespa, we can focus on the
full picture: companies, people, activities, and how they relate.” -
Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<h2 id="architecture-in-action">Architecture in Action</h2>

<p>Metal treats retrieval as part of an AI agent orchestration layer, not just a standard search box.</p>

<p>When a user or agent asks a question like, “What’s this company’s EBITDA?”, the query is first interpreted by an AI agent. Rather than issuing a single plain-text search, the agent:</p>

<ul>
  <li>
    <p>Determines which entity types to query (documents, companies, metrics, activities)</p>
  </li>
  <li>
    <p>Applies structured parameters such as recency or workflow-specific filters</p>
  </li>
  <li>
    <p>Executes retrieval against Vespa</p>
  </li>
  <li>
    <p>Iterates as needed (paginating, refining, or querying related entities)</p>
  </li>
  <li>
    <p>Assembles sufficient context before generating a response</p>
  </li>
</ul>

<p>Vespa powers this retrieval layer, enabling fast, structured queries across different object types and supporting the iterative retrieval process required by Metal’s agent-driven architecture.</p>

<h2 id="turning-ddq-chaos-into-structured-approved-intelligence">Turning DDQ Chaos into Structured, Approved Intelligence</h2>

<p>One clear example is Metal’s Due Diligence Questionnaire (DDQ) workflow. Private equity firms must respond to thousands of LP questionnaires using pre-approved answers. These responses cannot be freely generated by an LLM. They must come from content that has already been reviewed and approved by legal teams.</p>

<p>Answer banks change over time and are stored in unstructured formats like documents and spreadsheets. Metal indexes this data into Vespa, making the system aware of which documents are most recent. When answering a questionnaire, retrieval is prioritized not only by semantic similarity to the question but also by freshness.</p>

<p>This allows Metal to surface the most relevant and up-to-date approved answers, efficiently and reliably within its platform.</p>

<h2 id="scaling-without-infrastructure-headaches">Scaling without Infrastructure Headaches</h2>

<p>By building on <a href="https://vespa.ai/solutions/vespa-cloud/">Vespa Cloud</a>, Metal achieved:</p>

<ul>
  <li>
    <p>Improved feature velocity: The team can introduce new entity types and workflows quickly without architectural rework</p>
  </li>
  <li>
    <p>Greater engineering focus: The team spends less time managing infrastructure and more time building differentiating product features</p>
  </li>
  <li>
    <p>Scalable retrieval architecture: Metal can onboard new clients and data volumes without redesigning retrieval.</p>
  </li>
  <li>
    <p>Confidence in long-term flexibility: Vespa is not a limiting factor as Metal expands into more advanced agent-driven workflows.</p>
  </li>
</ul>

<blockquote>
  <p>“Managing infrastructure can be a distraction. Vespa Cloud lets us focus on product.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<h2 id="looking-forward-build-for-an-agentic-future">Looking Forward: Build for an Agentic Future</h2>

<p>Metal’s roadmap is deeply agentic. AI agents drive most interactions, deciding how best to query the platform and construct the context needed to answer sophisticated questions.</p>

<p>Because Vespa supports flexible, multi-entity retrieval with advanced ranking and real-time performance, Metal can:</p>

<ul>
  <li>
    <p>Expand into more advanced analysis workflows</p>
  </li>
  <li>
    <p>Build deeper relational structures between entities</p>
  </li>
  <li>
    <p>Adapt retrieval strategies dynamically as business logic evolves</p>
  </li>
</ul>

<p>The result is an institutional intelligence platform that scales in both data volume and intelligence, evolving alongside the firm it serves.</p>

<blockquote>
  <p>“When you’re building something ambitious, you don’t want to hit a capability wall. Vespa gives us confidence that we won’t.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

]]></content:encoded>
        <pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Build a High-Quality RAG App on Vespa Cloud in 15 Minutes</title>
        <description>Retrieval-Augmented Generation (RAG) allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" />
        
        <content:encoded><![CDATA[<p><strong>Retrieval-Augmented Generation (RAG)</strong> allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.</p>

<p>RAG bridges that gap by retrieving relevant information from your data and supplying it to the model as context, so responses are grounded in real, trusted sources rather than guesswork.</p>

<h2 id="the-challenge-the-quality-of-the-context-window">The Challenge: The Quality of the Context Window</h2>

<p>In Retrieval-Augmented Generation (RAG), the real bottleneck is the LLM’s context window. You can’t simply pass your entire dataset into a prompt—there’s a strict token budget.</p>

<p>Because of this, the problem isn’t just retrieving information, but retrieving the right information. When the context window is filled with loosely matched or low-quality results, the LLM has little to work with and the quality of its answers drops accordingly.</p>

<p>High-quality RAG depends on semantic understanding, precise retrieval, and strong ranking across diverse data types so that every token in the context window earns its place.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" alt="illustration_2" /></p>

<h2 id="the-solution-out-of-the-box-rag-on-vespa-cloud">The Solution: Out-of-the-Box RAG on Vespa Cloud</h2>

<p>Vespa Cloud provides an out-of-the-box Vespa <a href="https://docs.vespa.ai/en/examples/rag-blueprint.html">RAG Blueprint</a> designed to maximize the quality of the context sent to the LLM. Instead of relying solely on nearest-neighbor vector search, Vespa combines semantic vector retrieval with lexical BM25 scoring and applies advanced ranking, using models such as BERT, LightGBM, or custom logic—to ensure that only the strongest candidates are selected.</p>

<p>This hybrid retrieval and ranking approach consistently surfaces the most relevant document chunks, which significantly improves the quality of the final generated answer.</p>

<p>In this blog post, we’ll build a complete Retrieval-Augmented Generation (RAG) application from end to end by leveraging the OOTB Vespa RAG app on Vespa cloud. The following diagram shows the architecture we’ll be working with:</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/architecture_diagram.png" alt="Vespa RAG Architecture" /></p>

<p>The architecture consists of two main flows: data ingestion and query processing.</p>

<p><strong>Data Ingestion (one-time setup)</strong></p>

<p>First, we ingest our data sources, such as documents, PDFs, or web pages by using a Python-based pipeline. The pipeline processes the data, splits it into manageable chunks, generates embeddings, and feeds everything into a Vespa Cloud RAG application that is preconfigured with a schema and ranking profiles. This step populates the search index.</p>

<p><strong>Query Flow (live interaction)</strong></p>

<ol>
  <li>
    <p>A user enters a question in the <strong>Vespa RAG UI</strong>.</p>
  </li>
  <li>
    <p>The UI sends the query to a <strong>Python backend</strong>, which issues a hybrid search request (combining keyword and vector retrieval) to <strong>Vespa Cloud</strong>.</p>
  </li>
  <li>
    <p><strong>Vespa Cloud</strong> returns the most relevant document chunks.</p>
  </li>
  <li>
    <p>The backend sends those chunks, along with the original query, to an <strong>LLM</strong> as context.</p>
  </li>
  <li>
    <p>The model generates an answer grounded in that context and returns it to the backend.</p>
  </li>
  <li>
    <p>The backend streams the answer back to the UI.</p>
  </li>
</ol>

<p>This architecture ensures that generated responses are grounded in your own data, combining Vespa’s retrieval and ranking strengths with the generative capabilities of large language models.</p>

<p>The end-to-end setup takes about 15 minutes, plus additional time to process your documents.</p>

<hr />

<h2 id="deploy-vespa-rag-blueprint-to-vespa-cloud">Deploy Vespa RAG Blueprint to Vespa Cloud</h2>

<p>We’ll start by deploying a preconfigured RAG Blueprint to Vespa Cloud. This gives you a high-quality retrieval stack in minutes, and it’s free to get started. All of this is done directly from the Vespa Cloud console.</p>

<p><strong>Sign up for Vespa Cloud</strong></p>

<p>Go to the <a href="https://console.vespa-cloud.com/">Vespa Cloud Console</a> and create an account. If this is your first time using Vespa Cloud, the free trial is the fastest way to get going.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_1.png" alt="image_1" /></p>

<p><strong>Deploy RAG Blueprint</strong></p>

<p>In the console, select <strong>“Deploy your first application”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_2.png" alt="image_2" /></p>

<p>Choose <strong>“Select a sample application to deploy directly from the browser”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_3.png" alt="image_3" /></p>

<p>Select <strong>“RAG Blueprint”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_4.png" alt="image_4" /></p>

<p>Click <strong>“Deploy”</strong> and wait for the deployment to complete.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_5.png" alt="image_5" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_8.png" alt="image_8" /></p>

<p><strong>Save your credentials</strong></p>

<p>Once deployment finishes, the console will generate an access token. <strong>Save this immediately.</strong>
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_9.png" alt="image_9" /></p>

<p>That token is how Python backend authenticates with Vespa Cloud. Treat it like a password.</p>

<p>Continue through the remaining setup screens, then open the application view.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_10.png" alt="image_10" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_11.png" alt="image_11" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_12.png" alt="image_12" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_13.png" alt="image_13" /> 
<strong>Note your endpoint URL</strong></p>

<p>In the application view you will also find the endpoint URL. Save both the <strong>endpoint URL</strong> and the token; you will need them to configure Python backend in the next section.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_15.png" alt="image_15" />
You can download the Vespa application package by clicking the download icon if you’d like. From there, you can start building your data feeding pipeline, frontend service UI, and more. However, this blog provides a sample end-to-end RAG application, and the same Vespa application package is included, so there’s no need to download it separately.</p>

<h2 id="behind-the-scenes-what-you-just-deployed">Behind the Scenes: What You Just Deployed</h2>

<p>When you clicked <strong>Deploy</strong>, Vespa Cloud automatically provisioned infrastructure and deployed a complete <strong>Vespa application package</strong>. This package includes everything needed for a high-quality RAG system: schemas, indexing logic, ranking profiles, and service configuration.</p>

<p>In other words, you didn’t just spin up a demo, you launched a ready-to-use, high-quality retrieval engine.</p>

<p>Let’s take a closer look at what’s inside.</p>

<h3 id="the-schema">The Schema</h3>

<p>The RAG Blueprint uses a carefully designed schema that controls how documents are stored, chunked, embedded, and retrieved:</p>

<p><code class="language-plaintext highlighter-rouge">vespa_cloud/schemas/doc.sd</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="n">doc</span> <span class="o">{</span>
    <span class="n">document</span> <span class="n">doc</span> <span class="o">{</span>
        <span class="n">field</span> <span class="n">id</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">attribute</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">title</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">index</span> <span class="o">|</span> <span class="n">summary</span>
            <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">text</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
        <span class="o">}</span>

        <span class="err">#</span> <span class="nc">Optional</span> <span class="n">metadata</span> <span class="n">fields</span> <span class="k">for</span> <span class="n">tracking</span> <span class="n">document</span> <span class="n">usage</span>
        <span class="n">field</span> <span class="n">created_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">modified_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">last_opened_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">open_count</span> <span class="n">type</span> <span class="kt">int</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">favorite</span> <span class="n">type</span> <span class="n">bool</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">the</span> <span class="nf">title</span> <span class="o">(</span><span class="mi">768</span> <span class="n">floats</span> <span class="err">→</span> <span class="mi">96</span> <span class="n">int8</span><span class="o">)</span>
    <span class="n">field</span> <span class="n">title_embedding</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">title</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Automatically</span> <span class="n">chunks</span> <span class="n">text</span> <span class="n">into</span> <span class="mi">1024</span><span class="o">-</span><span class="n">character</span> <span class="n">segments</span>
    <span class="n">field</span> <span class="n">chunks</span> <span class="n">type</span> <span class="n">array</span><span class="o">&lt;</span><span class="n">string</span><span class="o">&gt;</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">index</span>
        <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">each</span> <span class="n">chunk</span>
    <span class="n">field</span> <span class="n">chunk_embeddings</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">chunk</span><span class="o">{},</span> <span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="n">fieldset</span> <span class="k">default</span> <span class="o">{</span>
        <span class="nl">fields:</span> <span class="n">title</span><span class="o">,</span> <span class="n">chunks</span>
    <span class="o">}</span>

    <span class="n">document</span><span class="o">-</span><span class="n">summary</span> <span class="n">top_3_chunks</span> <span class="o">{</span>
        <span class="n">from</span><span class="o">-</span><span class="n">disk</span>
        <span class="n">summary</span> <span class="n">chunks_top3</span> <span class="o">{</span>
            <span class="nl">source:</span> <span class="n">chunks</span>
            <span class="n">select</span><span class="o">-</span><span class="n">elements</span><span class="o">-</span><span class="nl">by:</span> <span class="n">top_3_chunk_sim_scores</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>What’s happening here:</strong> Your documents store their raw content in <code class="language-plaintext highlighter-rouge">title</code> and <code class="language-plaintext highlighter-rouge">text</code>. During indexing, the <code class="language-plaintext highlighter-rouge">text</code> field automatically split into 1024-character chunks. Embeddings are generated for both titles and chunks, then binary-quantized using <code class="language-plaintext highlighter-rouge">pack_bits</code>, shrinking 768 floating-point values down to just 96 <code class="language-plaintext highlighter-rouge">int8</code>s. This dramatically reduces storage and improves performance while still supporting efficient vector similarity search.</p>

<p>At the same time, BM25 is enabled for lexical matching. This combination is what enables Vespa’s hybrid retrieval: semantic matching plus exact term relevance.</p>

<p><strong>Out-of-the-Box Query Profiles:</strong></p>

<p>The RAG Blueprint ships with four query profiles optimized for NyRAG’s client-side RAG architecture:</p>

<p><strong>NyRAG Architecture:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Query → NyRAG (generates search queries)
          → Vespa (retrieval + ranking)
          → NyRAG (generates final answer)
</code></pre></div></div>
<p>Query profiles control <strong>only the Vespa retrieval/ranking step</strong>. NyRAG handles all LLM interactions.</p>

<p><strong>The 4 Profiles:</strong></p>

<ol>
  <li><strong>hybrid</strong> (default, fast)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector search with <code class="language-plaintext highlighter-rouge">targetHits:100</code></li>
      <li><strong>Ranking:</strong> Learned linear model (logistic regression)</li>
      <li><strong>Best for:</strong> Everyday queries where you want fast, solid results</li>
    </ul>
  </li>
  <li><strong>hybrid-with-gbdt</strong> (highest quality)
    <ul>
      <li><strong>Retrieval:</strong> Same as hybrid (BM25 + Vector, 100 targets)</li>
      <li><strong>Ranking:</strong> Two-phase with LightGBM (GBDT) second-phase</li>
      <li><strong>Best for:</strong> Complex queries where relevance matters most (~2-3x slower)</li>
    </ul>
  </li>
  <li><strong>deepresearch</strong> (exhaustive search)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector with <code class="language-plaintext highlighter-rouge">targetHits:10000</code> (100x more!)</li>
      <li><strong>Ranking:</strong> Learned linear model</li>
      <li><strong>Best for:</strong> Research scenarios needing maximum recall</li>
    </ul>
  </li>
  <li><strong>deepresearch-with-gbdt</strong> (exhaustive + best quality)
    <ul>
      <li><strong>Retrieval:</strong> Deep search (10k targets)</li>
      <li><strong>Ranking:</strong> Two-phase with GBDT</li>
      <li><strong>Best for:</strong> When you need both maximum recall and best ranking</li>
    </ul>
  </li>
</ol>

<blockquote>
  <p><strong>For Advanced Users:</strong> Query profiles bundle complete search configurations including YQL structure (with <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> operators), ranking profiles, and all required parameters (like learned coefficients). The Vespa application also includes <code class="language-plaintext highlighter-rouge">rag</code> and <code class="language-plaintext highlighter-rouge">rag-with-gbdt</code> profiles with <code class="language-plaintext highlighter-rouge">searchChain=openai</code> for <strong>server-side RAG</strong> (direct API usage), but these conflict with NyRAG’s client-side architecture and aren’t included. Learn more in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#ranking-profiles">technical guide</a>.</p>
</blockquote>

<p><strong>Which profile should you use?</strong></p>
<ul>
  <li>Start with <strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong> for everyday use - fast and accurate</li>
  <li>Switch to <strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong> when quality matters most (harder queries)</li>
  <li>Use <strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong> when you need to find everything relevant (research mode)</li>
  <li>Try <strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong> for maximum recall + quality (slowest but most thorough)</li>
</ul>

<hr />

<p>Now that your RAG Blueprint Vespa Cloud application is up and running, it’s time to add the missing pieces: a simple frontend UI and a data ingestion pipeline. For this, we’ll use <strong>NyRAG</strong>, a tool included in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint"><code class="language-plaintext highlighter-rouge">RAG-app-in-15min-ragblueprint
</code></a> repository.</p>

<p>NyRAG acts as the glue for the entire RAG workflow. It reads documents from local files or websites, splits text into manageable chunks, generates embeddings, feeds everything into Vespa, and finally exposes a lightweight chat UI where you can ask questions over your data. Instead of wiring all of this together yourself, NyRAG gives you a working end-to-end system out of the box.</p>

<h3 id="install-nyrag">Install NyRAG</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the repository</span>
git clone https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint.git
<span class="nb">cd </span>RAG-app-in-15min-ragblueprint

<span class="c"># Install uv (Fast, modern Python package manager)</span>
<span class="c"># macOS</span>
brew <span class="nb">install </span>uv

<span class="c"># Linux &amp; macOS</span>
<span class="c"># curl -LsSf https://astral.sh/uv/install.sh | sh</span>
<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"</span>

<span class="c"># Verify uv installation</span>
uv <span class="nt">--version</span>

<span class="c"># Install dependencies using uv</span>
uv <span class="nb">sync
source</span> .venv/bin/activate

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># . .\.venv\Scripts\activate</span>

<span class="c"># Install nyrag locally</span>
uv pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>

<span class="c"># Verify nyrag installation</span>
nyrag <span class="nt">--help</span>
</code></pre></div></div>

<p><strong>Get an LLM API key</strong></p>

<p>To generate final answers, NyRAG needs an OpenAI-compatible API key. The simplest way to get started is <strong>OpenRouter</strong>, which provides access to multiple LLMs through a single API.</p>

<p>In this walkthrough, we’ll use OpenRouter for convenience. In a real application, you’re free to swap in any compatible LLM provider. To continue, sign up for OpenRouter and generate an API key. You’ll use it in the next step when configuring NyRAG.</p>

<hr />

<h3 id="start-the-nyrag-ui">Start the NyRAG UI</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This script handles all configuration automatically</span>
./run_nyrag.sh

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># .\run_nyrag.ps1</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">run_nyrag.sh</code> script starts the UI and wires up the configuration so NyRAG can talk to Vespa Cloud. In practice, it loads your project config, uses the token you provide for authentication, and starts the web UI on port 8000.</p>

<p>Open http://localhost:8000 in your browser.</p>

<p><strong>Configure your project:</strong>
Now you’ll configure your project using the web UI to connect to your Vespa Cloud deployment and set up document processing.</p>

<p><strong>Step 1: Select and edit the example project</strong></p>

<p>In the top header, the project dropdown shows <strong>“doc_example”</strong>. If you are starting from the example config, it is usually pre-selected. The configuration editor typically opens automatically; if it does not (for example you land directly in chat), open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong>.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_7.png" alt="Project selector dropdown with &quot;doc_example&quot; highlighted" />
<strong>Description</strong>: Shows the project dropdown menu in the header with “doc_example” option</p>

<blockquote>
  <p><strong>Note:</strong> If the configuration editor doesn’t appear (shows chat interface instead), click the <strong>three-dot menu</strong> (⋮) in the top right corner and select <strong>“Edit Config”</strong> to open it manually.</p>
</blockquote>

<p><strong>Step 2: Update your credentials</strong></p>

<p>In the configuration editor, paste in the information you saved from Vespa Cloud and your LLM provider. You only need three things to get going: your Vespa tenant name, your Vespa endpoint + token, and your LLM API key.</p>

<p><strong>Required fields to update:</strong></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Your Vespa Cloud credentials (from Vespa Cloud Console)</span>
<span class="na">cloud_tenant</span><span class="pi">:</span> <span class="s">your-tenant</span>          <span class="c1"># Your Vespa Cloud tenant name</span>
<span class="na">vespa_cloud</span><span class="pi">:</span>
  <span class="na">endpoint</span><span class="pi">:</span> <span class="s">https://your-app.vespa-cloud.com</span>  <span class="c1"># Your Vespa token endpoint (not mtls)</span>
  <span class="na">token</span><span class="pi">:</span> <span class="s">vespa_cloud_YOUR_TOKEN_HERE</span>          <span class="c1"># Your Vespa data plane token</span>

<span class="c1"># Your LLM configuration (default: OpenRouter)</span>
<span class="na">llm_config</span><span class="pi">:</span>
  <span class="na">api_key</span><span class="pi">:</span> <span class="s">sk-or-v1-YOUR_KEY_HERE</span>   <span class="c1"># Your OpenRouter API key (or other provider)</span>
</code></pre></div></div>

<p><strong>Notes:</strong></p>

<p>The default LLM provider is OpenRouter. If you switch providers, also update <code class="language-plaintext highlighter-rouge">base_url</code> and <code class="language-plaintext highlighter-rouge">model</code> to match. For the included example documents, <code class="language-plaintext highlighter-rouge">start_loc</code> defaults to <code class="language-plaintext highlighter-rouge">./dataset</code>, so you can run the pipeline without changing anything else.</p>

<p><strong>Step 3: Save and start processing</strong></p>

<p>After updating the configuration, you can close the editor (changes are saved automatically) and start indexing. If you are using the example dataset, keep <code class="language-plaintext highlighter-rouge">./dataset</code> as-is; otherwise, point <code class="language-plaintext highlighter-rouge">start_loc</code> at the folder (or site) you want to ingest. When you click <strong>“Start Indexing”</strong>, NyRAG reads your input, chunks it into 1024-character segments, generates embeddings, feeds everything to Vespa Cloud, and shows progress in the terminal panel so you can see exactly what is happening.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_10.png" alt="Processing progress with terminal logs" />
<strong>Description</strong>: Shows documents being processed with terminal logs displaying progress</p>

<hr />

<h2 id="chat-with-your-data">Chat with Your Data</h2>

<p>You can now start asking questions in the chat UI.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_ui.png" alt="nyrag_ui" /></p>

<p>When you submit a query, NyRAG expands it into focused retrieval queries and sends them to Vespa. Vespa runs hybrid retrieval, combining BM25 keyword matching with vector similarity, and returns the most relevant chunks. Those chunks are packed into a compact context window and sent to the LLM, which generates an answer grounded entirely in your data.</p>

<p>A good way to sanity-check the setup is to start with a broad question like “What are the main topics in these documents?” and then follow up with something more specific to confirm the retrieved context makes sense.</p>

<p>At this point, you have a fully functional RAG application running on Vespa Cloud.</p>

<h3 id="improving-search-quality-with-query-profiles">Improving Search Quality with Query Profiles</h3>

<p>Want better search results? You can fine-tune how Vespa retrieves and ranks your documents using the Settings modal (⚙️ icon in the top right).</p>

<p><strong>Change query profiles:</strong> Open the ⚙️ <strong>Settings</strong> panel, choose a <strong>Query Profile</strong> from the dropdown, and click <strong>“Save”</strong>. The very next query you run will use the new profile.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_settings_query_profiles.png" alt="Settings modal with query profile dropdown" /><br />
<strong>Description</strong>: Settings modal showing query profile selection dropdown with 4 available options</p>

<p><strong>What each profile does:</strong></p>
<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong>: Fast hybrid search (BM25 + vector) with linear ranking</li>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong>: Same retrieval + advanced GBDT ranking (slower but best quality)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong>: Exhaustive search with 10,000 retrieval targets (maximum recall)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong>: Exhaustive search + GBDT ranking (slowest, most thorough)</li>
</ul>

<p><strong>Pro tip</strong>: The quality difference between <code class="language-plaintext highlighter-rouge">hybrid</code> and <code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code> can be dramatic for complex queries. The GBDT model offers significantly better relevance at the cost of 2-3x higher latency. For research tasks where you need to find everything relevant, try <code class="language-plaintext highlighter-rouge">deepresearch</code> variants which cast a much wider net!</p>

<hr />

<h3 id="manage-your-data">Manage Your Data</h3>

<p>NyRAG also gives you simple tools for cleanup. Open the advanced menu (three-dot icon ⋮ in the top right) and you will find two cleanup actions. <strong>Clear Local Cache</strong> removes cached files for all projects on your machine, which is useful when you want to re-process from scratch locally. <strong>Clear Vespa Data</strong> deletes the indexed documents in Vespa for the project, which is useful when you want a clean index before re-feeding. Both actions ask for confirmation so you do not delete data by accident.</p>

<hr />

<h2 id="bonus-try-web-crawling-mode">Bonus: Try Web Crawling Mode</h2>

<p>In addition to local documents, NyRAG supports web crawling. By switching to the web_example project, you can point NyRAG at a website and have it crawl, extract, and index content automatically.</p>

<p><strong>Switch to web crawling mode:</strong>  Select <code class="language-plaintext highlighter-rouge">web_example (web)</code> from the dropdown at the top and open the configuration editor. If you are currently on the chat screen, open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong> to bring the editor back. From there, update the same credential fields as you did for <code class="language-plaintext highlighter-rouge">doc_example</code>, then click <strong>“Start Indexing”</strong> to crawl and feed the site.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_indexing_web_2.png" alt="Web crawling in progress" /> 
<strong>Description</strong>: Shows web crawling in progress with terminal logs displaying discovered URLs and processed pages</p>

<p><strong>Web Mode Features:</strong> Web mode discovers and follows links automatically, while still respecting <code class="language-plaintext highlighter-rouge">robots.txt</code> and crawl delays so you do not hammer a site. It also does smart content extraction to drop navigation and boilerplate, deduplicates very similar pages, and supports resume so you can continue a crawl after interruption.</p>

<p><strong>Example Use Cases:</strong> Web mode is a good fit for product documentation, knowledge bases, blog archives, help-center content, and technical wikis. In general, it works best on sites with consistent HTML structure and clean, text-heavy pages.</p>

<p><strong>Tips:</strong> Start small. Crawl a limited part of a site first so you can sanity-check what gets extracted and indexed, then expand. Use <code class="language-plaintext highlighter-rouge">exclude</code> patterns to skip sections you do not want (for example <code class="language-plaintext highlighter-rouge">/pricing</code> or <code class="language-plaintext highlighter-rouge">/sales/*</code>), and keep an eye on the terminal output panel so you can spot loops, unexpected URLs, or pages that fail to parse.</p>

<hr />

<h2 id="troubleshooting">Troubleshooting</h2>

<p>Running into issues? We’ve got you covered! For detailed troubleshooting guides covering Vespa connection errors, LLM configuration, document processing, and more, see the <strong><a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#troubleshooting">Troubleshooting section</a></strong> in the main README.</p>

<p><strong>Quick help:</strong> If you get stuck, the fastest path is usually to ask in the <a href="http://slack.vespa.ai/">Vespa Slack</a> community, where people can help you interpret logs and query behavior. If you think you found a bug or want to request an improvement, open an issue in <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint/issues">GitHub Issues</a>. And when you want deeper background on schema, ranking, and deployment, the <a href="https://docs.vespa.ai/">Vespa Docs</a> are your go-to reference.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p><strong>Congratulations!</strong> You now have a working RAG app: a Vespa Cloud deployment that can retrieve high-quality context, and a small UI that lets you ingest data and chat with it.</p>

<p>Building a high-quality RAG system is never trivial. There are multiple moving parts to get right: the quality of the LLM, the size and management of its context window, and how effectively your retrieval system surfaces the most relevant information.</p>

<p>Thanks to the out-of-the-box Vespa RAG blueprint on Vespa Cloud, much of this complexity is handled for you. It comes with multiple ranking profiles, and its default hybrid retrieval setup combines <strong>vector similarity with BM25 text matching</strong>, ensuring your LLM sees the best possible context for every query.</p>

<p>Vespa Cloud doesn’t just make building RAG easier, it makes it <strong>scalable, fast, and reliable</strong>, giving you production-ready infrastructure, auto-scaling and observability without the headaches of self-hosting. Whether you’re experimenting with small datasets or scaling to millions of documents, Vespa Cloud provides the tools and flexibility to make your RAG project shine.</p>

<p>Want to dive deeper? Start with the <a href="https://docs.vespa.ai/en/learn/tutorials/rag-blueprint.html">RAG Blueprint Tutorial</a> for a thorough conceptual walkthrough. And remember the <a href="https://vespatalk.slack.com/">Vespa Slack community</a> is always there to help. Ask questions, share what you’ve built, or get advice on retrieval, ranking, and deployment strategies.</p>

<p>Ready to experience the power of Vespa Cloud for yourself? <a href="https://cloud.vespa.ai/">Sign up</a> today and <strong>start building high-quality RAG applications with ease</strong>!</p>

]]></content:encoded>
        <pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Vespa Newsletter, February 2026</title>
        <description>Advances in Vespa&apos;s retrieval performance, flexibility, and developer productivity.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/logo/logo-pi.jpg" />
        
        <content:encoded><![CDATA[<p>Welcome to the latest edition of the Vespa newsletter. In the <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">previous update</a>, we introduced several new features and improvements, including Automated ANN Tuning, Accelerated Exact Vector Distance with Google Highway, Precise Chunk-Level Matching for Higher Retrieval Quality, Quantile Computation in Grouping for Instant Distribution Insights, and <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">more</a>.</p>

<p>This month, we’re announcing several updates focused on retrieval quality, ranking flexibility, and developer productivity. Each feature is designed to help engineering teams build faster, more accurate, and more maintainable retrieval and ranking systems, while giving businesses better relevance, lower operational overhead, and more predictable performance at scale.</p>

<p>Let’s dive into what’s new.</p>

<h2 id="product-updates">Product updates</h2>

<ul>
  <li>Announcing the Vespa.ai Playground</li>
  <li>The Vespa Kubernetes Operator</li>
  <li>Faster result rendering with CBOR</li>
  <li>Pyvespa 1.0 with improved HTTP performance</li>
  <li>Hybrid search relevance evaluation tool</li>
  <li>Configurable linguistics per field</li>
  <li><strong>“switch”</strong> operator in ranking expressions</li>
  <li>Vespa is now available on GCP Marketplace</li>
  <li>Feed data and run queries in the Vespa Console</li>
</ul>

<h3 id="announcing-the-vespaai-playground">Announcing the Vespa.ai Playground</h3>

<p>The Vespa Playground is a new GitHub space where we share projects, tools, and demos built on the Vespa platform. It’s a practical place to explore real examples for embeddings, model training, and feed connectors that you can clone, run, and build on your own.</p>

<p>These repos are ideal for experimentation, learning, and inspiration, though they aren’t officially supported product releases.</p>

<p><a href="https://github.com/vespaai-playground">Explore the Playground</a></p>

<h3 id="the-vespa-kubernetes-operator">The Vespa Kubernetes Operator</h3>

<p>The safest, most robust and cost effective way to run Vespa is to deploy on Vespa Cloud, but for various reasons that’s not an option for everybody. For those who want to run Vespa securely at scale but can’t use Vespa Cloud we have now released the Vespa Kubernetes Operator. This brings many of the Vespa Cloud features such as security out of the box, dynamic provisioning, autoscaling and automated upgrades to your own Kubernetes environments.</p>

<p>Read more in the <a href="https://docs.vespa.ai/en/operations/kubernetes/vespa-on-kubernetes.html">Kubernetes Operator documentation</a>.</p>

<h3 id="faster-result-rendering-with-cbor">Faster result rendering with CBOR</h3>

<p>Query result sets can be large, and increasingly so when the client is an LLM retrieving many chunks for model context. <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">Layered ranking</a> is designed to address this by extracting the most relevant content. Still, in some cases, the total latency is dominated by the time it takes to send the query response. Compressing with gzip can help, but is also CPU-intensive and slow.. From Vespa 8.623.5, json response generation is over twice as fast as before.</p>

<p>Another new option in this release is to use the <a href="https://cbor.io/">CBOR</a> format for query results. CBOR is a binary format so it can be serialized faster and produces smaller payloads, especially when the result contains lots of numeric data. Read more in the <a href="https://docs.vespa.ai/en/reference/api/query.html#presentation.format">Query API reference</a> and query <a href="https://docs.vespa.ai/en/performance/practical-search-performance-guide.html#hits-and-summaries">performance guide</a>.</p>

<h3 id="pyvespa-10-with-improved-http-performance">Pyvespa 1.0 with improved HTTP performance</h3>

<p>We have released the first major version of Pyvespa! This release switches the HTTP-client used by Pyvespa, from httpx to httpr, which gives big performance gains, especially for serializing and deserializing tensors, largely by taking advantage of the new CBOR serialization support in Vespa.</p>

<p>On preliminary benchmarks, we compared end-to-end latency for:</p>

<ol>
  <li>
    <p>Vespa 8.591.16 + Pyvespa v0.63.0 (using JSON)</p>
  </li>
  <li>
    <p>Vespa 8.634.24 + Pyvespa v1.0.0 (using CBOR)</p>
  </li>
</ol>

<p>The latter was ~4.9x faster when returning 400 hits with a 768-dim vector each. Performance gains will be smaller when not returning large result sets with tensors, but still significant. You may encounter different exceptions than before, but we strived to not change any user-facing API’s even if we bumped the major version.</p>

<p><a href="https://github.com/vespa-engine/pyvespa">Go to Pyvespa</a></p>

<h3 id="hybrid-search-relevance-evaluation-tool">Hybrid search relevance evaluation tool</h3>

<p>Hybrid search combines lexical and embedding based search to get the best from both. One of the tasks you need to solve is to pick an embedding model that provides a good quality vs. cost tradeoff for your use case. We have done a systematic evaluation of modern alternatives in <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">this blog</a>.</p>

<p>The code used to run these experiments is now merged into Pyvespa. You can use the VespaMTEBApp to evaluate embedding model performance on any task/benchmark compatible with the <a href="https://embeddings-benchmark.github.io/mteb/overview/available_benchmarks/">mteb-library</a>. See example usage from the <a href="https://github.com/vespa-engine/pyvespa/blob/master/tests/integration/test_integration_mtebevaluation.py">tests</a>.</p>

<h3 id="configurable-linguistics-per-field">Configurable linguistics per field</h3>

<p>Vespa now lets you specify linguistics profiles on fields to select some specific linguistics processing in your Linguistics module. In Lucene Linguistics, linguistics profiles map to analyzer configuration, optionally in combination with a specific language.</p>

<p>For example, you can define a Lucene analyzer like this in services.xml:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  &lt;item key="profile=whitespaceLowercase;language=en"&gt;

    &lt;tokenizer&gt;

      &lt;name&gt;whitespace&lt;/name&gt;

    &lt;/tokenizer&gt;

    &lt;tokenFilters&gt;

      &lt;item&gt;

        &lt;name&gt;lowercase&lt;/name&gt;

      &lt;/item&gt;

    &lt;/tokenFilters&gt;

  &lt;/item&gt;
</code></pre></div></div>
<p>And use it in the schema, under any field’s definition, like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>field title type string {

  indexing: summary | index

  linguistics {

      profile: whitespaceLowercase

  }

}
</code></pre></div></div>
<p>By default the linguistics profile will be applied both when processing the text of the field and the text searching it, but you can also specify a different linguistics profile on the query side, which is useful for e.g. doing synonym query expansion.</p>

<p>We’ve added a sample application demonstrating how to use multiple Lucene linguistics <a href="https://github.com/vespa-engine/sample-apps/tree/master/examples/lucene-linguistics/multiple-profiles">profiles</a> across multiple fields and updated the Vespa <a href="https://docs.vespa.ai/en/linguistics/linguistics.html">linguistics documentation</a> with usage examples.</p>

<h3 id="new-switch-operator-in-ranking-expressions">New “switch” operator in ranking expressions</h3>

<p>We have added a “switch” function in ranking expressions as a clearer, more maintainable alternative to deeply nested if() clauses, making complex ranking easier to read, debug, and evolve.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>switch (attribute(category)) {

    case "restaurant": myRestaurantFunction(),

    case "hotel": myHotelFunction(),

    default: myDefaultFunction()

}
</code></pre></div></div>

<p><a href="https://docs.vespa.ai/en/ranking/ranking-expressions-features.html#the-switch-function">Learn more</a></p>

<h3 id="vespa-is-now-available-on-gcp-marketplace">Vespa is now available on GCP Marketplace</h3>

<p>Vespa Cloud is now listed on the GCP Marketplace, making it easier to deploy and manage Vespa using native Google Cloud billing and procurement. Vespa Cloud is already available on <a href="https://aws.amazon.com/marketplace/pp/prodview-5pkxkencasnoo?sr=0-1&amp;ref_=beagle&amp;applicationId=AWSMPContessa">AWS Marketplace</a>.</p>

<p><a href="https://console.cloud.google.com/marketplace/product/gcp-billing-marketplace/vespa-cloud">See details</a></p>

<h3 id="feed-data-and-run-queries-in-the-vespa-console">Feed data and run queries in the Vespa Console</h3>

<p>The onboarding experience is now even smoother for new Vespa Cloud users. When you follow the getting started guide and deploy a sample app from the browser, you can immediately feed data and run queries directly in the browser. This makes it easy to try your own data and see how it behaves in Vespa.</p>

<p>We also provide examples showing how to do the same using pyvespa, the Vespa CLI, or curl.</p>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/new-onboarding-console.png" alt="New onboarding experience" /></p>

<p><a href="https://login.console.vespa-cloud.com/u/signup/identifier?state=hKFo2SBsN1NBOERhNnRCbDhpajdqTnhYSTlzUlltUjNoUG5mZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIERwRkg4NkVwRHg2aFk1Rjg0ZHZrYmdBZ0pFc1lTb29Io2NpZNkgVk92OGViclhwcEdBTnVpWWZHOWhKWk94MVM5T0dhTTQ">Try it Free</a></p>

<h2 id="new-content-and-learning-resources">New content and learning resources</h2>

<p>We published several new articles and resources since our last newsletter to help teams get more out of Vespa and stay ahead of new developments in search, RAG, and large-scale AI.</p>

<p><strong>Examples and notebooks:</strong></p>

<ul>
  <li><a href="http://playground.vespa.ai">playground.vespa.ai</a></li>
</ul>

<p><strong>Videos, webinars, and podcasts</strong></p>

<ul>
  <li><a href="https://em360tech.com/podcasts/how-scale-ai-digital-commerce-effectively?utm_content=520974566&amp;utm_medium=social&amp;utm_source=linkedin&amp;hss_channel=lcp-100705136">How To Scale AI in Digital Commerce Effectively</a></li>
  <li><a href="https://vespa.ai/resource/vespa-now-year-in-review/">2025 Year in Review</a></li>
</ul>

<p><strong>Blogs and ebooks</strong></p>

<ul>
  <li><a href="https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/">Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</a></li>
  <li><a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a></li>
  <li><a href="https://blog.vespa.ai/enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/">Enterprise AI Search vs. the Real Needs of Customer-Facing Apps</a></li>
  <li><a href="https://blog.vespa.ai/eliminating-the-precision-latency-trade-off-in-large-scale-rag/">Eliminating the Precision–Latency Trade-Off in Large-Scale RAG</a></li>
  <li><a href="https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/">How Tensors Are Changing Search in Life Sciences</a></li>
  <li><a href="https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/">The Search API Reset: Incumbents Retreat, Innovators Step Up</a></li>
  <li><a href="https://blog.vespa.ai/why-ai-search-platforms-are-gaining-attention/">Why AI Search Platforms Are Gaining Attention</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-5-of-5/">Why Life Sciences AI Is a Search Problem (Part 5 of 5)</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-4-of-5/">Why Life Sciences AI Is a Search Problem (Part 4 of 5)</a></li>
</ul>

<h3 id="upcoming-events">Upcoming Events</h3>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/maven.jpeg" alt="Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET" />
<strong>Lightning Lesson: Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET</strong></p>
<ul>
  <li>Intro to sparse vectors and tensors for efficient data handling</li>
  <li>Using Vision-Language Models (VLMs) to extract high quality and nuanced features from images</li>
  <li>Leveraging these features in sparse representations for hyper-personalized search &amp; recommendations</li>
</ul>

<p><a href="https://maven.com/p/b5ee84/personalized-relevance-with-vl-ms-and-sparse-vectors">Register Now</a></p>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/eCommerce-Webinar-Series.png" alt="e-commerce-webinar-series" />
<strong>February 18: The Zero Results Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/f4f6c070-c094-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/305ace80-c3c0-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20(AMER)">Save your spot</a></li>
</ul>

<p><strong>March 11: The Relevance Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/70338df0-c5fd-11f0-831c-01bcfd385865?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/5bf695d0-c5fd-11f0-bb1f-e79dc2111266?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20AMER">Save your spot</a></li>
</ul>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/Vespa-Now-Q1-Product-Update.png" alt="product-update" />
<strong>March 10: Vespa Q1 Product Update</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/79245020-f186-11f0-ace7-c7ef52349391?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20Update">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/3d23e680-f186-11f0-b12c-b1c5402490b0?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20update">Save your spot</a></li>
</ul>

<hr />
<p>👉 <a href="https://www.linkedin.com/company/vespa-ai/">Follow us on LinkedIn</a> to stay in the loop on upcoming events, blog posts, and announcements.</p>

<hr />

<p>Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? <a href="https://vespa.ai/free-trial/">Deploy your application for free</a> on Vespa Cloud today.</p>

]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-newsletter-february-2026/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-newsletter-february-2026/</guid>
        
        
        <category>newsletter</category>
        
      </item>
    
      <item>
        <title>Nexla + Vespa, The Power Duo for AI-Ready Data Pipelines</title>
        <description>Nexla solves data readiness. Vespa solves intelligence and precision at scale. Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/images/New Partnership Nexla.png" />
        
        <content:encoded><![CDATA[<h3 id="partner-spotlight-nexla">Partner Spotlight: Nexla</h3>

<p>AI is transforming quickly. What started with Q&amp;A chatbots has already evolved into deep research applications and, now, autonomous AI agents. Vespa is proud to be at the center of this shift, enabling some of the most proficient adopters of AI, such as Perplexity. To help organizations maximize the benefits of Vespa, we’re building a robust partner ecosystem. These partners help bring Vespa’s AI-native capabilities into real-world deployments across industries.</p>

<p><strong>Meet the innovators shaping the future of AI. Today’s spotlight: Nexla</strong></p>

<h2 id="nexla--vespaai-the-power-duo-for-ai-ready-data-pipelines">Nexla + Vespa.ai: The Power Duo for AI-Ready Data Pipelines</h2>

<p>When AI systems fall short, it’s rarely the model’s fault. It’s the messy reality of data spread across systems and never quite staying in sync. That’s why Nexla and Vespa partnered together.</p>

<p><a href="https://nexla.com/">Nexla</a> makes data usable.</p>

<p><a href="http://vespa.ai">Vespa</a> makes data intelligent at scale.</p>

<p>Together, they turn messy, distributed enterprise data into real-time AI search, recommendation, and RAG systems, without months of custom code gluing things together.</p>

<h2 id="nexla-making-enterprise-data-usable">Nexla: Making Enterprise Data Usable</h2>

<p>Nexla is an enterprise-grade, AI-powered data integration <a href="https://nexla.com/nexla-platform-overview">platform</a> that turns raw data from any source into production-ready data products. It provides a declarative, no-code way to move, transform, and validate data across ETL/ELT, reverse ETL, streaming, APIs, and RAG pipelines.</p>

<p>Think of Nexla as the layer that answers: “How do we reliably get the right data, in the right shape, to the systems that need it?</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>500+ Bidirectional <a href="https://nexla.com/connectors/">Connectors</a>:</strong> Pull data from databases, APIs, cloud storage, SaaS apps, and data warehouses, including systems like Salesforce, Snowflake, and Amazon S3.</p>
  </li>
  <li>
    <p><strong>Metadata Intelligence:</strong> Nexla automatically scans sources and generates <a href="https://nexla.com/nexsets">Nexsets</a>, virtual, ready-to-use data products with schemas, samples, and validation rules.
Example: If a price field suddenly switches from numeric to string, Nexla detects it before bad data reaches production search.</p>
  </li>
  <li>
    <p><strong><a href="https://nexla.com/blog/introducing-express-conversational-data-platform/">Express</a> (conversational pipelines):</strong> A conversational AI interface where you can simply describe what you need.
Example: You can say, “Pull customer data from Salesforce and merge with Google Analytics,” and it builds the pipeline for you.</p>
  </li>
  <li>
    <p><strong>Universal <a href="https://nexla.com/data-integration/">integration</a> styles:</strong> Supports ELT, ETL, CDC, R-ETL, streaming, API integration, and FTP in a single platform.</p>
  </li>
</ul>

<p>Nexla processes over <strong>1 trillion records monthly</strong> for companies like DoorDash, LinkedIn, Carrier, and LiveRamp.</p>

<h2 id="vespa-where-retrieval-becomes-reasoning">Vespa: Where Retrieval Becomes Reasoning</h2>

<p>Vespa is a production-grade AI search platform that combines a distributed text search, vector search, structured filtering, and machine-learned ranking in a single system.</p>

<p>Think of Vespa as the engine that answers: “Given all this data, how do we retrieve, rank, and reason over it in real time?”</p>

<p>It powers demanding applications like Perplexity and supports search, recommendations, personalization, and RAG at massive scale.</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>Unified AI Search and Retrieval:</strong> Vespa natively combines vector and <a href="https://vespa.ai/tensor-formalism/">tensor search</a> for semantic retrieval, full-text search for precise keyword matching, and structured filtering on attributes like categories, prices, and dates to enable richer, contextual search without stitching multiple systems together.</p>
  </li>
  <li>
    <p><strong>Real-time Retrieval and Inference at Scale:</strong> Rather than separating indexing, ranking, and inference across multiple systems, Vespa performs real-time machine-learned ranking and model inference where the data lives. This means you can serve fresh, personalized results with predictable sub-100 ms latency even for large datasets.</p>
  </li>
  <li>
    <p><strong>Multi-Phase Ranking and Custom Logic:</strong> Vespa lets you embed custom ranking logic, including ML models like XGBoost, directly into your search pipeline using ONNX. You can combine relevance signals, business rules, and semantic vectors in multi-stage ranking to fine-tune which results surface first.</p>
  </li>
  <li>
    <p><strong>Massive Scalability with High Throughput:</strong> Designed for real-world, high-traffic applications, Vespa can scale horizontally across clusters, handling billions of documents with sub-100ms query latency and up to 100k writes per second per node.</p>
  </li>
  <li>
    <p><strong>Multi-Vector and Multi-Modal Retrieval:</strong> Vespa natively handles multiple vectors per document, with support for token-level embeddings, ColPali-based visual document retrieval, and <a href="https://vespa.ai/tensor-formalism/">tensor-based computations</a> for precise, cross-modal relevance and ranking.</p>
  </li>
</ul>

<p>GigaOm recognized Vespa as a <strong><a href="https://content.vespa.ai/gigaom-report-v3-2025?_gl=1*1ep8wq0*_gcl_aw*R0NMLjE3NjQ4Nzg2NjIuQ2owS0NRaUFfOFRKQmhETkFSSXNBUFg1cXhRbHdEbHgtMndtQjdqRS1aYzhVWHRBSW4zTzZ2eEVrelNYTTdLUkNXSkZCTGpISml4MzNSZ2FBbkRxRUFMd193Y0I.*_gcl_au*MjkzNDEwODQ3LjE3NjUyODY2NTk.">leader</a> in vector databases</strong> for two consecutive years, noting its performance advantages over alternatives like Elasticsearch, up to <strong><a href="https://content.vespa.ai/vespa-vs-elasticsearch-performance-comparison">12.9X higher throughput</a> per CPU core for vector searches</strong>.</p>

<h2 id="how-nexla-and-vespa-work-together">How Nexla and Vespa Work Together</h2>

<p>The Nexla-Vespa partnership removes one of the hardest parts of AI systems: getting clean, well-modeled data into a high-performance retrieval engine, continuously.</p>

<p>Nexla recently launched a Vespa connector that makes data integration with Vespa seamless. The integration includes:</p>

<p><strong><a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa Connector</a> in Nexla:</strong>
Handles all data piping from sources like Amazon S3, PostgreSQL, Pinecone, Snowflake, and others directly into Vespa:
<img src="/assets/images/nexla1.png" alt="" /></p>

<p><strong>Vespa Nexla Plugin CLI:</strong> Automatically generates draft Vespa application packages (including schema files) directly from a Nexset, eliminating manual configuration:
<img src="/assets/images/nexla2.png" alt="" /></p>

<p>This means you can move data from S3 to Vespa, migrate from Pinecone to Vespa, or sync <a href="https://nexla.com/demo-center/move-data-from-postgresql-to-vespa-ai-effortlessly/">PostgreSQL to Vespa</a>, all without writing a single line of code.</p>

<h2 id="when-nexla-clients-should-use-vespa">When Nexla Clients Should Use Vespa</h2>

<p>You’re a Nexla client. Use Vespa when you need:</p>

<p><strong>Advanced AI search and RAG applications:</strong>
If you’re building intelligent search, recommendation systems, or RAG applications that require hybrid search (combining semantic vector search with keyword matching and metadata filtering), Vespa is purpose-built for this. Nexla gets your data into Vespa, while Vespa delivers production-grade AI search with machine-learned ranking.</p>

<p><strong>Real-time, high-scale query performance:</strong>
When you need to serve thousands of queries per second across billions of documents with sub-100ms latency, Vespa’s distributed architecture scales horizontally without compromising quality. Nexla ensures your data flows continuously into Vespa with incremental updates and CDC support.</p>

<p><strong>Complex ranking and inference:</strong>
If your use case requires multi-phase ranking, custom ML models, or LLM integration at query time, Vespa executes these operations locally where data lives, avoiding costly data movement. Nexla prepares and transforms your data into the exact schema Vespa needs.</p>

<p><strong>Cost efficiency at scale:</strong>
Vespa delivers 5X infrastructure cost savings compared to alternatives like Elasticsearch while handling vector, lexical, and hybrid queries. Nexla minimizes integration costs by automating pipeline creation and schema management.</p>

<h2 id="when-vespa-clients-should-use-nexla">When Vespa Clients Should Use Nexla</h2>

<p>You’re a Vespa client. Use Nexla when you need:</p>

<p><strong>Multi-source data consolidation:</strong>
Vespa is your search and inference engine, but data lives everywhere, S3 buckets, PostgreSQL databases, Snowflake warehouses, Salesforce CRMs, APIs, and files. Nexla connects to 500+ sources with bidirectional connectors and consolidates data into Vespa without custom ETL scripts.</p>

<p><strong>Automated schema generation and management:</strong>
Instead of manually writing Vespa schema files and managing schema evolution, Nexla’s Plugin CLI auto-generates schemas from your Nexsets. As source schemas change, Nexla’s metadata intelligence detects changes and propagates them downstream automatically.</p>

<p><strong>Data transformation and enrichment:</strong>
Before data hits Vespa, it often needs cleaning, filtering, enrichment, or format conversion. Nexla provides a no-code transformation library and supports custom SQL, Python, or JavaScript, all without maintaining separate ETL infrastructure.</p>

<p><strong>Vector database migration:</strong>
Moving from Pinecone, Weaviate, or another vector database to Vespa? Nexla handles the migration with zero code, extracting records, transforming data to match Vespa’s schema, and syncing documents continuously.</p>

<p><strong>Data quality and monitoring:</strong>
Nexla continuously monitors data flows with built-in validation rules, error handling, and automated alerts. When data quality issues arise, Nexla quarantines bad records and provides audit trails, ensuring Vespa always receives clean, trustworthy data.</p>

<p><strong>Real-time and streaming pipelines:</strong>
Vespa supports real-time updates, but getting real-time data from streaming sources (Kafka, APIs, databases with CDC) requires integration logic. Nexla handles streaming, batch, and hybrid integration styles, optimizing throughput and latency for each source type.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Nexla solves <strong>data readiness</strong>.</p>

<p>Vespa solves <strong>intelligence and precision at scale</strong>.</p>

<p>Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications. <a href="http://vespa.ai">Vespa</a> gives you production-grade vector search, hybrid retrieval, and RAG capabilities at any scale. <a href="http://nexla.com">Nexla</a> eliminates months of pipeline development and makes multi-source data flows conversational.</p>

<p><strong>Ready to explore?</strong></p>

<p>Start at <a href="http://express.dev">express.dev</a> for conversational pipeline building, or explore the <a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa connector</a> in Nexla’s platform to see how quickly your data can power real AI applications.</p>
]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-nexla-partnership/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-nexla-partnership/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
  </channel>
</rss>
