<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Vespa Blog</title>
    <description>We Make AI Work</description>
    <link>https://blog.vespa.ai/</link>
    <atom:link href="https://blog.vespa.ai/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 12 May 2026 06:24:19 +0000</pubDate>
    <lastBuildDate>Tue, 12 May 2026 06:24:19 +0000</lastBuildDate>
    <generator>Jekyll v4.4.1</generator>
    
      <item>
        <title>Scaling a Vespa Application: Feeding Fast and Furiously</title>
        <description>A tutorial on how to scale the resources in a Vespa application to increase feed throughput. Using the metrics dashboard for informed and optimised scaling.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/shaun-sullivan-4Ia69jX7rq4-unsplash.jpg" />
        
        <content:encoded><![CDATA[<p><em>This is a blog/series on how to scale and evaluate a Vespa Application for serving enterprise-scale workloads and customer facing applications with potentially millions of users. Vespa is the AI search platform and all-in-one solution for all your retrieval and large scale computation needs.</em></p>

<p>In this blog I will show you how to feed a large dataset to a Vespa Application. We will be using the full MS_marco passages dataset, which is perhaps the most comprehensive open dataset for information retrieval. It is around 4GB and contains more than 8 million passages on a wide range of topics. The goal in this blog is to show how scaling works in Vespa through feeding the entire dataset as fast as we can.</p>

<h1 id="creating-the-vespa-application">Creating the Vespa Application</h1>

<p>We will be using a pre-made sample application as our basis for scaling but the concepts are the same for any other application.</p>

<p>Setup:</p>

<ol>
  <li>
    <p><strong>Create a <a href="https://docs.vespa.ai/en/learn/tenant-apps-instances.html">tenant</a> on Vespa Cloud:</strong></p>

    <p>Go to <a href="https://console.vespa-cloud.com/">console.vespa-cloud.com</a> and create your tenant (unless you already have one).</p>
  </li>
  <li><strong>Install the <a href="https://docs.vespa.ai/en/clients/vespa-cli.html">Vespa CLI</a></strong> using <a href="https://brew.sh/">Homebrew</a>:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ brew install vespa-cli
</code></pre></div>    </div>
    <p>Windows/No Homebrew? See the <a href="https://docs.vespa.ai/en/clients/vespa-cli.html">Vespa CLI page</a> to download directly.</p>
  </li>
  <li><strong>Configure the Vespa client:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa config set target cloud
$ vespa config set application your-tenant-name-here.scalingtutorial
</code></pre></div>    </div>
  </li>
  <li><strong>Get Vespa Cloud control plane access:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa auth login
</code></pre></div>    </div>
    <p>Follow the instructions from the command to authenticate.</p>
  </li>
  <li><strong>Clone the sample application:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa clone scaling-tutorial scaling-app &amp;&amp; cd scaling-app
</code></pre></div>    </div>
    <p>This sample app is perfect for demonstrating scaling and performance as it is quite intensive to run both for feeding and querying.
You can also check out <a href="https://github.com/vespa-engine/sample-apps">sample-apps</a> for other sample apps you can clone.</p>
  </li>
  <li><strong>Add a certificate for <a href="https://docs.vespa.ai/en/security/guide#data-plane">data plane access</a> to the application:</strong>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa auth cert
</code></pre></div>    </div>
    <p>It is a good idea to take note of the path to the .pem files written here.</p>
  </li>
  <li>
    <p><strong>Add the cross-encoder and Colbert model</strong></p>

    <p>Export the cross-encoder ranker model to onnx format using the <a href="https://huggingface.co/docs/optimum/index">Optimum</a> library from HF or download an exported ONNX version of the model (like in this example)</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p models
$ curl -L https://huggingface.co/Xenova/ms-marco-MiniLM-L-6-v2/resolve/main/onnx/model.onnx -o models/model.onnx
$ curl -L https://huggingface.co/Xenova/ms-marco-MiniLM-L-6-v2/raw/main/tokenizer.json -o models/tokenizer.json
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>Download the dataset</strong></p>

    <p>The msmarco passages dataset can be found <a href="https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus">here</a>. Download, unzip it and place it in the <code class="language-plaintext highlighter-rouge">ext/</code> folder in our application.</p>

    <p><strong>NOTE: You will need around 8GB of free disk space for the dataset and the subsets we will be creating.</strong></p>
  </li>
  <li>
    <p><strong>Prepare the dataset for Vespa</strong></p>

    <p>Then run the script to convert it into the vespa feed format:</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python3 ext/transform_ms_marco.py
</code></pre></div>    </div>
    <p>which gives us the dataset and a few subsets of various sizes to feed to our application.</p>
  </li>
</ol>

<h1 id="deploying-and-feeding">Deploying and Feeding</h1>

<p>We now have everything we need for deployment, feeding and scaling! Scaling a vespa application is largely managed through the services.xml file. This is what the file currently looks like:</p>

<style>
.code-block{overflow:hidden;border:1px solid #e0e0e0;background:#fafafa;font-family:monospace}
.code-header{display:flex;align-items:center;justify-content:space-between;padding:8px 14px;background:#f4f4f4;border-bottom:1px solid #e0e0e0}
.code-lang{font-size:11px;color:#999;letter-spacing:.06em;text-transform:uppercase;font-family:sans-serif}
.copy-btn{font-size:11px;color:#999;background:none;border:1px solid #ddd;padding:2px 9px;cursor:pointer;font-family:sans-serif}
.copy-btn:hover{color:#333;border-color:#aaa}
.code-body{position:relative}
.code-scroll{overflow:hidden;max-height:132px;transition:max-height 0.4s cubic-bezier(0.4,0,0.2,1)}
.code-scroll.expanded{max-height:4000px}
.code-block pre{padding:14px 16px;font-size:12.5px;line-height:22px;color:#222;overflow-x:auto;white-space:pre;tab-size:2;margin:0;background:none}
.fade-overlay{position:absolute;bottom:0;left:0;right:0;height:72px;background:linear-gradient(to bottom,transparent,#fafafa);pointer-events:none;transition:opacity 0.3s ease}
.code-scroll.expanded~.fade-overlay{opacity:0}
.show-btn-wrap{display:flex;justify-content:center;padding:8px 0 12px;background:#fafafa;border-top:1px solid #e0e0e0}
.show-btn{font-size:12px;color:#555;background:none;border:none;padding:4px 12px;cursor:pointer;display:flex;align-items:center;gap:5px;font-family:sans-serif}
.show-btn:hover{color:#000}
.show-btn svg{transition:transform 0.3s ease}
.show-btn.open svg{transform:rotate(180deg)}
</style>

<div class="code-block">
  <div class="code-header">
    <span class="code-lang">XML — services.xml</span>
    <button class="copy-btn" onclick="(function(){var t=document.getElementById('vespa-raw').innerText;navigator.clipboard.writeText(t).then(function(){var b=document.querySelector('.copy-btn');b.textContent='Copied!';setTimeout(function(){b.textContent='Copy'},1500)})})()">Copy</button>
  </div>
  <div class="code-body">
    <div class="code-scroll" id="vespa-scroll">
      <pre id="vespa-raw">&lt;?xml version="1.0" encoding="utf-8" ?&gt;
&lt;!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --&gt;
&lt;services version="1.0" xmlns:deploy="vespa" xmlns:preprocess="properties" minimum-required-vespa-version="8.311.28"&gt;

  &lt;container id="default" version="1.0"&gt;

    &lt;nodes deploy:environment="dev" count="1"&gt;
      &lt;resources vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/&gt;
    &lt;/nodes&gt;
   
    &lt;search/&gt;
    &lt;document-api/&gt;

     &lt;!-- See https://docs.vespa.ai/en/embedding.html#huggingface-embedder --&gt;
    &lt;component id="e5_embedding_model" type="hugging-face-embedder"&gt;
            &lt;transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/&gt;
            &lt;tokenizer-model url="https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"/&gt;
            &lt;prepend&gt;
                &lt;query&gt;query:&lt;/query&gt;
                &lt;document&gt;passage:&lt;/document&gt;
            &lt;/prepend&gt;
    &lt;/component&gt;

    &lt;!-- See https://docs.vespa.ai/en/embedding.html#colbert-embedder --&gt;
    &lt;component id="colbert_embedding_model" type="colbert-embedder"&gt;
      &lt;transformer-model url="https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"/&gt;
      &lt;tokenizer-model url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/&gt;
    &lt;/component&gt;

     &lt;!-- See https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder--&gt;
    &lt;component id="tokenizer" type="hugging-face-tokenizer"&gt;
      &lt;model path="models/tokenizer.json"/&gt;
    &lt;/component&gt;

  &lt;/container&gt;

  &lt;content id="msmarco" version="1.0"&gt;
    &lt;min-redundancy&gt;1&lt;/min-redundancy&gt;
    &lt;documents&gt;
      &lt;document mode="index" type="passage"/&gt;
    &lt;/documents&gt;
    &lt;nodes deploy:environment="dev" count="1"&gt;
      &lt;resources vcpu="1.0" memory="8Gb" architecture="arm64" storage-type="local" disk="59Gb"/&gt;
    &lt;/nodes&gt; 
    &lt;engine&gt;
      &lt;proton&gt;
        &lt;tuning&gt;
          &lt;searchnode&gt;
            &lt;requestthreads&gt;
              &lt;persearch&gt;4&lt;/persearch&gt;
            &lt;/requestthreads&gt;
            &lt;feeding&gt;
              &lt;concurrency&gt;1.0&lt;/concurrency&gt;
            &lt;/feeding&gt;
          &lt;/searchnode&gt;
        &lt;/tuning&gt;
      &lt;/proton&gt;
    &lt;/engine&gt;
  &lt;/content&gt;

&lt;/services&gt;
</pre>
    </div>
    <div class="fade-overlay"></div>
  </div>
  <div class="show-btn-wrap">
    <button class="show-btn" id="vespa-btn" onclick="(function(){var s=document.getElementById('vespa-scroll');var b=document.getElementById('vespa-btn');var open=s.classList.toggle('expanded');b.classList.toggle('open',open);b.innerHTML=open?'&lt;svg width=\'14\' height=\'14\' viewBox=\'0 0 14 14\' fill=\'none\'&gt;&lt;path d=\'M2 4.5L7 9.5L12 4.5\' stroke=\'#89b4fa\' stroke-width=\'1.8\' stroke-linecap=\'round\' stroke-linejoin=\'round\'/&gt;&lt;/svg&gt; Show less':'&lt;svg width=\'14\' height=\'14\' viewBox=\'0 0 14 14\' fill=\'none\'&gt;&lt;path d=\'M2 4.5L7 9.5L12 4.5\' stroke=\'#89b4fa\' stroke-width=\'1.8\' stroke-linecap=\'round\' stroke-linejoin=\'round\'/&gt;&lt;/svg&gt; Show all'})()">
      <svg width="14" height="14" viewBox="0 0 14 14" fill="none"><path d="M2 4.5L7 9.5L12 4.5" stroke="#888" stroke-width="1.8" stroke-linecap="round" stroke-linejoin="round" /></svg>
      Show all
    </button>
  </div>
</div>

<p><br /></p>

<p>The important parts to take note of in this tutorial are the two resource specifiers in the &lt;container&gt; and &lt;content&gt; tags:</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"1"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"1"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p>This is where we configure the machine resources that our Vespa application runs on in Vespa Cloud.</p>

<p><strong>NOTE: when deploying to dev we need to add the <code class="language-plaintext highlighter-rouge">&lt;nodes deploy:environment="dev"&gt;</code> specifier to ensure we actually get the resources we ask for,
otherwise we default to what is quickly available</strong>.</p>

<p>Adding more resources or more nodes are the main parameters that need to be tweaked in order to scale your application. Right now we have provisioned the smallest amount of resources to our application.</p>

<p>Deploy the application to Vespa Cloud:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>(It might take a little bit of time for all services and nodes to go up and start running.)</p>

<p>You can follow the progress of the deployment from the terminal or in your tenant in your cloud console. When it is finished you should get the message:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Application up!
</code></pre></div></div>
<p>If you go to your cloud console you should be able to see your application. Note that we haven’t fed it any documents yet, so it should look something like this:</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1howitshouldlook.png" alt="application view in console" /></p>

<p>Let’s feed some documents. Feed the smallest dataset to Vespa using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_1000.jsonl
</code></pre></div></div>
<p>or, on Unix systems:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_1000.jsonl
</code></pre></div></div>
<p>to see how long it takes.</p>

<p>It will take a few minutes as we are doing heavy computations on very modest resources.</p>

<p>If you want to see a live count of how many documents that are in Vespa you can run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'
</code></pre></div></div>
<p>to see how many documents have been processed so far (under <code class="language-plaintext highlighter-rouge">documents</code>).</p>

<p>On this lowest resource configuration we get this result.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_1000.jsonl  4.96s user 7.11s system 3% cpu 5:56.05 total 
</code></pre></div></div>

<p>If we were to try and feed the whole 8.8 million passage msmarco dataset on this instance it would take more than a month to finish feeding!</p>

<p>We can do better!</p>

<h1 id="scaling">Scaling</h1>

<p>Before scaling the application we’ll delete the documents from our instance so that we have a fresh start.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/deleteDocs.png" alt="Deleting documents" /></p>

<p>Now lets assign more resources to our Vespa instance. From our schema we see that  we are doing extensive computations during feeding (notice the configuration in the <code class="language-plaintext highlighter-rouge">indexing</code> parameters)</p>

<p><strong>Schema</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  field e5_embedding type tensor&lt;bfloat16&gt;(x[384]) {
    # Using the e5 embedding model defined in services.xml
    indexing: input text | embed e5_embedding_model | attribute | index
    attribute {
      distance-metric: angular
    }
    index { # override default hnsw settings 
      hnsw {
        max-links-per-node: 32
        neighbors-to-explore-at-insert: 400
      } 
    }
  }

  field colbert_embeddings type tensor&lt;int8&gt;(dt{}, x[16]) {
    # No index - used for ranking, not retrieval 
    indexing: input text | embed colbert_embedding_model | attribute
    attribute: paged
  }
</code></pre></div></div>

<p>Embedding in Vespa happens in the container cluster, so it is a very reasonable guess that if we can make the embeddings go faster, our whole system will be faster (bellow in this blog we will show how to more thouroughly deduce scaling parameters). So lets start by scaling up the resources for the <strong>container</strong> node. To see what resource configurations we have available we must look at the <a href="https://docs.vespa.ai/en/performance/instance-types/aws-instance-types.html">instance type</a> page in the documentation.
Embedding-computations are best suited to run on GPUs, so we will select an instance type with a GPU:</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"1"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p>Replace the resources in the container node in services.xml with the new instance type (see above). Leave the content node resources as is for now.</p>

<p>Run the command for checking document count to make sure that it is zero:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'
</code></pre></div></div>
<p>and redeploy:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>When the deployment is finished we’ll time the feeding process again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_1000.jsonl
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_1000.jsonl  0.40s user 0.53s system 8% cpu 11.371 total 
</code></pre></div></div>
<p>11.4 seconds, that’s more like it! Instead of a month, this new instance would be able to crunch through the full dataset in just around a day!</p>

<p>We have now significantly upgraded a part of the hardware Vespa is running on. But before we scale up further we shall take a look at the <strong>metrics</strong> tab for our application. Go to <strong>Metrics</strong> and then <strong>resources</strong>.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/metrics_resources.png" alt="Metrics and Resources" /></p>

<p>This is where you see the resource usage history in your vespa instance, but most importantly it gives you a clear image of where your application is bottlenecked. The bottleneck for your application will be different depending on how your application is configured and the kind of computations you do. The previous 1000-line dataset was no match for the upgraded instance, so lets give it a bigger one to get some proper bottleneck data:</p>

<p>Delete the documents from the instance again, wait a bit, and run the command to ensure that we have no documents in our application</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa query 'yql=select * from passage where true' 'hits=0' 'ranking=unranked'
</code></pre></div></div>
<p>Now we’ll feed the 50 000 line dataset to properly test and time the upgraded instance.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_50000.jsonl
</code></pre></div></div>
<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_50000.jsonl  17.44s user 22.68s system 11% cpu 5:56.41 total (~17.6 hours for full dataset)
</code></pre></div></div>

<p>This is a more accurate reading of the instance’s performance, and at 5min 56s to feed 50 000 documents, the full dataset would take around 17 and a half hours.</p>

<p>Look at the resources in the metrics and set it to show only the last 30 minutes so that we can see more clearly what went on. Notice the CPU-utilisation and the GPU-utilisation graphs. Notice that the GPU usage on the container node hit 100% and stayed there for the entire feeding process. The CPU usage on the container node peaked at 80% but leveled at around 60% and the content node’s CPU barely went over 50%.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1g1c_container.png" alt="5 GPU 1 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1g1c_gpu.png" alt="1 GPU 1 content - GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/1g1c_content.png" alt="1 GPU 1 content - content node resources" /></p>

<p>It is clear that on this Vespa Instance, the bottleneck for better feeding performance lies in the GPU processing. If we want to improve the feeding performance of the system, then we must increase the amount of GPUs in the container node.</p>

<p>Now that we know where the problem lies: Lets make it go faster! We’ll increase the amount of GPU nodes to 5 with the <code class="language-plaintext highlighter-rouge">count="5"</code> parameter in the container node in <code class="language-plaintext highlighter-rouge">services.xml</code>.</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"5"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>
<p>save the <code class="language-plaintext highlighter-rouge">services.xml</code> file and redeploy:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>Now lets feed the larger dataset:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_500000.jsonl
</code></pre></div></div>
<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_500000.jsonl  97.59s user 95.22s system 10% cpu 29:38.18 total (~8.8 hours for full dataset)
</code></pre></div></div>
<p>If we extrapolate the results we see we got around twice the speed of the single-container node instance. But why not 5 times the speed? Let’s look at the metrics.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g1c_container.png" alt="5 GPU 1 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g1c_gpu.png" alt="5 GPU 1 content - container GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g1c_content.png" alt="5 GPU 1 content - content node resources" /></p>

<p>We see that the container GPU utilization now sits comfortably at around 50% and the container CPU at around 20-30%. But the content node CPU utilization sits near 100%. The 5 content nodes with GPUs saturate the single content node’s ability to take in data. We have found the new bottleneck of the system.</p>

<p>We’ll add some more content nodes:</p>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"2"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span> 
</code></pre></div></div>
<p>Delete the documents, redeploy, and refeed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_500000.jsonl
</code></pre></div></div>
<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_500000.jsonl  92.22s user 88.21s system 17% cpu 16:51.51 total (~5.0 hours for full dataset)
</code></pre></div></div>
<p>Adding the second content node almost doubles the performance again. Look at the metrics to see what is going on.</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g2c_container.png" alt="5 GPUs 2 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g2c_gpu.png" alt="5 GPUs 2 content - container GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/5g2c_content.png" alt="5 GPUs 2 content - content node resources" /></p>

<p>We see now that the container GPU (70-80%) and the content node CPU (80-90%) are both highly utilised, whilst the container node CPU is around 40%. Since we are already on the smallest instance type with a GPU we can’t scale down the cpu to match the others, so we have actually found a near optimal balance of container and content node resources for feeding this application.</p>

<p>Now that we have found a good balance, lets really scale up!</p>

<h1 id="feeding-fast-20-gpus">Feeding Fast: 20 GPUs</h1>

<p>If we want serious feed throughput, we need serious hardware. Let’s scale the container and content nodes proportionately and jump to 20 GPU container nodes and 8 content nodes at the same time:</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"20"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"8"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>NOTE:</strong> At this point you will most likely hit the <code class="language-plaintext highlighter-rouge">quotaExceeded</code> error when you try to deploy. Vespa Cloud tenants have a default quota that prevents you from accidentally spending a lot of money. If you want to go past it, reach out to Vespa support. With the limit raised, redeploy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vespa deploy --wait 900
</code></pre></div></div>
<p>Delete any existing documents, wait for the count to hit zero, and feed the 500 000 line dataset again:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_500000.jsonl
</code></pre></div></div>

<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_500000.jsonl  59.66s user 48.08s system 42% cpu 4:13.19 total (~1 hour 15 min for full dataset)
</code></pre></div></div>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/20g8c_container.png" alt="20 GPUs 8 content - container node resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/20g8c_gpu.png" alt="20 GPUs 8 content - container GPU resources" /></p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/20g8c_content.png" alt="20 GPUs 8 content - content node resources" /></p>

<p>At an estimated 1 hour and 15 minutes for the full dataset we see that we got pretty much exactly 4x feeding speed with 4x the resources. We also see that the utilisation metrics are essentially the same as the last run (feeding at 11:30), just faster.</p>

<h1 id="feeding-furiously-100-gpus">Feeding Furiously: 100 GPUs</h1>

<p>Finally, because we can: 100 GPU container nodes and 40 content nodes, and this time we will feed the full 8.8 million passage dataset in one go.</p>

<p><strong>Container</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"100"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"4.0"</span> <span class="na">memory=</span><span class="s">"16Gb"</span> <span class="na">architecture=</span><span class="s">"x86_64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"125Gb"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;gpu</span> <span class="na">count=</span><span class="s">"1"</span> <span class="na">memory=</span><span class="s">"16.0Gb"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;/resources&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p><strong>Content</strong></p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;nodes</span> <span class="na">deploy:environment=</span><span class="s">"dev"</span> <span class="na">count=</span><span class="s">"40"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;resources</span> <span class="na">vcpu=</span><span class="s">"1.0"</span> <span class="na">memory=</span><span class="s">"8Gb"</span> <span class="na">architecture=</span><span class="s">"arm64"</span> <span class="na">storage-type=</span><span class="s">"local"</span> <span class="na">disk=</span><span class="s">"59Gb"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/nodes&gt;</span>
</code></pre></div></div>

<p>Delete the documents, Deploy, then feed the full dataset:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time vespa feed ext/corpus_transformed_full.jsonl
</code></pre></div></div>

<p><strong>Result</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vespa feed ext/corpus_transformed_full.jsonl  695.31s user 605.48s system 108% cpu 20:04.23 total
</code></pre></div></div>

<p>The Vespa instance managed to process more than 8.8 million passages, with embeddings and ColBERT vectors computed for every single one, in just over 20 minutes (over a fast internet connection).</p>

<p><img src="/assets/2026-04-16-scaling-a-vespa-application-feeding-fast-and-furiously/100g40c_console_3.png" alt="100 GPUs 40 content - feed complete at 8.84M documents" /></p>

<p>The feed client also gives us a nice summary at the end:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"feeder.operation.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">8841823</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1201.608</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.ok.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">8841823</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.ok.rate"</span><span class="p">:</span><span class="w"> </span><span class="mf">7358.324</span><span class="p">,</span><span class="w">
  </span><span class="nl">"feeder.error.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">399</span><span class="p">,</span><span class="w">
  </span><span class="nl">"http.request.count"</span><span class="p">:</span><span class="w"> </span><span class="mi">8844266</span><span class="p">,</span><span class="w">
  </span><span class="nl">"http.response.latency.millis.avg"</span><span class="p">:</span><span class="w"> </span><span class="mi">167</span><span class="p">,</span><span class="w">
  </span><span class="nl">"http.response.code.counts"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"200"</span><span class="p">:</span><span class="w"> </span><span class="mi">8841823</span><span class="p">,</span><span class="w">
    </span><span class="nl">"429"</span><span class="p">:</span><span class="w"> </span><span class="mi">2044</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The feeding process had an average feed rate of around <strong>7358 documents per second</strong>. Now that is fast and furious!</p>

<h1 id="conclusion">Conclusion</h1>

<p>The best way to scale your Vespa instance is to use the metrics dashboards to see where the bottlenecks lie. There is no singular best instance of Vespa as the computational requirements are highly dependent on how you define your application. Feed the instance a sizable corpus to see how it performs under sustained load, and adjust its resources accordingly.</p>
]]></content:encoded>
        <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/scaling-a-vespa-application-feeding-fast-and-furiously/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/scaling-a-vespa-application-feeding-fast-and-furiously/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>The Vespa Cloud Metrics Dashboard</title>
        <description>A guide to the Vespa Cloud metrics dashboard — how to move from symptom to bottleneck to action, and what&apos;s new in the latest revision.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-overview.png" />
        
        <content:encoded><![CDATA[<p>When something goes wrong in production, the hard part is rarely finding a metric.
The hard part is figuring out <strong>which metric tells you where to look next</strong>.</p>

<p>The Vespa Cloud metrics dashboard is designed for exactly that.
Instead of treating monitoring as a wall of graphs, it helps you move from
<strong>symptom → bottleneck → action</strong>.</p>

<h2 id="start-with-three-questions">Start with three questions</h2>

<p>Most production issues can be reduced to three questions:</p>

<ol>
  <li><strong>Is the system healthy?</strong></li>
  <li><strong>Where is latency added?</strong></li>
  <li><strong>Are we running out of resources?</strong></li>
</ol>

<p>The dashboard mirrors that flow.</p>

<h3 id="1-is-the-system-healthy">1. Is the system healthy?</h3>

<p>Start on the <strong>Overview</strong> tab. This is the fastest place to answer “is anything
obviously broken?”. A healthy system keeps read and write QoS close to 100%.
If it drops, look at whether 4xx or 5xx responses are rising — 5xx responses
usually mean the problem is on the server side. A rise in degraded or failed queries
means it is time to continue into the Query tab.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-overview.png" alt="Metrics Overview" /></p>

<p>See the docs for the full reference:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#overview-tab">Metrics Overview tab</a>.</p>

<h3 id="2-where-is-latency-added">2. Where is latency added?</h3>

<p>Latency in Vespa is layered — a slow request is not just “slow”, it can be
slow in different parts of the path:</p>

<p><strong>HTTP → container → content nodes → ranking</strong></p>

<p>That is why the dashboard shows several latency metrics for what feels like
the same request. If HTTP latency is much higher than query latency,
payload size or network overhead may be the issue. If search-protocol latency
on the content nodes is high, the bottleneck is deeper in the system.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-query-rate-latency.png" alt="Query rate / Latency" /></p>

<p>See the docs for a layer-by-layer walkthrough:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#query-tab">Query tab</a>
and <a href="https://docs.vespa.ai/en/operations/monitoring.html#feed-tab">Feed tab</a>.</p>

<h3 id="3-are-we-running-out-of-resources">3. Are we running out of resources?</h3>

<p>Once you know where the slowdown is, switch to the <strong>Resources</strong> tab.
As a rule of thumb, sustained utilization above roughly 80% is a sign the
cluster may need more headroom. If one host is much hotter than the others,
enable per-host metrics and look for uneven load distribution.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-resources.png" alt="Node Resources" /></p>

<p>See the docs for healthy-value tables and scaling guidance:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#resources-tab">Resources tab</a>.</p>

<h2 id="whats-new-in-the-latest-revision">What’s new in the latest revision</h2>

<p>The dashboard has picked up a few improvements worth calling out.</p>

<h3 id="health-indicators-on-the-overview-tab">Health Indicators on the Overview tab</h3>

<p>The Overview tab now opens with a dedicated <strong>Health Indicators</strong> row —
five stat panels that surface stability issues in a single glance:
Core Dumps (1h), Restarts (1h), Feed Blocked, Content Cluster with Groups/Nodes Down,
and Container Nodes with Services Down.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-health.png" alt="Health Indicators" /></p>

<p>Details and healthy values:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#health-indicators">Health Indicators</a>.</p>

<h3 id="annotations-for-service-restart-and-core-dump">Annotations for Service restart and Core dump</h3>

<p><strong>Annotations</strong> are the vertical lines drawn across every chart when an
operational event happens — Vespa upgrades, feed blocked, data migration, reindexing,
autoscaling changes. Two annotations were added recently and they are worth flagging:</p>

<ul>
  <li><strong>Service restart</strong> — fires when a Vespa service process restarts.
Outside of planned upgrades, restarts usually mean a crash, OOM, or forced stop.</li>
  <li><strong>Core dump</strong> — fires when a process core-dumps. Should be extremely rare.</li>
</ul>

<p>When a latency anomaly lines up with one of these annotations,
you get the context for the change without having to infer it from the graph alone.
Both signals also feed the Overview’s Health Indicators row, so the same event
shows up in three places: the counter, the annotation line, and the Health tab’s
historical time series.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-annotation.png" alt="Dashboard Annotation" /></p>

<p>Full annotation reference:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#dashboard-annotations">Annotations</a>.</p>

<h3 id="container-thread-pool-rows-one-per-configuration-case">Container thread pool rows, one per configuration case</h3>

<p>The Resources tab used to have a single thread-pool row that was mostly empty —
a container only has the thread pools that match its <code class="language-plaintext highlighter-rouge">services.xml</code> configuration
(<code class="language-plaintext highlighter-rouge">&lt;search&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;document-api&gt;</code>, or both). The row has been split into three
case-specific rows:</p>

<ul>
  <li><strong>Thread Pools (search + document-api)</strong> for full-feature containers</li>
  <li><strong>Thread Pools (search only)</strong> for query-only containers</li>
  <li><strong>Thread Pools (document-api only)</strong> for feed-only containers</li>
</ul>

<p>Classification is automatic — hidden variables derive the cluster list per case
from Prometheus set operations, so only relevant rows render for a given deployment.
Each thread pool now gets its own panel with avg (green) and max (yellow dashed)
on the same chart.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-thread-pools.png" alt="Dashboard Thread Pools" /></p>

<p>Details:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#container-thread-pools">Container Thread Pools</a>.</p>

<h3 id="jvm-memory-breakdown-heap--direct--native">JVM memory breakdown (heap / direct / native)</h3>

<p>The Resources tab separates the three layers of container memory: <strong>heap</strong>,
<strong>direct</strong>, and <strong>native</strong>. This matters on container nodes that run embedders
or local LLM components — model weights are memory-mapped and partially resident,
but KV cache and compute buffers are allocated upfront as <strong>native</strong> memory.
When node memory is high but heap and direct look normal, the native layer
is usually where to look.</p>

<p><img src="/assets/2026-04-24-the-vespa-cloud-metrics-dashboard/dashboard-jvm.png" alt="Dashboard JVM" /></p>

<p>Details:
<a href="https://docs.vespa.ai/en/operations/monitoring.html#jvm-memory">JVM memory breakdown</a>.</p>

<h2 id="a-simple-workflow">A simple workflow</h2>

<p>A practical way to use the dashboard during an incident:</p>

<ol>
  <li>Open <strong>Overview</strong> and scan the Health Indicators row.</li>
  <li>Confirm the symptom (QoS drop, latency spike, error-rate increase).</li>
  <li>Use <strong>Query</strong> or <strong>Feed</strong> to find the slow layer.</li>
  <li>Use <strong>Resources</strong> to confirm whether the cluster is saturated.</li>
  <li>Cross-reference <strong>annotations</strong> for restarts, upgrades, reindexing, or migration.</li>
</ol>

<p>That flow gets from “latency is up” to “this is the actual bottleneck” much faster
than scanning every chart. The
<a href="https://docs.vespa.ai/en/operations/monitoring.html#dashboard-workflows">common workflows</a>
section of the docs has recipes for the most frequent scenarios.</p>

<h2 id="summary">Summary</h2>

<p>The Vespa Cloud metrics dashboard works best as a troubleshooting tool —
not a metrics catalog. Start with health, follow the latency path, confirm with
resources, and use annotations to connect spikes to real events. The tab reference,
healthy-value tables, and step-by-step workflows live in the
<a href="https://docs.vespa.ai/en/operations/monitoring.html#vespa-cloud-dashboard">Monitoring documentation</a>.</p>
]]></content:encoded>
        <pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/the-vespa-cloud-metrics-dashboard/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/the-vespa-cloud-metrics-dashboard/</guid>
        
        <category>monitoring</category>
        
        <category>metrics</category>
        
        <category>performance</category>
        
        
      </item>
    
      <item>
        <title>Using Large ONNX Models with External Data in Vespa Embedders</title>
        <description>Many ONNX models exceed the 2GB protobuf limit and store weights in external data files. Vespa now supports these models for embedders.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-27-onnx-external-data-in-vespa-embedders/onnx-external-data-splash.png" />
        
        <content:encoded><![CDATA[<p>Many popular ONNX models exceed the 2 GB <a href="https://protobuf.dev/">protobuf</a> format limit and store their weights in separate external data files.
Until recently, these models could not be used directly in Vespa’s built-in embedders.</p>

<p>This was a long requested feature on our tracker (see <a href="https://github.com/vespa-engine/vespa/issues/28761">GitHub issue #28761</a>).</p>

<h2 id="the-2-gb-limitation">The 2 GB limitation</h2>

<p><a href="https://onnx.ai/">ONNX</a> uses Google’s Protocol Buffers as its serialization format.
Protobuf has a hard limit of 2 GB on message size.
For smaller models, this is not a problem — all tensor data (the model weights) is embedded directly in the <code class="language-plaintext highlighter-rouge">.onnx</code> file,
making it self-contained.</p>

<p>As models grow larger, they inevitably hit this limitation.
For a model exceeding 2 GB, ONNX tooling splits it into two parts:</p>

<ul>
  <li>A small <strong><code class="language-plaintext highlighter-rouge">.onnx</code> file</strong> containing the model graph structure (typically a few hundred KB to a few MB).</li>
  <li>One or more <strong>external data files</strong> (commonly named <code class="language-plaintext highlighter-rouge">.onnx_data</code>) containing the actual tensor weights.</li>
</ul>

<p>Note that reduced-precision variants of these models (INT8, FP16, etc.) are often small enough to fit in a single self-contained <code class="language-plaintext highlighter-rouge">.onnx</code> file.
The external data split primarily affects the full-precision versions.</p>

<p>Previously, if you pointed a Vespa embedder at a model with external data files, ONNX Runtime would fail to load it
because the data files were not available alongside the model file.</p>

<h2 id="what-changed">What changed</h2>

<p>Vespa embedders now automatically handle ONNX models with external data files.
When you configure an embedder with a URL pointing to an <code class="language-plaintext highlighter-rouge">.onnx</code> file,
Vespa inspects the model to check whether it references any external data files.
If it does, Vespa downloads those files automatically before loading the model.</p>

<p>This feature is available starting from Vespa 8.544.</p>

<h2 id="how-to-use-it">How to use it</h2>

<p>Here is an example using EmbeddingGemma 300M, which uses external data:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"gemma"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span>
      <span class="na">url=</span><span class="s">"https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/onnx/model.onnx"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;tokenizer-model</span>
      <span class="na">url=</span><span class="s">"https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/tokenizer.json"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>2048<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>task: search result | query: <span class="nt">&lt;/query&gt;</span>
      <span class="nt">&lt;document&gt;</span>title: none | text: <span class="nt">&lt;/document&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>If you are deploying to <a href="https://cloud.vespa.ai/">Vespa Cloud</a>, you can also use models from the
<a href="https://docs.vespa.ai/en/rag/model-hub.html">Vespa Model Hub</a> that use external data.
For example, the Multilingual-E5-large model (will be available on Vespa Cloud 8.668+):</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"e5"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"multilingual-e5-large"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>512<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>query: <span class="nt">&lt;/query&gt;</span>
      <span class="nt">&lt;document&gt;</span>passage: <span class="nt">&lt;/document&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>This works with our ONNX-based embedders:</p>

<ul>
  <li><a href="https://docs.vespa.ai/en/embedding.html#huggingface-embedder"><code class="language-plaintext highlighter-rouge">hugging-face-embedder</code></a></li>
  <li><a href="https://docs.vespa.ai/en/embedding.html#colbert-embedder"><code class="language-plaintext highlighter-rouge">colbert-embedder</code></a></li>
  <li><a href="https://docs.vespa.ai/en/embedding.html#splade-embedder"><code class="language-plaintext highlighter-rouge">splade-embedder</code></a></li>
</ul>

<p>It’s also possible to use <a href="https://docs.vespa.ai/en/reference/rag/embedding.html#private-model-hub">private models</a> — authentication tokens are propagated when downloading external data files.</p>

<h2 id="current-limitations">Current limitations</h2>

<p>There are a few constraints to be aware of:</p>

<ul>
  <li>
    <p><strong>Embedders only.</strong> Models used directly in <a href="https://docs.vespa.ai/en/ranking/onnx.html">ranking expressions</a>
must still be self-contained and under 2 GB.</p>
  </li>
  <li>
    <p><strong>URL-referenced or Model Hub models only.</strong> Models bundled in the
<a href="https://docs.vespa.ai/en/application-packages.html">application package</a>
using the <code class="language-plaintext highlighter-rouge">path</code> attribute do not support external data.
Models referenced via <code class="language-plaintext highlighter-rouge">url</code> or <code class="language-plaintext highlighter-rouge">model-id</code> (Vespa Cloud) are supported.</p>
  </li>
  <li>
    <p><strong>External data files must be co-located with the model.</strong>
The external data files are resolved relative to the model URL.
They must be in the same directory (or a subdirectory) as the <code class="language-plaintext highlighter-rouge">.onnx</code> file.</p>
  </li>
</ul>

<p>See the <a href="https://docs.vespa.ai/en/ranking/onnx.html#limitations-on-model-size-and-complexity">ONNX model documentation</a>
for the full list of requirements.</p>

<p>If you need more extensive support for ONNX models with external data — for example in ranking expressions —
feel free to <a href="https://github.com/vespa-engine/vespa/issues">file an issue</a>.</p>
]]></content:encoded>
        <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/</guid>
        
        <category>embedding</category>
        
        <category>onnx</category>
        
        
      </item>
    
      <item>
        <title>Asymmetric Retrieval: Spend on Docs, Embed your Queries for Free</title>
        <description>Documents are embedded once — worth the spend for maximum quality. Queries hit you on every request. This is what drives your cost at scale. Asymmetric retrieval with Voyage AI and Vespa. Real numbers, real config.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-10-asymmetric-retrieval-spend-on-docs-queries-for-free/hero.png" />
        
        <content:encoded><![CDATA[<p>At 10,000 queries per second with ~30-token queries, you’re pushing ~18 million tokens per minute through your embedding API. At $0.02 per million tokens, that’s <strong>over $15,000/month</strong> — just for query embeddings. Documents are embedded once. Queries are embedded forever.</p>

<p>What if you could drop that to $0?</p>

<p>That’s the promise of <strong>asymmetric retrieval</strong>: embed your documents with the best model money can buy, then embed queries with a tiny model running locally — for free. Voyage AI’s new <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">voyage-4 family</a> is the first to make this practical, and Vespa now has native support for it.</p>

<h2 id="the-asymmetric-insight">The asymmetric insight</h2>

<p>The conventional approach is to use the same embedding model for documents and queries. Same model, same vector space. But it ignores a fundamental asymmetry.</p>

<p>Document embedding is a <strong>one-time cost</strong>. You embed each document once at indexing time, and it’s not latency-sensitive — whether it takes 10ms or 500ms doesn’t matter because no user is waiting. You can throw the biggest, most accurate model at it and take your time.</p>

<p>Query embedding is the opposite. It’s on the <strong>critical path of every single request</strong>, continuously, at scale. It needs to be fast, and at 10K QPS the cost dwarfs everything else.</p>

<p>Why use the same model for both?</p>

<p>Asymmetric retrieval splits these two concerns:</p>

<ol>
  <li><strong>Documents</strong> — Embed once with <code class="language-plaintext highlighter-rouge">voyage-4-large</code>. Best accuracy, API-based, no rush.</li>
  <li><strong>Queries</strong> — Embed continuously with <code class="language-plaintext highlighter-rouge">voyage-4-nano</code>. Tiny, local, free.</li>
</ol>

<p>This works because all four models in the Voyage 4 family — <code class="language-plaintext highlighter-rouge">voyage-4-large</code>, <code class="language-plaintext highlighter-rouge">voyage-4</code>, <code class="language-plaintext highlighter-rouge">voyage-4-lite</code>, and <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> — produce <strong>compatible embeddings in a shared vector space</strong>.</p>

<p><img src="/assets/2026-03-10-asymmetric-retrieval-spend-on-docs-queries-for-free/asymmetric-embeddings.png" alt="Asymmetric retrieval: documents embedded with voyage-4-large via API, queries embedded with voyage-4-nano locally" /></p>

<p>It also means you can upgrade your query model independently. Start with <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> for cost, move to <code class="language-plaintext highlighter-rouge">voyage-4-lite</code> for quality — without re-embedding a single document.</p>

<p>The shared embedding space opens up document-side flexibility too. In a multi-tenant system, you could use different models for different tiers — <code class="language-plaintext highlighter-rouge">voyage-4-large</code> for premium customers who need the best retrieval quality, <code class="language-plaintext highlighter-rouge">voyage-4-lite</code> for cost-sensitive tenants — all searchable with the same query model. Same index, same query path, different quality/cost tradeoffs per tenant.</p>

<h2 id="the-numbers">The numbers</h2>

<h3 id="cost">Cost</h3>

<p>Let’s be concrete about the 10K QPS scenario:</p>

<ul>
  <li>10,000 QPS × 30 tokens = 300,000 tokens/sec</li>
  <li>300,000 × 60 × 60 × 24 × 30 = ~777 billion tokens/month</li>
  <li>At $0.02/1M tokens ≈ <strong>$15,500/month</strong> for query embeddings via API</li>
</ul>

<p>With <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> running locally on the Vespa container: <strong>$0/month</strong>. The model runs as part of the serving infrastructure you’re already paying for.</p>

<h3 id="latency">Latency</h3>

<p>API calls add network round-trips. Local inference on <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> runs in single-digit milliseconds on CPU.</p>

<h3 id="quality">Quality</h3>

<p>Voyage 4 is state-of-the-art. On the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">RTEB benchmark</a> (29 retrieval datasets, NDCG@10), <code class="language-plaintext highlighter-rouge">voyage-4-large</code> beats the competition:</p>

<style>
  table, th, td {
    border: 1px solid black;
  }
  th, td {
    padding: 5px;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Comparison</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>vs. Gemini Embedding 001</td>
      <td>+3.87%</td>
    </tr>
    <tr>
      <td>vs. Cohere Embed v4</td>
      <td>+8.20%</td>
    </tr>
    <tr>
      <td>vs. OpenAI v3 Large</td>
      <td>+14.05%</td>
    </tr>
  </tbody>
</table>

<p><br />
And asymmetric retrieval — querying with a smaller model against <code class="language-plaintext highlighter-rouge">voyage-4-large</code> document embeddings — preserves retrieval quality across medical, code, web, finance, and legal domains.</p>

<h3 id="storage">Storage</h3>

<p>Binary quantization gives you a <strong>16x memory reduction</strong> over bfloat16 — 2048-dim vectors go from 4,096 bytes to 256 bytes. The full-precision floats are still used for second-phase reranking, <a href="https://docs.vespa.ai/en/content/attributes.html#paged-attributes-disadvantages">paged from disk</a> only when needed. For a deeper dive on quantization tradeoffs, see <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a>.</p>

<h2 id="why-this-matters-at-scale">Why this matters at scale</h2>

<p>Cost and quality are table stakes. The real question for large-scale systems is: does this work in production?</p>

<h3 id="independent-scaling">Independent scaling</h3>

<p>Vespa separates stateless containers (where embedding runs) from content clusters (where data lives). This means you can scale query embedding capacity independently from storage. Need more QPS? Add container nodes. More documents? Add content nodes. They don’t interfere.</p>

<h3 id="no-external-api-on-the-query-path">No external API on the query path</h3>

<p>This is the underrated benefit. With asymmetric retrieval, the query embedding model runs locally inside Vespa — your critical search path has zero dependency on an external API.</p>

<p>That matters when:</p>

<ul>
  <li><strong>The API goes down.</strong> Every embedding API has outages. If your query path depends on one, your search goes down with it.</li>
  <li><strong>You get rate-limited.</strong> Traffic spikes don’t care about your API quota. A sudden 3x in query volume means dropped requests — or queued requests that blow your latency budget.</li>
  <li><strong>You need to scale fast.</strong> Adding Vespa container nodes takes minutes. Negotiating higher API rate limit may take days. On <a href="https://docs.vespa.ai/en/cloud/autoscaling.html">Vespa Cloud</a>, autoscaling handles traffic spikes automatically — container clusters are stateless and scale up quickly.</li>
</ul>

<p>Keeping the query path self-contained turns your search system from “works when everything is up” into “works, period.”</p>

<h3 id="two-phase-ranking">Two-phase ranking</h3>

<p>Binary vectors are fast — Vespa can do ~1 billion hamming distance calculations per second. But binary quantization loses precision. Vespa’s <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">phased ranking</a> recovers it:</p>

<ol>
  <li><strong>First phase</strong>: Hamming distance on binary embeddings. Fast, cheap, scans the full index.</li>
  <li><strong>Second phase</strong>: Float dot-product on the top 2,000 candidates. Accurate, but only touches a bounded set of vectors paged from disk.</li>
</ol>

<p>This gives you the speed of binary search with the accuracy of full-precision reranking.</p>

<h3 id="enterprise-proven">Enterprise-proven</h3>

<p>This isn’t theoretical. Vespa runs search and recommendation at Spotify, Yahoo, and Perplexity — billions of documents, thousands of QPS, sub-100ms latency. The architecture handles it.</p>

<h2 id="how-to-set-this-up">How to set this up</h2>

<p>Here’s the complete Vespa configuration for asymmetric retrieval with Voyage AI.</p>

<h3 id="schema">Schema</h3>

<p>Two embedding fields — binary for fast retrieval, float for accurate reranking:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
    }
  }

  field embedding_float type tensor&lt;bfloat16&gt;(x[2048]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: prenormalized-angular
      paged
    }
  }

  field embedding_binary type tensor&lt;int8&gt;(x[256]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: hamming
    }
  }
}
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">paged</code> attribute on <code class="language-plaintext highlighter-rouge">embedding_float</code> tells Vespa to keep these vectors on disk, paging them into memory only during second-phase reranking. The binary embeddings stay in memory for fast first-phase retrieval.</p>

<h3 id="embedders-servicesxml">Embedders (services.xml)</h3>

<p>Two embedders — one API-based for documents, one local for queries:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"voyage-4-large"</span> <span class="na">type=</span><span class="s">"voyage-ai-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;model&gt;</span>voyage-4-large<span class="nt">&lt;/model&gt;</span>
    <span class="nt">&lt;api-key-secret-ref&gt;</span>apiKey<span class="nt">&lt;/api-key-secret-ref&gt;</span>
    <span class="nt">&lt;dimensions&gt;</span>2048<span class="nt">&lt;/dimensions&gt;</span>
    <span class="nt">&lt;batching</span> <span class="na">max-size=</span><span class="s">"20"</span> <span class="na">max-delay=</span><span class="s">"20ms"</span><span class="nt">/&gt;</span>
  <span class="nt">&lt;/component&gt;</span>

  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"voyage-4-nano"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"voyage-4-nano-int8"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;tokenizer-model</span> <span class="na">model-id=</span><span class="s">"voyage-4-nano-vocab"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>32768<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;pooling-strategy&gt;</span>mean<span class="nt">&lt;/pooling-strategy&gt;</span>
    <span class="nt">&lt;normalize&gt;</span>true<span class="nt">&lt;/normalize&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>Represent the query for retrieving supporting documents: <span class="nt">&lt;/query&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>The <a href="https://docs.vespa.ai/en/rag/embedding.html#voyageai-embedder"><code class="language-plaintext highlighter-rouge">voyage-ai-embedder</code></a> handles vector quantization automatically — it infers the target precision from the destination tensor type. bfloat16 fields get full-precision embeddings; int8 fields get binary representations.</p>

<p>The <code class="language-plaintext highlighter-rouge">hugging-face-embedder</code> runs <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> locally. No API calls, no rate limits, no cost. Both model references (<code class="language-plaintext highlighter-rouge">voyage-4-nano-int8</code>, <code class="language-plaintext highlighter-rouge">voyage-4-nano-vocab</code>) resolve via the <a href="https://docs.vespa.ai/en/rag/model-hub.html">Vespa Model Hub</a>.</p>

<p><strong>A note on “quantization” — two different things.</strong> The <code class="language-plaintext highlighter-rouge">voyage-4-nano-int8</code> in the <code class="language-plaintext highlighter-rouge">model-id</code> refers to <strong>model weight quantization</strong>: the ONNX model file uses INT8 weights instead of FP32, which makes inference 2-3x faster on CPU with negligible quality loss. This is about how the <em>model itself</em> is stored and executed. The embedder still produces full-precision float vectors as output. <strong>Vector quantization</strong> is a separate concern — it’s about the precision of the <em>output embeddings</em> you store and search over (bfloat16, int8/binary, etc.). That’s controlled by the tensor type in your schema field, not the model format. These are independent knobs: you can run an INT8-quantized model that outputs float vectors, then store them as binary. For a deeper dive with benchmarks on both, see <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a>.</p>

<h3 id="rank-profile">Rank profile</h3>

<p>Two-phase ranking: hamming distance first, float reranking second:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile binary-with-rerank {
  inputs {
    query(q_float) tensor&lt;float&gt;(x[2048])
    query(q_bin) tensor&lt;int8&gt;(x[256])
  }

  function binary_closeness() {
    expression: 1 - (distance(field, embedding_binary) / 2048)
  }

  function float_closeness() {
    expression: reduce(query(q_float) * attribute(embedding_float), sum, x)
  }

  first-phase {
    expression: binary_closeness
  }

  second-phase {
    expression: float_closeness
    rerank-count: 2000
  }
}
</code></pre></div></div>

<h3 id="querying">Querying</h3>

<p>Both query tensors are produced by the local <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> embedder:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yql=select * from doc where {targetHits: 100}nearestNeighbor(embedding_binary, q_bin)
&amp;ranking=binary-with-rerank
&amp;input.query(q_bin)=embed(voyage-4-nano, "your query here")
&amp;input.query(q_float)=embed(voyage-4-nano, "your query here")
&amp;hits=10
</code></pre></div></div>

<p>The <a href="https://docs.vespa.ai/en/nearest-neighbor-search.html">nearest neighbor search</a> runs on the binary field for speed, while the rank profile handles two-phase scoring.</p>

<p>For a complete runnable example with pyvespa, see the <a href="https://vespa-engine.github.io/pyvespa/examples/voyage-ai-embeddings-cloud.html">Voyage AI embeddings notebook</a>.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>Asymmetric retrieval makes the most sense when:</p>

<ul>
  <li><strong>High QPS</strong> — The cost savings scale linearly. At 10K QPS, you’re saving $15.5K/month. At 100K QPS, it’s $155K.</li>
  <li><strong>Large corpus</strong> — Documents are embedded once, so the large model cost is amortized. The bigger the corpus, the more you benefit from cheap queries.</li>
  <li><strong>Latency-sensitive</strong> — Local inference eliminates network round-trips.</li>
</ul>

<p>When a single model is the better choice:</p>

<ul>
  <li><strong>Low volume and latency-tolerant</strong> — At 10 QPS, the API cost is ~$15/month and the network round-trip doesn’t matter. One model is simpler to operate.</li>
  <li><strong>Quality above all else</strong> — Using <code class="language-plaintext highlighter-rouge">voyage-4-large</code> for both documents and queries gives you the best possible retrieval quality. If you can afford the API cost and latency, symmetric with the top model is hard to beat.</li>
</ul>

<p>The Voyage 4 family and Vespa’s native integration make asymmetric retrieval practical for the first time. Embed documents with the best model available, query with a tiny local model, and let phased ranking close the quality gap.</p>

<p><strong>Resources:</strong></p>

<ul>
  <li><a href="https://vespa-engine.github.io/pyvespa/examples/voyage-ai-embeddings-cloud.html">Voyage AI embeddings notebook</a> — Full runnable example</li>
  <li><a href="https://docs.vespa.ai/en/embedding.html">Embedding documentation</a> — Configuring embedders in Vespa</li>
  <li><a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html">Binary quantization guide</a> — Deep dive on binarization</li>
  <li><a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">Phased ranking</a> — Multi-phase ranking architecture</li>
  <li><a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 announcement</a> — Model family details and benchmarks</li>
</ul>

<p>For those interested in learning more about Vespa, join the <a href="https://vespatalk.slack.com/">Vespa community on Slack</a> to exchange ideas,
seek assistance from the community, or stay in the loop on the latest Vespa developments.</p>
]]></content:encoded>
        <pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/</guid>
        
        <category>embedding</category>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>voyage-ai</category>
        
        
      </item>
    
      <item>
        <title>How Metal AI Built an Agent-Driven Intelligence Platform on Vespa Cloud</title>
        <description>How Metal built an AI-Native Intelligence Platform on Vespa.ai, where 95% of retrieval is handled by AI agents.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-10-metal-case-study-agent-driven-intelligence-on-vespa-cloud/MetalxVespa.png" />
        
        <content:encoded><![CDATA[<blockquote>
  <p>“95% of our retrieval is done by AI agents.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<p>Metal needed a retrieval foundation that could evolve as fast as their product, without hitting a wall.</p>

<h2 id="introduction">Introduction</h2>

<p>Private equity firms manage vast amounts of unstructured data, including deal documents, expert call transcripts, financial statements, CRM records, and more. The challenge isn’t simply accessing this information. It’s connecting and understanding it, in context, across the investment lifecycle.</p>

<p><a href="https://www.metal.ai/?utm_source=chatgpt.com">Metal AI</a> was built to address this challenge. Its purpose-built institutional intelligence platform, used by established private equity firms transforms fragmented historical and live deal data into a living system of record that drives conviction at every stage of the investment lifecycle.</p>

<p>To deliver this vision at scale, Metal leverages <a href="http://vespa.ai">Vespa.ai</a> as its core retrieval layer, powering entity relationships, advanced ranking, and real-time context-aware retrieval across complex investment data.</p>

<h2 id="the-need-for-relationship-driven-retrieval">The Need for Relationship-Driven Retrieval</h2>

<p>As Metal’s product evolved, the limitations of traditional retrieval systems became clear.</p>

<p>Early architecture supported basic document search, but private equity workflows aren’t document-centric. They are entity- and relationship-driven. The enduring edge in private equity lies in drawing on decades of deal history, portfolio outcomes, and institutional knowledge. When that depth of experience surfaces reasoning and connections across time, every investment decision carries greater conviction.</p>

<p>Most traditional vector stores and search engines are fundamentally document-first. They index text, return similar passages, and rely primarily on semantic similarity or keyword matching. But for Metal’s use case, relevance requires more:</p>

<ul>
  <li>
    <p>Understanding which answer is the most recent and legally approved</p>
  </li>
  <li>
    <p>Identifying which company a metric belongs to</p>
  </li>
  <li>
    <p>Connecting meetings to prior diligence activity</p>
  </li>
  <li>
    <p>Applying business logic alongside semantic similarity</p>
  </li>
</ul>

<p>As Metal introduced more advanced workflows, like DDQ automation and agent-driven retrieval, the gap widened. Traditional systems struggle to:</p>

<ul>
  <li>
    <p>Combine semantic similarity with recency and compliance rules within ranking</p>
  </li>
  <li>
    <p>Support evolving data models without significant rework</p>
  </li>
  <li>
    <p>Query across multiple object types in a unified way</p>
  </li>
  <li>
    <p>Serve as a foundation for structured, iterative queries issued by AI agents</p>
  </li>
</ul>

<p>Layering custom logic on top of limited retrieval infrastructure would have created increasing technical debt, and each new entity type or ranking rule risked architectural compromise.</p>

<p>Metal needed a retrieval foundation that could evolve with the product, not constrain it.</p>

<h2 id="choosing-a-retrieval-layer-without-limits">Choosing a Retrieval Layer without Limits</h2>

<p>Metal wasn’t simply selecting a search engine. They were selecting a long-term retrieval architecture.</p>

<p>Several capabilities distinguished Vespa:</p>

<ul>
  <li>
    <p><strong>Multi-entity modeling:</strong> Vespa supports multiple object types, like documents, people, activities, and financial data, as well as the relationships between them. This aligned with how Metal structures institutional knowledge.</p>
  </li>
  <li>
    <p><strong>Advanced ranking and filtering:</strong> Vespa can combine semantic similarity with structured filters like recency and business rules, enabling Metal to tailor retrieval to specific workflows.</p>
  </li>
  <li>
    <p><strong>Flexibility without re-architecture:</strong> New object types can be introduced without migrating existing data or rebuilding the system.</p>
  </li>
  <li>
    <p><strong>Operational simplicity:</strong> Moving to Vespa Cloud enabled the team to focus engineering capacity on product innovation instead of infrastructure.</p>
  </li>
</ul>

<p>These capabilities give Metal the ability to shape retrieval around business logic, rather than forcing business logic to adapt to infrastructure limitations.</p>

<blockquote>
  <p>“Our competitors focus on documents. With Vespa, we can focus on the
full picture: companies, people, activities, and how they relate.” -
Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<h2 id="architecture-in-action">Architecture in Action</h2>

<p>Metal treats retrieval as part of an AI agent orchestration layer, not just a standard search box.</p>

<p>When a user or agent asks a question like, “What’s this company’s EBITDA?”, the query is first interpreted by an AI agent. Rather than issuing a single plain-text search, the agent:</p>

<ul>
  <li>
    <p>Determines which entity types to query (documents, companies, metrics, activities)</p>
  </li>
  <li>
    <p>Applies structured parameters such as recency or workflow-specific filters</p>
  </li>
  <li>
    <p>Executes retrieval against Vespa</p>
  </li>
  <li>
    <p>Iterates as needed (paginating, refining, or querying related entities)</p>
  </li>
  <li>
    <p>Assembles sufficient context before generating a response</p>
  </li>
</ul>

<p>Vespa powers this retrieval layer, enabling fast, structured queries across different object types and supporting the iterative retrieval process required by Metal’s agent-driven architecture.</p>

<h2 id="turning-ddq-chaos-into-structured-approved-intelligence">Turning DDQ Chaos into Structured, Approved Intelligence</h2>

<p>One clear example is Metal’s Due Diligence Questionnaire (DDQ) workflow. Private equity firms must respond to thousands of LP questionnaires using pre-approved answers. These responses cannot be freely generated by an LLM. They must come from content that has already been reviewed and approved by legal teams.</p>

<p>Answer banks change over time and are stored in unstructured formats like documents and spreadsheets. Metal indexes this data into Vespa, making the system aware of which documents are most recent. When answering a questionnaire, retrieval is prioritized not only by semantic similarity to the question but also by freshness.</p>

<p>This allows Metal to surface the most relevant and up-to-date approved answers, efficiently and reliably within its platform.</p>

<h2 id="scaling-without-infrastructure-headaches">Scaling without Infrastructure Headaches</h2>

<p>By building on <a href="https://vespa.ai/solutions/vespa-cloud/">Vespa Cloud</a>, Metal achieved:</p>

<ul>
  <li>
    <p>Improved feature velocity: The team can introduce new entity types and workflows quickly without architectural rework</p>
  </li>
  <li>
    <p>Greater engineering focus: The team spends less time managing infrastructure and more time building differentiating product features</p>
  </li>
  <li>
    <p>Scalable retrieval architecture: Metal can onboard new clients and data volumes without redesigning retrieval.</p>
  </li>
  <li>
    <p>Confidence in long-term flexibility: Vespa is not a limiting factor as Metal expands into more advanced agent-driven workflows.</p>
  </li>
</ul>

<blockquote>
  <p>“Managing infrastructure can be a distraction. Vespa Cloud lets us focus on product.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<h2 id="looking-forward-build-for-an-agentic-future">Looking Forward: Build for an Agentic Future</h2>

<p>Metal’s roadmap is deeply agentic. AI agents drive most interactions, deciding how best to query the platform and construct the context needed to answer sophisticated questions.</p>

<p>Because Vespa supports flexible, multi-entity retrieval with advanced ranking and real-time performance, Metal can:</p>

<ul>
  <li>
    <p>Expand into more advanced analysis workflows</p>
  </li>
  <li>
    <p>Build deeper relational structures between entities</p>
  </li>
  <li>
    <p>Adapt retrieval strategies dynamically as business logic evolves</p>
  </li>
</ul>

<p>The result is an institutional intelligence platform that scales in both data volume and intelligence, evolving alongside the firm it serves.</p>

<blockquote>
  <p>“When you’re building something ambitious, you don’t want to hit a capability wall. Vespa gives us confidence that we won’t.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

]]></content:encoded>
        <pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Build a High-Quality RAG App on Vespa Cloud in 15 Minutes</title>
        <description>Retrieval-Augmented Generation (RAG) allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" />
        
        <content:encoded><![CDATA[<p><strong>Retrieval-Augmented Generation (RAG)</strong> allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.</p>

<p>RAG bridges that gap by retrieving relevant information from your data and supplying it to the model as context, so responses are grounded in real, trusted sources rather than guesswork.</p>

<h2 id="the-challenge-the-quality-of-the-context-window">The Challenge: The Quality of the Context Window</h2>

<p>In Retrieval-Augmented Generation (RAG), the real bottleneck is the LLM’s context window. You can’t simply pass your entire dataset into a prompt—there’s a strict token budget.</p>

<p>Because of this, the problem isn’t just retrieving information, but retrieving the right information. When the context window is filled with loosely matched or low-quality results, the LLM has little to work with and the quality of its answers drops accordingly.</p>

<p>High-quality RAG depends on semantic understanding, precise retrieval, and strong ranking across diverse data types so that every token in the context window earns its place.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" alt="illustration_2" /></p>

<h2 id="the-solution-out-of-the-box-rag-on-vespa-cloud">The Solution: Out-of-the-Box RAG on Vespa Cloud</h2>

<p>Vespa Cloud provides an out-of-the-box Vespa <a href="https://docs.vespa.ai/en/examples/rag-blueprint.html">RAG Blueprint</a> designed to maximize the quality of the context sent to the LLM. Instead of relying solely on nearest-neighbor vector search, Vespa combines semantic vector retrieval with lexical BM25 scoring and applies advanced ranking, using models such as BERT, LightGBM, or custom logic—to ensure that only the strongest candidates are selected.</p>

<p>This hybrid retrieval and ranking approach consistently surfaces the most relevant document chunks, which significantly improves the quality of the final generated answer.</p>

<p>In this blog post, we’ll build a complete Retrieval-Augmented Generation (RAG) application from end to end by leveraging the OOTB Vespa RAG app on Vespa cloud. The following diagram shows the architecture we’ll be working with:</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/architecture_diagram.png" alt="Vespa RAG Architecture" /></p>

<p>The architecture consists of two main flows: data ingestion and query processing.</p>

<p><strong>Data Ingestion (one-time setup)</strong></p>

<p>First, we ingest our data sources, such as documents, PDFs, or web pages by using a Python-based pipeline. The pipeline processes the data, splits it into manageable chunks, generates embeddings, and feeds everything into a Vespa Cloud RAG application that is preconfigured with a schema and ranking profiles. This step populates the search index.</p>

<p><strong>Query Flow (live interaction)</strong></p>

<ol>
  <li>
    <p>A user enters a question in the <strong>Vespa RAG UI</strong>.</p>
  </li>
  <li>
    <p>The UI sends the query to a <strong>Python backend</strong>, which issues a hybrid search request (combining keyword and vector retrieval) to <strong>Vespa Cloud</strong>.</p>
  </li>
  <li>
    <p><strong>Vespa Cloud</strong> returns the most relevant document chunks.</p>
  </li>
  <li>
    <p>The backend sends those chunks, along with the original query, to an <strong>LLM</strong> as context.</p>
  </li>
  <li>
    <p>The model generates an answer grounded in that context and returns it to the backend.</p>
  </li>
  <li>
    <p>The backend streams the answer back to the UI.</p>
  </li>
</ol>

<p>This architecture ensures that generated responses are grounded in your own data, combining Vespa’s retrieval and ranking strengths with the generative capabilities of large language models.</p>

<p>The end-to-end setup takes about 15 minutes, plus additional time to process your documents.</p>

<hr />

<h2 id="deploy-vespa-rag-blueprint-to-vespa-cloud">Deploy Vespa RAG Blueprint to Vespa Cloud</h2>

<p>We’ll start by deploying a preconfigured RAG Blueprint to Vespa Cloud. This gives you a high-quality retrieval stack in minutes, and it’s free to get started. All of this is done directly from the Vespa Cloud console.</p>

<p><strong>Sign up for Vespa Cloud</strong></p>

<p>Go to the <a href="https://console.vespa-cloud.com/">Vespa Cloud Console</a> and create an account. If this is your first time using Vespa Cloud, the free trial is the fastest way to get going.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_1.png" alt="image_1" /></p>

<p><strong>Deploy RAG Blueprint</strong></p>

<p>In the console, select <strong>“Deploy your first application”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_2.png" alt="image_2" /></p>

<p>Choose <strong>“Select a sample application to deploy directly from the browser”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_3.png" alt="image_3" /></p>

<p>Select <strong>“RAG Blueprint”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_4.png" alt="image_4" /></p>

<p>Click <strong>“Deploy”</strong> and wait for the deployment to complete.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_5.png" alt="image_5" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_8.png" alt="image_8" /></p>

<p><strong>Save your credentials</strong></p>

<p>Once deployment finishes, the console will generate an access token. <strong>Save this immediately.</strong>
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_9.png" alt="image_9" /></p>

<p>That token is how Python backend authenticates with Vespa Cloud. Treat it like a password.</p>

<p>Continue through the remaining setup screens, then open the application view.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_10.png" alt="image_10" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_11.png" alt="image_11" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_12.png" alt="image_12" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_13.png" alt="image_13" /> 
<strong>Note your endpoint URL</strong></p>

<p>In the application view you will also find the endpoint URL. Save both the <strong>endpoint URL</strong> and the token; you will need them to configure Python backend in the next section.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_15.png" alt="image_15" />
You can download the Vespa application package by clicking the download icon if you’d like. From there, you can start building your data feeding pipeline, frontend service UI, and more. However, this blog provides a sample end-to-end RAG application, and the same Vespa application package is included, so there’s no need to download it separately.</p>

<h2 id="behind-the-scenes-what-you-just-deployed">Behind the Scenes: What You Just Deployed</h2>

<p>When you clicked <strong>Deploy</strong>, Vespa Cloud automatically provisioned infrastructure and deployed a complete <strong>Vespa application package</strong>. This package includes everything needed for a high-quality RAG system: schemas, indexing logic, ranking profiles, and service configuration.</p>

<p>In other words, you didn’t just spin up a demo, you launched a ready-to-use, high-quality retrieval engine.</p>

<p>Let’s take a closer look at what’s inside.</p>

<h3 id="the-schema">The Schema</h3>

<p>The RAG Blueprint uses a carefully designed schema that controls how documents are stored, chunked, embedded, and retrieved:</p>

<p><code class="language-plaintext highlighter-rouge">vespa_cloud/schemas/doc.sd</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="n">doc</span> <span class="o">{</span>
    <span class="n">document</span> <span class="n">doc</span> <span class="o">{</span>
        <span class="n">field</span> <span class="n">id</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">attribute</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">title</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">index</span> <span class="o">|</span> <span class="n">summary</span>
            <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">text</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
        <span class="o">}</span>

        <span class="err">#</span> <span class="nc">Optional</span> <span class="n">metadata</span> <span class="n">fields</span> <span class="k">for</span> <span class="n">tracking</span> <span class="n">document</span> <span class="n">usage</span>
        <span class="n">field</span> <span class="n">created_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">modified_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">last_opened_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">open_count</span> <span class="n">type</span> <span class="kt">int</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">favorite</span> <span class="n">type</span> <span class="n">bool</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">the</span> <span class="nf">title</span> <span class="o">(</span><span class="mi">768</span> <span class="n">floats</span> <span class="err">→</span> <span class="mi">96</span> <span class="n">int8</span><span class="o">)</span>
    <span class="n">field</span> <span class="n">title_embedding</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">title</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Automatically</span> <span class="n">chunks</span> <span class="n">text</span> <span class="n">into</span> <span class="mi">1024</span><span class="o">-</span><span class="n">character</span> <span class="n">segments</span>
    <span class="n">field</span> <span class="n">chunks</span> <span class="n">type</span> <span class="n">array</span><span class="o">&lt;</span><span class="n">string</span><span class="o">&gt;</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">index</span>
        <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">each</span> <span class="n">chunk</span>
    <span class="n">field</span> <span class="n">chunk_embeddings</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">chunk</span><span class="o">{},</span> <span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="n">fieldset</span> <span class="k">default</span> <span class="o">{</span>
        <span class="nl">fields:</span> <span class="n">title</span><span class="o">,</span> <span class="n">chunks</span>
    <span class="o">}</span>

    <span class="n">document</span><span class="o">-</span><span class="n">summary</span> <span class="n">top_3_chunks</span> <span class="o">{</span>
        <span class="n">from</span><span class="o">-</span><span class="n">disk</span>
        <span class="n">summary</span> <span class="n">chunks_top3</span> <span class="o">{</span>
            <span class="nl">source:</span> <span class="n">chunks</span>
            <span class="n">select</span><span class="o">-</span><span class="n">elements</span><span class="o">-</span><span class="nl">by:</span> <span class="n">top_3_chunk_sim_scores</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>What’s happening here:</strong> Your documents store their raw content in <code class="language-plaintext highlighter-rouge">title</code> and <code class="language-plaintext highlighter-rouge">text</code>. During indexing, the <code class="language-plaintext highlighter-rouge">text</code> field automatically split into 1024-character chunks. Embeddings are generated for both titles and chunks, then binary-quantized using <code class="language-plaintext highlighter-rouge">pack_bits</code>, shrinking 768 floating-point values down to just 96 <code class="language-plaintext highlighter-rouge">int8</code>s. This dramatically reduces storage and improves performance while still supporting efficient vector similarity search.</p>

<p>At the same time, BM25 is enabled for lexical matching. This combination is what enables Vespa’s hybrid retrieval: semantic matching plus exact term relevance.</p>

<p><strong>Out-of-the-Box Query Profiles:</strong></p>

<p>The RAG Blueprint ships with four query profiles optimized for NyRAG’s client-side RAG architecture:</p>

<p><strong>NyRAG Architecture:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Query → NyRAG (generates search queries)
          → Vespa (retrieval + ranking)
          → NyRAG (generates final answer)
</code></pre></div></div>
<p>Query profiles control <strong>only the Vespa retrieval/ranking step</strong>. NyRAG handles all LLM interactions.</p>

<p><strong>The 4 Profiles:</strong></p>

<ol>
  <li><strong>hybrid</strong> (default, fast)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector search with <code class="language-plaintext highlighter-rouge">targetHits:100</code></li>
      <li><strong>Ranking:</strong> Learned linear model (logistic regression)</li>
      <li><strong>Best for:</strong> Everyday queries where you want fast, solid results</li>
    </ul>
  </li>
  <li><strong>hybrid-with-gbdt</strong> (highest quality)
    <ul>
      <li><strong>Retrieval:</strong> Same as hybrid (BM25 + Vector, 100 targets)</li>
      <li><strong>Ranking:</strong> Two-phase with LightGBM (GBDT) second-phase</li>
      <li><strong>Best for:</strong> Complex queries where relevance matters most (~2-3x slower)</li>
    </ul>
  </li>
  <li><strong>deepresearch</strong> (exhaustive search)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector with <code class="language-plaintext highlighter-rouge">targetHits:10000</code> (100x more!)</li>
      <li><strong>Ranking:</strong> Learned linear model</li>
      <li><strong>Best for:</strong> Research scenarios needing maximum recall</li>
    </ul>
  </li>
  <li><strong>deepresearch-with-gbdt</strong> (exhaustive + best quality)
    <ul>
      <li><strong>Retrieval:</strong> Deep search (10k targets)</li>
      <li><strong>Ranking:</strong> Two-phase with GBDT</li>
      <li><strong>Best for:</strong> When you need both maximum recall and best ranking</li>
    </ul>
  </li>
</ol>

<blockquote>
  <p><strong>For Advanced Users:</strong> Query profiles bundle complete search configurations including YQL structure (with <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> operators), ranking profiles, and all required parameters (like learned coefficients). The Vespa application also includes <code class="language-plaintext highlighter-rouge">rag</code> and <code class="language-plaintext highlighter-rouge">rag-with-gbdt</code> profiles with <code class="language-plaintext highlighter-rouge">searchChain=openai</code> for <strong>server-side RAG</strong> (direct API usage), but these conflict with NyRAG’s client-side architecture and aren’t included. Learn more in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#ranking-profiles">technical guide</a>.</p>
</blockquote>

<p><strong>Which profile should you use?</strong></p>
<ul>
  <li>Start with <strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong> for everyday use - fast and accurate</li>
  <li>Switch to <strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong> when quality matters most (harder queries)</li>
  <li>Use <strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong> when you need to find everything relevant (research mode)</li>
  <li>Try <strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong> for maximum recall + quality (slowest but most thorough)</li>
</ul>

<hr />

<p>Now that your RAG Blueprint Vespa Cloud application is up and running, it’s time to add the missing pieces: a simple frontend UI and a data ingestion pipeline. For this, we’ll use <strong>NyRAG</strong>, a tool included in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint"><code class="language-plaintext highlighter-rouge">RAG-app-in-15min-ragblueprint
</code></a> repository.</p>

<p>NyRAG acts as the glue for the entire RAG workflow. It reads documents from local files or websites, splits text into manageable chunks, generates embeddings, feeds everything into Vespa, and finally exposes a lightweight chat UI where you can ask questions over your data. Instead of wiring all of this together yourself, NyRAG gives you a working end-to-end system out of the box.</p>

<h3 id="install-nyrag">Install NyRAG</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the repository</span>
git clone https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint.git
<span class="nb">cd </span>RAG-app-in-15min-ragblueprint

<span class="c"># Install uv (Fast, modern Python package manager)</span>
<span class="c"># macOS</span>
brew <span class="nb">install </span>uv

<span class="c"># Linux &amp; macOS</span>
<span class="c"># curl -LsSf https://astral.sh/uv/install.sh | sh</span>
<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"</span>

<span class="c"># Verify uv installation</span>
uv <span class="nt">--version</span>

<span class="c"># Install dependencies using uv</span>
uv <span class="nb">sync
source</span> .venv/bin/activate

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># . .\.venv\Scripts\activate</span>

<span class="c"># Install nyrag locally</span>
uv pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>

<span class="c"># Verify nyrag installation</span>
nyrag <span class="nt">--help</span>
</code></pre></div></div>

<p><strong>Get an LLM API key</strong></p>

<p>To generate final answers, NyRAG needs an OpenAI-compatible API key. The simplest way to get started is <strong>OpenRouter</strong>, which provides access to multiple LLMs through a single API.</p>

<p>In this walkthrough, we’ll use OpenRouter for convenience. In a real application, you’re free to swap in any compatible LLM provider. To continue, sign up for OpenRouter and generate an API key. You’ll use it in the next step when configuring NyRAG.</p>

<hr />

<h3 id="start-the-nyrag-ui">Start the NyRAG UI</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This script handles all configuration automatically</span>
./run_nyrag.sh

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># .\run_nyrag.ps1</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">run_nyrag.sh</code> script starts the UI and wires up the configuration so NyRAG can talk to Vespa Cloud. In practice, it loads your project config, uses the token you provide for authentication, and starts the web UI on port 8000.</p>

<p>Open http://localhost:8000 in your browser.</p>

<p><strong>Configure your project:</strong>
Now you’ll configure your project using the web UI to connect to your Vespa Cloud deployment and set up document processing.</p>

<p><strong>Step 1: Select and edit the example project</strong></p>

<p>In the top header, the project dropdown shows <strong>“doc_example”</strong>. If you are starting from the example config, it is usually pre-selected. The configuration editor typically opens automatically; if it does not (for example you land directly in chat), open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong>.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_7.png" alt="Project selector dropdown with &quot;doc_example&quot; highlighted" />
<strong>Description</strong>: Shows the project dropdown menu in the header with “doc_example” option</p>

<blockquote>
  <p><strong>Note:</strong> If the configuration editor doesn’t appear (shows chat interface instead), click the <strong>three-dot menu</strong> (⋮) in the top right corner and select <strong>“Edit Config”</strong> to open it manually.</p>
</blockquote>

<p><strong>Step 2: Update your credentials</strong></p>

<p>In the configuration editor, paste in the information you saved from Vespa Cloud and your LLM provider. You only need three things to get going: your Vespa tenant name, your Vespa endpoint + token, and your LLM API key.</p>

<p><strong>Required fields to update:</strong></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Your Vespa Cloud credentials (from Vespa Cloud Console)</span>
<span class="na">cloud_tenant</span><span class="pi">:</span> <span class="s">your-tenant</span>          <span class="c1"># Your Vespa Cloud tenant name</span>
<span class="na">vespa_cloud</span><span class="pi">:</span>
  <span class="na">endpoint</span><span class="pi">:</span> <span class="s">https://your-app.vespa-cloud.com</span>  <span class="c1"># Your Vespa token endpoint (not mtls)</span>
  <span class="na">token</span><span class="pi">:</span> <span class="s">vespa_cloud_YOUR_TOKEN_HERE</span>          <span class="c1"># Your Vespa data plane token</span>

<span class="c1"># Your LLM configuration (default: OpenRouter)</span>
<span class="na">llm_config</span><span class="pi">:</span>
  <span class="na">api_key</span><span class="pi">:</span> <span class="s">sk-or-v1-YOUR_KEY_HERE</span>   <span class="c1"># Your OpenRouter API key (or other provider)</span>
</code></pre></div></div>

<p><strong>Notes:</strong></p>

<p>The default LLM provider is OpenRouter. If you switch providers, also update <code class="language-plaintext highlighter-rouge">base_url</code> and <code class="language-plaintext highlighter-rouge">model</code> to match. For the included example documents, <code class="language-plaintext highlighter-rouge">start_loc</code> defaults to <code class="language-plaintext highlighter-rouge">./dataset</code>, so you can run the pipeline without changing anything else.</p>

<p><strong>Step 3: Save and start processing</strong></p>

<p>After updating the configuration, you can close the editor (changes are saved automatically) and start indexing. If you are using the example dataset, keep <code class="language-plaintext highlighter-rouge">./dataset</code> as-is; otherwise, point <code class="language-plaintext highlighter-rouge">start_loc</code> at the folder (or site) you want to ingest. When you click <strong>“Start Indexing”</strong>, NyRAG reads your input, chunks it into 1024-character segments, generates embeddings, feeds everything to Vespa Cloud, and shows progress in the terminal panel so you can see exactly what is happening.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_10.png" alt="Processing progress with terminal logs" />
<strong>Description</strong>: Shows documents being processed with terminal logs displaying progress</p>

<hr />

<h2 id="chat-with-your-data">Chat with Your Data</h2>

<p>You can now start asking questions in the chat UI.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_ui.png" alt="nyrag_ui" /></p>

<p>When you submit a query, NyRAG expands it into focused retrieval queries and sends them to Vespa. Vespa runs hybrid retrieval, combining BM25 keyword matching with vector similarity, and returns the most relevant chunks. Those chunks are packed into a compact context window and sent to the LLM, which generates an answer grounded entirely in your data.</p>

<p>A good way to sanity-check the setup is to start with a broad question like “What are the main topics in these documents?” and then follow up with something more specific to confirm the retrieved context makes sense.</p>

<p>At this point, you have a fully functional RAG application running on Vespa Cloud.</p>

<h3 id="improving-search-quality-with-query-profiles">Improving Search Quality with Query Profiles</h3>

<p>Want better search results? You can fine-tune how Vespa retrieves and ranks your documents using the Settings modal (⚙️ icon in the top right).</p>

<p><strong>Change query profiles:</strong> Open the ⚙️ <strong>Settings</strong> panel, choose a <strong>Query Profile</strong> from the dropdown, and click <strong>“Save”</strong>. The very next query you run will use the new profile.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_settings_query_profiles.png" alt="Settings modal with query profile dropdown" /><br />
<strong>Description</strong>: Settings modal showing query profile selection dropdown with 4 available options</p>

<p><strong>What each profile does:</strong></p>
<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong>: Fast hybrid search (BM25 + vector) with linear ranking</li>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong>: Same retrieval + advanced GBDT ranking (slower but best quality)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong>: Exhaustive search with 10,000 retrieval targets (maximum recall)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong>: Exhaustive search + GBDT ranking (slowest, most thorough)</li>
</ul>

<p><strong>Pro tip</strong>: The quality difference between <code class="language-plaintext highlighter-rouge">hybrid</code> and <code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code> can be dramatic for complex queries. The GBDT model offers significantly better relevance at the cost of 2-3x higher latency. For research tasks where you need to find everything relevant, try <code class="language-plaintext highlighter-rouge">deepresearch</code> variants which cast a much wider net!</p>

<hr />

<h3 id="manage-your-data">Manage Your Data</h3>

<p>NyRAG also gives you simple tools for cleanup. Open the advanced menu (three-dot icon ⋮ in the top right) and you will find two cleanup actions. <strong>Clear Local Cache</strong> removes cached files for all projects on your machine, which is useful when you want to re-process from scratch locally. <strong>Clear Vespa Data</strong> deletes the indexed documents in Vespa for the project, which is useful when you want a clean index before re-feeding. Both actions ask for confirmation so you do not delete data by accident.</p>

<hr />

<h2 id="bonus-try-web-crawling-mode">Bonus: Try Web Crawling Mode</h2>

<p>In addition to local documents, NyRAG supports web crawling. By switching to the web_example project, you can point NyRAG at a website and have it crawl, extract, and index content automatically.</p>

<p><strong>Switch to web crawling mode:</strong>  Select <code class="language-plaintext highlighter-rouge">web_example (web)</code> from the dropdown at the top and open the configuration editor. If you are currently on the chat screen, open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong> to bring the editor back. From there, update the same credential fields as you did for <code class="language-plaintext highlighter-rouge">doc_example</code>, then click <strong>“Start Indexing”</strong> to crawl and feed the site.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_indexing_web_2.png" alt="Web crawling in progress" /> 
<strong>Description</strong>: Shows web crawling in progress with terminal logs displaying discovered URLs and processed pages</p>

<p><strong>Web Mode Features:</strong> Web mode discovers and follows links automatically, while still respecting <code class="language-plaintext highlighter-rouge">robots.txt</code> and crawl delays so you do not hammer a site. It also does smart content extraction to drop navigation and boilerplate, deduplicates very similar pages, and supports resume so you can continue a crawl after interruption.</p>

<p><strong>Example Use Cases:</strong> Web mode is a good fit for product documentation, knowledge bases, blog archives, help-center content, and technical wikis. In general, it works best on sites with consistent HTML structure and clean, text-heavy pages.</p>

<p><strong>Tips:</strong> Start small. Crawl a limited part of a site first so you can sanity-check what gets extracted and indexed, then expand. Use <code class="language-plaintext highlighter-rouge">exclude</code> patterns to skip sections you do not want (for example <code class="language-plaintext highlighter-rouge">/pricing</code> or <code class="language-plaintext highlighter-rouge">/sales/*</code>), and keep an eye on the terminal output panel so you can spot loops, unexpected URLs, or pages that fail to parse.</p>

<hr />

<h2 id="troubleshooting">Troubleshooting</h2>

<p>Running into issues? We’ve got you covered! For detailed troubleshooting guides covering Vespa connection errors, LLM configuration, document processing, and more, see the <strong><a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#troubleshooting">Troubleshooting section</a></strong> in the main README.</p>

<p><strong>Quick help:</strong> If you get stuck, the fastest path is usually to ask in the <a href="http://slack.vespa.ai/">Vespa Slack</a> community, where people can help you interpret logs and query behavior. If you think you found a bug or want to request an improvement, open an issue in <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint/issues">GitHub Issues</a>. And when you want deeper background on schema, ranking, and deployment, the <a href="https://docs.vespa.ai/">Vespa Docs</a> are your go-to reference.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p><strong>Congratulations!</strong> You now have a working RAG app: a Vespa Cloud deployment that can retrieve high-quality context, and a small UI that lets you ingest data and chat with it.</p>

<p>Building a high-quality RAG system is never trivial. There are multiple moving parts to get right: the quality of the LLM, the size and management of its context window, and how effectively your retrieval system surfaces the most relevant information.</p>

<p>Thanks to the out-of-the-box Vespa RAG blueprint on Vespa Cloud, much of this complexity is handled for you. It comes with multiple ranking profiles, and its default hybrid retrieval setup combines <strong>vector similarity with BM25 text matching</strong>, ensuring your LLM sees the best possible context for every query.</p>

<p>Vespa Cloud doesn’t just make building RAG easier, it makes it <strong>scalable, fast, and reliable</strong>, giving you production-ready infrastructure, auto-scaling and observability without the headaches of self-hosting. Whether you’re experimenting with small datasets or scaling to millions of documents, Vespa Cloud provides the tools and flexibility to make your RAG project shine.</p>

<p>Want to dive deeper? Start with the <a href="https://docs.vespa.ai/en/learn/tutorials/rag-blueprint.html">RAG Blueprint Tutorial</a> for a thorough conceptual walkthrough. And remember the <a href="https://vespatalk.slack.com/">Vespa Slack community</a> is always there to help. Ask questions, share what you’ve built, or get advice on retrieval, ranking, and deployment strategies.</p>

<p>Ready to experience the power of Vespa Cloud for yourself? <a href="https://cloud.vespa.ai/">Sign up</a> today and <strong>start building high-quality RAG applications with ease</strong>!</p>

]]></content:encoded>
        <pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Vespa Newsletter, February 2026</title>
        <description>Advances in Vespa&apos;s retrieval performance, flexibility, and developer productivity.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/logo/logo-pi.jpg" />
        
        <content:encoded><![CDATA[<p>Welcome to the latest edition of the Vespa newsletter. In the <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">previous update</a>, we introduced several new features and improvements, including Automated ANN Tuning, Accelerated Exact Vector Distance with Google Highway, Precise Chunk-Level Matching for Higher Retrieval Quality, Quantile Computation in Grouping for Instant Distribution Insights, and <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">more</a>.</p>

<p>This month, we’re announcing several updates focused on retrieval quality, ranking flexibility, and developer productivity. Each feature is designed to help engineering teams build faster, more accurate, and more maintainable retrieval and ranking systems, while giving businesses better relevance, lower operational overhead, and more predictable performance at scale.</p>

<p>Let’s dive into what’s new.</p>

<h2 id="product-updates">Product updates</h2>

<ul>
  <li>Announcing the Vespa.ai Playground</li>
  <li>The Vespa Kubernetes Operator</li>
  <li>Faster result rendering with CBOR</li>
  <li>Pyvespa 1.0 with improved HTTP performance</li>
  <li>Hybrid search relevance evaluation tool</li>
  <li>Configurable linguistics per field</li>
  <li><strong>“switch”</strong> operator in ranking expressions</li>
  <li>Vespa is now available on GCP Marketplace</li>
  <li>Feed data and run queries in the Vespa Console</li>
</ul>

<h3 id="announcing-the-vespaai-playground">Announcing the Vespa.ai Playground</h3>

<p>The Vespa Playground is a new GitHub space where we share projects, tools, and demos built on the Vespa platform. It’s a practical place to explore real examples for embeddings, model training, and feed connectors that you can clone, run, and build on your own.</p>

<p>These repos are ideal for experimentation, learning, and inspiration, though they aren’t officially supported product releases.</p>

<p><a href="https://github.com/vespaai-playground">Explore the Playground</a></p>

<h3 id="the-vespa-kubernetes-operator">The Vespa Kubernetes Operator</h3>

<p>The safest, most robust and cost effective way to run Vespa is to deploy on Vespa Cloud, but for various reasons that’s not an option for everybody. For those who want to run Vespa securely at scale but can’t use Vespa Cloud we have now released the Vespa Kubernetes Operator. This brings many of the Vespa Cloud features such as security out of the box, dynamic provisioning, autoscaling and automated upgrades to your own Kubernetes environments.</p>

<p>Read more in the <a href="https://docs.vespa.ai/en/operations/kubernetes/vespa-on-kubernetes.html">Kubernetes Operator documentation</a>.</p>

<h3 id="faster-result-rendering-with-cbor">Faster result rendering with CBOR</h3>

<p>Query result sets can be large, and increasingly so when the client is an LLM retrieving many chunks for model context. <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">Layered ranking</a> is designed to address this by extracting the most relevant content. Still, in some cases, the total latency is dominated by the time it takes to send the query response. Compressing with gzip can help, but is also CPU-intensive and slow.. From Vespa 8.623.5, json response generation is over twice as fast as before.</p>

<p>Another new option in this release is to use the <a href="https://cbor.io/">CBOR</a> format for query results. CBOR is a binary format so it can be serialized faster and produces smaller payloads, especially when the result contains lots of numeric data. Read more in the <a href="https://docs.vespa.ai/en/reference/api/query.html#presentation.format">Query API reference</a> and query <a href="https://docs.vespa.ai/en/performance/practical-search-performance-guide.html#hits-and-summaries">performance guide</a>.</p>

<h3 id="pyvespa-10-with-improved-http-performance">Pyvespa 1.0 with improved HTTP performance</h3>

<p>We have released the first major version of Pyvespa! This release switches the HTTP-client used by Pyvespa, from httpx to httpr, which gives big performance gains, especially for serializing and deserializing tensors, largely by taking advantage of the new CBOR serialization support in Vespa.</p>

<p>On preliminary benchmarks, we compared end-to-end latency for:</p>

<ol>
  <li>
    <p>Vespa 8.591.16 + Pyvespa v0.63.0 (using JSON)</p>
  </li>
  <li>
    <p>Vespa 8.634.24 + Pyvespa v1.0.0 (using CBOR)</p>
  </li>
</ol>

<p>The latter was ~4.9x faster when returning 400 hits with a 768-dim vector each. Performance gains will be smaller when not returning large result sets with tensors, but still significant. You may encounter different exceptions than before, but we strived to not change any user-facing API’s even if we bumped the major version.</p>

<p><a href="https://github.com/vespa-engine/pyvespa">Go to Pyvespa</a></p>

<h3 id="hybrid-search-relevance-evaluation-tool">Hybrid search relevance evaluation tool</h3>

<p>Hybrid search combines lexical and embedding based search to get the best from both. One of the tasks you need to solve is to pick an embedding model that provides a good quality vs. cost tradeoff for your use case. We have done a systematic evaluation of modern alternatives in <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">this blog</a>.</p>

<p>The code used to run these experiments is now merged into Pyvespa. You can use the VespaMTEBApp to evaluate embedding model performance on any task/benchmark compatible with the <a href="https://embeddings-benchmark.github.io/mteb/overview/available_benchmarks/">mteb-library</a>. See example usage from the <a href="https://github.com/vespa-engine/pyvespa/blob/master/tests/integration/test_integration_mtebevaluation.py">tests</a>.</p>

<h3 id="configurable-linguistics-per-field">Configurable linguistics per field</h3>

<p>Vespa now lets you specify linguistics profiles on fields to select some specific linguistics processing in your Linguistics module. In Lucene Linguistics, linguistics profiles map to analyzer configuration, optionally in combination with a specific language.</p>

<p>For example, you can define a Lucene analyzer like this in services.xml:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  &lt;item key="profile=whitespaceLowercase;language=en"&gt;

    &lt;tokenizer&gt;

      &lt;name&gt;whitespace&lt;/name&gt;

    &lt;/tokenizer&gt;

    &lt;tokenFilters&gt;

      &lt;item&gt;

        &lt;name&gt;lowercase&lt;/name&gt;

      &lt;/item&gt;

    &lt;/tokenFilters&gt;

  &lt;/item&gt;
</code></pre></div></div>
<p>And use it in the schema, under any field’s definition, like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>field title type string {

  indexing: summary | index

  linguistics {

      profile: whitespaceLowercase

  }

}
</code></pre></div></div>
<p>By default the linguistics profile will be applied both when processing the text of the field and the text searching it, but you can also specify a different linguistics profile on the query side, which is useful for e.g. doing synonym query expansion.</p>

<p>We’ve added a sample application demonstrating how to use multiple Lucene linguistics <a href="https://github.com/vespa-engine/sample-apps/tree/master/examples/lucene-linguistics/multiple-profiles">profiles</a> across multiple fields and updated the Vespa <a href="https://docs.vespa.ai/en/linguistics/linguistics.html">linguistics documentation</a> with usage examples.</p>

<h3 id="new-switch-operator-in-ranking-expressions">New “switch” operator in ranking expressions</h3>

<p>We have added a “switch” function in ranking expressions as a clearer, more maintainable alternative to deeply nested if() clauses, making complex ranking easier to read, debug, and evolve.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>switch (attribute(category)) {

    case "restaurant": myRestaurantFunction(),

    case "hotel": myHotelFunction(),

    default: myDefaultFunction()

}
</code></pre></div></div>

<p><a href="https://docs.vespa.ai/en/ranking/ranking-expressions-features.html#the-switch-function">Learn more</a></p>

<h3 id="vespa-is-now-available-on-gcp-marketplace">Vespa is now available on GCP Marketplace</h3>

<p>Vespa Cloud is now listed on the GCP Marketplace, making it easier to deploy and manage Vespa using native Google Cloud billing and procurement. Vespa Cloud is already available on <a href="https://aws.amazon.com/marketplace/pp/prodview-5pkxkencasnoo?sr=0-1&amp;ref_=beagle&amp;applicationId=AWSMPContessa">AWS Marketplace</a>.</p>

<p><a href="https://console.cloud.google.com/marketplace/product/gcp-billing-marketplace/vespa-cloud">See details</a></p>

<h3 id="feed-data-and-run-queries-in-the-vespa-console">Feed data and run queries in the Vespa Console</h3>

<p>The onboarding experience is now even smoother for new Vespa Cloud users. When you follow the getting started guide and deploy a sample app from the browser, you can immediately feed data and run queries directly in the browser. This makes it easy to try your own data and see how it behaves in Vespa.</p>

<p>We also provide examples showing how to do the same using pyvespa, the Vespa CLI, or curl.</p>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/new-onboarding-console.png" alt="New onboarding experience" /></p>

<p><a href="https://login.console.vespa-cloud.com/u/signup/identifier?state=hKFo2SBsN1NBOERhNnRCbDhpajdqTnhYSTlzUlltUjNoUG5mZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIERwRkg4NkVwRHg2aFk1Rjg0ZHZrYmdBZ0pFc1lTb29Io2NpZNkgVk92OGViclhwcEdBTnVpWWZHOWhKWk94MVM5T0dhTTQ">Try it Free</a></p>

<h2 id="new-content-and-learning-resources">New content and learning resources</h2>

<p>We published several new articles and resources since our last newsletter to help teams get more out of Vespa and stay ahead of new developments in search, RAG, and large-scale AI.</p>

<p><strong>Examples and notebooks:</strong></p>

<ul>
  <li><a href="http://playground.vespa.ai">playground.vespa.ai</a></li>
</ul>

<p><strong>Videos, webinars, and podcasts</strong></p>

<ul>
  <li><a href="https://em360tech.com/podcasts/how-scale-ai-digital-commerce-effectively?utm_content=520974566&amp;utm_medium=social&amp;utm_source=linkedin&amp;hss_channel=lcp-100705136">How To Scale AI in Digital Commerce Effectively</a></li>
  <li><a href="https://vespa.ai/resource/vespa-now-year-in-review/">2025 Year in Review</a></li>
</ul>

<p><strong>Blogs and ebooks</strong></p>

<ul>
  <li><a href="https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/">Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</a></li>
  <li><a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a></li>
  <li><a href="https://blog.vespa.ai/enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/">Enterprise AI Search vs. the Real Needs of Customer-Facing Apps</a></li>
  <li><a href="https://blog.vespa.ai/eliminating-the-precision-latency-trade-off-in-large-scale-rag/">Eliminating the Precision–Latency Trade-Off in Large-Scale RAG</a></li>
  <li><a href="https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/">How Tensors Are Changing Search in Life Sciences</a></li>
  <li><a href="https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/">The Search API Reset: Incumbents Retreat, Innovators Step Up</a></li>
  <li><a href="https://blog.vespa.ai/why-ai-search-platforms-are-gaining-attention/">Why AI Search Platforms Are Gaining Attention</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-5-of-5/">Why Life Sciences AI Is a Search Problem (Part 5 of 5)</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-4-of-5/">Why Life Sciences AI Is a Search Problem (Part 4 of 5)</a></li>
</ul>

<h3 id="upcoming-events">Upcoming Events</h3>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/maven.jpeg" alt="Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET" />
<strong>Lightning Lesson: Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET</strong></p>
<ul>
  <li>Intro to sparse vectors and tensors for efficient data handling</li>
  <li>Using Vision-Language Models (VLMs) to extract high quality and nuanced features from images</li>
  <li>Leveraging these features in sparse representations for hyper-personalized search &amp; recommendations</li>
</ul>

<p><a href="https://maven.com/p/b5ee84/personalized-relevance-with-vl-ms-and-sparse-vectors">Register Now</a></p>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/eCommerce-Webinar-Series.png" alt="e-commerce-webinar-series" />
<strong>February 18: The Zero Results Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/f4f6c070-c094-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/305ace80-c3c0-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20(AMER)">Save your spot</a></li>
</ul>

<p><strong>March 11: The Relevance Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/70338df0-c5fd-11f0-831c-01bcfd385865?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/5bf695d0-c5fd-11f0-bb1f-e79dc2111266?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20AMER">Save your spot</a></li>
</ul>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/Vespa-Now-Q1-Product-Update.png" alt="product-update" />
<strong>March 10: Vespa Q1 Product Update</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/79245020-f186-11f0-ace7-c7ef52349391?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20Update">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/3d23e680-f186-11f0-b12c-b1c5402490b0?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20update">Save your spot</a></li>
</ul>

<hr />
<p>👉 <a href="https://www.linkedin.com/company/vespa-ai/">Follow us on LinkedIn</a> to stay in the loop on upcoming events, blog posts, and announcements.</p>

<hr />

<p>Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? <a href="https://vespa.ai/free-trial/">Deploy your application for free</a> on Vespa Cloud today.</p>

]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-newsletter-february-2026/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-newsletter-february-2026/</guid>
        
        
        <category>newsletter</category>
        
      </item>
    
      <item>
        <title>Nexla + Vespa, The Power Duo for AI-Ready Data Pipelines</title>
        <description>Nexla solves data readiness. Vespa solves intelligence and precision at scale. Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/images/New Partnership Nexla.png" />
        
        <content:encoded><![CDATA[<h3 id="partner-spotlight-nexla">Partner Spotlight: Nexla</h3>

<p>AI is transforming quickly. What started with Q&amp;A chatbots has already evolved into deep research applications and, now, autonomous AI agents. Vespa is proud to be at the center of this shift, enabling some of the most proficient adopters of AI, such as Perplexity. To help organizations maximize the benefits of Vespa, we’re building a robust partner ecosystem. These partners help bring Vespa’s AI-native capabilities into real-world deployments across industries.</p>

<p><strong>Meet the innovators shaping the future of AI. Today’s spotlight: Nexla</strong></p>

<h2 id="nexla--vespaai-the-power-duo-for-ai-ready-data-pipelines">Nexla + Vespa.ai: The Power Duo for AI-Ready Data Pipelines</h2>

<p>When AI systems fall short, it’s rarely the model’s fault. It’s the messy reality of data spread across systems and never quite staying in sync. That’s why Nexla and Vespa partnered together.</p>

<p><a href="https://nexla.com/">Nexla</a> makes data usable.</p>

<p><a href="http://vespa.ai">Vespa</a> makes data intelligent at scale.</p>

<p>Together, they turn messy, distributed enterprise data into real-time AI search, recommendation, and RAG systems, without months of custom code gluing things together.</p>

<h2 id="nexla-making-enterprise-data-usable">Nexla: Making Enterprise Data Usable</h2>

<p>Nexla is an enterprise-grade, AI-powered data integration <a href="https://nexla.com/nexla-platform-overview">platform</a> that turns raw data from any source into production-ready data products. It provides a declarative, no-code way to move, transform, and validate data across ETL/ELT, reverse ETL, streaming, APIs, and RAG pipelines.</p>

<p>Think of Nexla as the layer that answers: “How do we reliably get the right data, in the right shape, to the systems that need it?</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>500+ Bidirectional <a href="https://nexla.com/connectors/">Connectors</a>:</strong> Pull data from databases, APIs, cloud storage, SaaS apps, and data warehouses, including systems like Salesforce, Snowflake, and Amazon S3.</p>
  </li>
  <li>
    <p><strong>Metadata Intelligence:</strong> Nexla automatically scans sources and generates <a href="https://nexla.com/nexsets">Nexsets</a>, virtual, ready-to-use data products with schemas, samples, and validation rules.
Example: If a price field suddenly switches from numeric to string, Nexla detects it before bad data reaches production search.</p>
  </li>
  <li>
    <p><strong><a href="https://nexla.com/blog/introducing-express-conversational-data-platform/">Express</a> (conversational pipelines):</strong> A conversational AI interface where you can simply describe what you need.
Example: You can say, “Pull customer data from Salesforce and merge with Google Analytics,” and it builds the pipeline for you.</p>
  </li>
  <li>
    <p><strong>Universal <a href="https://nexla.com/data-integration/">integration</a> styles:</strong> Supports ELT, ETL, CDC, R-ETL, streaming, API integration, and FTP in a single platform.</p>
  </li>
</ul>

<p>Nexla processes over <strong>1 trillion records monthly</strong> for companies like DoorDash, LinkedIn, Carrier, and LiveRamp.</p>

<h2 id="vespa-where-retrieval-becomes-reasoning">Vespa: Where Retrieval Becomes Reasoning</h2>

<p>Vespa is a production-grade AI search platform that combines a distributed text search, vector search, structured filtering, and machine-learned ranking in a single system.</p>

<p>Think of Vespa as the engine that answers: “Given all this data, how do we retrieve, rank, and reason over it in real time?”</p>

<p>It powers demanding applications like Perplexity and supports search, recommendations, personalization, and RAG at massive scale.</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>Unified AI Search and Retrieval:</strong> Vespa natively combines vector and <a href="https://vespa.ai/tensor-formalism/">tensor search</a> for semantic retrieval, full-text search for precise keyword matching, and structured filtering on attributes like categories, prices, and dates to enable richer, contextual search without stitching multiple systems together.</p>
  </li>
  <li>
    <p><strong>Real-time Retrieval and Inference at Scale:</strong> Rather than separating indexing, ranking, and inference across multiple systems, Vespa performs real-time machine-learned ranking and model inference where the data lives. This means you can serve fresh, personalized results with predictable sub-100 ms latency even for large datasets.</p>
  </li>
  <li>
    <p><strong>Multi-Phase Ranking and Custom Logic:</strong> Vespa lets you embed custom ranking logic, including ML models like XGBoost, directly into your search pipeline using ONNX. You can combine relevance signals, business rules, and semantic vectors in multi-stage ranking to fine-tune which results surface first.</p>
  </li>
  <li>
    <p><strong>Massive Scalability with High Throughput:</strong> Designed for real-world, high-traffic applications, Vespa can scale horizontally across clusters, handling billions of documents with sub-100ms query latency and up to 100k writes per second per node.</p>
  </li>
  <li>
    <p><strong>Multi-Vector and Multi-Modal Retrieval:</strong> Vespa natively handles multiple vectors per document, with support for token-level embeddings, ColPali-based visual document retrieval, and <a href="https://vespa.ai/tensor-formalism/">tensor-based computations</a> for precise, cross-modal relevance and ranking.</p>
  </li>
</ul>

<p>GigaOm recognized Vespa as a <strong><a href="https://content.vespa.ai/gigaom-report-v3-2025?_gl=1*1ep8wq0*_gcl_aw*R0NMLjE3NjQ4Nzg2NjIuQ2owS0NRaUFfOFRKQmhETkFSSXNBUFg1cXhRbHdEbHgtMndtQjdqRS1aYzhVWHRBSW4zTzZ2eEVrelNYTTdLUkNXSkZCTGpISml4MzNSZ2FBbkRxRUFMd193Y0I.*_gcl_au*MjkzNDEwODQ3LjE3NjUyODY2NTk.">leader</a> in vector databases</strong> for two consecutive years, noting its performance advantages over alternatives like Elasticsearch, up to <strong><a href="https://content.vespa.ai/vespa-vs-elasticsearch-performance-comparison">12.9X higher throughput</a> per CPU core for vector searches</strong>.</p>

<h2 id="how-nexla-and-vespa-work-together">How Nexla and Vespa Work Together</h2>

<p>The Nexla-Vespa partnership removes one of the hardest parts of AI systems: getting clean, well-modeled data into a high-performance retrieval engine, continuously.</p>

<p>Nexla recently launched a Vespa connector that makes data integration with Vespa seamless. The integration includes:</p>

<p><strong><a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa Connector</a> in Nexla:</strong>
Handles all data piping from sources like Amazon S3, PostgreSQL, Pinecone, Snowflake, and others directly into Vespa:
<img src="/assets/images/nexla1.png" alt="" /></p>

<p><strong>Vespa Nexla Plugin CLI:</strong> Automatically generates draft Vespa application packages (including schema files) directly from a Nexset, eliminating manual configuration:
<img src="/assets/images/nexla2.png" alt="" /></p>

<p>This means you can move data from S3 to Vespa, migrate from Pinecone to Vespa, or sync <a href="https://nexla.com/demo-center/move-data-from-postgresql-to-vespa-ai-effortlessly/">PostgreSQL to Vespa</a>, all without writing a single line of code.</p>

<h2 id="when-nexla-clients-should-use-vespa">When Nexla Clients Should Use Vespa</h2>

<p>You’re a Nexla client. Use Vespa when you need:</p>

<p><strong>Advanced AI search and RAG applications:</strong>
If you’re building intelligent search, recommendation systems, or RAG applications that require hybrid search (combining semantic vector search with keyword matching and metadata filtering), Vespa is purpose-built for this. Nexla gets your data into Vespa, while Vespa delivers production-grade AI search with machine-learned ranking.</p>

<p><strong>Real-time, high-scale query performance:</strong>
When you need to serve thousands of queries per second across billions of documents with sub-100ms latency, Vespa’s distributed architecture scales horizontally without compromising quality. Nexla ensures your data flows continuously into Vespa with incremental updates and CDC support.</p>

<p><strong>Complex ranking and inference:</strong>
If your use case requires multi-phase ranking, custom ML models, or LLM integration at query time, Vespa executes these operations locally where data lives, avoiding costly data movement. Nexla prepares and transforms your data into the exact schema Vespa needs.</p>

<p><strong>Cost efficiency at scale:</strong>
Vespa delivers 5X infrastructure cost savings compared to alternatives like Elasticsearch while handling vector, lexical, and hybrid queries. Nexla minimizes integration costs by automating pipeline creation and schema management.</p>

<h2 id="when-vespa-clients-should-use-nexla">When Vespa Clients Should Use Nexla</h2>

<p>You’re a Vespa client. Use Nexla when you need:</p>

<p><strong>Multi-source data consolidation:</strong>
Vespa is your search and inference engine, but data lives everywhere, S3 buckets, PostgreSQL databases, Snowflake warehouses, Salesforce CRMs, APIs, and files. Nexla connects to 500+ sources with bidirectional connectors and consolidates data into Vespa without custom ETL scripts.</p>

<p><strong>Automated schema generation and management:</strong>
Instead of manually writing Vespa schema files and managing schema evolution, Nexla’s Plugin CLI auto-generates schemas from your Nexsets. As source schemas change, Nexla’s metadata intelligence detects changes and propagates them downstream automatically.</p>

<p><strong>Data transformation and enrichment:</strong>
Before data hits Vespa, it often needs cleaning, filtering, enrichment, or format conversion. Nexla provides a no-code transformation library and supports custom SQL, Python, or JavaScript, all without maintaining separate ETL infrastructure.</p>

<p><strong>Vector database migration:</strong>
Moving from Pinecone, Weaviate, or another vector database to Vespa? Nexla handles the migration with zero code, extracting records, transforming data to match Vespa’s schema, and syncing documents continuously.</p>

<p><strong>Data quality and monitoring:</strong>
Nexla continuously monitors data flows with built-in validation rules, error handling, and automated alerts. When data quality issues arise, Nexla quarantines bad records and provides audit trails, ensuring Vespa always receives clean, trustworthy data.</p>

<p><strong>Real-time and streaming pipelines:</strong>
Vespa supports real-time updates, but getting real-time data from streaming sources (Kafka, APIs, databases with CDC) requires integration logic. Nexla handles streaming, batch, and hybrid integration styles, optimizing throughput and latency for each source type.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Nexla solves <strong>data readiness</strong>.</p>

<p>Vespa solves <strong>intelligence and precision at scale</strong>.</p>

<p>Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications. <a href="http://vespa.ai">Vespa</a> gives you production-grade vector search, hybrid retrieval, and RAG capabilities at any scale. <a href="http://nexla.com">Nexla</a> eliminates months of pipeline development and makes multi-source data flows conversational.</p>

<p><strong>Ready to explore?</strong></p>

<p>Start at <a href="http://express.dev">express.dev</a> for conversational pipeline building, or explore the <a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa connector</a> in Nexla’s platform to see how quickly your data can power real AI applications.</p>
]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-nexla-partnership/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-nexla-partnership/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</title>
        <description>Agentic AI-powered Sales for Developers, built on Vespa</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-01-16-agentic-ai-powered-sales-for-developers-with-vespa/clarmcase.jpg" />
        
        <content:encoded><![CDATA[<!--
|--------------------------|--------------|
| **Industry:**            | Technology   |
| **Founded:**             | 2024         |
| **Backing:**             | Y Combinator |

Vespa Cloud → Vespa Enclave (AWS) 
-->

<h2 id="overview">Overview</h2>
<p>Clarm helps open source software companies <a href="https://www.clarm.com/blog/articles/convert-github-stars-to-revenue?utm_source=vespa&amp;utm_campaign=clarm_case_study" data-proofer-ignore="">convert GitHub stars into revenue</a> through AI-powered lead generation, content production, and developer support automation. When building their platform, Clarm needed a search engine that could power accurate, zero-hallucination AI responses while handling complex enrichment across millions of GitHub data points. They chose <a href="http://vespa.ai">Vespa</a> for its unified text, vector, and structured search capabilities and were able to deploy to production in under a day.</p>

<h2 id="the-problem-software--oss-companies-struggle-to-monetize">The Problem: Software / OSS Companies Struggle to Monetize</h2>
<p>“Most OSS founders can’t get attention for their software initially. They’re <a href="https://www.clarm.com/blog/articles/developer-growth-engine-automating-sales-marketing?utm_source=vespa&amp;utm_campaign=clarm_case_study" data-proofer-ignore="">so focused on building the product that marketing, SEO, and content creation get dropped</a>. We built Clarm to automate all the growth work founders drop so they can focus on git commits,” explains Marcus Storm-Mollard, founder and CEO of Clarm.</p>

<p>The challenge is fundamental: 99% of successful open source is funded by businesses paying for solutions, but early-stage OSS companies lack the infrastructure to identify, engage, and convert those potential paying customers. They have thousands of GitHub stars but no clear path to revenue.</p>

<p>Clarm addresses this through three product pillars:</p>
<ol>
  <li>
    <p><strong>Lead Generation &amp; Prospecting:</strong> The killer feature. Clarm takes repo data from customers and competitors, enriches it with signals from website visits, commits, issues, and community engagement, then ranks and identifies good-fit prospects and potential enterprise buyers.</p>
  </li>
  <li>
    <p><strong>Marketing &amp; Content Production:</strong> Automated content creation from commits, PRs, and codebase analysis, helping OSS companies maintain consistent technical marketing.</p>
  </li>
  <li>
    <p><strong>Developer Support Automation:</strong> AI-powered support across Discord, Slack, GitHub Issues, and websites, with deep integrations and analytics for scaling customer success.</p>
  </li>
</ol>

<h2 id="the-search-challenge">The Search Challenge</h2>
<p>At the core of all three pillars sits a critical technical requirement: accurate, explainable search and retrieval.</p>

<blockquote>
  <p>“We realized early that search, not generation, was the real problem to solve. Generating LLM answers isn’t hard. Finding the right information to base them on is everything,” Marcus notes.</p>
</blockquote>

<p>Clarm needed a search engine that could:</p>
<ul>
  <li>Handle hybrid retrieval (combining text search, vector embeddings, and structured filters)</li>
  <li>Power zero-hallucination AI responses grounded in verifiable context</li>
  <li>Process and rank millions of GitHub data points in real-time</li>
  <li>Support complex multi-signal enrichment for lead scoring</li>
  <li>Scale cost-effectively on a startup budget</li>
</ul>

<p><a href="https://blog.vespa.ai/why-search-platform-is-better-than-vector-database/">Traditional vector databases</a> like Supabase or search engines like <a href="https://blog.vespa.ai/modernizing-elasticsearch-with-vespa/">Elasticsearch</a> couldn’t deliver the unified, production-grade retrieval required for Clarm’s zero-hallucination architecture.</p>

<h2 id="the-solution-vespas-production-grade-hybrid-search">The Solution: Vespa’s Production-Grade Hybrid Search</h2>

<p>Marcus discovered Vespa after researching how companies like <a href="https://blog.vespa.ai/perplexity-builds-ai-search-at-scale-on-vespa-ai/">Perplexity</a> and <a href="https://blog.vespa.ai/using-vespa-cloud-resource-suggestions-to-optimize-costs/">Onyx</a> built their advanced retrieval systems.</p>

<blockquote>
  <p>“We really liked that Vespa started as a search engine and evolved into a vector-based system.
It made so much sense for what we were building.
Vespa’s ranking and tensoring are built in, so we know our results are accurate and relevant right out of the box,” Marcus explains.</p>
</blockquote>

<h4 id="rapid-deployment-less-than-one-day-to-production">Rapid Deployment: Less Than One Day to Production</h4>
<p>Clarm began experimenting with Vespa’s Docker image for local development, then transitioned to Vespa Cloud for production deployment during their Y Combinator batch.</p>

<blockquote>
  <p>“It took about half a day to set up how we wanted it. That speed of onboarding made a huge impact during YC. We just deployed it, and it worked,” Marcus recalls.</p>
</blockquote>

<p>The quick deployment was critical. Clarm was racing toward Demo Day and couldn’t afford weeks of infrastructure setup. Vespa’s unified approach eliminated the complexity of stitching together multiple systems for text, vector, and structured search.</p>

<h4 id="key-vespa-capabilities-powering-clarm">Key Vespa Capabilities Powering Clarm</h4>

<ul>
  <li>Unified Retrieval Pipeline
    <ul>
      <li>Single query endpoint combining text search, vector similarity, and structured filters - no need to orchestrate multiple databases or services.</li>
    </ul>
  </li>
  <li>Built-in <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html#">Ranking</a> &amp; <a href="https://docs.vespa.ai/en/ranking/tensor-user-guide.html#">Tensor Operations</a>
    <ul>
      <li>Native support for complex ranking models and tensor operations means Clarm can implement sophisticated lead scoring without custom ranking layers.</li>
    </ul>
  </li>
  <li><a href="https://143590857.fs1.hubspotusercontent-eu1.net/hubfs/143590857/PDF-reports/Scaling-Smarter_-Vespas-Approach-to-High-Performance-Data-Management-3.pdf?hsCtaAttrib=232558642374">Real-Time</a> Indexing
    <ul>
      <li>GitHub events, user interactions, and enrichment signals are instantly searchable, enabling live lead intelligence and up-to-date AI responses.</li>
    </ul>
  </li>
  <li>Scalable Cloud Deployment
    <ul>
      <li><a href="https://vespa.ai/vespa-content/uploads/2025/07/Autoscaling-with-Vespa.pdf">Automatic scaling</a> and high availability handled by Vespa Cloud, allowing Clarm’s two-person engineering team to focus on product features instead of infrastructure operations.</li>
    </ul>
  </li>
  <li>Developer-Friendly <a href="https://docs.vespa.ai/en/learn/overview.html">Architecture</a>
    <ul>
      <li>Docker-based local development, straightforward schema design, and comprehensive documentation enabled rapid prototyping and iteration.</li>
    </ul>
  </li>
</ul>

<h2 id="the-results">The Results</h2>
<p>Clarm’s decision to build on Vespa Cloud delivered immediate impact:</p>
<ul>
  <li><strong>&lt;1 Day to Production:</strong> From prototype to live search infrastructure deployed during YC</li>
  <li><strong>Zero-Hallucination Architecture:</strong> Accurate retrieval enabling trustworthy AI responses grounded in verifiable data</li>
  <li><strong>High-Quality Lead Intelligence:</strong> Sophisticated ranking of GitHub data points across 50K+ collective stars from customers like <a href="https://better-auth.com/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Better Auth</a> (23.3K stars) and <a href="https://cua.ai/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Cua</a> (11.3K stars)</li>
  <li><strong>Exceptional Support:</strong> Direct collaboration with Vespa’s engineering team throughout development</li>
</ul>

<blockquote>
  <p>“The setup was easy, the support from the Vespa team was incredible, and everything just worked. We didn’t need to look anywhere else,” Marcus emphasizes.</p>
</blockquote>

<h4 id="customer-success-github-stars-becoming-revenue">Customer Success: <a href="https://www.clarm.com/blog/articles/convert-github-stars-to-revenue?utm_source=vespa&amp;utm_campaign=clarm_case_study" data-proofer-ignore="">GitHub Stars Becoming Revenue</a></h4>
<p>Clarm’s customers are seeing measurable results from the AI-powered lead generation platform:</p>
<ul>
  <li><strong>Better Auth:</strong> Grew from 8K to 23.3K GitHub stars in 3 months with Clarm’s lead gen and engagement automation</li>
  <li><strong>c/ua:</strong> Scaled from 5K to 11.3K stars while identifying and converting enterprise prospects</li>
  <li><strong><a href="https://www.skyvern.com/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Skyvern AI:</a></strong> after struggling with after hitting 19k stars, reduced support workload by 94% with Clarm across Github, Discord, and Slack</li>
  <li><strong>Engagement Depth:</strong> Developers “pair programming” with Clarm’s AI agents for extended sessions, sending thousands of queries a day and sessions lasting up to 22 hours</li>
</ul>

<h4 id="whats-next-building-the-future-of-oss-monetization">What’s Next: Building the Future of OSS Monetization</h4>
<p>Clarm represents a <a href="https://www.clarm.com/blog/articles/best-developer-growth-automation-tools-for-software-products-in-2025?utm_source=vespa&amp;utm_campaign=clarm_case_study">new category of growth infrastructure</a> built specifically for software and open source companies. By combining Vespa’s production-grade retrieval with their own zero-hallucination agent framework, Clarm is proving that AI-powered sales and marketing can be trustworthy, explainable, and grounded in truth.</p>

<blockquote>
  <p>“We’re focused on proving product value and retaining customers right now. Everything depends on us growing our customers’ MRR and showing software and OSS companies they can build sustainable businesses,” Marcus shares.</p>
</blockquote>

<p>That focus is reflected in Clarm’s positioning: “You build awesome software. Now build a business.” It resonates with software founders who want to monetize without compromising their community values. By recognizing that a vast majority of successful open source is ultimately funded by businesses paying for solutions, Clarm offers a clear path forward: free software for the community, paid solutions for enterprises.</p>

<h2 id="conclusion">Conclusion</h2>
<p>Clarm’s architecture reinforces a lesson many teams learn the hard way: LLMs are only as reliable as the retrieval systems behind them. By treating retrieval as a first-class system, built on Vespa Cloud, Clarm unified text search, vector similarity, structured filtering, and ranking into a single production-grade platform, eliminating the fragility and guesswork common in vector-only stacks.</p>

<p>The result is an agentic AI platform that can reason over live data, explain its outputs, and scale predictably without stitching together multiple databases or post-hoc ranking layers. This foundation enabled a small team to move from prototype to production in days, operate across millions of GitHub signals, and help open source companies turn community adoption into sustainable revenue.</p>

<p>More importantly, Clarm’s success offers a blueprint for any organization building serious AI applications: when retrieval is reliable, ranking is expressive, and data is always fresh, AI systems become trustworthy enough to power real business outcomes. Clarm is building the future of OSS monetization, and Vespa is the retrieval engine making it possible.</p>

]]></content:encoded>
        <pubDate>Mon, 19 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Embedding Tradeoffs, Quantified</title>
        <description>The embedding strategy you choose has a major impact on both cost, quality and latency. We ran a bunch of experiments to help you make better and more informed tradeoffs.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-01-14-embedding-tradeoffs-quantified/control-dashboard.png" />
        
        <content:encoded><![CDATA[<p>Most Vespa users run hybrid search - combining BM25 (and/or other lexical features) with semantic vectors. But which embedding model should you use? And how do you balance cost, quality, and latency as you scale?</p>

<p>The typical approach: open the <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB leaderboard</a>, find the “Retrieval” column, sort descending, pick something that fits your size budget. Done, right?</p>

<p>Not quite. MTEB doesn’t tell you:</p>

<ul>
  <li>How fast is inference on your actual hardware?</li>
  <li>What happens when you quantize the model weights?</li>
  <li>How much quality do you lose with binary vectors?</li>
  <li>Does this model even work well in a hybrid setup?</li>
</ul>

<p>So we ran the experiments ourselves. We picked models from the MTEB Retrieval leaderboard with these criteria:</p>

<ul>
  <li>Under 500M parameters (practical for most deployments)</li>
  <li>Open license</li>
  <li>ONNX weights available (required for Vespa)</li>
  <li>At least 10k downloads in the last month (actually used in production)</li>
</ul>

<p>For each model, we benchmarked across:</p>

<ul>
  <li><strong>Model quantizations</strong> (FP32, FP16, INT8)</li>
  <li><strong>Vector precisions</strong> (float, bfloat16, binary)</li>
  <li><strong>Matryoshka dimensions</strong> (for models that support it)</li>
  <li><strong>Real hardware</strong> (Graviton3, Graviton4, T4 GPU)</li>
  <li><strong>Hybrid retrieval</strong> (semantic, RRF, and score normalization methods)</li>
</ul>

<p><strong>Spoiler:</strong> We found some <em>really</em> attractive tradeoffs - 32x memory reduction, 4x faster inference, with nearly identical quality.</p>

<h2 id="what-mteb-doesnt-show-you">What MTEB doesn’t show you</h2>

<h3 id="model-quantization">Model quantization</h3>

<p>Vespa uses <a href="https://onnxruntime.ai/">ONNX runtime</a> for <a href="https://docs.vespa.ai/en/embedding.html">embedding inference</a>. Most models on HuggingFace ship with multiple ONNX variants - here’s <a href="https://huggingface.co/Alibaba-NLP/gte-modernbert-base/tree/main/onnx">Alibaba-NLP/gte-modernbert-base</a> as an example:</p>

<p><img src="/assets/2026-01-14-embedding-tradeoffs-quantified/model-quantizations.png" alt="model quantizations" /></p>

<p>Lower precision weights = smaller model = faster inference. But how much faster, and what’s the quality hit?</p>

<ul>
  <li><strong>On CPU:</strong> INT8 models run 2.7-3.4x faster while keeping 94-98% of the quality</li>
  <li><strong>On GPU:</strong> INT8 is actually 4-5x <em>slower</em> than FP32. Don’t do this.</li>
</ul>

<p>The difference between 30ms and 100ms query latency is huge. If you’re on CPU, INT8 is often a no-brainer.</p>

<p>On GPU, use FP16 instead - you get <a href="https://sbert.net/docs/sentence_transformer/usage/efficiency.html">~2x speedup with no meaningful quality loss</a>.</p>

<p><strong>GPU vs CPU:</strong> The T4 GPU runs 4-7x faster than Graviton3 for embedding inference. If you’re processing high query volumes or doing bulk indexing, GPU may be worth it.</p>

<h3 id="vector-precision">Vector precision</h3>

<p>Model quantization affects <em>inference</em> speed. Vector precision affects <em>storage</em> and <em>search</em> speed. Different knobs, both important.</p>

<p>Here’s the math for 100 million 768-dimensional embeddings:</p>

<style>
  table, th, td {
    border: 1px solid black;
  }
  th, td {
    padding: 5px;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Precision</th>
      <th style="text-align: center">Bytes/Dim</th>
      <th style="text-align: center">100M vectors</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>FP32</td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">307 GB</td>
    </tr>
    <tr>
      <td>FP16</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">154 GB</td>
    </tr>
    <tr>
      <td>INT8 (scalar)</td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">77 GB</td>
    </tr>
    <tr>
      <td>Binary (packed)</td>
      <td style="text-align: center">0.125</td>
      <td style="text-align: center">9.6 GB</td>
    </tr>
  </tbody>
</table>

<p><br />
That’s a 32x difference between FP32 and binary. When memory is what forces you to add more nodes, this matters a lot.</p>

<p><strong>bfloat16 is free:</strong> In our benchmarks, bfloat16 vectors show zero quality loss compared to FP32 - it’s a 2x storage reduction you can take without any tradeoff.</p>

<h3 id="matryoshka-dimensions">Matryoshka dimensions</h3>

<p>Some models support <a href="https://huggingface.co/blog/matryoshka">Matryoshka Representation Learning (MRL)</a> - you can truncate the embedding to fewer dimensions and still get decent results. Fewer dimensions = less storage, faster search.</p>

<p>Here’s EmbeddingGemma at different dimension sizes:</p>

<p><img src="/assets/2026-01-14-embedding-tradeoffs-quantified/embeddinggemma-mrl.png" alt="EmbeddingGemma MRL" /></p>

<p><em>Source: <a href="https://arxiv.org/pdf/2509.20354">EmbeddingGemma paper</a></em></p>

<p>Interestingly, EmbeddingGemma actually scores <em>higher</em> at 512 dimensions than at 768. We didn’t dig into why - it may be an artifact of the smaller evaluation set - but it’s a reminder that more dimensions isn’t always better.</p>

<p>Not all models support this - check the model card before truncating. If it wasn’t trained for MRL, slicing dimensions will tank your quality.</p>

<h3 id="inference-speed">Inference speed</h3>

<p>If you have a 200ms latency budget and your embedding model takes 150ms, you’re in trouble. We benchmarked actual inference times so you can plan accordingly.</p>

<p>We measured two things for each model:</p>

<ol>
  <li><strong>Query latency</strong> - how long to embed an 8-word query</li>
  <li><strong>Document throughput</strong> - embeddings per second for 103-word docs</li>
</ol>

<p>Tested on three AWS instance types:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">c7g.2xlarge</code> - Graviton 3 (ARM CPU)</li>
  <li><code class="language-plaintext highlighter-rouge">g4dn.xlarge</code> - T4 GPU</li>
  <li><code class="language-plaintext highlighter-rouge">m8g.xlarge</code> - Graviton 4 (ARM CPU)</li>
</ul>

<p>These numbers are pure ONNX inference time. Your actual indexing throughput will also depend on HNSW config and existing index size, but embedding inference is usually the bottleneck.</p>

<h3 id="quality">Quality</h3>

<p>We evaluated all models on <a href="https://huggingface.co/collections/zeta-alpha-ai/nanobeir">NanoBEIR</a>, a smaller but representative subset of the BEIR benchmark. This let us run a lot of experiments without waiting forever.</p>

<p>For each model, we measured nDCG@10 across four retrieval strategies:</p>

<ul>
  <li><strong>Semantic only</strong> - pure vector similarity</li>
  <li><strong>RRF (Reciprocal Rank Fusion)</strong> - combines BM25 and vector rankings</li>
  <li><strong>Atan hybrid</strong> - normalizes scores using arctangent before combining</li>
  <li><strong>Linear hybrid</strong> - linear normalization before combining</li>
</ul>

<p>The hybrid methods consistently outperform pure semantic search. <strong>Every single model</strong> in our benchmark scored higher with hybrid retrieval than semantic-only. On average, the best hybrid method beats semantic-only by 3-5 percentage points. That’s a meaningful lift you get “for free” by just using BM25 alongside your vectors.</p>

<p>We also tested each model with binarized vectors (int8). This is where things get interesting:</p>

<ul>
  <li><strong>ModernBERT models</strong> barely flinch - Alibaba GTE ModernBERT retains 98% of quality (0.670 binary vs 0.685 float)</li>
  <li><strong>E5 models</strong> take a bigger hit - E5-base-v2 drops to 92% (0.602 binary vs 0.651 float), and E5-small-v2 to just 87%</li>
</ul>

<p>The takeaway: not all models are created equal for binary quantization. The newer ModernBERT-based models handle it much better than the E5 family. Make sure to check before assuming you can just binarize everything.</p>

<h2 id="interactive-leaderboard">Interactive leaderboard</h2>

<p>We built an interactive leaderboard so you can explore the full results yourself. Filter by hardware, sort by different metrics, and expand each model to see the full breakdown across dimensions and precisions. <a href="https://huggingface.co/spaces/vespa-engine/nanobeir-hybrid-evaluation">Open in full screen</a>.</p>

<iframe src="https://vespa-engine-nanobeir-hybrid-evaluation.static.hf.space" frameborder="0" width="100%" height="1200">
</iframe>

<h2 id="getting-started-with-vespa">Getting started with Vespa</h2>

<p>Ready to put this into practice? Here’s how to configure an <a href="https://docs.vespa.ai/en/embedding.html">embedding model in Vespa</a>:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"alibaba_gte_modernbert_int8"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"alibaba-gte-modernbert"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>8192<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;pooling-strategy&gt;</span>cls<span class="nt">&lt;/pooling-strategy&gt;</span>
<span class="nt">&lt;/component&gt;</span>
</code></pre></div></div>

<p>Here’s a schema with a binarized embedding field (96 dimensions = 768 bits packed):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
      index: enable-bm25
    }
  }
  field embedding_alibaba_gte_modernbert_int8_96_int8 type tensor&lt;int8&gt;(x[96]) {
    indexing: input text | embed alibaba_gte_modernbert_int8 | pack_bits | index | attribute
    attribute {
      distance-metric: hamming
    }
    index {
      hnsw {
        max-links-per-node: 16
        neighbors-to-explore-at-insert: 200
      }
    }
  }
}
</code></pre></div></div>

<p>And a rank profile using linear normalization for hybrid scoring:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile hybrid_linear {
  inputs {
    query(q) tensor&lt;int8&gt;(x[96])
  }
  function similarity() {
    expression {
      1 - (distance(field, embedding_alibaba_gte_modernbert_int8_96_int8) / 768)
    }
  }
  first-phase {
    expression: similarity
  }
  global-phase {
    expression: normalize_linear(bm25(text)) + normalize_linear(similarity)
    rerank-count: 1000
  }
  match-features {
    similarity
    bm25(text)
  }
}
</code></pre></div></div>

<p>Check out the <a href="https://docs.vespa.ai/en/embedding.html">embedding documentation</a> for full details on configuration, including how to set up <a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html">binary quantization</a> and hybrid search.</p>

<h3 id="going-further">Going further</h3>

<p>Binary vectors are fast - really fast. Vespa can do ~1 billion hamming distance calculations per second, roughly 7x more than prenormalized angular distance. That speed difference means you can crank up <a href="https://docs.vespa.ai/en/nearest-neighbor-search.html#using-nearest-neighbor-query-operator">targetHits</a> significantly and still stay within latency budget. More candidates evaluated = better recall. So binary vectors aren’t just about 32x storage savings - they give you headroom to tune for quality too.</p>

<p>And luckily, Vespa’s <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">phased ranking</a> architecture lets you make up for any remaining quality loss in later phases. You can retrieve candidates with hamming distance, then rescore in any of the following ways:</p>

<ul>
  <li><strong>float-binary</strong> - Use float for query vector, and unpack the bits of document vector to float for angular distance calculation. <a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html#rank-profiles-and-queries">Example</a></li>
  <li><strong>float-float</strong> - Retrieve with hamming distance but rerank with full-precision vectors <a href="https://docs.vespa.ai/en/content/attributes.html#paged-attributes-disadvantages">paged in from disk</a>. Should be limited to a small candidate set.</li>
  <li><strong>int8-int8</strong> - Same as float-float, with int8 vectors (scalar quantization, not to be confused with binary quantization) for both query and document. Faster and more storage-efficient than float-float, with a small precision cost.</li>
</ul>

<p>See <a href="https://huggingface.co/blog/embedding-quantization#quantization-experiments">this</a> great huggingface blog post for more details on these techniques.</p>

<p>For even better results, add a <a href="https://docs.vespa.ai/en/cross-encoders.html">cross-encoder reranker</a> as a final stage. Or (especially if you have several user signals or features), train a <a href="https://docs.vespa.ai/en/xgboost.html">GBDT model</a> to learn optimal combinations.</p>

<p>The beauty of Vespa’s <a href="https://docs.vespa.ai/en/basics/ranking.html">ranking expressions</a> is that you can mix and match all of these - BM25, a bunch of other <a href="https://docs.vespa.ai/en/reference/ranking/rank-features.html">built-in features</a>, vectors, rerankers, learned models - however you want.</p>

<h2 id="a-few-caveats">A few caveats</h2>

<h3 id="multilingual-support">Multilingual support</h3>

<p>If you need to support multiple languages, your options narrow. The <code class="language-plaintext highlighter-rouge">multilingual-e5-base</code> model handles 100+ languages but comes with a quality tradeoff compared to English-only models. For English-only workloads, stick with the specialized models.</p>

<h3 id="context-length">Context length</h3>

<p>Document length matters too. Many newer models handle 8192 tokens, EmbeddingGemma can take 2048, while the E5 family tops out at 512. If your documents are long, look at benchmarks like <a href="https://arxiv.org/html/2402.07440v2">LoCo (Long Document Retrieval)</a> - NanoBEIR won’t tell you much here.</p>

<p>For long documents, check out Vespa’s <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">layered ranking</a> - it lets you rank chunks within documents so you’re not forced to return irrelevant chunks from top-ranking docs.</p>

<h3 id="test-on-your-own-data">Test on your own data</h3>

<p>NanoBEIR is a good starting point, but your domain matters. A model that tops the leaderboard on scientific papers might struggle with product descriptions, legal documents, or your internal knowledge base.</p>

<p>Benchmark rankings can be misleading for specialized domains. The models we tested were trained on general web data - if your corpus looks very different (medical records, source code, niche industry jargon), the relative rankings might shuffle significantly.</p>

<p>We’ve open-sourced the <a href="https://github.com/vespa-engine/pyvespa/blob/master/vespa/evaluation/_mteb.py">benchmarking code in pyvespa</a> so you can run the same experiments on any model with any dataset compatible with the MTEB library. Swap in your own data and see how different models actually perform for your use case.</p>

<h3 id="consider-finetuning">Consider finetuning</h3>

<p>If off-the-shelf models underperform on your domain, finetuning can help significantly. Even a small set of query-document pairs from your actual data can boost relevance.</p>

<p>Tools like <a href="https://www.sbert.net/docs/sentence_transformer/training_overview.html">sentence-transformers</a> make this straightforward. The ROI is often worth it for production systems where a few percentage points of nDCG translate to real user impact.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>The “best” embedding model depends entirely on your constraints. But now you have real data to make that call:</p>

<ul>
  <li><strong>Cost sensitive?</strong> Binary quantization with a compatible model (like GTE ModernBERT) gives you 32x savings with minimal quality loss.</li>
  <li><strong>Running on CPU?</strong> INT8 model quantization speeds up inference 2.7-3.4x.</li>
  <li><strong>Need great quality?</strong> Alibaba GTE ModernBERT + hybrid search is hard to beat.</li>
  <li><strong>Latency-critical?</strong> E5-small-v2 with INT8 can do a query inference in only 2.5ms on Graviton3.</li>
</ul>

<p>The interactive leaderboard above has all the details. Explore, filter, and find the sweet spot for your use case.</p>

<p>For those interested in learning more about Vespa, join the <a href="https://vespatalk.slack.com/">Vespa community on Slack</a> to exchange ideas,
seek assistance from the community, or stay in the loop on the latest Vespa developments.</p>
]]></content:encoded>
        <pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/embedding-tradeoffs-quantified/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/embedding-tradeoffs-quantified/</guid>
        
        <category>embedding</category>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        
      </item>
    
  </channel>
</rss>
