<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Vespa Blog</title>
    <description>We Make AI Work</description>
    <link>https://blog.vespa.ai/</link>
    <atom:link href="https://blog.vespa.ai/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Wed, 08 Apr 2026 22:31:44 +0000</pubDate>
    <lastBuildDate>Wed, 08 Apr 2026 22:31:44 +0000</lastBuildDate>
    <generator>Jekyll v4.4.1</generator>
    
      <item>
        <title>Using Large ONNX Models with External Data in Vespa Embedders</title>
        <description>Many ONNX models exceed the 2GB protobuf limit and store weights in external data files. Vespa now supports these models for embedders.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-27-onnx-external-data-in-vespa-embedders/onnx-external-data-splash.png" />
        
        <content:encoded><![CDATA[<p>Many popular ONNX models exceed the 2 GB <a href="https://protobuf.dev/">protobuf</a> format limit and store their weights in separate external data files.
Until recently, these models could not be used directly in Vespa’s built-in embedders.</p>

<p>This was a long requested feature on our tracker (see <a href="https://github.com/vespa-engine/vespa/issues/28761">GitHub issue #28761</a>).</p>

<h2 id="the-2-gb-limitation">The 2 GB limitation</h2>

<p><a href="https://onnx.ai/">ONNX</a> uses Google’s Protocol Buffers as its serialization format.
Protobuf has a hard limit of 2 GB on message size.
For smaller models, this is not a problem — all tensor data (the model weights) is embedded directly in the <code class="language-plaintext highlighter-rouge">.onnx</code> file,
making it self-contained.</p>

<p>As models grow larger, they inevitably hit this limitation.
For a model exceeding 2 GB, ONNX tooling splits it into two parts:</p>

<ul>
  <li>A small <strong><code class="language-plaintext highlighter-rouge">.onnx</code> file</strong> containing the model graph structure (typically a few hundred KB to a few MB).</li>
  <li>One or more <strong>external data files</strong> (commonly named <code class="language-plaintext highlighter-rouge">.onnx_data</code>) containing the actual tensor weights.</li>
</ul>

<p>Note that reduced-precision variants of these models (INT8, FP16, etc.) are often small enough to fit in a single self-contained <code class="language-plaintext highlighter-rouge">.onnx</code> file.
The external data split primarily affects the full-precision versions.</p>

<p>Previously, if you pointed a Vespa embedder at a model with external data files, ONNX Runtime would fail to load it
because the data files were not available alongside the model file.</p>

<h2 id="what-changed">What changed</h2>

<p>Vespa embedders now automatically handle ONNX models with external data files.
When you configure an embedder with a URL pointing to an <code class="language-plaintext highlighter-rouge">.onnx</code> file,
Vespa inspects the model to check whether it references any external data files.
If it does, Vespa downloads those files automatically before loading the model.</p>

<p>This feature is available starting from Vespa 8.544.</p>

<h2 id="how-to-use-it">How to use it</h2>

<p>Here is an example using EmbeddingGemma 300M, which uses external data:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"gemma"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span>
      <span class="na">url=</span><span class="s">"https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/onnx/model.onnx"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;tokenizer-model</span>
      <span class="na">url=</span><span class="s">"https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main/tokenizer.json"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>2048<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>task: search result | query: <span class="nt">&lt;/query&gt;</span>
      <span class="nt">&lt;document&gt;</span>title: none | text: <span class="nt">&lt;/document&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>If you are deploying to <a href="https://cloud.vespa.ai/">Vespa Cloud</a>, you can also use models from the
<a href="https://docs.vespa.ai/en/rag/model-hub.html">Vespa Model Hub</a> that use external data.
For example, the Multilingual-E5-large model (will be available on Vespa Cloud 8.668+):</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"e5"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"multilingual-e5-large"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>512<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>query: <span class="nt">&lt;/query&gt;</span>
      <span class="nt">&lt;document&gt;</span>passage: <span class="nt">&lt;/document&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>This works with our ONNX-based embedders:</p>

<ul>
  <li><a href="https://docs.vespa.ai/en/embedding.html#huggingface-embedder"><code class="language-plaintext highlighter-rouge">hugging-face-embedder</code></a></li>
  <li><a href="https://docs.vespa.ai/en/embedding.html#colbert-embedder"><code class="language-plaintext highlighter-rouge">colbert-embedder</code></a></li>
  <li><a href="https://docs.vespa.ai/en/embedding.html#splade-embedder"><code class="language-plaintext highlighter-rouge">splade-embedder</code></a></li>
</ul>

<p>It’s also possible to use <a href="https://docs.vespa.ai/en/reference/rag/embedding.html#private-model-hub">private models</a> — authentication tokens are propagated when downloading external data files.</p>

<h2 id="current-limitations">Current limitations</h2>

<p>There are a few constraints to be aware of:</p>

<ul>
  <li>
    <p><strong>Embedders only.</strong> Models used directly in <a href="https://docs.vespa.ai/en/ranking/onnx.html">ranking expressions</a>
must still be self-contained and under 2 GB.</p>
  </li>
  <li>
    <p><strong>URL-referenced or Model Hub models only.</strong> Models bundled in the
<a href="https://docs.vespa.ai/en/application-packages.html">application package</a>
using the <code class="language-plaintext highlighter-rouge">path</code> attribute do not support external data.
Models referenced via <code class="language-plaintext highlighter-rouge">url</code> or <code class="language-plaintext highlighter-rouge">model-id</code> (Vespa Cloud) are supported.</p>
  </li>
  <li>
    <p><strong>External data files must be co-located with the model.</strong>
The external data files are resolved relative to the model URL.
They must be in the same directory (or a subdirectory) as the <code class="language-plaintext highlighter-rouge">.onnx</code> file.</p>
  </li>
</ul>

<p>See the <a href="https://docs.vespa.ai/en/ranking/onnx.html#limitations-on-model-size-and-complexity">ONNX model documentation</a>
for the full list of requirements.</p>

<p>If you need more extensive support for ONNX models with external data — for example in ranking expressions —
feel free to <a href="https://github.com/vespa-engine/vespa/issues">file an issue</a>.</p>
]]></content:encoded>
        <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/onnx-external-data-in-vespa-embedders/</guid>
        
        <category>embedding</category>
        
        <category>onnx</category>
        
        
      </item>
    
      <item>
        <title>Asymmetric Retrieval: Spend on Docs, Embed your Queries for Free</title>
        <description>Documents are embedded once — worth the spend for maximum quality. Queries hit you on every request. This is what drives your cost at scale. Asymmetric retrieval with Voyage AI and Vespa. Real numbers, real config.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-10-asymmetric-retrieval-spend-on-docs-queries-for-free/hero.png" />
        
        <content:encoded><![CDATA[<p>At 10,000 queries per second with ~30-token queries, you’re pushing ~18 million tokens per minute through your embedding API. At $0.02 per million tokens, that’s <strong>over $15,000/month</strong> — just for query embeddings. Documents are embedded once. Queries are embedded forever.</p>

<p>What if you could drop that to $0?</p>

<p>That’s the promise of <strong>asymmetric retrieval</strong>: embed your documents with the best model money can buy, then embed queries with a tiny model running locally — for free. Voyage AI’s new <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">voyage-4 family</a> is the first to make this practical, and Vespa now has native support for it.</p>

<h2 id="the-asymmetric-insight">The asymmetric insight</h2>

<p>The conventional approach is to use the same embedding model for documents and queries. Same model, same vector space. But it ignores a fundamental asymmetry.</p>

<p>Document embedding is a <strong>one-time cost</strong>. You embed each document once at indexing time, and it’s not latency-sensitive — whether it takes 10ms or 500ms doesn’t matter because no user is waiting. You can throw the biggest, most accurate model at it and take your time.</p>

<p>Query embedding is the opposite. It’s on the <strong>critical path of every single request</strong>, continuously, at scale. It needs to be fast, and at 10K QPS the cost dwarfs everything else.</p>

<p>Why use the same model for both?</p>

<p>Asymmetric retrieval splits these two concerns:</p>

<ol>
  <li><strong>Documents</strong> — Embed once with <code class="language-plaintext highlighter-rouge">voyage-4-large</code>. Best accuracy, API-based, no rush.</li>
  <li><strong>Queries</strong> — Embed continuously with <code class="language-plaintext highlighter-rouge">voyage-4-nano</code>. Tiny, local, free.</li>
</ol>

<p>This works because all four models in the Voyage 4 family — <code class="language-plaintext highlighter-rouge">voyage-4-large</code>, <code class="language-plaintext highlighter-rouge">voyage-4</code>, <code class="language-plaintext highlighter-rouge">voyage-4-lite</code>, and <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> — produce <strong>compatible embeddings in a shared vector space</strong>.</p>

<p><img src="/assets/2026-03-10-asymmetric-retrieval-spend-on-docs-queries-for-free/asymmetric-embeddings.png" alt="Asymmetric retrieval: documents embedded with voyage-4-large via API, queries embedded with voyage-4-nano locally" /></p>

<p>It also means you can upgrade your query model independently. Start with <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> for cost, move to <code class="language-plaintext highlighter-rouge">voyage-4-lite</code> for quality — without re-embedding a single document.</p>

<p>The shared embedding space opens up document-side flexibility too. In a multi-tenant system, you could use different models for different tiers — <code class="language-plaintext highlighter-rouge">voyage-4-large</code> for premium customers who need the best retrieval quality, <code class="language-plaintext highlighter-rouge">voyage-4-lite</code> for cost-sensitive tenants — all searchable with the same query model. Same index, same query path, different quality/cost tradeoffs per tenant.</p>

<h2 id="the-numbers">The numbers</h2>

<h3 id="cost">Cost</h3>

<p>Let’s be concrete about the 10K QPS scenario:</p>

<ul>
  <li>10,000 QPS × 30 tokens = 300,000 tokens/sec</li>
  <li>300,000 × 60 × 60 × 24 × 30 = ~777 billion tokens/month</li>
  <li>At $0.02/1M tokens ≈ <strong>$15,500/month</strong> for query embeddings via API</li>
</ul>

<p>With <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> running locally on the Vespa container: <strong>$0/month</strong>. The model runs as part of the serving infrastructure you’re already paying for.</p>

<h3 id="latency">Latency</h3>

<p>API calls add network round-trips. Local inference on <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> runs in single-digit milliseconds on CPU.</p>

<h3 id="quality">Quality</h3>

<p>Voyage 4 is state-of-the-art. On the <a href="https://blog.voyageai.com/2026/01/15/voyage-4/">RTEB benchmark</a> (29 retrieval datasets, NDCG@10), <code class="language-plaintext highlighter-rouge">voyage-4-large</code> beats the competition:</p>

<style>
  table, th, td {
    border: 1px solid black;
  }
  th, td {
    padding: 5px;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Comparison</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>vs. Gemini Embedding 001</td>
      <td>+3.87%</td>
    </tr>
    <tr>
      <td>vs. Cohere Embed v4</td>
      <td>+8.20%</td>
    </tr>
    <tr>
      <td>vs. OpenAI v3 Large</td>
      <td>+14.05%</td>
    </tr>
  </tbody>
</table>

<p><br />
And asymmetric retrieval — querying with a smaller model against <code class="language-plaintext highlighter-rouge">voyage-4-large</code> document embeddings — preserves retrieval quality across medical, code, web, finance, and legal domains.</p>

<h3 id="storage">Storage</h3>

<p>Binary quantization gives you a <strong>16x memory reduction</strong> over bfloat16 — 2048-dim vectors go from 4,096 bytes to 256 bytes. The full-precision floats are still used for second-phase reranking, <a href="https://docs.vespa.ai/en/content/attributes.html#paged-attributes-disadvantages">paged from disk</a> only when needed. For a deeper dive on quantization tradeoffs, see <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a>.</p>

<h2 id="why-this-matters-at-scale">Why this matters at scale</h2>

<p>Cost and quality are table stakes. The real question for large-scale systems is: does this work in production?</p>

<h3 id="independent-scaling">Independent scaling</h3>

<p>Vespa separates stateless containers (where embedding runs) from content clusters (where data lives). This means you can scale query embedding capacity independently from storage. Need more QPS? Add container nodes. More documents? Add content nodes. They don’t interfere.</p>

<h3 id="no-external-api-on-the-query-path">No external API on the query path</h3>

<p>This is the underrated benefit. With asymmetric retrieval, the query embedding model runs locally inside Vespa — your critical search path has zero dependency on an external API.</p>

<p>That matters when:</p>

<ul>
  <li><strong>The API goes down.</strong> Every embedding API has outages. If your query path depends on one, your search goes down with it.</li>
  <li><strong>You get rate-limited.</strong> Traffic spikes don’t care about your API quota. A sudden 3x in query volume means dropped requests — or queued requests that blow your latency budget.</li>
  <li><strong>You need to scale fast.</strong> Adding Vespa container nodes takes minutes. Negotiating higher API rate limit may take days. On <a href="https://docs.vespa.ai/en/cloud/autoscaling.html">Vespa Cloud</a>, autoscaling handles traffic spikes automatically — container clusters are stateless and scale up quickly.</li>
</ul>

<p>Keeping the query path self-contained turns your search system from “works when everything is up” into “works, period.”</p>

<h3 id="two-phase-ranking">Two-phase ranking</h3>

<p>Binary vectors are fast — Vespa can do ~1 billion hamming distance calculations per second. But binary quantization loses precision. Vespa’s <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">phased ranking</a> recovers it:</p>

<ol>
  <li><strong>First phase</strong>: Hamming distance on binary embeddings. Fast, cheap, scans the full index.</li>
  <li><strong>Second phase</strong>: Float dot-product on the top 2,000 candidates. Accurate, but only touches a bounded set of vectors paged from disk.</li>
</ol>

<p>This gives you the speed of binary search with the accuracy of full-precision reranking.</p>

<h3 id="enterprise-proven">Enterprise-proven</h3>

<p>This isn’t theoretical. Vespa runs search and recommendation at Spotify, Yahoo, and Perplexity — billions of documents, thousands of QPS, sub-100ms latency. The architecture handles it.</p>

<h2 id="how-to-set-this-up">How to set this up</h2>

<p>Here’s the complete Vespa configuration for asymmetric retrieval with Voyage AI.</p>

<h3 id="schema">Schema</h3>

<p>Two embedding fields — binary for fast retrieval, float for accurate reranking:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
    }
  }

  field embedding_float type tensor&lt;bfloat16&gt;(x[2048]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: prenormalized-angular
      paged
    }
  }

  field embedding_binary type tensor&lt;int8&gt;(x[256]) {
    indexing: input text | embed voyage-4-large | attribute
    attribute {
      distance-metric: hamming
    }
  }
}
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">paged</code> attribute on <code class="language-plaintext highlighter-rouge">embedding_float</code> tells Vespa to keep these vectors on disk, paging them into memory only during second-phase reranking. The binary embeddings stay in memory for fast first-phase retrieval.</p>

<h3 id="embedders-servicesxml">Embedders (services.xml)</h3>

<p>Two embedders — one API-based for documents, one local for queries:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;container</span> <span class="na">id=</span><span class="s">"default"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"voyage-4-large"</span> <span class="na">type=</span><span class="s">"voyage-ai-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;model&gt;</span>voyage-4-large<span class="nt">&lt;/model&gt;</span>
    <span class="nt">&lt;api-key-secret-ref&gt;</span>apiKey<span class="nt">&lt;/api-key-secret-ref&gt;</span>
    <span class="nt">&lt;dimensions&gt;</span>2048<span class="nt">&lt;/dimensions&gt;</span>
    <span class="nt">&lt;batching</span> <span class="na">max-size=</span><span class="s">"20"</span> <span class="na">max-delay=</span><span class="s">"20ms"</span><span class="nt">/&gt;</span>
  <span class="nt">&lt;/component&gt;</span>

  <span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"voyage-4-nano"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"voyage-4-nano-int8"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;tokenizer-model</span> <span class="na">model-id=</span><span class="s">"voyage-4-nano-vocab"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>32768<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;pooling-strategy&gt;</span>mean<span class="nt">&lt;/pooling-strategy&gt;</span>
    <span class="nt">&lt;normalize&gt;</span>true<span class="nt">&lt;/normalize&gt;</span>
    <span class="nt">&lt;prepend&gt;</span>
      <span class="nt">&lt;query&gt;</span>Represent the query for retrieving supporting documents: <span class="nt">&lt;/query&gt;</span>
    <span class="nt">&lt;/prepend&gt;</span>
  <span class="nt">&lt;/component&gt;</span>
<span class="nt">&lt;/container&gt;</span>
</code></pre></div></div>

<p>The <a href="https://docs.vespa.ai/en/rag/embedding.html#voyageai-embedder"><code class="language-plaintext highlighter-rouge">voyage-ai-embedder</code></a> handles vector quantization automatically — it infers the target precision from the destination tensor type. bfloat16 fields get full-precision embeddings; int8 fields get binary representations.</p>

<p>The <code class="language-plaintext highlighter-rouge">hugging-face-embedder</code> runs <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> locally. No API calls, no rate limits, no cost. Both model references (<code class="language-plaintext highlighter-rouge">voyage-4-nano-int8</code>, <code class="language-plaintext highlighter-rouge">voyage-4-nano-vocab</code>) resolve via the <a href="https://docs.vespa.ai/en/rag/model-hub.html">Vespa Model Hub</a>.</p>

<p><strong>A note on “quantization” — two different things.</strong> The <code class="language-plaintext highlighter-rouge">voyage-4-nano-int8</code> in the <code class="language-plaintext highlighter-rouge">model-id</code> refers to <strong>model weight quantization</strong>: the ONNX model file uses INT8 weights instead of FP32, which makes inference 2-3x faster on CPU with negligible quality loss. This is about how the <em>model itself</em> is stored and executed. The embedder still produces full-precision float vectors as output. <strong>Vector quantization</strong> is a separate concern — it’s about the precision of the <em>output embeddings</em> you store and search over (bfloat16, int8/binary, etc.). That’s controlled by the tensor type in your schema field, not the model format. These are independent knobs: you can run an INT8-quantized model that outputs float vectors, then store them as binary. For a deeper dive with benchmarks on both, see <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a>.</p>

<h3 id="rank-profile">Rank profile</h3>

<p>Two-phase ranking: hamming distance first, float reranking second:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile binary-with-rerank {
  inputs {
    query(q_float) tensor&lt;float&gt;(x[2048])
    query(q_bin) tensor&lt;int8&gt;(x[256])
  }

  function binary_closeness() {
    expression: 1 - (distance(field, embedding_binary) / 2048)
  }

  function float_closeness() {
    expression: reduce(query(q_float) * attribute(embedding_float), sum, x)
  }

  first-phase {
    expression: binary_closeness
  }

  second-phase {
    expression: float_closeness
    rerank-count: 2000
  }
}
</code></pre></div></div>

<h3 id="querying">Querying</h3>

<p>Both query tensors are produced by the local <code class="language-plaintext highlighter-rouge">voyage-4-nano</code> embedder:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yql=select * from doc where {targetHits: 100}nearestNeighbor(embedding_binary, q_bin)
&amp;ranking=binary-with-rerank
&amp;input.query(q_bin)=embed(voyage-4-nano, "your query here")
&amp;input.query(q_float)=embed(voyage-4-nano, "your query here")
&amp;hits=10
</code></pre></div></div>

<p>The <a href="https://docs.vespa.ai/en/nearest-neighbor-search.html">nearest neighbor search</a> runs on the binary field for speed, while the rank profile handles two-phase scoring.</p>

<p>For a complete runnable example with pyvespa, see the <a href="https://vespa-engine.github.io/pyvespa/examples/voyage-ai-embeddings-cloud.html">Voyage AI embeddings notebook</a>.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>Asymmetric retrieval makes the most sense when:</p>

<ul>
  <li><strong>High QPS</strong> — The cost savings scale linearly. At 10K QPS, you’re saving $15.5K/month. At 100K QPS, it’s $155K.</li>
  <li><strong>Large corpus</strong> — Documents are embedded once, so the large model cost is amortized. The bigger the corpus, the more you benefit from cheap queries.</li>
  <li><strong>Latency-sensitive</strong> — Local inference eliminates network round-trips.</li>
</ul>

<p>When a single model is the better choice:</p>

<ul>
  <li><strong>Low volume and latency-tolerant</strong> — At 10 QPS, the API cost is ~$15/month and the network round-trip doesn’t matter. One model is simpler to operate.</li>
  <li><strong>Quality above all else</strong> — Using <code class="language-plaintext highlighter-rouge">voyage-4-large</code> for both documents and queries gives you the best possible retrieval quality. If you can afford the API cost and latency, symmetric with the top model is hard to beat.</li>
</ul>

<p>The Voyage 4 family and Vespa’s native integration make asymmetric retrieval practical for the first time. Embed documents with the best model available, query with a tiny local model, and let phased ranking close the quality gap.</p>

<p><strong>Resources:</strong></p>

<ul>
  <li><a href="https://vespa-engine.github.io/pyvespa/examples/voyage-ai-embeddings-cloud.html">Voyage AI embeddings notebook</a> — Full runnable example</li>
  <li><a href="https://docs.vespa.ai/en/embedding.html">Embedding documentation</a> — Configuring embedders in Vespa</li>
  <li><a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html">Binary quantization guide</a> — Deep dive on binarization</li>
  <li><a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">Phased ranking</a> — Multi-phase ranking architecture</li>
  <li><a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 announcement</a> — Model family details and benchmarks</li>
</ul>

<p>For those interested in learning more about Vespa, join the <a href="https://vespatalk.slack.com/">Vespa community on Slack</a> to exchange ideas,
seek assistance from the community, or stay in the loop on the latest Vespa developments.</p>
]]></content:encoded>
        <pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/asymmetric-retrieval-spend-on-docs-queries-for-free/</guid>
        
        <category>embedding</category>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>voyage-ai</category>
        
        
      </item>
    
      <item>
        <title>How Metal AI Built an Agent-Driven Intelligence Platform on Vespa Cloud</title>
        <description>How Metal built an AI-Native Intelligence Platform on Vespa.ai, where 95% of retrieval is handled by AI agents.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-03-10-metal-case-study-agent-driven-intelligence-on-vespa-cloud/MetalxVespa.png" />
        
        <content:encoded><![CDATA[<blockquote>
  <p>“95% of our retrieval is done by AI agents.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<p>Metal needed a retrieval foundation that could evolve as fast as their product, without hitting a wall.</p>

<h2 id="introduction">Introduction</h2>

<p>Private equity firms manage vast amounts of unstructured data, including deal documents, expert call transcripts, financial statements, CRM records, and more. The challenge isn’t simply accessing this information. It’s connecting and understanding it, in context, across the investment lifecycle.</p>

<p><a href="https://www.metal.ai/?utm_source=chatgpt.com">Metal AI</a> was built to address this challenge. Its purpose-built institutional intelligence platform, used by established private equity firms transforms fragmented historical and live deal data into a living system of record that drives conviction at every stage of the investment lifecycle.</p>

<p>To deliver this vision at scale, Metal leverages <a href="http://vespa.ai">Vespa.ai</a> as its core retrieval layer, powering entity relationships, advanced ranking, and real-time context-aware retrieval across complex investment data.</p>

<h2 id="the-need-for-relationship-driven-retrieval">The Need for Relationship-Driven Retrieval</h2>

<p>As Metal’s product evolved, the limitations of traditional retrieval systems became clear.</p>

<p>Early architecture supported basic document search, but private equity workflows aren’t document-centric. They are entity- and relationship-driven. The enduring edge in private equity lies in drawing on decades of deal history, portfolio outcomes, and institutional knowledge. When that depth of experience surfaces reasoning and connections across time, every investment decision carries greater conviction.</p>

<p>Most traditional vector stores and search engines are fundamentally document-first. They index text, return similar passages, and rely primarily on semantic similarity or keyword matching. But for Metal’s use case, relevance requires more:</p>

<ul>
  <li>
    <p>Understanding which answer is the most recent and legally approved</p>
  </li>
  <li>
    <p>Identifying which company a metric belongs to</p>
  </li>
  <li>
    <p>Connecting meetings to prior diligence activity</p>
  </li>
  <li>
    <p>Applying business logic alongside semantic similarity</p>
  </li>
</ul>

<p>As Metal introduced more advanced workflows, like DDQ automation and agent-driven retrieval, the gap widened. Traditional systems struggle to:</p>

<ul>
  <li>
    <p>Combine semantic similarity with recency and compliance rules within ranking</p>
  </li>
  <li>
    <p>Support evolving data models without significant rework</p>
  </li>
  <li>
    <p>Query across multiple object types in a unified way</p>
  </li>
  <li>
    <p>Serve as a foundation for structured, iterative queries issued by AI agents</p>
  </li>
</ul>

<p>Layering custom logic on top of limited retrieval infrastructure would have created increasing technical debt, and each new entity type or ranking rule risked architectural compromise.</p>

<p>Metal needed a retrieval foundation that could evolve with the product, not constrain it.</p>

<h2 id="choosing-a-retrieval-layer-without-limits">Choosing a Retrieval Layer without Limits</h2>

<p>Metal wasn’t simply selecting a search engine. They were selecting a long-term retrieval architecture.</p>

<p>Several capabilities distinguished Vespa:</p>

<ul>
  <li>
    <p><strong>Multi-entity modeling:</strong> Vespa supports multiple object types, like documents, people, activities, and financial data, as well as the relationships between them. This aligned with how Metal structures institutional knowledge.</p>
  </li>
  <li>
    <p><strong>Advanced ranking and filtering:</strong> Vespa can combine semantic similarity with structured filters like recency and business rules, enabling Metal to tailor retrieval to specific workflows.</p>
  </li>
  <li>
    <p><strong>Flexibility without re-architecture:</strong> New object types can be introduced without migrating existing data or rebuilding the system.</p>
  </li>
  <li>
    <p><strong>Operational simplicity:</strong> Moving to Vespa Cloud enabled the team to focus engineering capacity on product innovation instead of infrastructure.</p>
  </li>
</ul>

<p>These capabilities give Metal the ability to shape retrieval around business logic, rather than forcing business logic to adapt to infrastructure limitations.</p>

<blockquote>
  <p>“Our competitors focus on documents. With Vespa, we can focus on the
full picture: companies, people, activities, and how they relate.” -
Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<h2 id="architecture-in-action">Architecture in Action</h2>

<p>Metal treats retrieval as part of an AI agent orchestration layer, not just a standard search box.</p>

<p>When a user or agent asks a question like, “What’s this company’s EBITDA?”, the query is first interpreted by an AI agent. Rather than issuing a single plain-text search, the agent:</p>

<ul>
  <li>
    <p>Determines which entity types to query (documents, companies, metrics, activities)</p>
  </li>
  <li>
    <p>Applies structured parameters such as recency or workflow-specific filters</p>
  </li>
  <li>
    <p>Executes retrieval against Vespa</p>
  </li>
  <li>
    <p>Iterates as needed (paginating, refining, or querying related entities)</p>
  </li>
  <li>
    <p>Assembles sufficient context before generating a response</p>
  </li>
</ul>

<p>Vespa powers this retrieval layer, enabling fast, structured queries across different object types and supporting the iterative retrieval process required by Metal’s agent-driven architecture.</p>

<h2 id="turning-ddq-chaos-into-structured-approved-intelligence">Turning DDQ Chaos into Structured, Approved Intelligence</h2>

<p>One clear example is Metal’s Due Diligence Questionnaire (DDQ) workflow. Private equity firms must respond to thousands of LP questionnaires using pre-approved answers. These responses cannot be freely generated by an LLM. They must come from content that has already been reviewed and approved by legal teams.</p>

<p>Answer banks change over time and are stored in unstructured formats like documents and spreadsheets. Metal indexes this data into Vespa, making the system aware of which documents are most recent. When answering a questionnaire, retrieval is prioritized not only by semantic similarity to the question but also by freshness.</p>

<p>This allows Metal to surface the most relevant and up-to-date approved answers, efficiently and reliably within its platform.</p>

<h2 id="scaling-without-infrastructure-headaches">Scaling without Infrastructure Headaches</h2>

<p>By building on <a href="https://vespa.ai/solutions/vespa-cloud/">Vespa Cloud</a>, Metal achieved:</p>

<ul>
  <li>
    <p>Improved feature velocity: The team can introduce new entity types and workflows quickly without architectural rework</p>
  </li>
  <li>
    <p>Greater engineering focus: The team spends less time managing infrastructure and more time building differentiating product features</p>
  </li>
  <li>
    <p>Scalable retrieval architecture: Metal can onboard new clients and data volumes without redesigning retrieval.</p>
  </li>
  <li>
    <p>Confidence in long-term flexibility: Vespa is not a limiting factor as Metal expands into more advanced agent-driven workflows.</p>
  </li>
</ul>

<blockquote>
  <p>“Managing infrastructure can be a distraction. Vespa Cloud lets us focus on product.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

<h2 id="looking-forward-build-for-an-agentic-future">Looking Forward: Build for an Agentic Future</h2>

<p>Metal’s roadmap is deeply agentic. AI agents drive most interactions, deciding how best to query the platform and construct the context needed to answer sophisticated questions.</p>

<p>Because Vespa supports flexible, multi-entity retrieval with advanced ranking and real-time performance, Metal can:</p>

<ul>
  <li>
    <p>Expand into more advanced analysis workflows</p>
  </li>
  <li>
    <p>Build deeper relational structures between entities</p>
  </li>
  <li>
    <p>Adapt retrieval strategies dynamically as business logic evolves</p>
  </li>
</ul>

<p>The result is an institutional intelligence platform that scales in both data volume and intelligence, evolving alongside the firm it serves.</p>

<blockquote>
  <p>“When you’re building something ambitious, you don’t want to hit a capability wall. Vespa gives us confidence that we won’t.” - Sergio Prada, Co-Founder &amp; CTO, Metal</p>
</blockquote>

]]></content:encoded>
        <pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/agent-driven-intelligence-on-vespa-cloud/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Build a High-Quality RAG App on Vespa Cloud in 15 Minutes</title>
        <description>Retrieval-Augmented Generation (RAG) allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" />
        
        <content:encoded><![CDATA[<p><strong>Retrieval-Augmented Generation (RAG)</strong> allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.</p>

<p>RAG bridges that gap by retrieving relevant information from your data and supplying it to the model as context, so responses are grounded in real, trusted sources rather than guesswork.</p>

<h2 id="the-challenge-the-quality-of-the-context-window">The Challenge: The Quality of the Context Window</h2>

<p>In Retrieval-Augmented Generation (RAG), the real bottleneck is the LLM’s context window. You can’t simply pass your entire dataset into a prompt—there’s a strict token budget.</p>

<p>Because of this, the problem isn’t just retrieving information, but retrieving the right information. When the context window is filled with loosely matched or low-quality results, the LLM has little to work with and the quality of its answers drops accordingly.</p>

<p>High-quality RAG depends on semantic understanding, precise retrieval, and strong ranking across diverse data types so that every token in the context window earns its place.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" alt="illustration_2" /></p>

<h2 id="the-solution-out-of-the-box-rag-on-vespa-cloud">The Solution: Out-of-the-Box RAG on Vespa Cloud</h2>

<p>Vespa Cloud provides an out-of-the-box Vespa <a href="https://docs.vespa.ai/en/examples/rag-blueprint.html">RAG Blueprint</a> designed to maximize the quality of the context sent to the LLM. Instead of relying solely on nearest-neighbor vector search, Vespa combines semantic vector retrieval with lexical BM25 scoring and applies advanced ranking, using models such as BERT, LightGBM, or custom logic—to ensure that only the strongest candidates are selected.</p>

<p>This hybrid retrieval and ranking approach consistently surfaces the most relevant document chunks, which significantly improves the quality of the final generated answer.</p>

<p>In this blog post, we’ll build a complete Retrieval-Augmented Generation (RAG) application from end to end by leveraging the OOTB Vespa RAG app on Vespa cloud. The following diagram shows the architecture we’ll be working with:</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/architecture_diagram.png" alt="Vespa RAG Architecture" /></p>

<p>The architecture consists of two main flows: data ingestion and query processing.</p>

<p><strong>Data Ingestion (one-time setup)</strong></p>

<p>First, we ingest our data sources, such as documents, PDFs, or web pages by using a Python-based pipeline. The pipeline processes the data, splits it into manageable chunks, generates embeddings, and feeds everything into a Vespa Cloud RAG application that is preconfigured with a schema and ranking profiles. This step populates the search index.</p>

<p><strong>Query Flow (live interaction)</strong></p>

<ol>
  <li>
    <p>A user enters a question in the <strong>Vespa RAG UI</strong>.</p>
  </li>
  <li>
    <p>The UI sends the query to a <strong>Python backend</strong>, which issues a hybrid search request (combining keyword and vector retrieval) to <strong>Vespa Cloud</strong>.</p>
  </li>
  <li>
    <p><strong>Vespa Cloud</strong> returns the most relevant document chunks.</p>
  </li>
  <li>
    <p>The backend sends those chunks, along with the original query, to an <strong>LLM</strong> as context.</p>
  </li>
  <li>
    <p>The model generates an answer grounded in that context and returns it to the backend.</p>
  </li>
  <li>
    <p>The backend streams the answer back to the UI.</p>
  </li>
</ol>

<p>This architecture ensures that generated responses are grounded in your own data, combining Vespa’s retrieval and ranking strengths with the generative capabilities of large language models.</p>

<p>The end-to-end setup takes about 15 minutes, plus additional time to process your documents.</p>

<hr />

<h2 id="deploy-vespa-rag-blueprint-to-vespa-cloud">Deploy Vespa RAG Blueprint to Vespa Cloud</h2>

<p>We’ll start by deploying a preconfigured RAG Blueprint to Vespa Cloud. This gives you a high-quality retrieval stack in minutes, and it’s free to get started. All of this is done directly from the Vespa Cloud console.</p>

<p><strong>Sign up for Vespa Cloud</strong></p>

<p>Go to the <a href="https://console.vespa-cloud.com/">Vespa Cloud Console</a> and create an account. If this is your first time using Vespa Cloud, the free trial is the fastest way to get going.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_1.png" alt="image_1" /></p>

<p><strong>Deploy RAG Blueprint</strong></p>

<p>In the console, select <strong>“Deploy your first application”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_2.png" alt="image_2" /></p>

<p>Choose <strong>“Select a sample application to deploy directly from the browser”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_3.png" alt="image_3" /></p>

<p>Select <strong>“RAG Blueprint”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_4.png" alt="image_4" /></p>

<p>Click <strong>“Deploy”</strong> and wait for the deployment to complete.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_5.png" alt="image_5" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_8.png" alt="image_8" /></p>

<p><strong>Save your credentials</strong></p>

<p>Once deployment finishes, the console will generate an access token. <strong>Save this immediately.</strong>
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_9.png" alt="image_9" /></p>

<p>That token is how Python backend authenticates with Vespa Cloud. Treat it like a password.</p>

<p>Continue through the remaining setup screens, then open the application view.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_10.png" alt="image_10" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_11.png" alt="image_11" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_12.png" alt="image_12" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_13.png" alt="image_13" /> 
<strong>Note your endpoint URL</strong></p>

<p>In the application view you will also find the endpoint URL. Save both the <strong>endpoint URL</strong> and the token; you will need them to configure Python backend in the next section.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_15.png" alt="image_15" />
You can download the Vespa application package by clicking the download icon if you’d like. From there, you can start building your data feeding pipeline, frontend service UI, and more. However, this blog provides a sample end-to-end RAG application, and the same Vespa application package is included, so there’s no need to download it separately.</p>

<h2 id="behind-the-scenes-what-you-just-deployed">Behind the Scenes: What You Just Deployed</h2>

<p>When you clicked <strong>Deploy</strong>, Vespa Cloud automatically provisioned infrastructure and deployed a complete <strong>Vespa application package</strong>. This package includes everything needed for a high-quality RAG system: schemas, indexing logic, ranking profiles, and service configuration.</p>

<p>In other words, you didn’t just spin up a demo, you launched a ready-to-use, high-quality retrieval engine.</p>

<p>Let’s take a closer look at what’s inside.</p>

<h3 id="the-schema">The Schema</h3>

<p>The RAG Blueprint uses a carefully designed schema that controls how documents are stored, chunked, embedded, and retrieved:</p>

<p><code class="language-plaintext highlighter-rouge">vespa_cloud/schemas/doc.sd</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="n">doc</span> <span class="o">{</span>
    <span class="n">document</span> <span class="n">doc</span> <span class="o">{</span>
        <span class="n">field</span> <span class="n">id</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">attribute</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">title</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">index</span> <span class="o">|</span> <span class="n">summary</span>
            <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">text</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
        <span class="o">}</span>

        <span class="err">#</span> <span class="nc">Optional</span> <span class="n">metadata</span> <span class="n">fields</span> <span class="k">for</span> <span class="n">tracking</span> <span class="n">document</span> <span class="n">usage</span>
        <span class="n">field</span> <span class="n">created_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">modified_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">last_opened_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">open_count</span> <span class="n">type</span> <span class="kt">int</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">favorite</span> <span class="n">type</span> <span class="n">bool</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">the</span> <span class="nf">title</span> <span class="o">(</span><span class="mi">768</span> <span class="n">floats</span> <span class="err">→</span> <span class="mi">96</span> <span class="n">int8</span><span class="o">)</span>
    <span class="n">field</span> <span class="n">title_embedding</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">title</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Automatically</span> <span class="n">chunks</span> <span class="n">text</span> <span class="n">into</span> <span class="mi">1024</span><span class="o">-</span><span class="n">character</span> <span class="n">segments</span>
    <span class="n">field</span> <span class="n">chunks</span> <span class="n">type</span> <span class="n">array</span><span class="o">&lt;</span><span class="n">string</span><span class="o">&gt;</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">index</span>
        <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">each</span> <span class="n">chunk</span>
    <span class="n">field</span> <span class="n">chunk_embeddings</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">chunk</span><span class="o">{},</span> <span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="n">fieldset</span> <span class="k">default</span> <span class="o">{</span>
        <span class="nl">fields:</span> <span class="n">title</span><span class="o">,</span> <span class="n">chunks</span>
    <span class="o">}</span>

    <span class="n">document</span><span class="o">-</span><span class="n">summary</span> <span class="n">top_3_chunks</span> <span class="o">{</span>
        <span class="n">from</span><span class="o">-</span><span class="n">disk</span>
        <span class="n">summary</span> <span class="n">chunks_top3</span> <span class="o">{</span>
            <span class="nl">source:</span> <span class="n">chunks</span>
            <span class="n">select</span><span class="o">-</span><span class="n">elements</span><span class="o">-</span><span class="nl">by:</span> <span class="n">top_3_chunk_sim_scores</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>What’s happening here:</strong> Your documents store their raw content in <code class="language-plaintext highlighter-rouge">title</code> and <code class="language-plaintext highlighter-rouge">text</code>. During indexing, the <code class="language-plaintext highlighter-rouge">text</code> field automatically split into 1024-character chunks. Embeddings are generated for both titles and chunks, then binary-quantized using <code class="language-plaintext highlighter-rouge">pack_bits</code>, shrinking 768 floating-point values down to just 96 <code class="language-plaintext highlighter-rouge">int8</code>s. This dramatically reduces storage and improves performance while still supporting efficient vector similarity search.</p>

<p>At the same time, BM25 is enabled for lexical matching. This combination is what enables Vespa’s hybrid retrieval: semantic matching plus exact term relevance.</p>

<p><strong>Out-of-the-Box Query Profiles:</strong></p>

<p>The RAG Blueprint ships with four query profiles optimized for NyRAG’s client-side RAG architecture:</p>

<p><strong>NyRAG Architecture:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Query → NyRAG (generates search queries)
          → Vespa (retrieval + ranking)
          → NyRAG (generates final answer)
</code></pre></div></div>
<p>Query profiles control <strong>only the Vespa retrieval/ranking step</strong>. NyRAG handles all LLM interactions.</p>

<p><strong>The 4 Profiles:</strong></p>

<ol>
  <li><strong>hybrid</strong> (default, fast)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector search with <code class="language-plaintext highlighter-rouge">targetHits:100</code></li>
      <li><strong>Ranking:</strong> Learned linear model (logistic regression)</li>
      <li><strong>Best for:</strong> Everyday queries where you want fast, solid results</li>
    </ul>
  </li>
  <li><strong>hybrid-with-gbdt</strong> (highest quality)
    <ul>
      <li><strong>Retrieval:</strong> Same as hybrid (BM25 + Vector, 100 targets)</li>
      <li><strong>Ranking:</strong> Two-phase with LightGBM (GBDT) second-phase</li>
      <li><strong>Best for:</strong> Complex queries where relevance matters most (~2-3x slower)</li>
    </ul>
  </li>
  <li><strong>deepresearch</strong> (exhaustive search)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector with <code class="language-plaintext highlighter-rouge">targetHits:10000</code> (100x more!)</li>
      <li><strong>Ranking:</strong> Learned linear model</li>
      <li><strong>Best for:</strong> Research scenarios needing maximum recall</li>
    </ul>
  </li>
  <li><strong>deepresearch-with-gbdt</strong> (exhaustive + best quality)
    <ul>
      <li><strong>Retrieval:</strong> Deep search (10k targets)</li>
      <li><strong>Ranking:</strong> Two-phase with GBDT</li>
      <li><strong>Best for:</strong> When you need both maximum recall and best ranking</li>
    </ul>
  </li>
</ol>

<blockquote>
  <p><strong>For Advanced Users:</strong> Query profiles bundle complete search configurations including YQL structure (with <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> operators), ranking profiles, and all required parameters (like learned coefficients). The Vespa application also includes <code class="language-plaintext highlighter-rouge">rag</code> and <code class="language-plaintext highlighter-rouge">rag-with-gbdt</code> profiles with <code class="language-plaintext highlighter-rouge">searchChain=openai</code> for <strong>server-side RAG</strong> (direct API usage), but these conflict with NyRAG’s client-side architecture and aren’t included. Learn more in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#ranking-profiles">technical guide</a>.</p>
</blockquote>

<p><strong>Which profile should you use?</strong></p>
<ul>
  <li>Start with <strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong> for everyday use - fast and accurate</li>
  <li>Switch to <strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong> when quality matters most (harder queries)</li>
  <li>Use <strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong> when you need to find everything relevant (research mode)</li>
  <li>Try <strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong> for maximum recall + quality (slowest but most thorough)</li>
</ul>

<hr />

<p>Now that your RAG Blueprint Vespa Cloud application is up and running, it’s time to add the missing pieces: a simple frontend UI and a data ingestion pipeline. For this, we’ll use <strong>NyRAG</strong>, a tool included in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint"><code class="language-plaintext highlighter-rouge">RAG-app-in-15min-ragblueprint
</code></a> repository.</p>

<p>NyRAG acts as the glue for the entire RAG workflow. It reads documents from local files or websites, splits text into manageable chunks, generates embeddings, feeds everything into Vespa, and finally exposes a lightweight chat UI where you can ask questions over your data. Instead of wiring all of this together yourself, NyRAG gives you a working end-to-end system out of the box.</p>

<h3 id="install-nyrag">Install NyRAG</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the repository</span>
git clone https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint.git
<span class="nb">cd </span>RAG-app-in-15min-ragblueprint

<span class="c"># Install uv (Fast, modern Python package manager)</span>
<span class="c"># macOS</span>
brew <span class="nb">install </span>uv

<span class="c"># Linux &amp; macOS</span>
<span class="c"># curl -LsSf https://astral.sh/uv/install.sh | sh</span>
<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"</span>

<span class="c"># Verify uv installation</span>
uv <span class="nt">--version</span>

<span class="c"># Install dependencies using uv</span>
uv <span class="nb">sync
source</span> .venv/bin/activate

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># . .\.venv\Scripts\activate</span>

<span class="c"># Install nyrag locally</span>
uv pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>

<span class="c"># Verify nyrag installation</span>
nyrag <span class="nt">--help</span>
</code></pre></div></div>

<p><strong>Get an LLM API key</strong></p>

<p>To generate final answers, NyRAG needs an OpenAI-compatible API key. The simplest way to get started is <strong>OpenRouter</strong>, which provides access to multiple LLMs through a single API.</p>

<p>In this walkthrough, we’ll use OpenRouter for convenience. In a real application, you’re free to swap in any compatible LLM provider. To continue, sign up for OpenRouter and generate an API key. You’ll use it in the next step when configuring NyRAG.</p>

<hr />

<h3 id="start-the-nyrag-ui">Start the NyRAG UI</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This script handles all configuration automatically</span>
./run_nyrag.sh

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># .\run_nyrag.ps1</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">run_nyrag.sh</code> script starts the UI and wires up the configuration so NyRAG can talk to Vespa Cloud. In practice, it loads your project config, uses the token you provide for authentication, and starts the web UI on port 8000.</p>

<p>Open http://localhost:8000 in your browser.</p>

<p><strong>Configure your project:</strong>
Now you’ll configure your project using the web UI to connect to your Vespa Cloud deployment and set up document processing.</p>

<p><strong>Step 1: Select and edit the example project</strong></p>

<p>In the top header, the project dropdown shows <strong>“doc_example”</strong>. If you are starting from the example config, it is usually pre-selected. The configuration editor typically opens automatically; if it does not (for example you land directly in chat), open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong>.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_7.png" alt="Project selector dropdown with &quot;doc_example&quot; highlighted" />
<strong>Description</strong>: Shows the project dropdown menu in the header with “doc_example” option</p>

<blockquote>
  <p><strong>Note:</strong> If the configuration editor doesn’t appear (shows chat interface instead), click the <strong>three-dot menu</strong> (⋮) in the top right corner and select <strong>“Edit Config”</strong> to open it manually.</p>
</blockquote>

<p><strong>Step 2: Update your credentials</strong></p>

<p>In the configuration editor, paste in the information you saved from Vespa Cloud and your LLM provider. You only need three things to get going: your Vespa tenant name, your Vespa endpoint + token, and your LLM API key.</p>

<p><strong>Required fields to update:</strong></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Your Vespa Cloud credentials (from Vespa Cloud Console)</span>
<span class="na">cloud_tenant</span><span class="pi">:</span> <span class="s">your-tenant</span>          <span class="c1"># Your Vespa Cloud tenant name</span>
<span class="na">vespa_cloud</span><span class="pi">:</span>
  <span class="na">endpoint</span><span class="pi">:</span> <span class="s">https://your-app.vespa-cloud.com</span>  <span class="c1"># Your Vespa token endpoint (not mtls)</span>
  <span class="na">token</span><span class="pi">:</span> <span class="s">vespa_cloud_YOUR_TOKEN_HERE</span>          <span class="c1"># Your Vespa data plane token</span>

<span class="c1"># Your LLM configuration (default: OpenRouter)</span>
<span class="na">llm_config</span><span class="pi">:</span>
  <span class="na">api_key</span><span class="pi">:</span> <span class="s">sk-or-v1-YOUR_KEY_HERE</span>   <span class="c1"># Your OpenRouter API key (or other provider)</span>
</code></pre></div></div>

<p><strong>Notes:</strong></p>

<p>The default LLM provider is OpenRouter. If you switch providers, also update <code class="language-plaintext highlighter-rouge">base_url</code> and <code class="language-plaintext highlighter-rouge">model</code> to match. For the included example documents, <code class="language-plaintext highlighter-rouge">start_loc</code> defaults to <code class="language-plaintext highlighter-rouge">./dataset</code>, so you can run the pipeline without changing anything else.</p>

<p><strong>Step 3: Save and start processing</strong></p>

<p>After updating the configuration, you can close the editor (changes are saved automatically) and start indexing. If you are using the example dataset, keep <code class="language-plaintext highlighter-rouge">./dataset</code> as-is; otherwise, point <code class="language-plaintext highlighter-rouge">start_loc</code> at the folder (or site) you want to ingest. When you click <strong>“Start Indexing”</strong>, NyRAG reads your input, chunks it into 1024-character segments, generates embeddings, feeds everything to Vespa Cloud, and shows progress in the terminal panel so you can see exactly what is happening.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_10.png" alt="Processing progress with terminal logs" />
<strong>Description</strong>: Shows documents being processed with terminal logs displaying progress</p>

<hr />

<h2 id="chat-with-your-data">Chat with Your Data</h2>

<p>You can now start asking questions in the chat UI.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_ui.png" alt="nyrag_ui" /></p>

<p>When you submit a query, NyRAG expands it into focused retrieval queries and sends them to Vespa. Vespa runs hybrid retrieval, combining BM25 keyword matching with vector similarity, and returns the most relevant chunks. Those chunks are packed into a compact context window and sent to the LLM, which generates an answer grounded entirely in your data.</p>

<p>A good way to sanity-check the setup is to start with a broad question like “What are the main topics in these documents?” and then follow up with something more specific to confirm the retrieved context makes sense.</p>

<p>At this point, you have a fully functional RAG application running on Vespa Cloud.</p>

<h3 id="improving-search-quality-with-query-profiles">Improving Search Quality with Query Profiles</h3>

<p>Want better search results? You can fine-tune how Vespa retrieves and ranks your documents using the Settings modal (⚙️ icon in the top right).</p>

<p><strong>Change query profiles:</strong> Open the ⚙️ <strong>Settings</strong> panel, choose a <strong>Query Profile</strong> from the dropdown, and click <strong>“Save”</strong>. The very next query you run will use the new profile.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_settings_query_profiles.png" alt="Settings modal with query profile dropdown" /><br />
<strong>Description</strong>: Settings modal showing query profile selection dropdown with 4 available options</p>

<p><strong>What each profile does:</strong></p>
<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong>: Fast hybrid search (BM25 + vector) with linear ranking</li>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong>: Same retrieval + advanced GBDT ranking (slower but best quality)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong>: Exhaustive search with 10,000 retrieval targets (maximum recall)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong>: Exhaustive search + GBDT ranking (slowest, most thorough)</li>
</ul>

<p><strong>Pro tip</strong>: The quality difference between <code class="language-plaintext highlighter-rouge">hybrid</code> and <code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code> can be dramatic for complex queries. The GBDT model offers significantly better relevance at the cost of 2-3x higher latency. For research tasks where you need to find everything relevant, try <code class="language-plaintext highlighter-rouge">deepresearch</code> variants which cast a much wider net!</p>

<hr />

<h3 id="manage-your-data">Manage Your Data</h3>

<p>NyRAG also gives you simple tools for cleanup. Open the advanced menu (three-dot icon ⋮ in the top right) and you will find two cleanup actions. <strong>Clear Local Cache</strong> removes cached files for all projects on your machine, which is useful when you want to re-process from scratch locally. <strong>Clear Vespa Data</strong> deletes the indexed documents in Vespa for the project, which is useful when you want a clean index before re-feeding. Both actions ask for confirmation so you do not delete data by accident.</p>

<hr />

<h2 id="bonus-try-web-crawling-mode">Bonus: Try Web Crawling Mode</h2>

<p>In addition to local documents, NyRAG supports web crawling. By switching to the web_example project, you can point NyRAG at a website and have it crawl, extract, and index content automatically.</p>

<p><strong>Switch to web crawling mode:</strong>  Select <code class="language-plaintext highlighter-rouge">web_example (web)</code> from the dropdown at the top and open the configuration editor. If you are currently on the chat screen, open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong> to bring the editor back. From there, update the same credential fields as you did for <code class="language-plaintext highlighter-rouge">doc_example</code>, then click <strong>“Start Indexing”</strong> to crawl and feed the site.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_indexing_web_2.png" alt="Web crawling in progress" /> 
<strong>Description</strong>: Shows web crawling in progress with terminal logs displaying discovered URLs and processed pages</p>

<p><strong>Web Mode Features:</strong> Web mode discovers and follows links automatically, while still respecting <code class="language-plaintext highlighter-rouge">robots.txt</code> and crawl delays so you do not hammer a site. It also does smart content extraction to drop navigation and boilerplate, deduplicates very similar pages, and supports resume so you can continue a crawl after interruption.</p>

<p><strong>Example Use Cases:</strong> Web mode is a good fit for product documentation, knowledge bases, blog archives, help-center content, and technical wikis. In general, it works best on sites with consistent HTML structure and clean, text-heavy pages.</p>

<p><strong>Tips:</strong> Start small. Crawl a limited part of a site first so you can sanity-check what gets extracted and indexed, then expand. Use <code class="language-plaintext highlighter-rouge">exclude</code> patterns to skip sections you do not want (for example <code class="language-plaintext highlighter-rouge">/pricing</code> or <code class="language-plaintext highlighter-rouge">/sales/*</code>), and keep an eye on the terminal output panel so you can spot loops, unexpected URLs, or pages that fail to parse.</p>

<hr />

<h2 id="troubleshooting">Troubleshooting</h2>

<p>Running into issues? We’ve got you covered! For detailed troubleshooting guides covering Vespa connection errors, LLM configuration, document processing, and more, see the <strong><a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#troubleshooting">Troubleshooting section</a></strong> in the main README.</p>

<p><strong>Quick help:</strong> If you get stuck, the fastest path is usually to ask in the <a href="http://slack.vespa.ai/">Vespa Slack</a> community, where people can help you interpret logs and query behavior. If you think you found a bug or want to request an improvement, open an issue in <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint/issues">GitHub Issues</a>. And when you want deeper background on schema, ranking, and deployment, the <a href="https://docs.vespa.ai/">Vespa Docs</a> are your go-to reference.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p><strong>Congratulations!</strong> You now have a working RAG app: a Vespa Cloud deployment that can retrieve high-quality context, and a small UI that lets you ingest data and chat with it.</p>

<p>Building a high-quality RAG system is never trivial. There are multiple moving parts to get right: the quality of the LLM, the size and management of its context window, and how effectively your retrieval system surfaces the most relevant information.</p>

<p>Thanks to the out-of-the-box Vespa RAG blueprint on Vespa Cloud, much of this complexity is handled for you. It comes with multiple ranking profiles, and its default hybrid retrieval setup combines <strong>vector similarity with BM25 text matching</strong>, ensuring your LLM sees the best possible context for every query.</p>

<p>Vespa Cloud doesn’t just make building RAG easier, it makes it <strong>scalable, fast, and reliable</strong>, giving you production-ready infrastructure, auto-scaling and observability without the headaches of self-hosting. Whether you’re experimenting with small datasets or scaling to millions of documents, Vespa Cloud provides the tools and flexibility to make your RAG project shine.</p>

<p>Want to dive deeper? Start with the <a href="https://docs.vespa.ai/en/learn/tutorials/rag-blueprint.html">RAG Blueprint Tutorial</a> for a thorough conceptual walkthrough. And remember the <a href="https://vespatalk.slack.com/">Vespa Slack community</a> is always there to help. Ask questions, share what you’ve built, or get advice on retrieval, ranking, and deployment strategies.</p>

<p>Ready to experience the power of Vespa Cloud for yourself? <a href="https://cloud.vespa.ai/">Sign up</a> today and <strong>start building high-quality RAG applications with ease</strong>!</p>

]]></content:encoded>
        <pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Vespa Newsletter, February 2026</title>
        <description>Advances in Vespa&apos;s retrieval performance, flexibility, and developer productivity.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/logo/logo-pi.jpg" />
        
        <content:encoded><![CDATA[<p>Welcome to the latest edition of the Vespa newsletter. In the <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">previous update</a>, we introduced several new features and improvements, including Automated ANN Tuning, Accelerated Exact Vector Distance with Google Highway, Precise Chunk-Level Matching for Higher Retrieval Quality, Quantile Computation in Grouping for Instant Distribution Insights, and <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">more</a>.</p>

<p>This month, we’re announcing several updates focused on retrieval quality, ranking flexibility, and developer productivity. Each feature is designed to help engineering teams build faster, more accurate, and more maintainable retrieval and ranking systems, while giving businesses better relevance, lower operational overhead, and more predictable performance at scale.</p>

<p>Let’s dive into what’s new.</p>

<h2 id="product-updates">Product updates</h2>

<ul>
  <li>Announcing the Vespa.ai Playground</li>
  <li>The Vespa Kubernetes Operator</li>
  <li>Faster result rendering with CBOR</li>
  <li>Pyvespa 1.0 with improved HTTP performance</li>
  <li>Hybrid search relevance evaluation tool</li>
  <li>Configurable linguistics per field</li>
  <li><strong>“switch”</strong> operator in ranking expressions</li>
  <li>Vespa is now available on GCP Marketplace</li>
  <li>Feed data and run queries in the Vespa Console</li>
</ul>

<h3 id="announcing-the-vespaai-playground">Announcing the Vespa.ai Playground</h3>

<p>The Vespa Playground is a new GitHub space where we share projects, tools, and demos built on the Vespa platform. It’s a practical place to explore real examples for embeddings, model training, and feed connectors that you can clone, run, and build on your own.</p>

<p>These repos are ideal for experimentation, learning, and inspiration, though they aren’t officially supported product releases.</p>

<p><a href="https://github.com/vespaai-playground">Explore the Playground</a></p>

<h3 id="the-vespa-kubernetes-operator">The Vespa Kubernetes Operator</h3>

<p>The safest, most robust and cost effective way to run Vespa is to deploy on Vespa Cloud, but for various reasons that’s not an option for everybody. For those who want to run Vespa securely at scale but can’t use Vespa Cloud we have now released the Vespa Kubernetes Operator. This brings many of the Vespa Cloud features such as security out of the box, dynamic provisioning, autoscaling and automated upgrades to your own Kubernetes environments.</p>

<p>Read more in the <a href="https://docs.vespa.ai/en/operations/kubernetes/vespa-on-kubernetes.html">Kubernetes Operator documentation</a>.</p>

<h3 id="faster-result-rendering-with-cbor">Faster result rendering with CBOR</h3>

<p>Query result sets can be large, and increasingly so when the client is an LLM retrieving many chunks for model context. <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">Layered ranking</a> is designed to address this by extracting the most relevant content. Still, in some cases, the total latency is dominated by the time it takes to send the query response. Compressing with gzip can help, but is also CPU-intensive and slow.. From Vespa 8.623.5, json response generation is over twice as fast as before.</p>

<p>Another new option in this release is to use the <a href="https://cbor.io/">CBOR</a> format for query results. CBOR is a binary format so it can be serialized faster and produces smaller payloads, especially when the result contains lots of numeric data. Read more in the <a href="https://docs.vespa.ai/en/reference/api/query.html#presentation.format">Query API reference</a> and query <a href="https://docs.vespa.ai/en/performance/practical-search-performance-guide.html#hits-and-summaries">performance guide</a>.</p>

<h3 id="pyvespa-10-with-improved-http-performance">Pyvespa 1.0 with improved HTTP performance</h3>

<p>We have released the first major version of Pyvespa! This release switches the HTTP-client used by Pyvespa, from httpx to httpr, which gives big performance gains, especially for serializing and deserializing tensors, largely by taking advantage of the new CBOR serialization support in Vespa.</p>

<p>On preliminary benchmarks, we compared end-to-end latency for:</p>

<ol>
  <li>
    <p>Vespa 8.591.16 + Pyvespa v0.63.0 (using JSON)</p>
  </li>
  <li>
    <p>Vespa 8.634.24 + Pyvespa v1.0.0 (using CBOR)</p>
  </li>
</ol>

<p>The latter was ~4.9x faster when returning 400 hits with a 768-dim vector each. Performance gains will be smaller when not returning large result sets with tensors, but still significant. You may encounter different exceptions than before, but we strived to not change any user-facing API’s even if we bumped the major version.</p>

<p><a href="https://github.com/vespa-engine/pyvespa">Go to Pyvespa</a></p>

<h3 id="hybrid-search-relevance-evaluation-tool">Hybrid search relevance evaluation tool</h3>

<p>Hybrid search combines lexical and embedding based search to get the best from both. One of the tasks you need to solve is to pick an embedding model that provides a good quality vs. cost tradeoff for your use case. We have done a systematic evaluation of modern alternatives in <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">this blog</a>.</p>

<p>The code used to run these experiments is now merged into Pyvespa. You can use the VespaMTEBApp to evaluate embedding model performance on any task/benchmark compatible with the <a href="https://embeddings-benchmark.github.io/mteb/overview/available_benchmarks/">mteb-library</a>. See example usage from the <a href="https://github.com/vespa-engine/pyvespa/blob/master/tests/integration/test_integration_mtebevaluation.py">tests</a>.</p>

<h3 id="configurable-linguistics-per-field">Configurable linguistics per field</h3>

<p>Vespa now lets you specify linguistics profiles on fields to select some specific linguistics processing in your Linguistics module. In Lucene Linguistics, linguistics profiles map to analyzer configuration, optionally in combination with a specific language.</p>

<p>For example, you can define a Lucene analyzer like this in services.xml:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  &lt;item key="profile=whitespaceLowercase;language=en"&gt;

    &lt;tokenizer&gt;

      &lt;name&gt;whitespace&lt;/name&gt;

    &lt;/tokenizer&gt;

    &lt;tokenFilters&gt;

      &lt;item&gt;

        &lt;name&gt;lowercase&lt;/name&gt;

      &lt;/item&gt;

    &lt;/tokenFilters&gt;

  &lt;/item&gt;
</code></pre></div></div>
<p>And use it in the schema, under any field’s definition, like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>field title type string {

  indexing: summary | index

  linguistics {

      profile: whitespaceLowercase

  }

}
</code></pre></div></div>
<p>By default the linguistics profile will be applied both when processing the text of the field and the text searching it, but you can also specify a different linguistics profile on the query side, which is useful for e.g. doing synonym query expansion.</p>

<p>We’ve added a sample application demonstrating how to use multiple Lucene linguistics <a href="https://github.com/vespa-engine/sample-apps/tree/master/examples/lucene-linguistics/multiple-profiles">profiles</a> across multiple fields and updated the Vespa <a href="https://docs.vespa.ai/en/linguistics/linguistics.html">linguistics documentation</a> with usage examples.</p>

<h3 id="new-switch-operator-in-ranking-expressions">New “switch” operator in ranking expressions</h3>

<p>We have added a “switch” function in ranking expressions as a clearer, more maintainable alternative to deeply nested if() clauses, making complex ranking easier to read, debug, and evolve.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>switch (attribute(category)) {

    case "restaurant": myRestaurantFunction(),

    case "hotel": myHotelFunction(),

    default: myDefaultFunction()

}
</code></pre></div></div>

<p><a href="https://docs.vespa.ai/en/ranking/ranking-expressions-features.html#the-switch-function">Learn more</a></p>

<h3 id="vespa-is-now-available-on-gcp-marketplace">Vespa is now available on GCP Marketplace</h3>

<p>Vespa Cloud is now listed on the GCP Marketplace, making it easier to deploy and manage Vespa using native Google Cloud billing and procurement. Vespa Cloud is already available on <a href="https://aws.amazon.com/marketplace/pp/prodview-5pkxkencasnoo?sr=0-1&amp;ref_=beagle&amp;applicationId=AWSMPContessa">AWS Marketplace</a>.</p>

<p><a href="https://console.cloud.google.com/marketplace/product/gcp-billing-marketplace/vespa-cloud">See details</a></p>

<h3 id="feed-data-and-run-queries-in-the-vespa-console">Feed data and run queries in the Vespa Console</h3>

<p>The onboarding experience is now even smoother for new Vespa Cloud users. When you follow the getting started guide and deploy a sample app from the browser, you can immediately feed data and run queries directly in the browser. This makes it easy to try your own data and see how it behaves in Vespa.</p>

<p>We also provide examples showing how to do the same using pyvespa, the Vespa CLI, or curl.</p>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/new-onboarding-console.png" alt="New onboarding experience" /></p>

<p><a href="https://login.console.vespa-cloud.com/u/signup/identifier?state=hKFo2SBsN1NBOERhNnRCbDhpajdqTnhYSTlzUlltUjNoUG5mZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIERwRkg4NkVwRHg2aFk1Rjg0ZHZrYmdBZ0pFc1lTb29Io2NpZNkgVk92OGViclhwcEdBTnVpWWZHOWhKWk94MVM5T0dhTTQ">Try it Free</a></p>

<h2 id="new-content-and-learning-resources">New content and learning resources</h2>

<p>We published several new articles and resources since our last newsletter to help teams get more out of Vespa and stay ahead of new developments in search, RAG, and large-scale AI.</p>

<p><strong>Examples and notebooks:</strong></p>

<ul>
  <li><a href="http://playground.vespa.ai">playground.vespa.ai</a></li>
</ul>

<p><strong>Videos, webinars, and podcasts</strong></p>

<ul>
  <li><a href="https://em360tech.com/podcasts/how-scale-ai-digital-commerce-effectively?utm_content=520974566&amp;utm_medium=social&amp;utm_source=linkedin&amp;hss_channel=lcp-100705136">How To Scale AI in Digital Commerce Effectively</a></li>
  <li><a href="https://vespa.ai/resource/vespa-now-year-in-review/">2025 Year in Review</a></li>
</ul>

<p><strong>Blogs and ebooks</strong></p>

<ul>
  <li><a href="https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/">Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</a></li>
  <li><a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a></li>
  <li><a href="https://blog.vespa.ai/enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/">Enterprise AI Search vs. the Real Needs of Customer-Facing Apps</a></li>
  <li><a href="https://blog.vespa.ai/eliminating-the-precision-latency-trade-off-in-large-scale-rag/">Eliminating the Precision–Latency Trade-Off in Large-Scale RAG</a></li>
  <li><a href="https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/">How Tensors Are Changing Search in Life Sciences</a></li>
  <li><a href="https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/">The Search API Reset: Incumbents Retreat, Innovators Step Up</a></li>
  <li><a href="https://blog.vespa.ai/why-ai-search-platforms-are-gaining-attention/">Why AI Search Platforms Are Gaining Attention</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-5-of-5/">Why Life Sciences AI Is a Search Problem (Part 5 of 5)</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-4-of-5/">Why Life Sciences AI Is a Search Problem (Part 4 of 5)</a></li>
</ul>

<h3 id="upcoming-events">Upcoming Events</h3>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/maven.jpeg" alt="Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET" />
<strong>Lightning Lesson: Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET</strong></p>
<ul>
  <li>Intro to sparse vectors and tensors for efficient data handling</li>
  <li>Using Vision-Language Models (VLMs) to extract high quality and nuanced features from images</li>
  <li>Leveraging these features in sparse representations for hyper-personalized search &amp; recommendations</li>
</ul>

<p><a href="https://maven.com/p/b5ee84/personalized-relevance-with-vl-ms-and-sparse-vectors">Register Now</a></p>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/eCommerce-Webinar-Series.png" alt="e-commerce-webinar-series" />
<strong>February 18: The Zero Results Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/f4f6c070-c094-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/305ace80-c3c0-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20(AMER)">Save your spot</a></li>
</ul>

<p><strong>March 11: The Relevance Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/70338df0-c5fd-11f0-831c-01bcfd385865?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/5bf695d0-c5fd-11f0-bb1f-e79dc2111266?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20AMER">Save your spot</a></li>
</ul>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/Vespa-Now-Q1-Product-Update.png" alt="product-update" />
<strong>March 10: Vespa Q1 Product Update</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/79245020-f186-11f0-ace7-c7ef52349391?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20Update">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/3d23e680-f186-11f0-b12c-b1c5402490b0?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20update">Save your spot</a></li>
</ul>

<hr />
<p>👉 <a href="https://www.linkedin.com/company/vespa-ai/">Follow us on LinkedIn</a> to stay in the loop on upcoming events, blog posts, and announcements.</p>

<hr />

<p>Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? <a href="https://vespa.ai/free-trial/">Deploy your application for free</a> on Vespa Cloud today.</p>

]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-newsletter-february-2026/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-newsletter-february-2026/</guid>
        
        
        <category>newsletter</category>
        
      </item>
    
      <item>
        <title>Nexla + Vespa, The Power Duo for AI-Ready Data Pipelines</title>
        <description>Nexla solves data readiness. Vespa solves intelligence and precision at scale. Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/images/New Partnership Nexla.png" />
        
        <content:encoded><![CDATA[<h3 id="partner-spotlight-nexla">Partner Spotlight: Nexla</h3>

<p>AI is transforming quickly. What started with Q&amp;A chatbots has already evolved into deep research applications and, now, autonomous AI agents. Vespa is proud to be at the center of this shift, enabling some of the most proficient adopters of AI, such as Perplexity. To help organizations maximize the benefits of Vespa, we’re building a robust partner ecosystem. These partners help bring Vespa’s AI-native capabilities into real-world deployments across industries.</p>

<p><strong>Meet the innovators shaping the future of AI. Today’s spotlight: Nexla</strong></p>

<h2 id="nexla--vespaai-the-power-duo-for-ai-ready-data-pipelines">Nexla + Vespa.ai: The Power Duo for AI-Ready Data Pipelines</h2>

<p>When AI systems fall short, it’s rarely the model’s fault. It’s the messy reality of data spread across systems and never quite staying in sync. That’s why Nexla and Vespa partnered together.</p>

<p><a href="https://nexla.com/">Nexla</a> makes data usable.</p>

<p><a href="http://vespa.ai">Vespa</a> makes data intelligent at scale.</p>

<p>Together, they turn messy, distributed enterprise data into real-time AI search, recommendation, and RAG systems, without months of custom code gluing things together.</p>

<h2 id="nexla-making-enterprise-data-usable">Nexla: Making Enterprise Data Usable</h2>

<p>Nexla is an enterprise-grade, AI-powered data integration <a href="https://nexla.com/nexla-platform-overview">platform</a> that turns raw data from any source into production-ready data products. It provides a declarative, no-code way to move, transform, and validate data across ETL/ELT, reverse ETL, streaming, APIs, and RAG pipelines.</p>

<p>Think of Nexla as the layer that answers: “How do we reliably get the right data, in the right shape, to the systems that need it?</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>500+ Bidirectional <a href="https://nexla.com/connectors/">Connectors</a>:</strong> Pull data from databases, APIs, cloud storage, SaaS apps, and data warehouses, including systems like Salesforce, Snowflake, and Amazon S3.</p>
  </li>
  <li>
    <p><strong>Metadata Intelligence:</strong> Nexla automatically scans sources and generates <a href="https://nexla.com/nexsets">Nexsets</a>, virtual, ready-to-use data products with schemas, samples, and validation rules.
Example: If a price field suddenly switches from numeric to string, Nexla detects it before bad data reaches production search.</p>
  </li>
  <li>
    <p><strong><a href="https://nexla.com/blog/introducing-express-conversational-data-platform/">Express</a> (conversational pipelines):</strong> A conversational AI interface where you can simply describe what you need.
Example: You can say, “Pull customer data from Salesforce and merge with Google Analytics,” and it builds the pipeline for you.</p>
  </li>
  <li>
    <p><strong>Universal <a href="https://nexla.com/data-integration/">integration</a> styles:</strong> Supports ELT, ETL, CDC, R-ETL, streaming, API integration, and FTP in a single platform.</p>
  </li>
</ul>

<p>Nexla processes over <strong>1 trillion records monthly</strong> for companies like DoorDash, LinkedIn, Carrier, and LiveRamp.</p>

<h2 id="vespa-where-retrieval-becomes-reasoning">Vespa: Where Retrieval Becomes Reasoning</h2>

<p>Vespa is a production-grade AI search platform that combines a distributed text search, vector search, structured filtering, and machine-learned ranking in a single system.</p>

<p>Think of Vespa as the engine that answers: “Given all this data, how do we retrieve, rank, and reason over it in real time?”</p>

<p>It powers demanding applications like Perplexity and supports search, recommendations, personalization, and RAG at massive scale.</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>Unified AI Search and Retrieval:</strong> Vespa natively combines vector and <a href="https://vespa.ai/tensor-formalism/">tensor search</a> for semantic retrieval, full-text search for precise keyword matching, and structured filtering on attributes like categories, prices, and dates to enable richer, contextual search without stitching multiple systems together.</p>
  </li>
  <li>
    <p><strong>Real-time Retrieval and Inference at Scale:</strong> Rather than separating indexing, ranking, and inference across multiple systems, Vespa performs real-time machine-learned ranking and model inference where the data lives. This means you can serve fresh, personalized results with predictable sub-100 ms latency even for large datasets.</p>
  </li>
  <li>
    <p><strong>Multi-Phase Ranking and Custom Logic:</strong> Vespa lets you embed custom ranking logic, including ML models like XGBoost, directly into your search pipeline using ONNX. You can combine relevance signals, business rules, and semantic vectors in multi-stage ranking to fine-tune which results surface first.</p>
  </li>
  <li>
    <p><strong>Massive Scalability with High Throughput:</strong> Designed for real-world, high-traffic applications, Vespa can scale horizontally across clusters, handling billions of documents with sub-100ms query latency and up to 100k writes per second per node.</p>
  </li>
  <li>
    <p><strong>Multi-Vector and Multi-Modal Retrieval:</strong> Vespa natively handles multiple vectors per document, with support for token-level embeddings, ColPali-based visual document retrieval, and <a href="https://vespa.ai/tensor-formalism/">tensor-based computations</a> for precise, cross-modal relevance and ranking.</p>
  </li>
</ul>

<p>GigaOm recognized Vespa as a <strong><a href="https://content.vespa.ai/gigaom-report-v3-2025?_gl=1*1ep8wq0*_gcl_aw*R0NMLjE3NjQ4Nzg2NjIuQ2owS0NRaUFfOFRKQmhETkFSSXNBUFg1cXhRbHdEbHgtMndtQjdqRS1aYzhVWHRBSW4zTzZ2eEVrelNYTTdLUkNXSkZCTGpISml4MzNSZ2FBbkRxRUFMd193Y0I.*_gcl_au*MjkzNDEwODQ3LjE3NjUyODY2NTk.">leader</a> in vector databases</strong> for two consecutive years, noting its performance advantages over alternatives like Elasticsearch, up to <strong><a href="https://content.vespa.ai/vespa-vs-elasticsearch-performance-comparison">12.9X higher throughput</a> per CPU core for vector searches</strong>.</p>

<h2 id="how-nexla-and-vespa-work-together">How Nexla and Vespa Work Together</h2>

<p>The Nexla-Vespa partnership removes one of the hardest parts of AI systems: getting clean, well-modeled data into a high-performance retrieval engine, continuously.</p>

<p>Nexla recently launched a Vespa connector that makes data integration with Vespa seamless. The integration includes:</p>

<p><strong><a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa Connector</a> in Nexla:</strong>
Handles all data piping from sources like Amazon S3, PostgreSQL, Pinecone, Snowflake, and others directly into Vespa:
<img src="/assets/images/nexla1.png" alt="" /></p>

<p><strong>Vespa Nexla Plugin CLI:</strong> Automatically generates draft Vespa application packages (including schema files) directly from a Nexset, eliminating manual configuration:
<img src="/assets/images/nexla2.png" alt="" /></p>

<p>This means you can move data from S3 to Vespa, migrate from Pinecone to Vespa, or sync <a href="https://nexla.com/demo-center/move-data-from-postgresql-to-vespa-ai-effortlessly/">PostgreSQL to Vespa</a>, all without writing a single line of code.</p>

<h2 id="when-nexla-clients-should-use-vespa">When Nexla Clients Should Use Vespa</h2>

<p>You’re a Nexla client. Use Vespa when you need:</p>

<p><strong>Advanced AI search and RAG applications:</strong>
If you’re building intelligent search, recommendation systems, or RAG applications that require hybrid search (combining semantic vector search with keyword matching and metadata filtering), Vespa is purpose-built for this. Nexla gets your data into Vespa, while Vespa delivers production-grade AI search with machine-learned ranking.</p>

<p><strong>Real-time, high-scale query performance:</strong>
When you need to serve thousands of queries per second across billions of documents with sub-100ms latency, Vespa’s distributed architecture scales horizontally without compromising quality. Nexla ensures your data flows continuously into Vespa with incremental updates and CDC support.</p>

<p><strong>Complex ranking and inference:</strong>
If your use case requires multi-phase ranking, custom ML models, or LLM integration at query time, Vespa executes these operations locally where data lives, avoiding costly data movement. Nexla prepares and transforms your data into the exact schema Vespa needs.</p>

<p><strong>Cost efficiency at scale:</strong>
Vespa delivers 5X infrastructure cost savings compared to alternatives like Elasticsearch while handling vector, lexical, and hybrid queries. Nexla minimizes integration costs by automating pipeline creation and schema management.</p>

<h2 id="when-vespa-clients-should-use-nexla">When Vespa Clients Should Use Nexla</h2>

<p>You’re a Vespa client. Use Nexla when you need:</p>

<p><strong>Multi-source data consolidation:</strong>
Vespa is your search and inference engine, but data lives everywhere, S3 buckets, PostgreSQL databases, Snowflake warehouses, Salesforce CRMs, APIs, and files. Nexla connects to 500+ sources with bidirectional connectors and consolidates data into Vespa without custom ETL scripts.</p>

<p><strong>Automated schema generation and management:</strong>
Instead of manually writing Vespa schema files and managing schema evolution, Nexla’s Plugin CLI auto-generates schemas from your Nexsets. As source schemas change, Nexla’s metadata intelligence detects changes and propagates them downstream automatically.</p>

<p><strong>Data transformation and enrichment:</strong>
Before data hits Vespa, it often needs cleaning, filtering, enrichment, or format conversion. Nexla provides a no-code transformation library and supports custom SQL, Python, or JavaScript, all without maintaining separate ETL infrastructure.</p>

<p><strong>Vector database migration:</strong>
Moving from Pinecone, Weaviate, or another vector database to Vespa? Nexla handles the migration with zero code, extracting records, transforming data to match Vespa’s schema, and syncing documents continuously.</p>

<p><strong>Data quality and monitoring:</strong>
Nexla continuously monitors data flows with built-in validation rules, error handling, and automated alerts. When data quality issues arise, Nexla quarantines bad records and provides audit trails, ensuring Vespa always receives clean, trustworthy data.</p>

<p><strong>Real-time and streaming pipelines:</strong>
Vespa supports real-time updates, but getting real-time data from streaming sources (Kafka, APIs, databases with CDC) requires integration logic. Nexla handles streaming, batch, and hybrid integration styles, optimizing throughput and latency for each source type.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Nexla solves <strong>data readiness</strong>.</p>

<p>Vespa solves <strong>intelligence and precision at scale</strong>.</p>

<p>Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications. <a href="http://vespa.ai">Vespa</a> gives you production-grade vector search, hybrid retrieval, and RAG capabilities at any scale. <a href="http://nexla.com">Nexla</a> eliminates months of pipeline development and makes multi-source data flows conversational.</p>

<p><strong>Ready to explore?</strong></p>

<p>Start at <a href="http://express.dev">express.dev</a> for conversational pipeline building, or explore the <a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa connector</a> in Nexla’s platform to see how quickly your data can power real AI applications.</p>
]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-nexla-partnership/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-nexla-partnership/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</title>
        <description>Agentic AI-powered Sales for Developers, built on Vespa</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-01-16-agentic-ai-powered-sales-for-developers-with-vespa/clarmcase.jpg" />
        
        <content:encoded><![CDATA[<!--
|--------------------------|--------------|
| **Industry:**            | Technology   |
| **Founded:**             | 2024         |
| **Backing:**             | Y Combinator |

Vespa Cloud → Vespa Enclave (AWS) 
-->

<h2 id="overview">Overview</h2>
<p>Clarm helps open source software companies <a href="https://www.clarm.com/blog/articles/convert-github-stars-to-revenue?utm_source=vespa&amp;utm_campaign=clarm_case_study">convert GitHub stars into revenue</a> through AI-powered lead generation, content production, and developer support automation. When building their platform, Clarm needed a search engine that could power accurate, zero-hallucination AI responses while handling complex enrichment across millions of GitHub data points. They chose <a href="http://vespa.ai">Vespa</a> for its unified text, vector, and structured search capabilities and were able to deploy to production in under a day.</p>

<h2 id="the-problem-software--oss-companies-struggle-to-monetize">The Problem: Software / OSS Companies Struggle to Monetize</h2>
<p>“Most OSS founders can’t get attention for their software initially. They’re <a href="https://www.clarm.com/blog/articles/developer-growth-engine-automating-sales-marketing?utm_source=vespa&amp;utm_campaign=clarm_case_study">so focused on building the product that marketing, SEO, and content creation get dropped</a>. We built Clarm to automate all the growth work founders drop so they can focus on git commits,” explains Marcus Storm-Mollard, founder and CEO of Clarm.</p>

<p>The challenge is fundamental: 99% of successful open source is funded by businesses paying for solutions, but early-stage OSS companies lack the infrastructure to identify, engage, and convert those potential paying customers. They have thousands of GitHub stars but no clear path to revenue.</p>

<p>Clarm addresses this through three product pillars:</p>
<ol>
  <li>
    <p><strong>Lead Generation &amp; Prospecting:</strong> The killer feature. Clarm takes repo data from customers and competitors, enriches it with signals from website visits, commits, issues, and community engagement, then ranks and identifies good-fit prospects and potential enterprise buyers.</p>
  </li>
  <li>
    <p><strong>Marketing &amp; Content Production:</strong> Automated content creation from commits, PRs, and codebase analysis, helping OSS companies maintain consistent technical marketing.</p>
  </li>
  <li>
    <p><strong>Developer Support Automation:</strong> AI-powered support across Discord, Slack, GitHub Issues, and websites, with deep integrations and analytics for scaling customer success.</p>
  </li>
</ol>

<h2 id="the-search-challenge">The Search Challenge</h2>
<p>At the core of all three pillars sits a critical technical requirement: accurate, explainable search and retrieval.</p>

<blockquote>
  <p>“We realized early that search, not generation, was the real problem to solve. Generating LLM answers isn’t hard. Finding the right information to base them on is everything,” Marcus notes.</p>
</blockquote>

<p>Clarm needed a search engine that could:</p>
<ul>
  <li>Handle hybrid retrieval (combining text search, vector embeddings, and structured filters)</li>
  <li>Power zero-hallucination AI responses grounded in verifiable context</li>
  <li>Process and rank millions of GitHub data points in real-time</li>
  <li>Support complex multi-signal enrichment for lead scoring</li>
  <li>Scale cost-effectively on a startup budget</li>
</ul>

<p><a href="https://blog.vespa.ai/why-search-platform-is-better-than-vector-database/">Traditional vector databases</a> like Supabase or search engines like <a href="https://blog.vespa.ai/modernizing-elasticsearch-with-vespa/">Elasticsearch</a> couldn’t deliver the unified, production-grade retrieval required for Clarm’s zero-hallucination architecture.</p>

<h2 id="the-solution-vespas-production-grade-hybrid-search">The Solution: Vespa’s Production-Grade Hybrid Search</h2>

<p>Marcus discovered Vespa after researching how companies like <a href="https://blog.vespa.ai/perplexity-builds-ai-search-at-scale-on-vespa-ai/">Perplexity</a> and <a href="https://blog.vespa.ai/using-vespa-cloud-resource-suggestions-to-optimize-costs/">Onyx</a> built their advanced retrieval systems.</p>

<blockquote>
  <p>“We really liked that Vespa started as a search engine and evolved into a vector-based system.
It made so much sense for what we were building.
Vespa’s ranking and tensoring are built in, so we know our results are accurate and relevant right out of the box,” Marcus explains.</p>
</blockquote>

<h4 id="rapid-deployment-less-than-one-day-to-production">Rapid Deployment: Less Than One Day to Production</h4>
<p>Clarm began experimenting with Vespa’s Docker image for local development, then transitioned to Vespa Cloud for production deployment during their Y Combinator batch.</p>

<blockquote>
  <p>“It took about half a day to set up how we wanted it. That speed of onboarding made a huge impact during YC. We just deployed it, and it worked,” Marcus recalls.</p>
</blockquote>

<p>The quick deployment was critical. Clarm was racing toward Demo Day and couldn’t afford weeks of infrastructure setup. Vespa’s unified approach eliminated the complexity of stitching together multiple systems for text, vector, and structured search.</p>

<h4 id="key-vespa-capabilities-powering-clarm">Key Vespa Capabilities Powering Clarm</h4>

<ul>
  <li>Unified Retrieval Pipeline
    <ul>
      <li>Single query endpoint combining text search, vector similarity, and structured filters - no need to orchestrate multiple databases or services.</li>
    </ul>
  </li>
  <li>Built-in <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html#">Ranking</a> &amp; <a href="https://docs.vespa.ai/en/ranking/tensor-user-guide.html#">Tensor Operations</a>
    <ul>
      <li>Native support for complex ranking models and tensor operations means Clarm can implement sophisticated lead scoring without custom ranking layers.</li>
    </ul>
  </li>
  <li><a href="https://143590857.fs1.hubspotusercontent-eu1.net/hubfs/143590857/PDF-reports/Scaling-Smarter_-Vespas-Approach-to-High-Performance-Data-Management-3.pdf?hsCtaAttrib=232558642374">Real-Time</a> Indexing
    <ul>
      <li>GitHub events, user interactions, and enrichment signals are instantly searchable, enabling live lead intelligence and up-to-date AI responses.</li>
    </ul>
  </li>
  <li>Scalable Cloud Deployment
    <ul>
      <li><a href="https://vespa.ai/vespa-content/uploads/2025/07/Autoscaling-with-Vespa.pdf">Automatic scaling</a> and high availability handled by Vespa Cloud, allowing Clarm’s two-person engineering team to focus on product features instead of infrastructure operations.</li>
    </ul>
  </li>
  <li>Developer-Friendly <a href="https://docs.vespa.ai/en/learn/overview.html">Architecture</a>
    <ul>
      <li>Docker-based local development, straightforward schema design, and comprehensive documentation enabled rapid prototyping and iteration.</li>
    </ul>
  </li>
</ul>

<h2 id="the-results">The Results</h2>
<p>Clarm’s decision to build on Vespa Cloud delivered immediate impact:</p>
<ul>
  <li><strong>&lt;1 Day to Production:</strong> From prototype to live search infrastructure deployed during YC</li>
  <li><strong>Zero-Hallucination Architecture:</strong> Accurate retrieval enabling trustworthy AI responses grounded in verifiable data</li>
  <li><strong>High-Quality Lead Intelligence:</strong> Sophisticated ranking of GitHub data points across 50K+ collective stars from customers like <a href="https://better-auth.com/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Better Auth</a> (23.3K stars) and <a href="https://cua.ai/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Cua</a> (11.3K stars)</li>
  <li><strong>Exceptional Support:</strong> Direct collaboration with Vespa’s engineering team throughout development</li>
</ul>

<blockquote>
  <p>“The setup was easy, the support from the Vespa team was incredible, and everything just worked. We didn’t need to look anywhere else,” Marcus emphasizes.</p>
</blockquote>

<h4 id="customer-success-github-stars-becoming-revenue">Customer Success: <a href="https://www.clarm.com/blog/articles/convert-github-stars-to-revenue?utm_source=vespa&amp;utm_campaign=clarm_case_study">GitHub Stars Becoming Revenue</a></h4>
<p>Clarm’s customers are seeing measurable results from the AI-powered lead generation platform:</p>
<ul>
  <li><strong>Better Auth:</strong> Grew from 8K to 23.3K GitHub stars in 3 months with Clarm’s lead gen and engagement automation</li>
  <li><strong>c/ua:</strong> Scaled from 5K to 11.3K stars while identifying and converting enterprise prospects</li>
  <li><strong><a href="https://www.skyvern.com/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Skyvern AI:</a></strong> after struggling with after hitting 19k stars, reduced support workload by 94% with Clarm across Github, Discord, and Slack</li>
  <li><strong>Engagement Depth:</strong> Developers “pair programming” with Clarm’s AI agents for extended sessions, sending thousands of queries a day and sessions lasting up to 22 hours</li>
</ul>

<h4 id="whats-next-building-the-future-of-oss-monetization">What’s Next: Building the Future of OSS Monetization</h4>
<p>Clarm represents a <a href="https://www.clarm.com/blog/articles/best-developer-growth-automation-tools-for-software-products-in-2025?utm_source=vespa&amp;utm_campaign=clarm_case_study">new category of growth infrastructure</a> built specifically for software and open source companies. By combining Vespa’s production-grade retrieval with their own zero-hallucination agent framework, Clarm is proving that AI-powered sales and marketing can be trustworthy, explainable, and grounded in truth.</p>

<blockquote>
  <p>“We’re focused on proving product value and retaining customers right now. Everything depends on us growing our customers’ MRR and showing software and OSS companies they can build sustainable businesses,” Marcus shares.</p>
</blockquote>

<p>That focus is reflected in Clarm’s positioning: “You build awesome software. Now build a business.” It resonates with software founders who want to monetize without compromising their community values. By recognizing that a vast majority of successful open source is ultimately funded by businesses paying for solutions, Clarm offers a clear path forward: free software for the community, paid solutions for enterprises.</p>

<h2 id="conclusion">Conclusion</h2>
<p>Clarm’s architecture reinforces a lesson many teams learn the hard way: LLMs are only as reliable as the retrieval systems behind them. By treating retrieval as a first-class system, built on Vespa Cloud, Clarm unified text search, vector similarity, structured filtering, and ranking into a single production-grade platform, eliminating the fragility and guesswork common in vector-only stacks.</p>

<p>The result is an agentic AI platform that can reason over live data, explain its outputs, and scale predictably without stitching together multiple databases or post-hoc ranking layers. This foundation enabled a small team to move from prototype to production in days, operate across millions of GitHub signals, and help open source companies turn community adoption into sustainable revenue.</p>

<p>More importantly, Clarm’s success offers a blueprint for any organization building serious AI applications: when retrieval is reliable, ranking is expressive, and data is always fresh, AI systems become trustworthy enough to power real business outcomes. Clarm is building the future of OSS monetization, and Vespa is the retrieval engine making it possible.</p>

]]></content:encoded>
        <pubDate>Mon, 19 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Embedding Tradeoffs, Quantified</title>
        <description>The embedding strategy you choose has a major impact on both cost, quality and latency. We ran a bunch of experiments to help you make better and more informed tradeoffs.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-01-14-embedding-tradeoffs-quantified/control-dashboard.png" />
        
        <content:encoded><![CDATA[<p>Most Vespa users run hybrid search - combining BM25 (and/or other lexical features) with semantic vectors. But which embedding model should you use? And how do you balance cost, quality, and latency as you scale?</p>

<p>The typical approach: open the <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB leaderboard</a>, find the “Retrieval” column, sort descending, pick something that fits your size budget. Done, right?</p>

<p>Not quite. MTEB doesn’t tell you:</p>

<ul>
  <li>How fast is inference on your actual hardware?</li>
  <li>What happens when you quantize the model weights?</li>
  <li>How much quality do you lose with binary vectors?</li>
  <li>Does this model even work well in a hybrid setup?</li>
</ul>

<p>So we ran the experiments ourselves. We picked models from the MTEB Retrieval leaderboard with these criteria:</p>

<ul>
  <li>Under 500M parameters (practical for most deployments)</li>
  <li>Open license</li>
  <li>ONNX weights available (required for Vespa)</li>
  <li>At least 10k downloads in the last month (actually used in production)</li>
</ul>

<p>For each model, we benchmarked across:</p>

<ul>
  <li><strong>Model quantizations</strong> (FP32, FP16, INT8)</li>
  <li><strong>Vector precisions</strong> (float, bfloat16, binary)</li>
  <li><strong>Matryoshka dimensions</strong> (for models that support it)</li>
  <li><strong>Real hardware</strong> (Graviton3, Graviton4, T4 GPU)</li>
  <li><strong>Hybrid retrieval</strong> (semantic, RRF, and score normalization methods)</li>
</ul>

<p><strong>Spoiler:</strong> We found some <em>really</em> attractive tradeoffs - 32x memory reduction, 4x faster inference, with nearly identical quality.</p>

<h2 id="what-mteb-doesnt-show-you">What MTEB doesn’t show you</h2>

<h3 id="model-quantization">Model quantization</h3>

<p>Vespa uses <a href="https://onnxruntime.ai/">ONNX runtime</a> for <a href="https://docs.vespa.ai/en/embedding.html">embedding inference</a>. Most models on HuggingFace ship with multiple ONNX variants - here’s <a href="https://huggingface.co/Alibaba-NLP/gte-modernbert-base/tree/main/onnx">Alibaba-NLP/gte-modernbert-base</a> as an example:</p>

<p><img src="/assets/2026-01-14-embedding-tradeoffs-quantified/model-quantizations.png" alt="model quantizations" /></p>

<p>Lower precision weights = smaller model = faster inference. But how much faster, and what’s the quality hit?</p>

<ul>
  <li><strong>On CPU:</strong> INT8 models run 2.7-3.4x faster while keeping 94-98% of the quality</li>
  <li><strong>On GPU:</strong> INT8 is actually 4-5x <em>slower</em> than FP32. Don’t do this.</li>
</ul>

<p>The difference between 30ms and 100ms query latency is huge. If you’re on CPU, INT8 is often a no-brainer.</p>

<p>On GPU, use FP16 instead - you get <a href="https://sbert.net/docs/sentence_transformer/usage/efficiency.html">~2x speedup with no meaningful quality loss</a>.</p>

<p><strong>GPU vs CPU:</strong> The T4 GPU runs 4-7x faster than Graviton3 for embedding inference. If you’re processing high query volumes or doing bulk indexing, GPU may be worth it.</p>

<h3 id="vector-precision">Vector precision</h3>

<p>Model quantization affects <em>inference</em> speed. Vector precision affects <em>storage</em> and <em>search</em> speed. Different knobs, both important.</p>

<p>Here’s the math for 100 million 768-dimensional embeddings:</p>

<style>
  table, th, td {
    border: 1px solid black;
  }
  th, td {
    padding: 5px;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Precision</th>
      <th style="text-align: center">Bytes/Dim</th>
      <th style="text-align: center">100M vectors</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>FP32</td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">307 GB</td>
    </tr>
    <tr>
      <td>FP16</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">154 GB</td>
    </tr>
    <tr>
      <td>INT8 (scalar)</td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">77 GB</td>
    </tr>
    <tr>
      <td>Binary (packed)</td>
      <td style="text-align: center">0.125</td>
      <td style="text-align: center">9.6 GB</td>
    </tr>
  </tbody>
</table>

<p><br />
That’s a 32x difference between FP32 and binary. When memory is what forces you to add more nodes, this matters a lot.</p>

<p><strong>bfloat16 is free:</strong> In our benchmarks, bfloat16 vectors show zero quality loss compared to FP32 - it’s a 2x storage reduction you can take without any tradeoff.</p>

<h3 id="matryoshka-dimensions">Matryoshka dimensions</h3>

<p>Some models support <a href="https://huggingface.co/blog/matryoshka">Matryoshka Representation Learning (MRL)</a> - you can truncate the embedding to fewer dimensions and still get decent results. Fewer dimensions = less storage, faster search.</p>

<p>Here’s EmbeddingGemma at different dimension sizes:</p>

<p><img src="/assets/2026-01-14-embedding-tradeoffs-quantified/embeddinggemma-mrl.png" alt="EmbeddingGemma MRL" /></p>

<p><em>Source: <a href="https://arxiv.org/pdf/2509.20354">EmbeddingGemma paper</a></em></p>

<p>Interestingly, EmbeddingGemma actually scores <em>higher</em> at 512 dimensions than at 768. We didn’t dig into why - it may be an artifact of the smaller evaluation set - but it’s a reminder that more dimensions isn’t always better.</p>

<p>Not all models support this - check the model card before truncating. If it wasn’t trained for MRL, slicing dimensions will tank your quality.</p>

<h3 id="inference-speed">Inference speed</h3>

<p>If you have a 200ms latency budget and your embedding model takes 150ms, you’re in trouble. We benchmarked actual inference times so you can plan accordingly.</p>

<p>We measured two things for each model:</p>

<ol>
  <li><strong>Query latency</strong> - how long to embed an 8-word query</li>
  <li><strong>Document throughput</strong> - embeddings per second for 103-word docs</li>
</ol>

<p>Tested on three AWS instance types:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">c7g.2xlarge</code> - Graviton 3 (ARM CPU)</li>
  <li><code class="language-plaintext highlighter-rouge">g4dn.xlarge</code> - T4 GPU</li>
  <li><code class="language-plaintext highlighter-rouge">m8g.xlarge</code> - Graviton 4 (ARM CPU)</li>
</ul>

<p>These numbers are pure ONNX inference time. Your actual indexing throughput will also depend on HNSW config and existing index size, but embedding inference is usually the bottleneck.</p>

<h3 id="quality">Quality</h3>

<p>We evaluated all models on <a href="https://huggingface.co/collections/zeta-alpha-ai/nanobeir">NanoBEIR</a>, a smaller but representative subset of the BEIR benchmark. This let us run a lot of experiments without waiting forever.</p>

<p>For each model, we measured nDCG@10 across four retrieval strategies:</p>

<ul>
  <li><strong>Semantic only</strong> - pure vector similarity</li>
  <li><strong>RRF (Reciprocal Rank Fusion)</strong> - combines BM25 and vector rankings</li>
  <li><strong>Atan hybrid</strong> - normalizes scores using arctangent before combining</li>
  <li><strong>Linear hybrid</strong> - linear normalization before combining</li>
</ul>

<p>The hybrid methods consistently outperform pure semantic search. <strong>Every single model</strong> in our benchmark scored higher with hybrid retrieval than semantic-only. On average, the best hybrid method beats semantic-only by 3-5 percentage points. That’s a meaningful lift you get “for free” by just using BM25 alongside your vectors.</p>

<p>We also tested each model with binarized vectors (int8). This is where things get interesting:</p>

<ul>
  <li><strong>ModernBERT models</strong> barely flinch - Alibaba GTE ModernBERT retains 98% of quality (0.670 binary vs 0.685 float)</li>
  <li><strong>E5 models</strong> take a bigger hit - E5-base-v2 drops to 92% (0.602 binary vs 0.651 float), and E5-small-v2 to just 87%</li>
</ul>

<p>The takeaway: not all models are created equal for binary quantization. The newer ModernBERT-based models handle it much better than the E5 family. Make sure to check before assuming you can just binarize everything.</p>

<h2 id="interactive-leaderboard">Interactive leaderboard</h2>

<p>We built an interactive leaderboard so you can explore the full results yourself. Filter by hardware, sort by different metrics, and expand each model to see the full breakdown across dimensions and precisions. <a href="https://huggingface.co/spaces/vespa-engine/nanobeir-hybrid-evaluation">Open in full screen</a>.</p>

<iframe src="https://vespa-engine-nanobeir-hybrid-evaluation.static.hf.space" frameborder="0" width="100%" height="1200">
</iframe>

<h2 id="getting-started-with-vespa">Getting started with Vespa</h2>

<p>Ready to put this into practice? Here’s how to configure an <a href="https://docs.vespa.ai/en/embedding.html">embedding model in Vespa</a>:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"alibaba_gte_modernbert_int8"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"alibaba-gte-modernbert"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>8192<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;pooling-strategy&gt;</span>cls<span class="nt">&lt;/pooling-strategy&gt;</span>
<span class="nt">&lt;/component&gt;</span>
</code></pre></div></div>

<p>Here’s a schema with a binarized embedding field (96 dimensions = 768 bits packed):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
      index: enable-bm25
    }
  }
  field embedding_alibaba_gte_modernbert_int8_96_int8 type tensor&lt;int8&gt;(x[96]) {
    indexing: input text | embed alibaba_gte_modernbert_int8 | pack_bits | index | attribute
    attribute {
      distance-metric: hamming
    }
    index {
      hnsw {
        max-links-per-node: 16
        neighbors-to-explore-at-insert: 200
      }
    }
  }
}
</code></pre></div></div>

<p>And a rank profile using linear normalization for hybrid scoring:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile hybrid_linear {
  inputs {
    query(q) tensor&lt;int8&gt;(x[96])
  }
  function similarity() {
    expression {
      1 - (distance(field, embedding_alibaba_gte_modernbert_int8_96_int8) / 768)
    }
  }
  first-phase {
    expression: similarity
  }
  global-phase {
    expression: normalize_linear(bm25(text)) + normalize_linear(similarity)
    rerank-count: 1000
  }
  match-features {
    similarity
    bm25(text)
  }
}
</code></pre></div></div>

<p>Check out the <a href="https://docs.vespa.ai/en/embedding.html">embedding documentation</a> for full details on configuration, including how to set up <a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html">binary quantization</a> and hybrid search.</p>

<h3 id="going-further">Going further</h3>

<p>Binary vectors are fast - really fast. Vespa can do ~1 billion hamming distance calculations per second, roughly 7x more than prenormalized angular distance. That speed difference means you can crank up <a href="https://docs.vespa.ai/en/nearest-neighbor-search.html#using-nearest-neighbor-query-operator">targetHits</a> significantly and still stay within latency budget. More candidates evaluated = better recall. So binary vectors aren’t just about 32x storage savings - they give you headroom to tune for quality too.</p>

<p>And luckily, Vespa’s <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">phased ranking</a> architecture lets you make up for any remaining quality loss in later phases. You can retrieve candidates with hamming distance, then rescore in any of the following ways:</p>

<ul>
  <li><strong>float-binary</strong> - Use float for query vector, and unpack the bits of document vector to float for angular distance calculation. <a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html#rank-profiles-and-queries">Example</a></li>
  <li><strong>float-float</strong> - Retrieve with hamming distance but rerank with full-precision vectors <a href="https://docs.vespa.ai/en/content/attributes.html#paged-attributes-disadvantages">paged in from disk</a>. Should be limited to a small candidate set.</li>
  <li><strong>int8-int8</strong> - Same as float-float, with int8 vectors (scalar quantization, not to be confused with binary quantization) for both query and document. Faster and more storage-efficient than float-float, with a small precision cost.</li>
</ul>

<p>See <a href="https://huggingface.co/blog/embedding-quantization#quantization-experiments">this</a> great huggingface blog post for more details on these techniques.</p>

<p>For even better results, add a <a href="https://docs.vespa.ai/en/cross-encoders.html">cross-encoder reranker</a> as a final stage. Or (especially if you have several user signals or features), train a <a href="https://docs.vespa.ai/en/xgboost.html">GBDT model</a> to learn optimal combinations.</p>

<p>The beauty of Vespa’s <a href="https://docs.vespa.ai/en/basics/ranking.html">ranking expressions</a> is that you can mix and match all of these - BM25, a bunch of other <a href="https://docs.vespa.ai/en/reference/ranking/rank-features.html">built-in features</a>, vectors, rerankers, learned models - however you want.</p>

<h2 id="a-few-caveats">A few caveats</h2>

<h3 id="multilingual-support">Multilingual support</h3>

<p>If you need to support multiple languages, your options narrow. The <code class="language-plaintext highlighter-rouge">multilingual-e5-base</code> model handles 100+ languages but comes with a quality tradeoff compared to English-only models. For English-only workloads, stick with the specialized models.</p>

<h3 id="context-length">Context length</h3>

<p>Document length matters too. Many newer models handle 8192 tokens, EmbeddingGemma can take 2048, while the E5 family tops out at 512. If your documents are long, look at benchmarks like <a href="https://arxiv.org/html/2402.07440v2">LoCo (Long Document Retrieval)</a> - NanoBEIR won’t tell you much here.</p>

<p>For long documents, check out Vespa’s <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">layered ranking</a> - it lets you rank chunks within documents so you’re not forced to return irrelevant chunks from top-ranking docs.</p>

<h3 id="test-on-your-own-data">Test on your own data</h3>

<p>NanoBEIR is a good starting point, but your domain matters. A model that tops the leaderboard on scientific papers might struggle with product descriptions, legal documents, or your internal knowledge base.</p>

<p>Benchmark rankings can be misleading for specialized domains. The models we tested were trained on general web data - if your corpus looks very different (medical records, source code, niche industry jargon), the relative rankings might shuffle significantly.</p>

<p>We’ve open-sourced the <a href="https://github.com/vespa-engine/pyvespa/blob/master/vespa/evaluation/_mteb.py">benchmarking code in pyvespa</a> so you can run the same experiments on any model with any dataset compatible with the MTEB library. Swap in your own data and see how different models actually perform for your use case.</p>

<h3 id="consider-finetuning">Consider finetuning</h3>

<p>If off-the-shelf models underperform on your domain, finetuning can help significantly. Even a small set of query-document pairs from your actual data can boost relevance.</p>

<p>Tools like <a href="https://www.sbert.net/docs/sentence_transformer/training_overview.html">sentence-transformers</a> make this straightforward. The ROI is often worth it for production systems where a few percentage points of nDCG translate to real user impact.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>The “best” embedding model depends entirely on your constraints. But now you have real data to make that call:</p>

<ul>
  <li><strong>Cost sensitive?</strong> Binary quantization with a compatible model (like GTE ModernBERT) gives you 32x savings with minimal quality loss.</li>
  <li><strong>Running on CPU?</strong> INT8 model quantization speeds up inference 2.7-3.4x.</li>
  <li><strong>Need great quality?</strong> Alibaba GTE ModernBERT + hybrid search is hard to beat.</li>
  <li><strong>Latency-critical?</strong> E5-small-v2 with INT8 can do a query inference in only 2.5ms on Graviton3.</li>
</ul>

<p>The interactive leaderboard above has all the details. Explore, filter, and find the sweet spot for your use case.</p>

<p>For those interested in learning more about Vespa, join the <a href="https://vespatalk.slack.com/">Vespa community on Slack</a> to exchange ideas,
seek assistance from the community, or stay in the loop on the latest Vespa developments.</p>
]]></content:encoded>
        <pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/embedding-tradeoffs-quantified/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/embedding-tradeoffs-quantified/</guid>
        
        <category>embedding</category>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        
      </item>
    
      <item>
        <title>How Tensors Are Changing Search in Life Sciences</title>
        <description>Tensor-based retrieval preserves context across queries, maintains &quot;chain of thought&quot; and ranking relevance of multiple scientific factors simultaneously.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tns-tensor-lifescience.jpg" />
        
        <content:encoded><![CDATA[<p><em>Tensor-based retrieval preserves context across queries, maintains “chain of thought” and ranking relevance of multiple scientific factors simultaneously.</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/how-tensors-are-changing-search-in-life-sciences/">25th Aug. 2025 on The New Stack</a></em></p>

<hr />

<p>In my years working across life sciences, one question comes up again and again: What’s next for AI in our field? The truth is that the life sciences industry faces challenges unlike any other.</p>

<p>Where a bank or retailer might deploy AI chatbots to improve customer service, our world is defined by enormous, messy datasets, including clinical trials, lab results, publications and patient records. This must be interpreted with care. The stakes are not just efficiency or convenience; they are breakthroughs in treatment, safety and patient outcomes.</p>

<p>That’s why I believe the real opportunity for <a href="https://thenewstack.io/genai-is-quickly-reinventing-it-operations-leaving-many-behind/">generative AI (GenAI)</a> in the life sciences is not in chatbots, but in enabling <a href="https://thenewstack.io/wrangling-data-is-becoming-critical-in-an-ai-driven-world/">deep and precise retrieval</a>. Success here means connecting across multiple sources, reconciling heterogeneous data and surfacing insights that a human researcher would struggle to piece together.</p>

<p>Imagine asking: “Find me colorectal cancer trials using ZALTRAP [a drug] with the most recent supporting publications.” GenAI, when applied effectively, can handle that complexity, and this is where the next frontier begins.</p>

<h2 id="from-traditional-search-to-ai-driven-discovery">From Traditional Search to AI-Driven Discovery</h2>

<p>For decades, search in life sciences has mostly meant keyword lookups or rule-based retrieval. Researchers, clinicians and pharma teams relied on these tools to sift through scientific literature, clinical trial data, patents and regulatory filings. They worked well enough for simple, well-defined questions. But as soon as you needed to account for domain-specific language, synonyms or the complex relationships between diseases, molecules and pathways, <a href="https://thenewstack.io/vector-search-is-reaching-its-limit-heres-what-comes-next/">traditional search hit its limits</a>.</p>

<p>The result? Endless manual refinements, stitching insights together from different sources and lots of time spent just finding the right information.</p>

<p>Now, with GenAI and <a href="https://www.nature.com/articles/s44387-025-00047-1">large language models (LLMs)</a>, that’s changing. LLM-powered search understands meaning, not just exact words. You can ask complex, natural-language questions and get results that connect the dots across literature, trials and patents — even when they use different terminology. This opens up entirely <a href="https://thenewstack.io/microsoft-opens-ai-store-for-healthcare-developers/">new ways of working</a>: identifying drug repurposing opportunities hidden in disconnected studies, accelerating biomarker discovery or finding previously unseen links between biological entities. It’s faster, more comprehensive and far less manual.</p>

<h2 id="why-tensors-matter-in-this-shift">Why Tensors Matter in This Shift</h2>

<p>Life sciences data comes in all shapes and sizes — omics data, 3D protein structures, medical images, regulatory documents, clinical trial reports and more. Most of it is unstructured or semi-structured, which makes it tricky for AI systems to find and assemble relevant information quickly. Given the nature of life sciences, accuracy is critical. “Good enough” seldom exists.</p>

<p>This is where tensors come in.</p>

<p>So, <a href="https://thenewstack.io/beyond-vector-search-the-move-to-tensor-based-retrieval/">what is a tensor</a>? Think of it as a multidimensional data container. A vector is a one-dimensional list of numbers. A matrix is two-dimensional. A tensor goes beyond that, capturing multiple dimensions at once. This allows AI models to represent complex relationships — like spatial configurations of proteins or contextual relationships between words in a scientific article — even if those pieces of information are far apart.</p>

<p>In other words, tensors let AI “see” and learn patterns that are deeply embedded across different dimensions of data.</p>

<h2 id="tensors-in-action-protein-structures">Tensors in Action: Protein Structures</h2>

<p><img src="/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tnsharini1.png" alt="" /></p>

<p>Take structural biology as an example. Models like AlphaFold use 3D tensors to represent the spatial relationships between amino acids. These tensors allow the model to learn how proteins fold, twist and interact — crucial knowledge for understanding disease mechanisms and designing new therapies.</p>

<p>When you embed a protein as a tensor, you preserve:</p>

<ul>
  <li>Sequential data (the order of amino acids)</li>
  <li>Spatial relationships (how parts of the protein fold and connect)</li>
  <li>Biochemical properties (like charge or hydrophobicity)</li>
</ul>

<p>This rich representation lets machine learning (ML) models predict protein folding, identify binding sites, map protein-protein interactions and even discover new drug targets.</p>

<p>The same idea applies beyond proteins.</p>

<p>Medical imaging, for example, can use tensors to encode not just pixels, but also their contextual relevance, helping AI detect subtle cancer markers even in noisy scans. In clinical settings, tensors help AI analyze data streams from wearables or Internet of Things (IoT) devices in real time, enabling faster interventions.</p>

<p><img src="/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tnsharini2.png" alt="" /></p>

<h2 id="beyond-retrieval-ai-agents-in-life-sciences">Beyond Retrieval: AI Agents in Life Sciences</h2>

<p>AI agents are another emerging application. Think of them as intelligent assistants that continuously gather, analyze and synthesize information across fragmented data sources. An AI agent could monitor new literature, clinical trials and regulatory updates in real time, summarize findings and even suggest next research steps.</p>

<hr />

<p><em>Good agents don’t just fetch information — they connect it, building context and reason through problems step by step.</em></p>

<hr />

<p>The key here is multistep reasoning. Good agents don’t just fetch information — they connect it, building context and reason through problems step by step.</p>

<p>This means faster reasoning, better accuracy and more meaningful insights. This allows you to stitch together multimodal data and ask questions across modalities and time. For example, as illustrated in the example below, you can now find patients for trial recruitment for a disease subtype based on certain progression (or regression) images over time. You can do this by combining into a tensor the patient’s medical record, biomarker assays, histopathology slides and any other prognosis outcome notes.</p>

<p><img src="/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tnsharini3.png" alt="" /></p>

<h2 id="why-this-matters">Why This Matters</h2>

<p>Life sciences are moving into an era where data is simply too complex and too vast for traditional tools. Tensors provide the foundation for AI models to handle this complexity, enabling everything from better search to advanced reasoning. Whether it’s predicting protein structures, extracting insights from clinical data or powering AI agents that help researchers focus on discovery rather than data wrangling, tensors are quietly becoming the backbone of the next wave of AI in life sciences.</p>

]]></content:encoded>
        <pubDate>Mon, 05 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        <category>LifeSciencesAI</category>
        
        
      </item>
    
      <item>
        <title>The Search API Reset: Incumbents Retreat, Innovators Step Up</title>
        <description>Google and Bing are restricting their search APIs, creating opportunities for new players to build the next generation of search infrastructure.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-14-the-search-api-reset-incumbents-retreat-innovators-step-up/tns-search-api-image.jpg" />
        
        <content:encoded><![CDATA[<p><em>Google and Bing are restricting their search APIs, creating opportunities for new players to build the next generation of search infrastructure.</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/the-search-api-reset-incumbents-retreat-innovators-step-up/">7th Nov. 2025 on The New Stack</a></em></p>

<hr />

<p>The <a href="https://thenewstack.io/why-ai-search-platforms-are-gaining-attention/">search landscape</a> is shifting. In recent months, Microsoft announced the retirement of the Bing Search API, while Google limited its own API to a maximum of 10 results per query. These moves mark a notable change in the way the web’s dominant search providers view access to their data and who gets to build on it.</p>

<p>For more than a decade, search <a href="https://thenewstack.io/why-api-first-matters-in-an-ai-driven-world/">APIs</a> like Bing and Google Custom Search have been part of the web’s plumbing. Developers have used them to retrieve web results, images and news without maintaining their own indexes. Enterprises have embedded them in applications such as customer support, knowledge bases and market intelligence to provide external context. Startups and research teams have used them to collect training data, ground language models or perform competitive analysis without running their own crawlers.</p>

<p>In short, search APIs have offered a simple way to access the open web programmatically, bridging the gap between consumer search and enterprise information retrieval.</p>

<h2 id="the-ai-shift">The AI Shift</h2>

<p>The emergence of generative AI has changed <a href="https://thenewstack.io/enterprise-ai-search-vs-the-real-needs-of-customer-facing-apps/">what the search infrastructure needs to deliver</a>. With <a href="https://thenewstack.io/why-rag-is-essential-for-next-gen-ai-development/">retrieval-augmented generation (RAG)</a> becoming central to AI systems, developers now require flexible retrieval layers within the AI pipeline, not just returning links.</p>

<p>Against this backdrop, the timing of Microsoft and Google’s decisions stands out. Microsoft has folded search access into Azure’s AI stack through its Grounding with Bing Search feature for AI agents, while Google continues to reduce external visibility into its own results. Limiting queries to 10 results per call fits with its long-standing goal of minimizing bulk data extraction and automated scraping.</p>

<p>The business thinking is clear: Both companies are steering developers away from large-scale, open retrieval and toward AI-mediated access inside their own ecosystems. Full result sets are expensive to serve and often used by automated systems such as SEO platforms, data-mining tools or research crawlers rather than by interactive users. Restricting APIs helps contain those costs while repositioning web data as a controlled resource for higher-level AI services.</p>

<h2 id="a-reset-not-a-retreat">A Reset, Not a Retreat</h2>

<p>This isn’t a collapse of search, but a realignment of control. The open, list-based APIs of the past belong to an era where raw results were the product. In the generative AI era, incumbents are redefining search around answers, grounding and context, tightly coupled with their cloud ecosystems.</p>

<p>But as the large providers step back, new players are moving in. Perplexity and Parallel represent a new generation of search APIs designed for AI workloads. They publish benchmarks, expose APIs openly and emphasize retrieval quality and low latency, the performance characteristics that matter most in RAG and agentic systems. You can read more about the <a href="https://www.perplexity.ai/hub/blog/introducing-the-perplexity-search-api">Perplexity search API here</a>.</p>

<p>Perplexity has also shown that it <a href="https://medium.com/@evolutionaihub/whats-new-in-perplexity-s-search-api-that-just-killed-google-s-edge-b95047ada22e">outperforms Google on relevance</a> for RAG-style tasks. Not to be outdone, Parallel, founded by Twitter’s former CEO, Parag Agrawal, recently <a href="https://x.com/paraga/status/1971650814705127438">reported better results</a> than Perplexity, using Perplexity’s own evaluation tool.</p>

<h2 id="a-hot-market-new-foundations">A Hot Market, New Foundations</h2>

<p>The search API market is heating up again, this time around AI native infrastructure. Beneath Perplexity and Parallel is a common component: Vespa, the open source engine built for large-scale retrieval, ranking and machine learning inference.</p>

<p>Vespa’s role in these systems reflects a broader shift in architecture: Search infrastructure is now part of the AI stack itself. As models depend more on retrieval, factors such as performance, scalability and the ability to combine <a href="https://thenewstack.io/automating-context-in-structured-data-for-llms/">structured and unstructured data</a> have become key differentiators.</p>

<p>The incumbents are narrowing access; the innovators are expanding it. Either way, search is once again at the center of how the web is organized, only this time, it’s being rebuilt for AI.</p>

]]></content:encoded>
        <pubDate>Fri, 02 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        <category>AI search</category>
        
        
      </item>
    
  </channel>
</rss>
