<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Vespa Blog</title>
    <description>We Make AI Work</description>
    <link>https://blog.vespa.ai/</link>
    <atom:link href="https://blog.vespa.ai/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 02 Mar 2026 13:27:41 +0000</pubDate>
    <lastBuildDate>Mon, 02 Mar 2026 13:27:41 +0000</lastBuildDate>
    <generator>Jekyll v4.4.1</generator>
    
      <item>
        <title>Build a High-Quality RAG App on Vespa Cloud in 15 Minutes</title>
        <description>Retrieval-Augmented Generation (RAG) allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.
</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" />
        
        <content:encoded><![CDATA[<p><strong>Retrieval-Augmented Generation (RAG)</strong> allows an LLM to answer questions using your data at query time. On their own, LLMs are powerful but limited: they can hallucinate, they have a fixed knowledge cutoff, and they know nothing about your private documents, internal wikis, or proprietary systems.</p>

<p>RAG bridges that gap by retrieving relevant information from your data and supplying it to the model as context, so responses are grounded in real, trusted sources rather than guesswork.</p>

<h2 id="the-challenge-the-quality-of-the-context-window">The Challenge: The Quality of the Context Window</h2>

<p>In Retrieval-Augmented Generation (RAG), the real bottleneck is the LLM’s context window. You can’t simply pass your entire dataset into a prompt—there’s a strict token budget.</p>

<p>Because of this, the problem isn’t just retrieving information, but retrieving the right information. When the context window is filled with loosely matched or low-quality results, the LLM has little to work with and the quality of its answers drops accordingly.</p>

<p>High-quality RAG depends on semantic understanding, precise retrieval, and strong ranking across diverse data types so that every token in the context window earns its place.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/illustration_2.png" alt="illustration_2" /></p>

<h2 id="the-solution-out-of-the-box-rag-on-vespa-cloud">The Solution: Out-of-the-Box RAG on Vespa Cloud</h2>

<p>Vespa Cloud provides an out-of-the-box Vespa <a href="https://docs.vespa.ai/en/examples/rag-blueprint.html">RAG Blueprint</a> designed to maximize the quality of the context sent to the LLM. Instead of relying solely on nearest-neighbor vector search, Vespa combines semantic vector retrieval with lexical BM25 scoring and applies advanced ranking, using models such as BERT, LightGBM, or custom logic—to ensure that only the strongest candidates are selected.</p>

<p>This hybrid retrieval and ranking approach consistently surfaces the most relevant document chunks, which significantly improves the quality of the final generated answer.</p>

<p>In this blog post, we’ll build a complete Retrieval-Augmented Generation (RAG) application from end to end by leveraging the OOTB Vespa RAG app on Vespa cloud. The following diagram shows the architecture we’ll be working with:</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/architecture_diagram.png" alt="Vespa RAG Architecture" /></p>

<p>The architecture consists of two main flows: data ingestion and query processing.</p>

<p><strong>Data Ingestion (one-time setup)</strong></p>

<p>First, we ingest our data sources, such as documents, PDFs, or web pages by using a Python-based pipeline. The pipeline processes the data, splits it into manageable chunks, generates embeddings, and feeds everything into a Vespa Cloud RAG application that is preconfigured with a schema and ranking profiles. This step populates the search index.</p>

<p><strong>Query Flow (live interaction)</strong></p>

<ol>
  <li>
    <p>A user enters a question in the <strong>Vespa RAG UI</strong>.</p>
  </li>
  <li>
    <p>The UI sends the query to a <strong>Python backend</strong>, which issues a hybrid search request (combining keyword and vector retrieval) to <strong>Vespa Cloud</strong>.</p>
  </li>
  <li>
    <p><strong>Vespa Cloud</strong> returns the most relevant document chunks.</p>
  </li>
  <li>
    <p>The backend sends those chunks, along with the original query, to an <strong>LLM</strong> as context.</p>
  </li>
  <li>
    <p>The model generates an answer grounded in that context and returns it to the backend.</p>
  </li>
  <li>
    <p>The backend streams the answer back to the UI.</p>
  </li>
</ol>

<p>This architecture ensures that generated responses are grounded in your own data, combining Vespa’s retrieval and ranking strengths with the generative capabilities of large language models.</p>

<p>The end-to-end setup takes about 15 minutes, plus additional time to process your documents.</p>

<hr />

<h2 id="deploy-vespa-rag-blueprint-to-vespa-cloud">Deploy Vespa RAG Blueprint to Vespa Cloud</h2>

<p>We’ll start by deploying a preconfigured RAG Blueprint to Vespa Cloud. This gives you a high-quality retrieval stack in minutes, and it’s free to get started. All of this is done directly from the Vespa Cloud console.</p>

<p><strong>Sign up for Vespa Cloud</strong></p>

<p>Go to the <a href="https://console.vespa-cloud.com/">Vespa Cloud Console</a> and create an account. If this is your first time using Vespa Cloud, the free trial is the fastest way to get going.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_1.png" alt="image_1" /></p>

<p><strong>Deploy RAG Blueprint</strong></p>

<p>In the console, select <strong>“Deploy your first application”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_2.png" alt="image_2" /></p>

<p>Choose <strong>“Select a sample application to deploy directly from the browser”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_3.png" alt="image_3" /></p>

<p>Select <strong>“RAG Blueprint”</strong>.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_4.png" alt="image_4" /></p>

<p>Click <strong>“Deploy”</strong> and wait for the deployment to complete.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_5.png" alt="image_5" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_8.png" alt="image_8" /></p>

<p><strong>Save your credentials</strong></p>

<p>Once deployment finishes, the console will generate an access token. <strong>Save this immediately.</strong>
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_9.png" alt="image_9" /></p>

<p>That token is how Python backend authenticates with Vespa Cloud. Treat it like a password.</p>

<p>Continue through the remaining setup screens, then open the application view.
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_10.png" alt="image_10" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_11.png" alt="image_11" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_12.png" alt="image_12" />
<img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_13.png" alt="image_13" /> 
<strong>Note your endpoint URL</strong></p>

<p>In the application view you will also find the endpoint URL. Save both the <strong>endpoint URL</strong> and the token; you will need them to configure Python backend in the next section.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/image_15.png" alt="image_15" />
You can download the Vespa application package by clicking the download icon if you’d like. From there, you can start building your data feeding pipeline, frontend service UI, and more. However, this blog provides a sample end-to-end RAG application, and the same Vespa application package is included, so there’s no need to download it separately.</p>

<h2 id="behind-the-scenes-what-you-just-deployed">Behind the Scenes: What You Just Deployed</h2>

<p>When you clicked <strong>Deploy</strong>, Vespa Cloud automatically provisioned infrastructure and deployed a complete <strong>Vespa application package</strong>. This package includes everything needed for a high-quality RAG system: schemas, indexing logic, ranking profiles, and service configuration.</p>

<p>In other words, you didn’t just spin up a demo, you launched a ready-to-use, high-quality retrieval engine.</p>

<p>Let’s take a closer look at what’s inside.</p>

<h3 id="the-schema">The Schema</h3>

<p>The RAG Blueprint uses a carefully designed schema that controls how documents are stored, chunked, embedded, and retrieved:</p>

<p><code class="language-plaintext highlighter-rouge">vespa_cloud/schemas/doc.sd</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="n">doc</span> <span class="o">{</span>
    <span class="n">document</span> <span class="n">doc</span> <span class="o">{</span>
        <span class="n">field</span> <span class="n">id</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">attribute</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">title</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">index</span> <span class="o">|</span> <span class="n">summary</span>
            <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">text</span> <span class="n">type</span> <span class="n">string</span> <span class="o">{</span>
        <span class="o">}</span>

        <span class="err">#</span> <span class="nc">Optional</span> <span class="n">metadata</span> <span class="n">fields</span> <span class="k">for</span> <span class="n">tracking</span> <span class="n">document</span> <span class="n">usage</span>
        <span class="n">field</span> <span class="n">created_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">modified_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">last_opened_timestamp</span> <span class="n">type</span> <span class="kt">long</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">open_count</span> <span class="n">type</span> <span class="kt">int</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
        <span class="n">field</span> <span class="n">favorite</span> <span class="n">type</span> <span class="n">bool</span> <span class="o">{</span>
            <span class="nl">indexing:</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">summary</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">the</span> <span class="nf">title</span> <span class="o">(</span><span class="mi">768</span> <span class="n">floats</span> <span class="err">→</span> <span class="mi">96</span> <span class="n">int8</span><span class="o">)</span>
    <span class="n">field</span> <span class="n">title_embedding</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">title</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Automatically</span> <span class="n">chunks</span> <span class="n">text</span> <span class="n">into</span> <span class="mi">1024</span><span class="o">-</span><span class="n">character</span> <span class="n">segments</span>
    <span class="n">field</span> <span class="n">chunks</span> <span class="n">type</span> <span class="n">array</span><span class="o">&lt;</span><span class="n">string</span><span class="o">&gt;</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">summary</span> <span class="o">|</span> <span class="n">index</span>
        <span class="nl">index:</span> <span class="n">enable</span><span class="o">-</span><span class="n">bm25</span>
    <span class="o">}</span>

    <span class="err">#</span> <span class="nc">Binary</span> <span class="n">quantized</span> <span class="n">embeddings</span> <span class="k">for</span> <span class="n">each</span> <span class="n">chunk</span>
    <span class="n">field</span> <span class="n">chunk_embeddings</span> <span class="n">type</span> <span class="n">tensor</span><span class="o">&lt;</span><span class="n">int8</span><span class="o">&gt;(</span><span class="n">chunk</span><span class="o">{},</span> <span class="n">x</span><span class="o">[</span><span class="mi">96</span><span class="o">])</span> <span class="o">{</span>
        <span class="nl">indexing:</span> <span class="n">input</span> <span class="n">text</span> <span class="o">|</span> <span class="n">chunk</span> <span class="n">fixed</span><span class="o">-</span><span class="n">length</span> <span class="mi">1024</span> <span class="o">|</span> <span class="n">embed</span> <span class="o">|</span> <span class="n">pack_bits</span> <span class="o">|</span> <span class="n">attribute</span> <span class="o">|</span> <span class="n">index</span>
        <span class="n">attribute</span> <span class="o">{</span>
            <span class="n">distance</span><span class="o">-</span><span class="nl">metric:</span> <span class="n">hamming</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="n">fieldset</span> <span class="k">default</span> <span class="o">{</span>
        <span class="nl">fields:</span> <span class="n">title</span><span class="o">,</span> <span class="n">chunks</span>
    <span class="o">}</span>

    <span class="n">document</span><span class="o">-</span><span class="n">summary</span> <span class="n">top_3_chunks</span> <span class="o">{</span>
        <span class="n">from</span><span class="o">-</span><span class="n">disk</span>
        <span class="n">summary</span> <span class="n">chunks_top3</span> <span class="o">{</span>
            <span class="nl">source:</span> <span class="n">chunks</span>
            <span class="n">select</span><span class="o">-</span><span class="n">elements</span><span class="o">-</span><span class="nl">by:</span> <span class="n">top_3_chunk_sim_scores</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>What’s happening here:</strong> Your documents store their raw content in <code class="language-plaintext highlighter-rouge">title</code> and <code class="language-plaintext highlighter-rouge">text</code>. During indexing, the <code class="language-plaintext highlighter-rouge">text</code> field automatically split into 1024-character chunks. Embeddings are generated for both titles and chunks, then binary-quantized using <code class="language-plaintext highlighter-rouge">pack_bits</code>, shrinking 768 floating-point values down to just 96 <code class="language-plaintext highlighter-rouge">int8</code>s. This dramatically reduces storage and improves performance while still supporting efficient vector similarity search.</p>

<p>At the same time, BM25 is enabled for lexical matching. This combination is what enables Vespa’s hybrid retrieval: semantic matching plus exact term relevance.</p>

<p><strong>Out-of-the-Box Query Profiles:</strong></p>

<p>The RAG Blueprint ships with four query profiles optimized for NyRAG’s client-side RAG architecture:</p>

<p><strong>NyRAG Architecture:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Query → NyRAG (generates search queries)
          → Vespa (retrieval + ranking)
          → NyRAG (generates final answer)
</code></pre></div></div>
<p>Query profiles control <strong>only the Vespa retrieval/ranking step</strong>. NyRAG handles all LLM interactions.</p>

<p><strong>The 4 Profiles:</strong></p>

<ol>
  <li><strong>hybrid</strong> (default, fast)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector search with <code class="language-plaintext highlighter-rouge">targetHits:100</code></li>
      <li><strong>Ranking:</strong> Learned linear model (logistic regression)</li>
      <li><strong>Best for:</strong> Everyday queries where you want fast, solid results</li>
    </ul>
  </li>
  <li><strong>hybrid-with-gbdt</strong> (highest quality)
    <ul>
      <li><strong>Retrieval:</strong> Same as hybrid (BM25 + Vector, 100 targets)</li>
      <li><strong>Ranking:</strong> Two-phase with LightGBM (GBDT) second-phase</li>
      <li><strong>Best for:</strong> Complex queries where relevance matters most (~2-3x slower)</li>
    </ul>
  </li>
  <li><strong>deepresearch</strong> (exhaustive search)
    <ul>
      <li><strong>Retrieval:</strong> BM25 + Vector with <code class="language-plaintext highlighter-rouge">targetHits:10000</code> (100x more!)</li>
      <li><strong>Ranking:</strong> Learned linear model</li>
      <li><strong>Best for:</strong> Research scenarios needing maximum recall</li>
    </ul>
  </li>
  <li><strong>deepresearch-with-gbdt</strong> (exhaustive + best quality)
    <ul>
      <li><strong>Retrieval:</strong> Deep search (10k targets)</li>
      <li><strong>Ranking:</strong> Two-phase with GBDT</li>
      <li><strong>Best for:</strong> When you need both maximum recall and best ranking</li>
    </ul>
  </li>
</ol>

<blockquote>
  <p><strong>For Advanced Users:</strong> Query profiles bundle complete search configurations including YQL structure (with <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> operators), ranking profiles, and all required parameters (like learned coefficients). The Vespa application also includes <code class="language-plaintext highlighter-rouge">rag</code> and <code class="language-plaintext highlighter-rouge">rag-with-gbdt</code> profiles with <code class="language-plaintext highlighter-rouge">searchChain=openai</code> for <strong>server-side RAG</strong> (direct API usage), but these conflict with NyRAG’s client-side architecture and aren’t included. Learn more in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#ranking-profiles">technical guide</a>.</p>
</blockquote>

<p><strong>Which profile should you use?</strong></p>
<ul>
  <li>Start with <strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong> for everyday use - fast and accurate</li>
  <li>Switch to <strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong> when quality matters most (harder queries)</li>
  <li>Use <strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong> when you need to find everything relevant (research mode)</li>
  <li>Try <strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong> for maximum recall + quality (slowest but most thorough)</li>
</ul>

<hr />

<p>Now that your RAG Blueprint Vespa Cloud application is up and running, it’s time to add the missing pieces: a simple frontend UI and a data ingestion pipeline. For this, we’ll use <strong>NyRAG</strong>, a tool included in the <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint"><code class="language-plaintext highlighter-rouge">RAG-app-in-15min-ragblueprint
</code></a> repository.</p>

<p>NyRAG acts as the glue for the entire RAG workflow. It reads documents from local files or websites, splits text into manageable chunks, generates embeddings, feeds everything into Vespa, and finally exposes a lightweight chat UI where you can ask questions over your data. Instead of wiring all of this together yourself, NyRAG gives you a working end-to-end system out of the box.</p>

<h3 id="install-nyrag">Install NyRAG</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the repository</span>
git clone https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint.git
<span class="nb">cd </span>RAG-app-in-15min-ragblueprint

<span class="c"># Install uv (Fast, modern Python package manager)</span>
<span class="c"># macOS</span>
brew <span class="nb">install </span>uv

<span class="c"># Linux &amp; macOS</span>
<span class="c"># curl -LsSf https://astral.sh/uv/install.sh | sh</span>
<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"</span>

<span class="c"># Verify uv installation</span>
uv <span class="nt">--version</span>

<span class="c"># Install dependencies using uv</span>
uv <span class="nb">sync
source</span> .venv/bin/activate

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># . .\.venv\Scripts\activate</span>

<span class="c"># Install nyrag locally</span>
uv pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>

<span class="c"># Verify nyrag installation</span>
nyrag <span class="nt">--help</span>
</code></pre></div></div>

<p><strong>Get an LLM API key</strong></p>

<p>To generate final answers, NyRAG needs an OpenAI-compatible API key. The simplest way to get started is <strong>OpenRouter</strong>, which provides access to multiple LLMs through a single API.</p>

<p>In this walkthrough, we’ll use OpenRouter for convenience. In a real application, you’re free to swap in any compatible LLM provider. To continue, sign up for OpenRouter and generate an API key. You’ll use it in the next step when configuring NyRAG.</p>

<hr />

<h3 id="start-the-nyrag-ui">Start the NyRAG UI</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This script handles all configuration automatically</span>
./run_nyrag.sh

<span class="c"># Windows (PowerShell)</span>
<span class="c"># powershell -ExecutionPolicy Bypass</span>
<span class="c"># .\run_nyrag.ps1</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">run_nyrag.sh</code> script starts the UI and wires up the configuration so NyRAG can talk to Vespa Cloud. In practice, it loads your project config, uses the token you provide for authentication, and starts the web UI on port 8000.</p>

<p>Open http://localhost:8000 in your browser.</p>

<p><strong>Configure your project:</strong>
Now you’ll configure your project using the web UI to connect to your Vespa Cloud deployment and set up document processing.</p>

<p><strong>Step 1: Select and edit the example project</strong></p>

<p>In the top header, the project dropdown shows <strong>“doc_example”</strong>. If you are starting from the example config, it is usually pre-selected. The configuration editor typically opens automatically; if it does not (for example you land directly in chat), open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong>.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_7.png" alt="Project selector dropdown with &quot;doc_example&quot; highlighted" />
<strong>Description</strong>: Shows the project dropdown menu in the header with “doc_example” option</p>

<blockquote>
  <p><strong>Note:</strong> If the configuration editor doesn’t appear (shows chat interface instead), click the <strong>three-dot menu</strong> (⋮) in the top right corner and select <strong>“Edit Config”</strong> to open it manually.</p>
</blockquote>

<p><strong>Step 2: Update your credentials</strong></p>

<p>In the configuration editor, paste in the information you saved from Vespa Cloud and your LLM provider. You only need three things to get going: your Vespa tenant name, your Vespa endpoint + token, and your LLM API key.</p>

<p><strong>Required fields to update:</strong></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Your Vespa Cloud credentials (from Vespa Cloud Console)</span>
<span class="na">cloud_tenant</span><span class="pi">:</span> <span class="s">your-tenant</span>          <span class="c1"># Your Vespa Cloud tenant name</span>
<span class="na">vespa_cloud</span><span class="pi">:</span>
  <span class="na">endpoint</span><span class="pi">:</span> <span class="s">https://your-app.vespa-cloud.com</span>  <span class="c1"># Your Vespa token endpoint (not mtls)</span>
  <span class="na">token</span><span class="pi">:</span> <span class="s">vespa_cloud_YOUR_TOKEN_HERE</span>          <span class="c1"># Your Vespa data plane token</span>

<span class="c1"># Your LLM configuration (default: OpenRouter)</span>
<span class="na">llm_config</span><span class="pi">:</span>
  <span class="na">api_key</span><span class="pi">:</span> <span class="s">sk-or-v1-YOUR_KEY_HERE</span>   <span class="c1"># Your OpenRouter API key (or other provider)</span>
</code></pre></div></div>

<p><strong>Notes:</strong></p>

<p>The default LLM provider is OpenRouter. If you switch providers, also update <code class="language-plaintext highlighter-rouge">base_url</code> and <code class="language-plaintext highlighter-rouge">model</code> to match. For the included example documents, <code class="language-plaintext highlighter-rouge">start_loc</code> defaults to <code class="language-plaintext highlighter-rouge">./dataset</code>, so you can run the pipeline without changing anything else.</p>

<p><strong>Step 3: Save and start processing</strong></p>

<p>After updating the configuration, you can close the editor (changes are saved automatically) and start indexing. If you are using the example dataset, keep <code class="language-plaintext highlighter-rouge">./dataset</code> as-is; otherwise, point <code class="language-plaintext highlighter-rouge">start_loc</code> at the folder (or site) you want to ingest. When you click <strong>“Start Indexing”</strong>, NyRAG reads your input, chunks it into 1024-character segments, generates embeddings, feeds everything to Vespa Cloud, and shows progress in the terminal panel so you can see exactly what is happening.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_10.png" alt="Processing progress with terminal logs" />
<strong>Description</strong>: Shows documents being processed with terminal logs displaying progress</p>

<hr />

<h2 id="chat-with-your-data">Chat with Your Data</h2>

<p>You can now start asking questions in the chat UI.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_ui.png" alt="nyrag_ui" /></p>

<p>When you submit a query, NyRAG expands it into focused retrieval queries and sends them to Vespa. Vespa runs hybrid retrieval, combining BM25 keyword matching with vector similarity, and returns the most relevant chunks. Those chunks are packed into a compact context window and sent to the LLM, which generates an answer grounded entirely in your data.</p>

<p>A good way to sanity-check the setup is to start with a broad question like “What are the main topics in these documents?” and then follow up with something more specific to confirm the retrieved context makes sense.</p>

<p>At this point, you have a fully functional RAG application running on Vespa Cloud.</p>

<h3 id="improving-search-quality-with-query-profiles">Improving Search Quality with Query Profiles</h3>

<p>Want better search results? You can fine-tune how Vespa retrieves and ranks your documents using the Settings modal (⚙️ icon in the top right).</p>

<p><strong>Change query profiles:</strong> Open the ⚙️ <strong>Settings</strong> panel, choose a <strong>Query Profile</strong> from the dropdown, and click <strong>“Save”</strong>. The very next query you run will use the new profile.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_settings_query_profiles.png" alt="Settings modal with query profile dropdown" /><br />
<strong>Description</strong>: Settings modal showing query profile selection dropdown with 4 available options</p>

<p><strong>What each profile does:</strong></p>
<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid</code></strong>: Fast hybrid search (BM25 + vector) with linear ranking</li>
  <li><strong><code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code></strong>: Same retrieval + advanced GBDT ranking (slower but best quality)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch</code></strong>: Exhaustive search with 10,000 retrieval targets (maximum recall)</li>
  <li><strong><code class="language-plaintext highlighter-rouge">deepresearch-with-gbdt</code></strong>: Exhaustive search + GBDT ranking (slowest, most thorough)</li>
</ul>

<p><strong>Pro tip</strong>: The quality difference between <code class="language-plaintext highlighter-rouge">hybrid</code> and <code class="language-plaintext highlighter-rouge">hybrid-with-gbdt</code> can be dramatic for complex queries. The GBDT model offers significantly better relevance at the cost of 2-3x higher latency. For research tasks where you need to find everything relevant, try <code class="language-plaintext highlighter-rouge">deepresearch</code> variants which cast a much wider net!</p>

<hr />

<h3 id="manage-your-data">Manage Your Data</h3>

<p>NyRAG also gives you simple tools for cleanup. Open the advanced menu (three-dot icon ⋮ in the top right) and you will find two cleanup actions. <strong>Clear Local Cache</strong> removes cached files for all projects on your machine, which is useful when you want to re-process from scratch locally. <strong>Clear Vespa Data</strong> deletes the indexed documents in Vespa for the project, which is useful when you want a clean index before re-feeding. Both actions ask for confirmation so you do not delete data by accident.</p>

<hr />

<h2 id="bonus-try-web-crawling-mode">Bonus: Try Web Crawling Mode</h2>

<p>In addition to local documents, NyRAG supports web crawling. By switching to the web_example project, you can point NyRAG at a website and have it crawl, extract, and index content automatically.</p>

<p><strong>Switch to web crawling mode:</strong>  Select <code class="language-plaintext highlighter-rouge">web_example (web)</code> from the dropdown at the top and open the configuration editor. If you are currently on the chat screen, open the three-dot menu (⋮) and choose <strong>“Edit Config”</strong> to bring the editor back. From there, update the same credential fields as you did for <code class="language-plaintext highlighter-rouge">doc_example</code>, then click <strong>“Start Indexing”</strong> to crawl and feed the site.</p>

<p><img src="/assets/2026-02-23-build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/nyrag_indexing_web_2.png" alt="Web crawling in progress" /> 
<strong>Description</strong>: Shows web crawling in progress with terminal logs displaying discovered URLs and processed pages</p>

<p><strong>Web Mode Features:</strong> Web mode discovers and follows links automatically, while still respecting <code class="language-plaintext highlighter-rouge">robots.txt</code> and crawl delays so you do not hammer a site. It also does smart content extraction to drop navigation and boilerplate, deduplicates very similar pages, and supports resume so you can continue a crawl after interruption.</p>

<p><strong>Example Use Cases:</strong> Web mode is a good fit for product documentation, knowledge bases, blog archives, help-center content, and technical wikis. In general, it works best on sites with consistent HTML structure and clean, text-heavy pages.</p>

<p><strong>Tips:</strong> Start small. Crawl a limited part of a site first so you can sanity-check what gets extracted and indexed, then expand. Use <code class="language-plaintext highlighter-rouge">exclude</code> patterns to skip sections you do not want (for example <code class="language-plaintext highlighter-rouge">/pricing</code> or <code class="language-plaintext highlighter-rouge">/sales/*</code>), and keep an eye on the terminal output panel so you can spot loops, unexpected URLs, or pages that fail to parse.</p>

<hr />

<h2 id="troubleshooting">Troubleshooting</h2>

<p>Running into issues? We’ve got you covered! For detailed troubleshooting guides covering Vespa connection errors, LLM configuration, document processing, and more, see the <strong><a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint#troubleshooting">Troubleshooting section</a></strong> in the main README.</p>

<p><strong>Quick help:</strong> If you get stuck, the fastest path is usually to ask in the <a href="http://slack.vespa.ai/">Vespa Slack</a> community, where people can help you interpret logs and query behavior. If you think you found a bug or want to request an improvement, open an issue in <a href="https://github.com/vespaai-playground/RAG-app-in-15min-ragblueprint/issues">GitHub Issues</a>. And when you want deeper background on schema, ranking, and deployment, the <a href="https://docs.vespa.ai/">Vespa Docs</a> are your go-to reference.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p><strong>Congratulations!</strong> You now have a working RAG app: a Vespa Cloud deployment that can retrieve high-quality context, and a small UI that lets you ingest data and chat with it.</p>

<p>Building a high-quality RAG system is never trivial. There are multiple moving parts to get right: the quality of the LLM, the size and management of its context window, and how effectively your retrieval system surfaces the most relevant information.</p>

<p>Thanks to the out-of-the-box Vespa RAG blueprint on Vespa Cloud, much of this complexity is handled for you. It comes with multiple ranking profiles, and its default hybrid retrieval setup combines <strong>vector similarity with BM25 text matching</strong>, ensuring your LLM sees the best possible context for every query.</p>

<p>Vespa Cloud doesn’t just make building RAG easier, it makes it <strong>scalable, fast, and reliable</strong>, giving you production-ready infrastructure, auto-scaling and observability without the headaches of self-hosting. Whether you’re experimenting with small datasets or scaling to millions of documents, Vespa Cloud provides the tools and flexibility to make your RAG project shine.</p>

<p>Want to dive deeper? Start with the <a href="https://docs.vespa.ai/en/learn/tutorials/rag-blueprint.html">RAG Blueprint Tutorial</a> for a thorough conceptual walkthrough. And remember the <a href="https://vespatalk.slack.com/">Vespa Slack community</a> is always there to help. Ask questions, share what you’ve built, or get advice on retrieval, ranking, and deployment strategies.</p>

<p>Ready to experience the power of Vespa Cloud for yourself? <a href="https://cloud.vespa.ai/">Sign up</a> today and <strong>start building high-quality RAG applications with ease</strong>!</p>

]]></content:encoded>
        <pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/build-a-high-quality-rag-app-on-vespa-cloud-in-15-minutes/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Vespa Newsletter, February 2026</title>
        <description>Advances in Vespa&apos;s retrieval performance, flexibility, and developer productivity.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/logo/logo-pi.jpg" />
        
        <content:encoded><![CDATA[<p>Welcome to the latest edition of the Vespa newsletter. In the <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">previous update</a>, we introduced several new features and improvements, including Automated ANN Tuning, Accelerated Exact Vector Distance with Google Highway, Precise Chunk-Level Matching for Higher Retrieval Quality, Quantile Computation in Grouping for Instant Distribution Insights, and <a href="https://blog.vespa.ai/vespa-newsletter-december-2025/">more</a>.</p>

<p>This month, we’re announcing several updates focused on retrieval quality, ranking flexibility, and developer productivity. Each feature is designed to help engineering teams build faster, more accurate, and more maintainable retrieval and ranking systems, while giving businesses better relevance, lower operational overhead, and more predictable performance at scale.</p>

<p>Let’s dive into what’s new.</p>

<h2 id="product-updates">Product updates</h2>

<ul>
  <li>Announcing the Vespa.ai Playground</li>
  <li>The Vespa Kubernetes Operator</li>
  <li>Faster result rendering with CBOR</li>
  <li>Pyvespa 1.0 with improved HTTP performance</li>
  <li>Hybrid search relevance evaluation tool</li>
  <li>Configurable linguistics per field</li>
  <li><strong>“switch”</strong> operator in ranking expressions</li>
  <li>Vespa is now available on GCP Marketplace</li>
  <li>Feed data and run queries in the Vespa Console</li>
</ul>

<h3 id="announcing-the-vespaai-playground">Announcing the Vespa.ai Playground</h3>

<p>The Vespa Playground is a new GitHub space where we share projects, tools, and demos built on the Vespa platform. It’s a practical place to explore real examples for embeddings, model training, and feed connectors that you can clone, run, and build on your own.</p>

<p>These repos are ideal for experimentation, learning, and inspiration, though they aren’t officially supported product releases.</p>

<p><a href="https://github.com/vespaai-playground">Explore the Playground</a></p>

<h3 id="the-vespa-kubernetes-operator">The Vespa Kubernetes Operator</h3>

<p>The safest, most robust and cost effective way to run Vespa is to deploy on Vespa Cloud, but for various reasons that’s not an option for everybody. For those who want to run Vespa securely at scale but can’t use Vespa Cloud we have now released the Vespa Kubernetes Operator. This brings many of the Vespa Cloud features such as security out of the box, dynamic provisioning, autoscaling and automated upgrades to your own Kubernetes environments.</p>

<p>Read more in the <a href="https://docs.vespa.ai/en/operations/kubernetes/vespa-on-kubernetes.html">Kubernetes Operator documentation</a>.</p>

<h3 id="faster-result-rendering-with-cbor">Faster result rendering with CBOR</h3>

<p>Query result sets can be large, and increasingly so when the client is an LLM retrieving many chunks for model context. <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">Layered ranking</a> is designed to address this by extracting the most relevant content. Still, in some cases, the total latency is dominated by the time it takes to send the query response. Compressing with gzip can help, but is also CPU-intensive and slow.. From Vespa 8.623.5, json response generation is over twice as fast as before.</p>

<p>Another new option in this release is to use the <a href="https://cbor.io/">CBOR</a> format for query results. CBOR is a binary format so it can be serialized faster and produces smaller payloads, especially when the result contains lots of numeric data. Read more in the <a href="https://docs.vespa.ai/en/reference/api/query.html#presentation.format">Query API reference</a> and query <a href="https://docs.vespa.ai/en/performance/practical-search-performance-guide.html#hits-and-summaries">performance guide</a>.</p>

<h3 id="pyvespa-10-with-improved-http-performance">Pyvespa 1.0 with improved HTTP performance</h3>

<p>We have released the first major version of Pyvespa! This release switches the HTTP-client used by Pyvespa, from httpx to httpr, which gives big performance gains, especially for serializing and deserializing tensors, largely by taking advantage of the new CBOR serialization support in Vespa.</p>

<p>On preliminary benchmarks, we compared end-to-end latency for:</p>

<ol>
  <li>
    <p>Vespa 8.591.16 + Pyvespa v0.63.0 (using JSON)</p>
  </li>
  <li>
    <p>Vespa 8.634.24 + Pyvespa v1.0.0 (using CBOR)</p>
  </li>
</ol>

<p>The latter was ~4.9x faster when returning 400 hits with a 768-dim vector each. Performance gains will be smaller when not returning large result sets with tensors, but still significant. You may encounter different exceptions than before, but we strived to not change any user-facing API’s even if we bumped the major version.</p>

<p><a href="https://github.com/vespa-engine/pyvespa">Go to Pyvespa</a></p>

<h3 id="hybrid-search-relevance-evaluation-tool">Hybrid search relevance evaluation tool</h3>

<p>Hybrid search combines lexical and embedding based search to get the best from both. One of the tasks you need to solve is to pick an embedding model that provides a good quality vs. cost tradeoff for your use case. We have done a systematic evaluation of modern alternatives in <a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">this blog</a>.</p>

<p>The code used to run these experiments is now merged into Pyvespa. You can use the VespaMTEBApp to evaluate embedding model performance on any task/benchmark compatible with the <a href="https://embeddings-benchmark.github.io/mteb/overview/available_benchmarks/">mteb-library</a>. See example usage from the <a href="https://github.com/vespa-engine/pyvespa/blob/master/tests/integration/test_integration_mtebevaluation.py">tests</a>.</p>

<h3 id="configurable-linguistics-per-field">Configurable linguistics per field</h3>

<p>Vespa now lets you specify linguistics profiles on fields to select some specific linguistics processing in your Linguistics module. In Lucene Linguistics, linguistics profiles map to analyzer configuration, optionally in combination with a specific language.</p>

<p>For example, you can define a Lucene analyzer like this in services.xml:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  &lt;item key="profile=whitespaceLowercase;language=en"&gt;

    &lt;tokenizer&gt;

      &lt;name&gt;whitespace&lt;/name&gt;

    &lt;/tokenizer&gt;

    &lt;tokenFilters&gt;

      &lt;item&gt;

        &lt;name&gt;lowercase&lt;/name&gt;

      &lt;/item&gt;

    &lt;/tokenFilters&gt;

  &lt;/item&gt;
</code></pre></div></div>
<p>And use it in the schema, under any field’s definition, like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>field title type string {

  indexing: summary | index

  linguistics {

      profile: whitespaceLowercase

  }

}
</code></pre></div></div>
<p>By default the linguistics profile will be applied both when processing the text of the field and the text searching it, but you can also specify a different linguistics profile on the query side, which is useful for e.g. doing synonym query expansion.</p>

<p>We’ve added a sample application demonstrating how to use multiple Lucene linguistics <a href="https://github.com/vespa-engine/sample-apps/tree/master/examples/lucene-linguistics/multiple-profiles">profiles</a> across multiple fields and updated the Vespa <a href="https://docs.vespa.ai/en/linguistics/linguistics.html">linguistics documentation</a> with usage examples.</p>

<h3 id="new-switch-operator-in-ranking-expressions">New “switch” operator in ranking expressions</h3>

<p>We have added a “switch” function in ranking expressions as a clearer, more maintainable alternative to deeply nested if() clauses, making complex ranking easier to read, debug, and evolve.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>switch (attribute(category)) {

    case "restaurant": myRestaurantFunction(),

    case "hotel": myHotelFunction(),

    default: myDefaultFunction()

}
</code></pre></div></div>

<p><a href="https://docs.vespa.ai/en/ranking/ranking-expressions-features.html#the-switch-function">Learn more</a></p>

<h3 id="vespa-is-now-available-on-gcp-marketplace">Vespa is now available on GCP Marketplace</h3>

<p>Vespa Cloud is now listed on the GCP Marketplace, making it easier to deploy and manage Vespa using native Google Cloud billing and procurement. Vespa Cloud is already available on <a href="https://aws.amazon.com/marketplace/pp/prodview-5pkxkencasnoo?sr=0-1&amp;ref_=beagle&amp;applicationId=AWSMPContessa">AWS Marketplace</a>.</p>

<p><a href="https://console.cloud.google.com/marketplace/product/gcp-billing-marketplace/vespa-cloud">See details</a></p>

<h3 id="feed-data-and-run-queries-in-the-vespa-console">Feed data and run queries in the Vespa Console</h3>

<p>The onboarding experience is now even smoother for new Vespa Cloud users. When you follow the getting started guide and deploy a sample app from the browser, you can immediately feed data and run queries directly in the browser. This makes it easy to try your own data and see how it behaves in Vespa.</p>

<p>We also provide examples showing how to do the same using pyvespa, the Vespa CLI, or curl.</p>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/new-onboarding-console.png" alt="New onboarding experience" /></p>

<p><a href="https://login.console.vespa-cloud.com/u/signup/identifier?state=hKFo2SBsN1NBOERhNnRCbDhpajdqTnhYSTlzUlltUjNoUG5mZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIERwRkg4NkVwRHg2aFk1Rjg0ZHZrYmdBZ0pFc1lTb29Io2NpZNkgVk92OGViclhwcEdBTnVpWWZHOWhKWk94MVM5T0dhTTQ">Try it Free</a></p>

<h2 id="new-content-and-learning-resources">New content and learning resources</h2>

<p>We published several new articles and resources since our last newsletter to help teams get more out of Vespa and stay ahead of new developments in search, RAG, and large-scale AI.</p>

<p><strong>Examples and notebooks:</strong></p>

<ul>
  <li><a href="http://playground.vespa.ai">playground.vespa.ai</a></li>
</ul>

<p><strong>Videos, webinars, and podcasts</strong></p>

<ul>
  <li><a href="https://em360tech.com/podcasts/how-scale-ai-digital-commerce-effectively?utm_content=520974566&amp;utm_medium=social&amp;utm_source=linkedin&amp;hss_channel=lcp-100705136">How To Scale AI in Digital Commerce Effectively</a></li>
  <li><a href="https://vespa.ai/resource/vespa-now-year-in-review/">2025 Year in Review</a></li>
</ul>

<p><strong>Blogs and ebooks</strong></p>

<ul>
  <li><a href="https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/">Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</a></li>
  <li><a href="https://blog.vespa.ai/embedding-tradeoffs-quantified/">Embedding Tradeoffs, Quantified</a></li>
  <li><a href="https://blog.vespa.ai/enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/">Enterprise AI Search vs. the Real Needs of Customer-Facing Apps</a></li>
  <li><a href="https://blog.vespa.ai/eliminating-the-precision-latency-trade-off-in-large-scale-rag/">Eliminating the Precision–Latency Trade-Off in Large-Scale RAG</a></li>
  <li><a href="https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/">How Tensors Are Changing Search in Life Sciences</a></li>
  <li><a href="https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/">The Search API Reset: Incumbents Retreat, Innovators Step Up</a></li>
  <li><a href="https://blog.vespa.ai/why-ai-search-platforms-are-gaining-attention/">Why AI Search Platforms Are Gaining Attention</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-5-of-5/">Why Life Sciences AI Is a Search Problem (Part 5 of 5)</a></li>
  <li><a href="https://blog.vespa.ai/why-life-sciences-ai-is-a-search-problem-4-of-5/">Why Life Sciences AI Is a Search Problem (Part 4 of 5)</a></li>
</ul>

<h3 id="upcoming-events">Upcoming Events</h3>

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/maven.jpeg" alt="Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET" />
<strong>Lightning Lesson: Personalized Relevance with VLMs and Sparse Vectors: February 17, 11:30am ET</strong></p>
<ul>
  <li>Intro to sparse vectors and tensors for efficient data handling</li>
  <li>Using Vision-Language Models (VLMs) to extract high quality and nuanced features from images</li>
  <li>Leveraging these features in sparse representations for hyper-personalized search &amp; recommendations</li>
</ul>

<p><a href="https://maven.com/p/b5ee84/personalized-relevance-with-vl-ms-and-sparse-vectors">Register Now</a></p>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/eCommerce-Webinar-Series.png" alt="e-commerce-webinar-series" />
<strong>February 18: The Zero Results Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/f4f6c070-c094-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/305ace80-c3c0-11f0-9be4-375c53bcf15c?utm_source=Newsletter&amp;utm_campaign=Zero%20results%20(AMER)">Save your spot</a></li>
</ul>

<p><strong>March 11: The Relevance Problem in eCommerce</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/70338df0-c5fd-11f0-831c-01bcfd385865?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20EMEA">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/5bf695d0-c5fd-11f0-bb1f-e79dc2111266?utm_source=Newsletter&amp;utm_campaign=Relevance%20Problem%20AMER">Save your spot</a></li>
</ul>

<hr />

<p><img src="/assets/2026-02-18-vespa-newsletter-february-2026/Vespa-Now-Q1-Product-Update.png" alt="product-update" />
<strong>March 10: Vespa Q1 Product Update</strong></p>
<ul>
  <li>🔗 10am CET (EMEA): <a href="https://www.airmeet.com/e/79245020-f186-11f0-ace7-c7ef52349391?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20Update">Save your spot</a></li>
  <li>🔗 1pm ET (Americas): <a href="https://www.airmeet.com/e/3d23e680-f186-11f0-b12c-b1c5402490b0?utm_source=Newsletter&amp;utm_campaign=Q1%20Product%20update">Save your spot</a></li>
</ul>

<hr />
<p>👉 <a href="https://www.linkedin.com/company/vespa-ai/">Follow us on LinkedIn</a> to stay in the loop on upcoming events, blog posts, and announcements.</p>

<hr />

<p>Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? <a href="https://vespa.ai/free-trial/">Deploy your application for free</a> on Vespa Cloud today.</p>

]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-newsletter-february-2026/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-newsletter-february-2026/</guid>
        
        
        <category>newsletter</category>
        
      </item>
    
      <item>
        <title>Nexla + Vespa, The Power Duo for AI-Ready Data Pipelines</title>
        <description>Nexla solves data readiness. Vespa solves intelligence and precision at scale. Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/images/New Partnership Nexla.png" />
        
        <content:encoded><![CDATA[<h3 id="partner-spotlight-nexla">Partner Spotlight: Nexla</h3>

<p>AI is transforming quickly. What started with Q&amp;A chatbots has already evolved into deep research applications and, now, autonomous AI agents. Vespa is proud to be at the center of this shift, enabling some of the most proficient adopters of AI, such as Perplexity. To help organizations maximize the benefits of Vespa, we’re building a robust partner ecosystem. These partners help bring Vespa’s AI-native capabilities into real-world deployments across industries.</p>

<p><strong>Meet the innovators shaping the future of AI. Today’s spotlight: Nexla</strong></p>

<h2 id="nexla--vespaai-the-power-duo-for-ai-ready-data-pipelines">Nexla + Vespa.ai: The Power Duo for AI-Ready Data Pipelines</h2>

<p>When AI systems fall short, it’s rarely the model’s fault. It’s the messy reality of data spread across systems and never quite staying in sync. That’s why Nexla and Vespa partnered together.</p>

<p><a href="https://nexla.com/">Nexla</a> makes data usable.</p>

<p><a href="http://vespa.ai">Vespa</a> makes data intelligent at scale.</p>

<p>Together, they turn messy, distributed enterprise data into real-time AI search, recommendation, and RAG systems, without months of custom code gluing things together.</p>

<h2 id="nexla-making-enterprise-data-usable">Nexla: Making Enterprise Data Usable</h2>

<p>Nexla is an enterprise-grade, AI-powered data integration <a href="https://nexla.com/nexla-platform-overview">platform</a> that turns raw data from any source into production-ready data products. It provides a declarative, no-code way to move, transform, and validate data across ETL/ELT, reverse ETL, streaming, APIs, and RAG pipelines.</p>

<p>Think of Nexla as the layer that answers: “How do we reliably get the right data, in the right shape, to the systems that need it?</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>500+ Bidirectional <a href="https://nexla.com/connectors/">Connectors</a>:</strong> Pull data from databases, APIs, cloud storage, SaaS apps, and data warehouses, including systems like Salesforce, Snowflake, and Amazon S3.</p>
  </li>
  <li>
    <p><strong>Metadata Intelligence:</strong> Nexla automatically scans sources and generates <a href="https://nexla.com/nexsets">Nexsets</a>, virtual, ready-to-use data products with schemas, samples, and validation rules.
Example: If a price field suddenly switches from numeric to string, Nexla detects it before bad data reaches production search.</p>
  </li>
  <li>
    <p><strong><a href="https://nexla.com/blog/introducing-express-conversational-data-platform/">Express</a> (conversational pipelines):</strong> A conversational AI interface where you can simply describe what you need.
Example: You can say, “Pull customer data from Salesforce and merge with Google Analytics,” and it builds the pipeline for you.</p>
  </li>
  <li>
    <p><strong>Universal <a href="https://nexla.com/data-integration/">integration</a> styles:</strong> Supports ELT, ETL, CDC, R-ETL, streaming, API integration, and FTP in a single platform.</p>
  </li>
</ul>

<p>Nexla processes over <strong>1 trillion records monthly</strong> for companies like DoorDash, LinkedIn, Carrier, and LiveRamp.</p>

<h2 id="vespa-where-retrieval-becomes-reasoning">Vespa: Where Retrieval Becomes Reasoning</h2>

<p>Vespa is a production-grade AI search platform that combines a distributed text search, vector search, structured filtering, and machine-learned ranking in a single system.</p>

<p>Think of Vespa as the engine that answers: “Given all this data, how do we retrieve, rank, and reason over it in real time?”</p>

<p>It powers demanding applications like Perplexity and supports search, recommendations, personalization, and RAG at massive scale.</p>

<p>Core capabilities:</p>

<ul>
  <li>
    <p><strong>Unified AI Search and Retrieval:</strong> Vespa natively combines vector and <a href="https://vespa.ai/tensor-formalism/">tensor search</a> for semantic retrieval, full-text search for precise keyword matching, and structured filtering on attributes like categories, prices, and dates to enable richer, contextual search without stitching multiple systems together.</p>
  </li>
  <li>
    <p><strong>Real-time Retrieval and Inference at Scale:</strong> Rather than separating indexing, ranking, and inference across multiple systems, Vespa performs real-time machine-learned ranking and model inference where the data lives. This means you can serve fresh, personalized results with predictable sub-100 ms latency even for large datasets.</p>
  </li>
  <li>
    <p><strong>Multi-Phase Ranking and Custom Logic:</strong> Vespa lets you embed custom ranking logic, including ML models like XGBoost, directly into your search pipeline using ONNX. You can combine relevance signals, business rules, and semantic vectors in multi-stage ranking to fine-tune which results surface first.</p>
  </li>
  <li>
    <p><strong>Massive Scalability with High Throughput:</strong> Designed for real-world, high-traffic applications, Vespa can scale horizontally across clusters, handling billions of documents with sub-100ms query latency and up to 100k writes per second per node.</p>
  </li>
  <li>
    <p><strong>Multi-Vector and Multi-Modal Retrieval:</strong> Vespa natively handles multiple vectors per document, with support for token-level embeddings, ColPali-based visual document retrieval, and <a href="https://vespa.ai/tensor-formalism/">tensor-based computations</a> for precise, cross-modal relevance and ranking.</p>
  </li>
</ul>

<p>GigaOm recognized Vespa as a <strong><a href="https://content.vespa.ai/gigaom-report-v3-2025?_gl=1*1ep8wq0*_gcl_aw*R0NMLjE3NjQ4Nzg2NjIuQ2owS0NRaUFfOFRKQmhETkFSSXNBUFg1cXhRbHdEbHgtMndtQjdqRS1aYzhVWHRBSW4zTzZ2eEVrelNYTTdLUkNXSkZCTGpISml4MzNSZ2FBbkRxRUFMd193Y0I.*_gcl_au*MjkzNDEwODQ3LjE3NjUyODY2NTk.">leader</a> in vector databases</strong> for two consecutive years, noting its performance advantages over alternatives like Elasticsearch, up to <strong><a href="https://content.vespa.ai/vespa-vs-elasticsearch-performance-comparison">12.9X higher throughput</a> per CPU core for vector searches</strong>.</p>

<h2 id="how-nexla-and-vespa-work-together">How Nexla and Vespa Work Together</h2>

<p>The Nexla-Vespa partnership removes one of the hardest parts of AI systems: getting clean, well-modeled data into a high-performance retrieval engine, continuously.</p>

<p>Nexla recently launched a Vespa connector that makes data integration with Vespa seamless. The integration includes:</p>

<p><strong><a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa Connector</a> in Nexla:</strong>
Handles all data piping from sources like Amazon S3, PostgreSQL, Pinecone, Snowflake, and others directly into Vespa:
<img src="/assets/images/nexla1.png" alt="" /></p>

<p><strong>Vespa Nexla Plugin CLI:</strong> Automatically generates draft Vespa application packages (including schema files) directly from a Nexset, eliminating manual configuration:
<img src="/assets/images/nexla2.png" alt="" /></p>

<p>This means you can move data from S3 to Vespa, migrate from Pinecone to Vespa, or sync <a href="https://nexla.com/demo-center/move-data-from-postgresql-to-vespa-ai-effortlessly/">PostgreSQL to Vespa</a>, all without writing a single line of code.</p>

<h2 id="when-nexla-clients-should-use-vespa">When Nexla Clients Should Use Vespa</h2>

<p>You’re a Nexla client. Use Vespa when you need:</p>

<p><strong>Advanced AI search and RAG applications:</strong>
If you’re building intelligent search, recommendation systems, or RAG applications that require hybrid search (combining semantic vector search with keyword matching and metadata filtering), Vespa is purpose-built for this. Nexla gets your data into Vespa, while Vespa delivers production-grade AI search with machine-learned ranking.</p>

<p><strong>Real-time, high-scale query performance:</strong>
When you need to serve thousands of queries per second across billions of documents with sub-100ms latency, Vespa’s distributed architecture scales horizontally without compromising quality. Nexla ensures your data flows continuously into Vespa with incremental updates and CDC support.</p>

<p><strong>Complex ranking and inference:</strong>
If your use case requires multi-phase ranking, custom ML models, or LLM integration at query time, Vespa executes these operations locally where data lives, avoiding costly data movement. Nexla prepares and transforms your data into the exact schema Vespa needs.</p>

<p><strong>Cost efficiency at scale:</strong>
Vespa delivers 5X infrastructure cost savings compared to alternatives like Elasticsearch while handling vector, lexical, and hybrid queries. Nexla minimizes integration costs by automating pipeline creation and schema management.</p>

<h2 id="when-vespa-clients-should-use-nexla">When Vespa Clients Should Use Nexla</h2>

<p>You’re a Vespa client. Use Nexla when you need:</p>

<p><strong>Multi-source data consolidation:</strong>
Vespa is your search and inference engine, but data lives everywhere, S3 buckets, PostgreSQL databases, Snowflake warehouses, Salesforce CRMs, APIs, and files. Nexla connects to 500+ sources with bidirectional connectors and consolidates data into Vespa without custom ETL scripts.</p>

<p><strong>Automated schema generation and management:</strong>
Instead of manually writing Vespa schema files and managing schema evolution, Nexla’s Plugin CLI auto-generates schemas from your Nexsets. As source schemas change, Nexla’s metadata intelligence detects changes and propagates them downstream automatically.</p>

<p><strong>Data transformation and enrichment:</strong>
Before data hits Vespa, it often needs cleaning, filtering, enrichment, or format conversion. Nexla provides a no-code transformation library and supports custom SQL, Python, or JavaScript, all without maintaining separate ETL infrastructure.</p>

<p><strong>Vector database migration:</strong>
Moving from Pinecone, Weaviate, or another vector database to Vespa? Nexla handles the migration with zero code, extracting records, transforming data to match Vespa’s schema, and syncing documents continuously.</p>

<p><strong>Data quality and monitoring:</strong>
Nexla continuously monitors data flows with built-in validation rules, error handling, and automated alerts. When data quality issues arise, Nexla quarantines bad records and provides audit trails, ensuring Vespa always receives clean, trustworthy data.</p>

<p><strong>Real-time and streaming pipelines:</strong>
Vespa supports real-time updates, but getting real-time data from streaming sources (Kafka, APIs, databases with CDC) requires integration logic. Nexla handles streaming, batch, and hybrid integration styles, optimizing throughput and latency for each source type.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Nexla solves <strong>data readiness</strong>.</p>

<p>Vespa solves <strong>intelligence and precision at scale</strong>.</p>

<p>Together, they give teams a clean, practical path from raw enterprise data to real-time AI applications. <a href="http://vespa.ai">Vespa</a> gives you production-grade vector search, hybrid retrieval, and RAG capabilities at any scale. <a href="http://nexla.com">Nexla</a> eliminates months of pipeline development and makes multi-source data flows conversational.</p>

<p><strong>Ready to explore?</strong></p>

<p>Start at <a href="http://express.dev">express.dev</a> for conversational pipeline building, or explore the <a href="https://docs.nexla.com/user-guides/connectors/vespa_api/overview">Vespa connector</a> in Nexla’s platform to see how quickly your data can power real AI applications.</p>
]]></content:encoded>
        <pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/vespa-nexla-partnership/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/vespa-nexla-partnership/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Clarm: Agentic AI-powered Sales for Developers with Vespa Cloud</title>
        <description>Agentic AI-powered Sales for Developers, built on Vespa</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-01-16-agentic-ai-powered-sales-for-developers-with-vespa/clarmcase.jpg" />
        
        <content:encoded><![CDATA[<!--
|--------------------------|--------------|
| **Industry:**            | Technology   |
| **Founded:**             | 2024         |
| **Backing:**             | Y Combinator |

Vespa Cloud → Vespa Enclave (AWS) 
-->

<h2 id="overview">Overview</h2>
<p>Clarm helps open source software companies <a href="https://www.clarm.com/blog/articles/convert-github-stars-to-revenue?utm_source=vespa&amp;utm_campaign=clarm_case_study">convert GitHub stars into revenue</a> through AI-powered lead generation, content production, and developer support automation. When building their platform, Clarm needed a search engine that could power accurate, zero-hallucination AI responses while handling complex enrichment across millions of GitHub data points. They chose <a href="http://vespa.ai">Vespa</a> for its unified text, vector, and structured search capabilities and were able to deploy to production in under a day.</p>

<h2 id="the-problem-software--oss-companies-struggle-to-monetize">The Problem: Software / OSS Companies Struggle to Monetize</h2>
<p>“Most OSS founders can’t get attention for their software initially. They’re <a href="https://www.clarm.com/blog/articles/developer-growth-engine-automating-sales-marketing?utm_source=vespa&amp;utm_campaign=clarm_case_study">so focused on building the product that marketing, SEO, and content creation get dropped</a>. We built Clarm to automate all the growth work founders drop so they can focus on git commits,” explains Marcus Storm-Mollard, founder and CEO of Clarm.</p>

<p>The challenge is fundamental: 99% of successful open source is funded by businesses paying for solutions, but early-stage OSS companies lack the infrastructure to identify, engage, and convert those potential paying customers. They have thousands of GitHub stars but no clear path to revenue.</p>

<p>Clarm addresses this through three product pillars:</p>
<ol>
  <li>
    <p><strong>Lead Generation &amp; Prospecting:</strong> The killer feature. Clarm takes repo data from customers and competitors, enriches it with signals from website visits, commits, issues, and community engagement, then ranks and identifies good-fit prospects and potential enterprise buyers.</p>
  </li>
  <li>
    <p><strong>Marketing &amp; Content Production:</strong> Automated content creation from commits, PRs, and codebase analysis, helping OSS companies maintain consistent technical marketing.</p>
  </li>
  <li>
    <p><strong>Developer Support Automation:</strong> AI-powered support across Discord, Slack, GitHub Issues, and websites, with deep integrations and analytics for scaling customer success.</p>
  </li>
</ol>

<h2 id="the-search-challenge">The Search Challenge</h2>
<p>At the core of all three pillars sits a critical technical requirement: accurate, explainable search and retrieval.</p>

<blockquote>
  <p>“We realized early that search, not generation, was the real problem to solve. Generating LLM answers isn’t hard. Finding the right information to base them on is everything,” Marcus notes.</p>
</blockquote>

<p>Clarm needed a search engine that could:</p>
<ul>
  <li>Handle hybrid retrieval (combining text search, vector embeddings, and structured filters)</li>
  <li>Power zero-hallucination AI responses grounded in verifiable context</li>
  <li>Process and rank millions of GitHub data points in real-time</li>
  <li>Support complex multi-signal enrichment for lead scoring</li>
  <li>Scale cost-effectively on a startup budget</li>
</ul>

<p><a href="https://blog.vespa.ai/why-search-platform-is-better-than-vector-database/">Traditional vector databases</a> like Supabase or search engines like <a href="https://blog.vespa.ai/modernizing-elasticsearch-with-vespa/">Elasticsearch</a> couldn’t deliver the unified, production-grade retrieval required for Clarm’s zero-hallucination architecture.</p>

<h2 id="the-solution-vespas-production-grade-hybrid-search">The Solution: Vespa’s Production-Grade Hybrid Search</h2>

<p>Marcus discovered Vespa after researching how companies like <a href="https://blog.vespa.ai/perplexity-builds-ai-search-at-scale-on-vespa-ai/">Perplexity</a> and <a href="https://blog.vespa.ai/using-vespa-cloud-resource-suggestions-to-optimize-costs/">Onyx</a> built their advanced retrieval systems.</p>

<blockquote>
  <p>“We really liked that Vespa started as a search engine and evolved into a vector-based system.
It made so much sense for what we were building.
Vespa’s ranking and tensoring are built in, so we know our results are accurate and relevant right out of the box,” Marcus explains.</p>
</blockquote>

<h4 id="rapid-deployment-less-than-one-day-to-production">Rapid Deployment: Less Than One Day to Production</h4>
<p>Clarm began experimenting with Vespa’s Docker image for local development, then transitioned to Vespa Cloud for production deployment during their Y Combinator batch.</p>

<blockquote>
  <p>“It took about half a day to set up how we wanted it. That speed of onboarding made a huge impact during YC. We just deployed it, and it worked,” Marcus recalls.</p>
</blockquote>

<p>The quick deployment was critical. Clarm was racing toward Demo Day and couldn’t afford weeks of infrastructure setup. Vespa’s unified approach eliminated the complexity of stitching together multiple systems for text, vector, and structured search.</p>

<h4 id="key-vespa-capabilities-powering-clarm">Key Vespa Capabilities Powering Clarm</h4>

<ul>
  <li>Unified Retrieval Pipeline
    <ul>
      <li>Single query endpoint combining text search, vector similarity, and structured filters - no need to orchestrate multiple databases or services.</li>
    </ul>
  </li>
  <li>Built-in <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html#">Ranking</a> &amp; <a href="https://docs.vespa.ai/en/ranking/tensor-user-guide.html#">Tensor Operations</a>
    <ul>
      <li>Native support for complex ranking models and tensor operations means Clarm can implement sophisticated lead scoring without custom ranking layers.</li>
    </ul>
  </li>
  <li><a href="https://143590857.fs1.hubspotusercontent-eu1.net/hubfs/143590857/PDF-reports/Scaling-Smarter_-Vespas-Approach-to-High-Performance-Data-Management-3.pdf?hsCtaAttrib=232558642374">Real-Time</a> Indexing
    <ul>
      <li>GitHub events, user interactions, and enrichment signals are instantly searchable, enabling live lead intelligence and up-to-date AI responses.</li>
    </ul>
  </li>
  <li>Scalable Cloud Deployment
    <ul>
      <li><a href="https://vespa.ai/vespa-content/uploads/2025/07/Autoscaling-with-Vespa.pdf">Automatic scaling</a> and high availability handled by Vespa Cloud, allowing Clarm’s two-person engineering team to focus on product features instead of infrastructure operations.</li>
    </ul>
  </li>
  <li>Developer-Friendly <a href="https://docs.vespa.ai/en/learn/overview.html">Architecture</a>
    <ul>
      <li>Docker-based local development, straightforward schema design, and comprehensive documentation enabled rapid prototyping and iteration.</li>
    </ul>
  </li>
</ul>

<h2 id="the-results">The Results</h2>
<p>Clarm’s decision to build on Vespa Cloud delivered immediate impact:</p>
<ul>
  <li><strong>&lt;1 Day to Production:</strong> From prototype to live search infrastructure deployed during YC</li>
  <li><strong>Zero-Hallucination Architecture:</strong> Accurate retrieval enabling trustworthy AI responses grounded in verifiable data</li>
  <li><strong>High-Quality Lead Intelligence:</strong> Sophisticated ranking of GitHub data points across 50K+ collective stars from customers like <a href="https://better-auth.com/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Better Auth</a> (23.3K stars) and <a href="https://cua.ai/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Cua</a> (11.3K stars)</li>
  <li><strong>Exceptional Support:</strong> Direct collaboration with Vespa’s engineering team throughout development</li>
</ul>

<blockquote>
  <p>“The setup was easy, the support from the Vespa team was incredible, and everything just worked. We didn’t need to look anywhere else,” Marcus emphasizes.</p>
</blockquote>

<h4 id="customer-success-github-stars-becoming-revenue">Customer Success: <a href="https://www.clarm.com/blog/articles/convert-github-stars-to-revenue?utm_source=vespa&amp;utm_campaign=clarm_case_study">GitHub Stars Becoming Revenue</a></h4>
<p>Clarm’s customers are seeing measurable results from the AI-powered lead generation platform:</p>
<ul>
  <li><strong>Better Auth:</strong> Grew from 8K to 23.3K GitHub stars in 3 months with Clarm’s lead gen and engagement automation</li>
  <li><strong>c/ua:</strong> Scaled from 5K to 11.3K stars while identifying and converting enterprise prospects</li>
  <li><strong><a href="https://www.skyvern.com/?utm_source=vespa&amp;utm_campaign=clarm_case_study">Skyvern AI:</a></strong> after struggling with after hitting 19k stars, reduced support workload by 94% with Clarm across Github, Discord, and Slack</li>
  <li><strong>Engagement Depth:</strong> Developers “pair programming” with Clarm’s AI agents for extended sessions, sending thousands of queries a day and sessions lasting up to 22 hours</li>
</ul>

<h4 id="whats-next-building-the-future-of-oss-monetization">What’s Next: Building the Future of OSS Monetization</h4>
<p>Clarm represents a <a href="https://www.clarm.com/blog/articles/best-developer-growth-automation-tools-for-software-products-in-2025?utm_source=vespa&amp;utm_campaign=clarm_case_study">new category of growth infrastructure</a> built specifically for software and open source companies. By combining Vespa’s production-grade retrieval with their own zero-hallucination agent framework, Clarm is proving that AI-powered sales and marketing can be trustworthy, explainable, and grounded in truth.</p>

<blockquote>
  <p>“We’re focused on proving product value and retaining customers right now. Everything depends on us growing our customers’ MRR and showing software and OSS companies they can build sustainable businesses,” Marcus shares.</p>
</blockquote>

<p>That focus is reflected in Clarm’s positioning: “You build awesome software. Now build a business.” It resonates with software founders who want to monetize without compromising their community values. By recognizing that a vast majority of successful open source is ultimately funded by businesses paying for solutions, Clarm offers a clear path forward: free software for the community, paid solutions for enterprises.</p>

<h2 id="conclusion">Conclusion</h2>
<p>Clarm’s architecture reinforces a lesson many teams learn the hard way: LLMs are only as reliable as the retrieval systems behind them. By treating retrieval as a first-class system, built on Vespa Cloud, Clarm unified text search, vector similarity, structured filtering, and ranking into a single production-grade platform, eliminating the fragility and guesswork common in vector-only stacks.</p>

<p>The result is an agentic AI platform that can reason over live data, explain its outputs, and scale predictably without stitching together multiple databases or post-hoc ranking layers. This foundation enabled a small team to move from prototype to production in days, operate across millions of GitHub signals, and help open source companies turn community adoption into sustainable revenue.</p>

<p>More importantly, Clarm’s success offers a blueprint for any organization building serious AI applications: when retrieval is reliable, ranking is expressive, and data is always fresh, AI systems become trustworthy enough to power real business outcomes. Clarm is building the future of OSS monetization, and Vespa is the retrieval engine making it possible.</p>

]]></content:encoded>
        <pubDate>Mon, 19 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/agentic-ai-powered-sales-for-developers-with-vespa/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Embedding Tradeoffs, Quantified</title>
        <description>The embedding strategy you choose has a major impact on both cost, quality and latency. We ran a bunch of experiments to help you make better and more informed tradeoffs.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2026-01-14-embedding-tradeoffs-quantified/control-dashboard.png" />
        
        <content:encoded><![CDATA[<p>Most Vespa users run hybrid search - combining BM25 (and/or other lexical features) with semantic vectors. But which embedding model should you use? And how do you balance cost, quality, and latency as you scale?</p>

<p>The typical approach: open the <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB leaderboard</a>, find the “Retrieval” column, sort descending, pick something that fits your size budget. Done, right?</p>

<p>Not quite. MTEB doesn’t tell you:</p>

<ul>
  <li>How fast is inference on your actual hardware?</li>
  <li>What happens when you quantize the model weights?</li>
  <li>How much quality do you lose with binary vectors?</li>
  <li>Does this model even work well in a hybrid setup?</li>
</ul>

<p>So we ran the experiments ourselves. We picked models from the MTEB Retrieval leaderboard with these criteria:</p>

<ul>
  <li>Under 500M parameters (practical for most deployments)</li>
  <li>Open license</li>
  <li>ONNX weights available (required for Vespa)</li>
  <li>At least 10k downloads in the last month (actually used in production)</li>
</ul>

<p>For each model, we benchmarked across:</p>

<ul>
  <li><strong>Model quantizations</strong> (FP32, FP16, INT8)</li>
  <li><strong>Vector precisions</strong> (float, bfloat16, binary)</li>
  <li><strong>Matryoshka dimensions</strong> (for models that support it)</li>
  <li><strong>Real hardware</strong> (Graviton3, Graviton4, T4 GPU)</li>
  <li><strong>Hybrid retrieval</strong> (semantic, RRF, and score normalization methods)</li>
</ul>

<p><strong>Spoiler:</strong> We found some <em>really</em> attractive tradeoffs - 32x memory reduction, 4x faster inference, with nearly identical quality.</p>

<h2 id="what-mteb-doesnt-show-you">What MTEB doesn’t show you</h2>

<h3 id="model-quantization">Model quantization</h3>

<p>Vespa uses <a href="https://onnxruntime.ai/">ONNX runtime</a> for <a href="https://docs.vespa.ai/en/embedding.html">embedding inference</a>. Most models on HuggingFace ship with multiple ONNX variants - here’s <a href="https://huggingface.co/Alibaba-NLP/gte-modernbert-base/tree/main/onnx">Alibaba-NLP/gte-modernbert-base</a> as an example:</p>

<p><img src="/assets/2026-01-14-embedding-tradeoffs-quantified/model-quantizations.png" alt="model quantizations" /></p>

<p>Lower precision weights = smaller model = faster inference. But how much faster, and what’s the quality hit?</p>

<ul>
  <li><strong>On CPU:</strong> INT8 models run 2.7-3.4x faster while keeping 94-98% of the quality</li>
  <li><strong>On GPU:</strong> INT8 is actually 4-5x <em>slower</em> than FP32. Don’t do this.</li>
</ul>

<p>The difference between 30ms and 100ms query latency is huge. If you’re on CPU, INT8 is often a no-brainer.</p>

<p>On GPU, use FP16 instead - you get <a href="https://sbert.net/docs/sentence_transformer/usage/efficiency.html">~2x speedup with no meaningful quality loss</a>.</p>

<p><strong>GPU vs CPU:</strong> The T4 GPU runs 4-7x faster than Graviton3 for embedding inference. If you’re processing high query volumes or doing bulk indexing, GPU may be worth it.</p>

<h3 id="vector-precision">Vector precision</h3>

<p>Model quantization affects <em>inference</em> speed. Vector precision affects <em>storage</em> and <em>search</em> speed. Different knobs, both important.</p>

<p>Here’s the math for 100 million 768-dimensional embeddings:</p>

<style>
  table, th, td {
    border: 1px solid black;
  }
  th, td {
    padding: 5px;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Precision</th>
      <th style="text-align: center">Bytes/Dim</th>
      <th style="text-align: center">100M vectors</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>FP32</td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">307 GB</td>
    </tr>
    <tr>
      <td>FP16</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">154 GB</td>
    </tr>
    <tr>
      <td>INT8 (scalar)</td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">77 GB</td>
    </tr>
    <tr>
      <td>Binary (packed)</td>
      <td style="text-align: center">0.125</td>
      <td style="text-align: center">9.6 GB</td>
    </tr>
  </tbody>
</table>

<p><br />
That’s a 32x difference between FP32 and binary. When memory is what forces you to add more nodes, this matters a lot.</p>

<p><strong>bfloat16 is free:</strong> In our benchmarks, bfloat16 vectors show zero quality loss compared to FP32 - it’s a 2x storage reduction you can take without any tradeoff.</p>

<h3 id="matryoshka-dimensions">Matryoshka dimensions</h3>

<p>Some models support <a href="https://huggingface.co/blog/matryoshka">Matryoshka Representation Learning (MRL)</a> - you can truncate the embedding to fewer dimensions and still get decent results. Fewer dimensions = less storage, faster search.</p>

<p>Here’s EmbeddingGemma at different dimension sizes:</p>

<p><img src="/assets/2026-01-14-embedding-tradeoffs-quantified/embeddinggemma-mrl.png" alt="EmbeddingGemma MRL" /></p>

<p><em>Source: <a href="https://arxiv.org/pdf/2509.20354">EmbeddingGemma paper</a></em></p>

<p>Interestingly, EmbeddingGemma actually scores <em>higher</em> at 512 dimensions than at 768. We didn’t dig into why - it may be an artifact of the smaller evaluation set - but it’s a reminder that more dimensions isn’t always better.</p>

<p>Not all models support this - check the model card before truncating. If it wasn’t trained for MRL, slicing dimensions will tank your quality.</p>

<h3 id="inference-speed">Inference speed</h3>

<p>If you have a 200ms latency budget and your embedding model takes 150ms, you’re in trouble. We benchmarked actual inference times so you can plan accordingly.</p>

<p>We measured two things for each model:</p>

<ol>
  <li><strong>Query latency</strong> - how long to embed an 8-word query</li>
  <li><strong>Document throughput</strong> - embeddings per second for 103-word docs</li>
</ol>

<p>Tested on three AWS instance types:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">c7g.2xlarge</code> - Graviton 3 (ARM CPU)</li>
  <li><code class="language-plaintext highlighter-rouge">g4dn.xlarge</code> - T4 GPU</li>
  <li><code class="language-plaintext highlighter-rouge">m8g.xlarge</code> - Graviton 4 (ARM CPU)</li>
</ul>

<p>These numbers are pure ONNX inference time. Your actual indexing throughput will also depend on HNSW config and existing index size, but embedding inference is usually the bottleneck.</p>

<h3 id="quality">Quality</h3>

<p>We evaluated all models on <a href="https://huggingface.co/collections/zeta-alpha-ai/nanobeir">NanoBEIR</a>, a smaller but representative subset of the BEIR benchmark. This let us run a lot of experiments without waiting forever.</p>

<p>For each model, we measured nDCG@10 across four retrieval strategies:</p>

<ul>
  <li><strong>Semantic only</strong> - pure vector similarity</li>
  <li><strong>RRF (Reciprocal Rank Fusion)</strong> - combines BM25 and vector rankings</li>
  <li><strong>Atan hybrid</strong> - normalizes scores using arctangent before combining</li>
  <li><strong>Linear hybrid</strong> - linear normalization before combining</li>
</ul>

<p>The hybrid methods consistently outperform pure semantic search. <strong>Every single model</strong> in our benchmark scored higher with hybrid retrieval than semantic-only. On average, the best hybrid method beats semantic-only by 3-5 percentage points. That’s a meaningful lift you get “for free” by just using BM25 alongside your vectors.</p>

<p>We also tested each model with binarized vectors (int8). This is where things get interesting:</p>

<ul>
  <li><strong>ModernBERT models</strong> barely flinch - Alibaba GTE ModernBERT retains 98% of quality (0.670 binary vs 0.685 float)</li>
  <li><strong>E5 models</strong> take a bigger hit - E5-base-v2 drops to 92% (0.602 binary vs 0.651 float), and E5-small-v2 to just 87%</li>
</ul>

<p>The takeaway: not all models are created equal for binary quantization. The newer ModernBERT-based models handle it much better than the E5 family. Make sure to check before assuming you can just binarize everything.</p>

<h2 id="interactive-leaderboard">Interactive leaderboard</h2>

<p>We built an interactive leaderboard so you can explore the full results yourself. Filter by hardware, sort by different metrics, and expand each model to see the full breakdown across dimensions and precisions. <a href="https://huggingface.co/spaces/vespa-engine/nanobeir-hybrid-evaluation">Open in full screen</a>.</p>

<iframe src="https://vespa-engine-nanobeir-hybrid-evaluation.static.hf.space" frameborder="0" width="100%" height="1200">
</iframe>

<h2 id="getting-started-with-vespa">Getting started with Vespa</h2>

<p>Ready to put this into practice? Here’s how to configure an <a href="https://docs.vespa.ai/en/embedding.html">embedding model in Vespa</a>:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;component</span> <span class="na">id=</span><span class="s">"alibaba_gte_modernbert_int8"</span> <span class="na">type=</span><span class="s">"hugging-face-embedder"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;transformer-model</span> <span class="na">model-id=</span><span class="s">"alibaba-gte-modernbert"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;max-tokens&gt;</span>8192<span class="nt">&lt;/max-tokens&gt;</span>
    <span class="nt">&lt;pooling-strategy&gt;</span>cls<span class="nt">&lt;/pooling-strategy&gt;</span>
<span class="nt">&lt;/component&gt;</span>
</code></pre></div></div>

<p>Here’s a schema with a binarized embedding field (96 dimensions = 768 bits packed):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
  document doc {
    field id type string {
      indexing: summary | attribute
    }
    field text type string {
      indexing: index | summary
      index: enable-bm25
    }
  }
  field embedding_alibaba_gte_modernbert_int8_96_int8 type tensor&lt;int8&gt;(x[96]) {
    indexing: input text | embed alibaba_gte_modernbert_int8 | pack_bits | index | attribute
    attribute {
      distance-metric: hamming
    }
    index {
      hnsw {
        max-links-per-node: 16
        neighbors-to-explore-at-insert: 200
      }
    }
  }
}
</code></pre></div></div>

<p>And a rank profile using linear normalization for hybrid scoring:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile hybrid_linear {
  inputs {
    query(q) tensor&lt;int8&gt;(x[96])
  }
  function similarity() {
    expression {
      1 - (distance(field, embedding_alibaba_gte_modernbert_int8_96_int8) / 768)
    }
  }
  first-phase {
    expression: similarity
  }
  global-phase {
    expression: normalize_linear(bm25(text)) + normalize_linear(similarity)
    rerank-count: 1000
  }
  match-features {
    similarity
    bm25(text)
  }
}
</code></pre></div></div>

<p>Check out the <a href="https://docs.vespa.ai/en/embedding.html">embedding documentation</a> for full details on configuration, including how to set up <a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html">binary quantization</a> and hybrid search.</p>

<h3 id="going-further">Going further</h3>

<p>Binary vectors are fast - really fast. Vespa can do ~1 billion hamming distance calculations per second, roughly 7x more than prenormalized angular distance. That speed difference means you can crank up <a href="https://docs.vespa.ai/en/nearest-neighbor-search.html#using-nearest-neighbor-query-operator">targetHits</a> significantly and still stay within latency budget. More candidates evaluated = better recall. So binary vectors aren’t just about 32x storage savings - they give you headroom to tune for quality too.</p>

<p>And luckily, Vespa’s <a href="https://docs.vespa.ai/en/ranking/phased-ranking.html">phased ranking</a> architecture lets you make up for any remaining quality loss in later phases. You can retrieve candidates with hamming distance, then rescore in any of the following ways:</p>

<ul>
  <li><strong>float-binary</strong> - Use float for query vector, and unpack the bits of document vector to float for angular distance calculation. <a href="https://docs.vespa.ai/en/rag/binarizing-vectors.html#rank-profiles-and-queries">Example</a></li>
  <li><strong>float-float</strong> - Retrieve with hamming distance but rerank with full-precision vectors <a href="https://docs.vespa.ai/en/content/attributes.html#paged-attributes-disadvantages">paged in from disk</a>. Should be limited to a small candidate set.</li>
  <li><strong>int8-int8</strong> - Same as float-float, with int8 vectors (scalar quantization, not to be confused with binary quantization) for both query and document. Faster and more storage-efficient than float-float, with a small precision cost.</li>
</ul>

<p>See <a href="https://huggingface.co/blog/embedding-quantization#quantization-experiments">this</a> great huggingface blog post for more details on these techniques.</p>

<p>For even better results, add a <a href="https://docs.vespa.ai/en/cross-encoders.html">cross-encoder reranker</a> as a final stage. Or (especially if you have several user signals or features), train a <a href="https://docs.vespa.ai/en/xgboost.html">GBDT model</a> to learn optimal combinations.</p>

<p>The beauty of Vespa’s <a href="https://docs.vespa.ai/en/basics/ranking.html">ranking expressions</a> is that you can mix and match all of these - BM25, a bunch of other <a href="https://docs.vespa.ai/en/reference/ranking/rank-features.html">built-in features</a>, vectors, rerankers, learned models - however you want.</p>

<h2 id="a-few-caveats">A few caveats</h2>

<h3 id="multilingual-support">Multilingual support</h3>

<p>If you need to support multiple languages, your options narrow. The <code class="language-plaintext highlighter-rouge">multilingual-e5-base</code> model handles 100+ languages but comes with a quality tradeoff compared to English-only models. For English-only workloads, stick with the specialized models.</p>

<h3 id="context-length">Context length</h3>

<p>Document length matters too. Many newer models handle 8192 tokens, EmbeddingGemma can take 2048, while the E5 family tops out at 512. If your documents are long, look at benchmarks like <a href="https://arxiv.org/html/2402.07440v2">LoCo (Long Document Retrieval)</a> - NanoBEIR won’t tell you much here.</p>

<p>For long documents, check out Vespa’s <a href="https://blog.vespa.ai/introducing-layered-ranking-for-rag-applications/">layered ranking</a> - it lets you rank chunks within documents so you’re not forced to return irrelevant chunks from top-ranking docs.</p>

<h3 id="test-on-your-own-data">Test on your own data</h3>

<p>NanoBEIR is a good starting point, but your domain matters. A model that tops the leaderboard on scientific papers might struggle with product descriptions, legal documents, or your internal knowledge base.</p>

<p>Benchmark rankings can be misleading for specialized domains. The models we tested were trained on general web data - if your corpus looks very different (medical records, source code, niche industry jargon), the relative rankings might shuffle significantly.</p>

<p>We’ve open-sourced the <a href="https://github.com/vespa-engine/pyvespa/blob/master/vespa/evaluation/_mteb.py">benchmarking code in pyvespa</a> so you can run the same experiments on any model with any dataset compatible with the MTEB library. Swap in your own data and see how different models actually perform for your use case.</p>

<h3 id="consider-finetuning">Consider finetuning</h3>

<p>If off-the-shelf models underperform on your domain, finetuning can help significantly. Even a small set of query-document pairs from your actual data can boost relevance.</p>

<p>Tools like <a href="https://www.sbert.net/docs/sentence_transformer/training_overview.html">sentence-transformers</a> make this straightforward. The ROI is often worth it for production systems where a few percentage points of nDCG translate to real user impact.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>The “best” embedding model depends entirely on your constraints. But now you have real data to make that call:</p>

<ul>
  <li><strong>Cost sensitive?</strong> Binary quantization with a compatible model (like GTE ModernBERT) gives you 32x savings with minimal quality loss.</li>
  <li><strong>Running on CPU?</strong> INT8 model quantization speeds up inference 2.7-3.4x.</li>
  <li><strong>Need great quality?</strong> Alibaba GTE ModernBERT + hybrid search is hard to beat.</li>
  <li><strong>Latency-critical?</strong> E5-small-v2 with INT8 can do a query inference in only 2.5ms on Graviton3.</li>
</ul>

<p>The interactive leaderboard above has all the details. Explore, filter, and find the sweet spot for your use case.</p>

<p>For those interested in learning more about Vespa, join the <a href="https://vespatalk.slack.com/">Vespa community on Slack</a> to exchange ideas,
seek assistance from the community, or stay in the loop on the latest Vespa developments.</p>
]]></content:encoded>
        <pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/embedding-tradeoffs-quantified/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/embedding-tradeoffs-quantified/</guid>
        
        <category>embedding</category>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        
      </item>
    
      <item>
        <title>How Tensors Are Changing Search in Life Sciences</title>
        <description>Tensor-based retrieval preserves context across queries, maintains &quot;chain of thought&quot; and ranking relevance of multiple scientific factors simultaneously.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tns-tensor-lifescience.jpg" />
        
        <content:encoded><![CDATA[<p><em>Tensor-based retrieval preserves context across queries, maintains “chain of thought” and ranking relevance of multiple scientific factors simultaneously.</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/how-tensors-are-changing-search-in-life-sciences/">25th Aug. 2025 on The New Stack</a></em></p>

<hr />

<p>In my years working across life sciences, one question comes up again and again: What’s next for AI in our field? The truth is that the life sciences industry faces challenges unlike any other.</p>

<p>Where a bank or retailer might deploy AI chatbots to improve customer service, our world is defined by enormous, messy datasets, including clinical trials, lab results, publications and patient records. This must be interpreted with care. The stakes are not just efficiency or convenience; they are breakthroughs in treatment, safety and patient outcomes.</p>

<p>That’s why I believe the real opportunity for <a href="https://thenewstack.io/genai-is-quickly-reinventing-it-operations-leaving-many-behind/">generative AI (GenAI)</a> in the life sciences is not in chatbots, but in enabling <a href="https://thenewstack.io/wrangling-data-is-becoming-critical-in-an-ai-driven-world/">deep and precise retrieval</a>. Success here means connecting across multiple sources, reconciling heterogeneous data and surfacing insights that a human researcher would struggle to piece together.</p>

<p>Imagine asking: “Find me colorectal cancer trials using ZALTRAP [a drug] with the most recent supporting publications.” GenAI, when applied effectively, can handle that complexity, and this is where the next frontier begins.</p>

<h2 id="from-traditional-search-to-ai-driven-discovery">From Traditional Search to AI-Driven Discovery</h2>

<p>For decades, search in life sciences has mostly meant keyword lookups or rule-based retrieval. Researchers, clinicians and pharma teams relied on these tools to sift through scientific literature, clinical trial data, patents and regulatory filings. They worked well enough for simple, well-defined questions. But as soon as you needed to account for domain-specific language, synonyms or the complex relationships between diseases, molecules and pathways, <a href="https://thenewstack.io/vector-search-is-reaching-its-limit-heres-what-comes-next/">traditional search hit its limits</a>.</p>

<p>The result? Endless manual refinements, stitching insights together from different sources and lots of time spent just finding the right information.</p>

<p>Now, with GenAI and <a href="https://www.nature.com/articles/s44387-025-00047-1">large language models (LLMs)</a>, that’s changing. LLM-powered search understands meaning, not just exact words. You can ask complex, natural-language questions and get results that connect the dots across literature, trials and patents — even when they use different terminology. This opens up entirely <a href="https://thenewstack.io/microsoft-opens-ai-store-for-healthcare-developers/">new ways of working</a>: identifying drug repurposing opportunities hidden in disconnected studies, accelerating biomarker discovery or finding previously unseen links between biological entities. It’s faster, more comprehensive and far less manual.</p>

<h2 id="why-tensors-matter-in-this-shift">Why Tensors Matter in This Shift</h2>

<p>Life sciences data comes in all shapes and sizes — omics data, 3D protein structures, medical images, regulatory documents, clinical trial reports and more. Most of it is unstructured or semi-structured, which makes it tricky for AI systems to find and assemble relevant information quickly. Given the nature of life sciences, accuracy is critical. “Good enough” seldom exists.</p>

<p>This is where tensors come in.</p>

<p>So, <a href="https://thenewstack.io/beyond-vector-search-the-move-to-tensor-based-retrieval/">what is a tensor</a>? Think of it as a multidimensional data container. A vector is a one-dimensional list of numbers. A matrix is two-dimensional. A tensor goes beyond that, capturing multiple dimensions at once. This allows AI models to represent complex relationships — like spatial configurations of proteins or contextual relationships between words in a scientific article — even if those pieces of information are far apart.</p>

<p>In other words, tensors let AI “see” and learn patterns that are deeply embedded across different dimensions of data.</p>

<h2 id="tensors-in-action-protein-structures">Tensors in Action: Protein Structures</h2>

<p><img src="/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tnsharini1.png" alt="" /></p>

<p>Take structural biology as an example. Models like AlphaFold use 3D tensors to represent the spatial relationships between amino acids. These tensors allow the model to learn how proteins fold, twist and interact — crucial knowledge for understanding disease mechanisms and designing new therapies.</p>

<p>When you embed a protein as a tensor, you preserve:</p>

<ul>
  <li>Sequential data (the order of amino acids)</li>
  <li>Spatial relationships (how parts of the protein fold and connect)</li>
  <li>Biochemical properties (like charge or hydrophobicity)</li>
</ul>

<p>This rich representation lets machine learning (ML) models predict protein folding, identify binding sites, map protein-protein interactions and even discover new drug targets.</p>

<p>The same idea applies beyond proteins.</p>

<p>Medical imaging, for example, can use tensors to encode not just pixels, but also their contextual relevance, helping AI detect subtle cancer markers even in noisy scans. In clinical settings, tensors help AI analyze data streams from wearables or Internet of Things (IoT) devices in real time, enabling faster interventions.</p>

<p><img src="/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tnsharini2.png" alt="" /></p>

<h2 id="beyond-retrieval-ai-agents-in-life-sciences">Beyond Retrieval: AI Agents in Life Sciences</h2>

<p>AI agents are another emerging application. Think of them as intelligent assistants that continuously gather, analyze and synthesize information across fragmented data sources. An AI agent could monitor new literature, clinical trials and regulatory updates in real time, summarize findings and even suggest next research steps.</p>

<hr />

<p><em>Good agents don’t just fetch information — they connect it, building context and reason through problems step by step.</em></p>

<hr />

<p>The key here is multistep reasoning. Good agents don’t just fetch information — they connect it, building context and reason through problems step by step.</p>

<p>This means faster reasoning, better accuracy and more meaningful insights. This allows you to stitch together multimodal data and ask questions across modalities and time. For example, as illustrated in the example below, you can now find patients for trial recruitment for a disease subtype based on certain progression (or regression) images over time. You can do this by combining into a tensor the patient’s medical record, biomarker assays, histopathology slides and any other prognosis outcome notes.</p>

<p><img src="/assets/2025-12-15-how-tensors-are-changing-search-in-life-sciences/tnsharini3.png" alt="" /></p>

<h2 id="why-this-matters">Why This Matters</h2>

<p>Life sciences are moving into an era where data is simply too complex and too vast for traditional tools. Tensors provide the foundation for AI models to handle this complexity, enabling everything from better search to advanced reasoning. Whether it’s predicting protein structures, extracting insights from clinical data or powering AI agents that help researchers focus on discovery rather than data wrangling, tensors are quietly becoming the backbone of the next wave of AI in life sciences.</p>

]]></content:encoded>
        <pubDate>Mon, 05 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/how-tensors-are-changing-search-in-life-sciences/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        <category>LifeSciencesAI</category>
        
        
      </item>
    
      <item>
        <title>The Search API Reset: Incumbents Retreat, Innovators Step Up</title>
        <description>Google and Bing are restricting their search APIs, creating opportunities for new players to build the next generation of search infrastructure.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-14-the-search-api-reset-incumbents-retreat-innovators-step-up/tns-search-api-image.jpg" />
        
        <content:encoded><![CDATA[<p><em>Google and Bing are restricting their search APIs, creating opportunities for new players to build the next generation of search infrastructure.</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/the-search-api-reset-incumbents-retreat-innovators-step-up/">7th Nov. 2025 on The New Stack</a></em></p>

<hr />

<p>The <a href="https://thenewstack.io/why-ai-search-platforms-are-gaining-attention/">search landscape</a> is shifting. In recent months, Microsoft announced the retirement of the Bing Search API, while Google limited its own API to a maximum of 10 results per query. These moves mark a notable change in the way the web’s dominant search providers view access to their data and who gets to build on it.</p>

<p>For more than a decade, search <a href="https://thenewstack.io/why-api-first-matters-in-an-ai-driven-world/">APIs</a> like Bing and Google Custom Search have been part of the web’s plumbing. Developers have used them to retrieve web results, images and news without maintaining their own indexes. Enterprises have embedded them in applications such as customer support, knowledge bases and market intelligence to provide external context. Startups and research teams have used them to collect training data, ground language models or perform competitive analysis without running their own crawlers.</p>

<p>In short, search APIs have offered a simple way to access the open web programmatically, bridging the gap between consumer search and enterprise information retrieval.</p>

<h2 id="the-ai-shift">The AI Shift</h2>

<p>The emergence of generative AI has changed <a href="https://thenewstack.io/enterprise-ai-search-vs-the-real-needs-of-customer-facing-apps/">what the search infrastructure needs to deliver</a>. With <a href="https://thenewstack.io/why-rag-is-essential-for-next-gen-ai-development/">retrieval-augmented generation (RAG)</a> becoming central to AI systems, developers now require flexible retrieval layers within the AI pipeline, not just returning links.</p>

<p>Against this backdrop, the timing of Microsoft and Google’s decisions stands out. Microsoft has folded search access into Azure’s AI stack through its Grounding with Bing Search feature for AI agents, while Google continues to reduce external visibility into its own results. Limiting queries to 10 results per call fits with its long-standing goal of minimizing bulk data extraction and automated scraping.</p>

<p>The business thinking is clear: Both companies are steering developers away from large-scale, open retrieval and toward AI-mediated access inside their own ecosystems. Full result sets are expensive to serve and often used by automated systems such as SEO platforms, data-mining tools or research crawlers rather than by interactive users. Restricting APIs helps contain those costs while repositioning web data as a controlled resource for higher-level AI services.</p>

<h2 id="a-reset-not-a-retreat">A Reset, Not a Retreat</h2>

<p>This isn’t a collapse of search, but a realignment of control. The open, list-based APIs of the past belong to an era where raw results were the product. In the generative AI era, incumbents are redefining search around answers, grounding and context, tightly coupled with their cloud ecosystems.</p>

<p>But as the large providers step back, new players are moving in. Perplexity and Parallel represent a new generation of search APIs designed for AI workloads. They publish benchmarks, expose APIs openly and emphasize retrieval quality and low latency, the performance characteristics that matter most in RAG and agentic systems. You can read more about the <a href="https://www.perplexity.ai/hub/blog/introducing-the-perplexity-search-api">Perplexity search API here</a>.</p>

<p>Perplexity has also shown that it <a href="https://medium.com/@evolutionaihub/whats-new-in-perplexity-s-search-api-that-just-killed-google-s-edge-b95047ada22e">outperforms Google on relevance</a> for RAG-style tasks. Not to be outdone, Parallel, founded by Twitter’s former CEO, Parag Agrawal, recently <a href="https://x.com/paraga/status/1971650814705127438">reported better results</a> than Perplexity, using Perplexity’s own evaluation tool.</p>

<h2 id="a-hot-market-new-foundations">A Hot Market, New Foundations</h2>

<p>The search API market is heating up again, this time around AI native infrastructure. Beneath Perplexity and Parallel is a common component: Vespa, the open source engine built for large-scale retrieval, ranking and machine learning inference.</p>

<p>Vespa’s role in these systems reflects a broader shift in architecture: Search infrastructure is now part of the AI stack itself. As models depend more on retrieval, factors such as performance, scalability and the ability to combine <a href="https://thenewstack.io/automating-context-in-structured-data-for-llms/">structured and unstructured data</a> have become key differentiators.</p>

<p>The incumbents are narrowing access; the innovators are expanding it. Either way, search is once again at the center of how the web is organized, only this time, it’s being rebuilt for AI.</p>

]]></content:encoded>
        <pubDate>Fri, 02 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/the-search-api-reset-incumbents-retreat-innovators-step-up/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        <category>AI search</category>
        
        
      </item>
    
      <item>
        <title>Enterprise AI Search vs. the Real Needs of Customer-Facing Apps</title>
        <description>Customer-facing AI search must optimize for low-latency relevance, responsiveness and personalization rather than compliance with internal policies.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-13-enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/tns-enterpise-aiearch-image.png" />
        
        <content:encoded><![CDATA[<p><em>Customer-facing AI search must optimize for low-latency relevance, responsiveness and personalization rather than compliance with internal policies</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/enterprise-ai-search-vs-the-real-needs-of-customer-facing-apps/">13th Oct. 2025 on The New Stack</a></em></p>

<hr />

<p>Gartner’s new “Market Guide for Enterprise AI Search” offers a valuable perspective on how generative AI is transforming <a href="https://thenewstack.io/why-ai-search-platforms-are-gaining-attention/">enterprise knowledge access</a>. Yet, as the focus remains on internal productivity and governance, a crucial question arises: Do the same approaches apply when search is customer-facing and directly tied to engagement, conversion and revenue?</p>

<h2 id="gartners-take-on-enterprise-ai-search">Gartner’s Take on Enterprise AI Search</h2>

<p>Gartner’s research focuses on internal, employee-focused use cases, including digital workplace assistants, IT support, HR knowledge management and compliance automation. The goal is to enable <a href="https://thenewstack.io/ai-coding-assistants-are-reshaping-engineering-not-replacing-engineers/">AI assistants and copilots</a> to increase employee productivity by synthesizing information across corporate data silos. Customer experience (CX) scenarios are mentioned only briefly and treated as adjacent markets rather than core priorities.</p>

<p>As a result, the report emphasizes governance, connectors, metadata enrichment, security trimming and knowledge graphs. These capabilities are critical for enterprise environments but less relevant to real-time, customer-facing systems.</p>

<p>Performance is primarily assessed in terms of reliability and governance for AI assistants, which aligns with the internal focus of enterprise AI search. In these settings, latency and throughput matter less than access control and compliance. Hybrid search and <a href="https://thenewstack.io/freshen-up-llms-with-retrieval-augmented-generation/">retrieval-augmented generation (RAG)</a> are recognized as core capabilities, but the emphasis remains on managing complexity across diverse data silos rather than serving high-volume, low-latency workloads.</p>

<p>Gartner also expects large enterprises to operate multiple embedded search platforms across SaaS suites such as Microsoft 365, Salesforce and SAP. This architecture fits internal knowledge management well, but is not designed for customer-facing systems, where performance, scale and accuracy directly shape user experience and business outcomes.</p>

<h2 id="customer-facing-search">Customer-Facing Search</h2>

<p>This creates a clear gap for organizations building customer-facing AI applications, where search performance is not just a productivity factor but a business-critical capability. Whether e-commerce, finance, media, social networks, market intelligence or other apps, the requirements are fundamentally different. Developers must deliver high-volume, low-latency retrieval and ranking, often processing thousands of queries per second under strict service-level objectives.</p>

<p>These systems depend on multiphase ranking pipelines, sophisticated tensor computation and multimodal retrieval across text, image and structured data. They must also serve generative or retrieval-augmented results at machine speed. Indexing and feature updates occur in near real time to reflect changes in inventory, behavior or content streams, all while maintaining uptime, cost efficiency and performance at scale.</p>

<p>Unlike internal enterprise AI search systems, where governance and policy compliance are the top priorities, customer-facing AI search must optimize for low-latency relevance, responsiveness and personalization.</p>

<p>These systems are embedded directly into the core product experience, influencing not just user satisfaction but also revenue, engagement and retention. They require a unified architecture that can handle both lexical and vector retrieval, execute learned ranking models close to the data to reduce network bandwidth and support large-scale inference workloads without heavy orchestration or additional middleware.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Gartner’s report provides an essential framework for understanding enterprise AI search within the context of internal productivity and governance. However, the same assumptions and architectures do not translate to customer-facing applications.</p>

<p>As the market evolves, a more precise distinction will emerge between enterprise AI search platforms designed for internal knowledge synthesis and AI search platforms built for production-grade, real-time environments where search is the product itself.</p>

<p>The latter must meet far higher expectations for performance, scale, accuracy and multimodality, which are defining characteristics of modern generative and retrieval-augmented systems.</p>

<p>For organizations building customer-facing AI applications, Vespa provides an alternative to the traditional “insight engine” lineage. It is built for AI-native systems, not retrofitted enterprise search. Vespa powers high-volume, real-time retrieval and ranking at scale for companies such as Perplexity.ai, Spotify and Yahoo. Its architecture supports multiphase ranking, tensor computation, multimodal retrieval and serving AI at machine speed. Vespa is designed for engineering-led companies where search is the product and performance directly drives revenue.</p>

]]></content:encoded>
        <pubDate>Mon, 29 Dec 2025 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/enterpise-ai-search-vs-the-real-needs-of-customer-facing-apps/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        <category>AI search</category>
        
        <category>retrieval</category>
        
        
      </item>
    
      <item>
        <title>Eliminating the Precision–Latency Trade-Off in Large-Scale RAG</title>
        <description>A look at three techniques that together eliminate this trade-off: multiphase ranking, layered retrieval and semantic chunking.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-12-eliminating-the-precision-latency-trade-off-in-large-scale-rag/tns_eliminating-header.jpg" />
        
        <content:encoded><![CDATA[<p><em>A look at three techniques that together eliminate this trade-off: multiphase ranking, layered retrieval and semantic chunking.</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/eliminating-the-precision-latency-trade-off-in-large-scale-rag/">3rd Oct. 2025 on The New Stack</a></em></p>

<hr />

<p><a href="https://thenewstack.io/why-rag-is-essential-for-next-gen-ai-development/">Retrieval-Augmented Generation (RAG)</a> systems constantly face a trade-off: Precise results often mean higher latency and cost, while faster responses risk losing context and accuracy. The solution isn’t choosing one or the other. It’s redesigning retrieval. Let’s explore three techniques that together eliminate this trade-off: multiphase ranking, layered retrieval and semantic chunking.</p>

<p>When combined, they create a retrieval stack that balances speed, scalability and precision.</p>

<h2 id="multiphase-ranking-incremental-refinement-of-results">Multiphase Ranking: Incremental Refinement of Results</h2>

<p>At the heart of retrieval lies ranking. Running deep machine learning (ML) ranking across the entire candidate set introduces increased latency and increases infrastructure cost. On the other hand, lightweight scoring methods alone can’t capture enough context, so precision suffers. That’s why a multiphased approach provides a balanced alternative.</p>

<p>Instead of choosing between expensive deep models or fast but shallow heuristics, multiphase ranking stages scoring from cheap to costly. Lightweight filters (lexical, approximate nearest neighbor or ANN) quickly trim the candidate pool, while progressively heavier ML functions are applied only to the top results. This preserves precision while keeping latency and compute under control.</p>

<p>Multiphase ranking provides a balanced alternative:</p>
<ul>
  <li><strong>Phase 1</strong>: Fast filtering using keyword matching or ANN search.</li>
  <li><strong>Phase 2</strong>: Reranking with dense embeddings, hybrid similarity measures or custom ranking expressions.</li>
  <li><strong>Phase 3</strong>: Advanced machine-learned models, personalization signals or domain-specific scoring rules.</li>
</ul>

<p>This staged refinement ensures that expensive models are applied only where they add the most value.</p>

<p>Benefits include:</p>
<ul>
  <li><strong>Cost-aware precision</strong>: Spend compute strategically across phases.</li>
  <li><strong>Hybrid logic</strong>: Blend symbolic rules, semantic similarity and behavioral data.</li>
  <li><strong>Personalization</strong>: Adapt results to individual users or sessions.</li>
</ul>

<p>By mirroring these best practices in large-scale search and recommendation, multiphase ranking enables RAG systems to deliver accurate results without breaking latency budgets.</p>

<h2 id="layered-retrieval-the-foundation-of-ranking-quality">Layered Retrieval: The Foundation of Ranking Quality</h2>

<p>Even the most sophisticated multiphase ranking stack can’t compensate for poor retrieval units or noisy inputs. The quality of ranking depends heavily on the retrieval unit you choose:</p>

<ul>
  <li><strong>Fine-grained chunks</strong> (paragraphs or sliding windows) maximize recall, since even short queries are likely to match. But they introduce trade-offs:
    <ul>
      <li><strong>Context fragmentation</strong>: Key signals get split across chunks.</li>
      <li><strong>Redundancy</strong>: Overlapping chunks inflate index size and cause duplicates.</li>
      <li><strong>Downstream burden</strong>: Ranking and <a href="https://thenewstack.io/what-is-a-large-language-model/">large language models (LLMs)</a> must stitch fragmented evidence together, increasing token usage and latency.</li>
    </ul>
  </li>
  <li><strong>Whole-document retrieval</strong> preserves global context and reduces redundancy, but often sacrifices precision. Large spans of irrelevant text are pulled into prompts, diluting relevance signals, inflating token costs and making reranking less effective.</li>
</ul>

<p>A well-designed retrieval strategy typically lands in between: defining a semantic retrieval unit that captures enough local context to be self-contained, while still preserving structural metadata (headings, sections, timestamps) that downstream ranking can exploit. This balance ensures that ranking operates over high-quality candidates, minimizing wasted compute and maximizing the signal-to-noise ratio that feeds the LLM.</p>

<p>Layered retrieval achieves this balance by combining both levels of relevance:</p>

<ol>
  <li>Rank and select the most relevant documents.</li>
  <li>Within those documents, retrieve only the top-K chunks.</li>
</ol>

<p>This hierarchical process preserves the broader context of document-level signals while narrowing down to the specific spans that matter.</p>

<p>Benefits include:</p>
<ul>
  <li>Reduced token usage and lower prompt costs.</li>
  <li>Cleaner, more coherent context for the LLM.</li>
  <li>Improved precision without sacrificing recall.</li>
</ul>

<h2 id="semantic-chunking-precision-starts-with-preprocessing">Semantic Chunking: Precision Starts With Preprocessing</h2>

<p>Finally, retrieval quality depends on how you index your data. Long-form documents stored as monoliths often produce noisy retrieval, because only part of the content is relevant to a given query.</p>

<p>Semantic chunking addresses this by splitting documents into meaningful, self-contained units like paragraphs or logical sections while retaining contextual metadata like headings, authorship or timestamps.</p>

<p>Benefits include:</p>

<ul>
  <li><strong>Higher recall</strong>: More granular entry points into documents.</li>
  <li><strong>Better precision</strong>: Irrelevant sections can be excluded at query time.</li>
  <li><strong>Metadata enrichment</strong>: Supports symbolic filtering and downstream ranking.</li>
</ul>

<p>Chunking can increase index size and requires careful prompt assembly, but when combined with layered retrieval and multiphase ranking, it becomes a powerful foundation for precision.</p>

<h2 id="building-a-production-ready-retrieval-stack-for-rag">Building a Production-Ready Retrieval Stack for RAG</h2>

<p>Together, these three techniques address the biggest pain points in scaling RAG:</p>

<ul>
  <li>Overlong prompts from including too much content.</li>
  <li>Context fragmentation from isolated chunks.</li>
  <li>Rigid ranking pipelines that ignore domain logic.</li>
</ul>

<p>A robust retrieval stack should therefore:</p>

<ul>
  <li>Index documents with semantic chunking while preserving metadata.</li>
  <li>Retrieve hierarchically through layered retrieval.</li>
  <li>Refine results efficiently with multiphase ranking.</li>
</ul>

<p>This combination enables more accurate, cost-efficient and trustworthy LLM outputs, especially when paired with retrieval-aware prompt engineering.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>As <a href="https://thenewstack.io/a-blueprint-for-implementing-rag-at-scale/">RAG systems scale</a>, retrieval design becomes a key differentiator. <a href="https://thenewstack.io/beyond-vector-search-the-move-to-tensor-based-retrieval/">Moving beyond simple vector or ANN search</a> to incorporate multiphase ranking, layered retrieval and semantic chunking dramatically improves both efficiency and output quality.</p>

<p>Vespa was built to handle these retrieval challenges at enterprise scale. Its tensor-native architecture supports multiphase ranking, layered retrieval and semantic chunking directly in-cluster, eliminating external bottlenecks and costly workarounds. By running retrieval and ranking where the data lives, Vespa delivers low-latency, high-precision results across billions of documents and thousands of queries per second.</p>

<p>Whether you’re building knowledge assistants, research agents or large-scale production RAG systems, Vespa provides the retrieval foundation that keeps generative AI accurate, efficient and ready to scale.</p>

]]></content:encoded>
        <pubDate>Mon, 22 Dec 2025 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/eliminating-the-precision-latency-trade-off-in-large-scale-rag/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/eliminating-the-precision-latency-trade-off-in-large-scale-rag/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        
      </item>
    
      <item>
        <title>Why AI Search Platforms Are Gaining Attention</title>
        <description>Users expect search not just to return accurate results, but to do the heavy lifting: Answer a question, summarize research, or even solve a problem.</description>
        
        <media:thumbnail url="https://blog.vespa.ai/assets/2025-12-11-why-ai-search-platforms-are-gaining-attention/TNSheader.png" />
        
        <content:encoded><![CDATA[<p><em>Users expect search not just to return accurate results, but to do the heavy lifting: Answer a question, summarize research, or even solve a problem.</em></p>

<p><em>Originally posted <a href="https://thenewstack.io/why-ai-search-platforms-are-gaining-attention/">29th Aug. 2025 on The New Stack</a></em></p>

<hr />

<p>A few years ago, my daughter told me that her school research project was so deep she had to venture all the way to Page 3 of Google. That moment stuck with me because it shows just how ingrained search has become in our lives. “To google” quickly turned into a verb. <a href="https://thenewstack.io/vector-search-is-reaching-its-limit-heres-what-comes-next/">Search was built</a> for human speed: deliver a shortlist of results, let the user scan, interpret, decide and even go to Page 3 if required. This same foundation has powered e-commerce, content discovery, compliance and countless other applications.</p>

<p>Generative AI has changed expectations almost overnight. Instead of typing keywords, people ask questions in plain language and increasingly complex ones. They expect search not just to return accurate results, but to do the heavy lifting: answer a question, summarize research or even solve a problem.</p>

<h2 id="generative-ai-is-maturing-fast">Generative AI Is Maturing Fast</h2>

<p>Despite being a relatively recent phenomenon, there are already at least three levels of GenAI maturity:</p>

<ul>
  <li>Level 1: Chatbots – “Answer my question.”</li>
  <li>Level 2: Deep research – “Research this and report back.”</li>
  <li>Level 3: Agentic systems – “Solve my problem.”</li>
</ul>

<p>At Levels 2 and 3, retrieval becomes challenging. Systems may run dozens of searches for a single task. A sluggish retrieval layer doesn’t just slow things down; it can cripple the whole experience.</p>

<h2 id="the-challenge-delivering-retrieval-accuracy-at-scale">The Challenge: Delivering Retrieval Accuracy at Scale</h2>

<p>Vector databases made similarity search possible, enabling large language models (LLMs) to ground answers in large unstructured data sets. But <a href="https://thenewstack.io/ai-needs-more-than-a-vector-database/">vector</a> search alone isn’t enough. Production-grade AI search needs more: combining semantic, keyword and metadata retrieval, applying machine-learned ranking and handling constantly <a href="https://thenewstack.io/how-tensors-are-changing-search-in-life-sciences/">changing structured and unstructured data</a>, all at scale.</p>

<p>Trying to bolt these components together across multiple systems quickly hits its limits. Bandwidth, integration overhead and shallow connections create bottlenecks and erode accuracy, which is key since people rarely question the answers the AI provides.</p>

<h2 id="enter-the-ai-search-platform">Enter the AI Search Platform</h2>

<p>The AI search platform is a new class of infrastructure that makes retrieval smarter, faster and more scalable by uniting classical search techniques with modern AI: vector and <a href="https://thenewstack.io/beyond-vector-search-the-move-to-tensor-based-retrieval/">tensor</a> search in embedding spaces, full-text search for precision, multistep ranking and real-time inference, using machine-learned models and tensor math. It enables accurate search at machine speed with filtering and ranking to ensure only the most relevant answers surface instantly. The AI search platform is critical in simplifying the development and deployment of generative AI at every maturity level.</p>

<h2 id="why-this-matters-for-enterprises">Why This Matters for Enterprises</h2>

<p>Mainstream data platforms, such as Snowflake or Postgres, now include basic vector search capabilities. That’s fine for entry-level GenAI chatbots, but not for customer-facing deep research or agentic AI use cases where speed, scale and accuracy deliver competitiveness.</p>

<p>For CIOs, this has created a split:</p>

<ul>
  <li>Basic enterprise GenAI: supported by incumbent platforms, “good enough” for simple internal tasks.</li>
  <li>Advanced enterprise GenAI: for demanding customer-facing use cases, where only AI search platforms can keep up.</li>
</ul>

<p>In this landscape, pure-play vector DBs risk being marginalized, sandwiched between incumbent data platforms for simple use cases, and AI search platforms that deliver scale, performance and accuracy.</p>

<p>Companies that adopt AI search platforms early will set the pace in this new era. Search is no longer just a utility; it’s becoming the backbone of AI-driven business. And no doubt, the backbone of my daughter’s research.</p>

]]></content:encoded>
        <pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate>
        <link>https://blog.vespa.ai/why-ai-search-platforms-are-gaining-attention/</link>
        <guid isPermaLink="true">https://blog.vespa.ai/why-ai-search-platforms-are-gaining-attention/</guid>
        
        <category>rag</category>
        
        <category>AI</category>
        
        <category>GenAI</category>
        
        <category>ranking</category>
        
        <category>tensors</category>
        
        <category>AI search</category>
        
        <category>search</category>
        
        
      </item>
    
  </channel>
</rss>
