Kristian Aune
Head of Customer Success, Vespa.ai

30 Jun 2025

Vespa Newsletter, June 2025

In the previous update, we mentioned 3x lexical search query performance, Pyvespa Relevance Evaluator, Global-phase rank-score-drop-limit, Compact tensor representation, and Agentic and Video example applications.

Today, we’re excited to share the following updates:

Features for high-quality Generative AI retrieval, including integrated chunking
Elementwise rank features
Facet filtering in Vespa Grouping
Pyvespa Match Evaluator

Don’t forget to check out Vespa Voice, our new podcast!

Layered ranking for RAG applications

Until now, RAG systems have relied purely on document ranking to populate the LLM context: The entire top N ranked documents are put into the context. In version 8.530 we’re for the first time changing this paradigm in retrieval systems by introducing layered ranking, where ranking functions can be used both to score and select the top N documents and also to score the top M chunks of content within each of those documents. Read more in the announcement.

New features released as part of this:

From Vespa 8.478:

elementwise BM25 rank feature
The elementwise(bm25(field), dimension, cell_type) rank feature for multi-value index fields. It calculates the Okapi BM25 ranking function over each element in the given multi-valued indexed string field and creates a tensor with a single mapped dimension containing the BM25 score for each matching element. The element indexes (starting at 0) are used as dimension labels. In short, this generates BM25 scores for an array<string> and is used for chunked documents.
read more

From Vespa 8.528 - new tensor functions:

top( N, tensor )
Picks top-N cells in a simple mapped tensor. It is a convenience expression, built from filter_subspaces and cell_order (below). This function extracts the highest-ranking elements from an array, like the best-scoring chunks in a document.
Read more
filter_subspaces( tensor, f(x)(expr) )
Returns a new tensor containing only the subspaces for which the lambda function defined in f(x)(expr) returns true. Typically used to eliminate values in sparse tensors. Example:
filter_subspaces(tensor(x{}):{a:1, b:2, c:3, d:4}, f(value)(value>2)) -> tensor(x{}):{c:3, d:4}
Read more
cell_order( tensor, order )
Returns a new tensor with the rank of the original cells. With max the largest value gets rank 0, with min the smallest value gets rank 0. Examples:
cell_order(tensor(x[3]):[2,3,1], max) -> tensor(x[3]):[1,0,2] and cell_order(tensor<float>(chunk{}):{0:13, 1:7, 2:5, 3:15, 5:2}, min) -> {0:3.0, 1:2.0, 2:1.0, 3:4.0, 4:0.0}
Note how both indexed and mapped tensors are supported. Mapped tensors do not have a bound on dimensions and are hence useful in multivector cases like an array of chunk scores, as documents normally have a variable number of chunks.
Read more

From Vespa 8.528:

select-elements-by
Used to control which elements in an array or array field are returned as part of the document summary. Particularly useful in GenAI applications to return the best-scoring chunks of a document to improve result quality.

The summary feature used is a tensor with a single mapped dimension. An element will be returned if its id is a label along the mapped dimension of this tensor. Read more

Chunking during indexing

The embedding models used in semantic search require input data of a limited size to work well, so most applications split larger text fields into chunks, an array of shorter strings and create an embedding for each array element.

From version 8.520, Vespa supports chunking in the indexing language. The id of the chunker to use is required, and can be either sentence, fixed-length to use chunkers provided by Vespa, or any chunker component provided by the application.

Example:

schema doc {

    document doc {

        field text type string {
        }
    }

    field chunk_embeddings type tensor(chunk{}, x[768]) {
        indexing: input text | chunk fixed-length 1024 | embed | attribute | index
        attribute {
            distance-metric: angular
        }
    }
}

Proximity across chunks with element-gap

In Generative AI use cases, applications split large text fields into chunks and store them as elements in an array of strings. The general assumption when indexing arrays is that each array element is independent and that the proximity between terms in separate elements should not matter. However, this is not true with chunking since the source text was contiguous.

Vespa now supports configuring an element-gap value in rank profiles, to specify to what extent proximity between elements should matter in rank features and Near conditions. The default value is infinite, while any value down to 0 is supported.

We recommend RAG applications to set this to 0-2 or so for their text chunks field, depending on the chunking strategy used. This setting affects the nativeProximity and nativeRank rank features, and the near and onear query operators. Read more.

Filtering in grouping results

Grouping, or faceting, is used to aggregate field values. This is often used in e-commerce, to organize by brand, price, and so on. This is also supported for multivalue fields such as maps and arrays, and in these cases, you might need to filter out results from the aggregations. Since Vespa 8.512, use the filter keyword to filter for the right field values. See the reference for details. Example:

{
    "fields": {
        "attributes": {
            "delivery_method": "Curbside Pickup",
            "sales_rep": "Bonnie",
            "coupon": "SAVE10"
        },
        "customer": "Smith",
        "date": 1157526000,
        "item": "Intake valve",
        "price": "1000",
        "tax": "0.24"
    }
}

select (…) | all(group(attributes.value) filter(regex("delivery_method",attributes.key)) each(output(sum(price)) each(output(summary()))))

WeakAnd allowDropAll

In the previous newsletter, we announced 3x lexical search performance by filtering and reducing the required precision for common words, plus reducing ranking costs for these. If a query has only common terms, Vespa defaults to keeping the rarest one.

With weakand.allowDropAll, one can choose to drop all terms. This is useful in hybrid search applications, if the lexical signal is too weak to be used in ranking. It is a boolean value that, if set to true, will allow the weakAnd operator to drop all terms from the query if they are considered stopwords.

VespaMatchEvaluator in Pyvespa

Pyvespa v0.56.0 is out, with the most notable addition being the VespaMatchEvaluator. This class provides a simple interface to evaluate the recall over a set of queries and relevant documents, and outputs totalCount and time spent. This makes it easy to tune, for example, targetHits for nearestNeighborand weakAnd, along with other weakAnd parameters.

See also usage in the Pyvespa notebook, Evaluating a Vespa Application.

Using private models on HuggingFace

Many applications use models hosted on HuggingFace in their applications. Starting from Vespa 8.540, you can also use private models on HuggingFace by adding your key to the Vespa Cloud secret key store, and referring to it in the HuggingFace component configuration, see the documentation.

Azure zones are now generally available on Vespa Cloud

Azure is now generally available on Vespa Cloud in addition to AWS and GCP. We have provided a dev zone to get people started: azure-eastus-az1. We’ll be adding prod zones in the regions people need - reach out to us if you need a particular region.

Other new features and performance improvements:

The query API now supports independent model type settings through model.type.composite, model.type.tokenization, and model.type.syntax. This lets you set these aspects of turning strings into query subtrees separately, which is more flexible than just setting a type (or grammar).
The new linguistics tokenization setting lets you pass the entire string as-is to the linguistics tokenizer, just like what’s done on the indexing side. This is suitable for applications that want full control over the linguistic processing, typically through a configured linguistics component.
Returning multiple tokens (e.g., stems) on the same term position is supported in LuceneLinguistics from Vespa 8.522. As with other linguistic implementations, Vespa will:
- Index all the token alternatives at the same position.
- Search using the first alternative returned by the linguistics implementation.
Vespa has built-in language detection using Apache OpenNLP. From Vespa 8.520, you can configure the detection confidence, see default-languages. This is particularly useful for query language detection. Queries with few terms make detection harder, and this feature allows you to tune the detection confidence.
Vespa’s feed clients adapt to the write capacity of the Vespa cluster, but it can take some time to converge on the optimal rate. Use initial-inflight-factor with the vespa-feed-client for faster feed ramp-up. Find details in #34247. Thanks to yjtakamabe for submitting this!
Since Vespa 8.533, tensor dot products on INT8 use vectorized kernel instructions. This speeds up expressions with INT8 dot products in ranking profiles.
To reduce the size of a result sets with tensors, use presentation.format.tensors with hex or hex-value for a more compact representation. Since Vespa 8.518.
When using match: exact for text matching, you can now also configure match: cased for case-sensitive matching - see cased. Since Vespa 8.504.
Vespa supports various topologies to optimize query performance and operational trade-offs. Since Vespa 8.502, using prioritize-availability, all groups within the min-active-docs-coverage of the median document count of other groups will be used to service queries. This allows application owners to prioritize availability while accepting some possible loss in result coverage, if needed.
Since Vespa 8.541 it is legal to nest Equiv items inside Near and ONear.

New examples and notebooks:

Evaluating a Vespa Application

Posts, e-books and podcasts

GigaOm: Vespa Should Be on Your AI Shortlist: A new GigaOm CxO Brief explains why traditional Lucene-based search platforms are falling short for real-time AI workloads like RAG, recommendation, and semantic search. The report highlights Vespa.ai as a purpose-built alternative, with unified support for hybrid retrieval, integrated inference, and real-time ranking—all at scale. Vespa’s architecture powers deployments like Perplexity, delivering sub-100ms responses across billions of documents.

“Vespa warrants serious consideration and should be on the evaluation short list.” — GigaOm

If you’re building modern AI search systems, this is a must-read: download.
Podcast: Jon Bratseth on Vespa AI: Reinventing Search for Machines with RAG at Enterprise Scale
Demo video: Smarter Product Discovery with Real-Time AI
Introducing Document Enrichment with Large Language Models in Vespa
Quick Start with Logstash: from data to Vespa schema
Perplexity builds AI Search at scale on Vespa.ai
Why a Search Platform (Not a Vector Database) is the Smarter Choice for AI Search
Vespa Guide for Solr Users
Beyond Simple RAG: Crafting a Sophisticated Retail AI Assistant with Vespa and LangGraph
Beyond Vectors: AI for Life Sciences Needs More Than Vectors—Here’s Why
Search me if You Can: Building the Next Generation Advanced RAG Solution in Lifesciences

Events

Past

Unlock the Future of ECommerce webinar - recording
Haystack 2025, Charlottesville: Building Relevance Formulas with LLMs - video
Berlin Buzzwords: Vespa.ai’s Personalized Search: Advanced Ranking & Tensor framework
Berlin Buzzwords: Which GPU for Local LLMs?

Upcoming

Ai4, Las Vegas, Aug 11-13
AI in Financial Services, London, Sept 9-11
AWS Summit, Los Angeles, Sept 17

👉 Follow us on LinkedIn to stay in the loop on upcoming events, blog posts, and announcements.

Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? Deploy your application for free on Vespa Cloud today.

Vespa Newsletter, April 2025

Advances in Vespa features and performance include Lexical Search Query Performance, Pyvespa Relevance Evaluator, Global-phase rank-score-drop-limit, and Compact tensor representation.

Scaling Smarter: Vespa's Approach to High-Performance Data Management

Balancing Performance and Cost: A Guide to Optimizing Node Size in Vespa

newsletter

« Introducing layered ranking for RAG applications The RAG Blueprint »

Vespa Blog