Zohar Nissare-Houssen
Zohar Nissare-Houssen
Lead Strategic Presales Engineer

Vespa for Dummies

Vespa for Dummies

Introduction

After dedicating much of my career in Data Analytics working with RDBMS, Data Warehouses, Hadoop, and Snowflake, I now find myself venturing into the domain of Information Retrieval. Although the field is new to me, the challenges it addresses are somewhat familiar: how to store vast quantities of data and process it at scale to get specific insights that drive informed decisions. Over the years, I’ve helped countless customers and partners navigate these very challenges. Information retrieval, in particular, is about retrieving the most relevant data to answer questions after sifting through massive amounts of data efficiently in milliseconds to find that crucial and relevant ‘needle in a haystack’ piece of data.

While working in my previous role on agentic applications using Retrieval-Augmented Generation (RAG) patterns, I came to realize how critical Information Retrieval is for building effective Enterprise AI applications, particularly when tackling semantic search challenges. Ensuring the right data reaches your Enterprise Agentic Gen AI applications isn’t just important—it’s critical for success.

After an immersive week with the Vespa.ai core engineering team at their headquarters in the charming town of Trondheim, Norway, I’m excited to share some of my initial insights into this powerful platform.


Vespa: Not the Newcomer You Might Think

With the rise of Generative AI and Retrieval-Augmented Generation (RAG) applications, a wave of new commercial vector databases has emerged to tackle the challenges of semantic search. But Vespa stands apart—it’s not just another vector database. Vespa’s origins trace back to 1997, at the dawn of enterprise search, with Fast Search & Transfer. Yahoo acquired Fast Search & Transfer in 2003, bringing Vespa into its ecosystem to power search, recommendation engines, and advertising platforms at web scale.

For over 20 years, Vespa has been tackling complex information retrieval challenges at web scale, evolving its architecture, features, and functionality to meet the demands of an ever-changing data landscape. As data types expanded from text to include audio, images, videos, and files, data volumes surged, and customer use cases evolved in complexity and expectations, Vespa relentlessly adapted to tackle these challenges while improving low latency, high throughput, availability and enhanced relevancy and accuracy. It’s this deep-rooted history and relentless innovation with a strong engineering culture that make Vespa uniquely suited to address today’s advanced retrieval needs.

Some of your favorite apps that you may be using today might be powered by Vespa. For example, Spotify uses Vespa for its search infrastructure, and helps deliver the content recommendations you love. And if you’ve found your perfect match on OKCupid, Vespa might have played a part in that, too. There are other hot AI powered apps that you may hear about more leveraging Vespa.


Key Advantages of Vespa

Let’s review some of the key advantages of the platform. By no means, this is an exhaustive list. I’m listing the ones that spoke to me as a newbie onto the platform with some pressing challenges I had in mind for Gen AI applications.

While Vespa started in the search engine space, it introduced tensors and vectors back in 2014 to address semantic search and personalization. Its unique architecture allows Vespa to effortlessly combine vector search with traditional search engine query operators for better accuracy.

A document can include structured, semi-structured, and unstructured text and also images for example. Vespa allows you to model a schema that will include all the content of the document. In the previous example, some structured fields of the document which may be a publication date, or a category of the document could be represented as attributes for fast filtering, faceting and grouping. The Title and a short abstract of the document could be embedded each into a vector. The image could leverage advanced VLM models such as Colpali for embedding. You will now have a semantic representation of your entire document that will be enabled for efficient and accurate retrieval.

Vespa will then allow you to perform a hybrid search on it combining a full text lexical search, exact matching based on some attributes of your document, and nearest neighbor semantic search on the image and text embeddings.
In the meanwhile, your data can be updated on the fly (No batch index updates).

Vespa can also optimize embeddings for large vector search using binary quantization for efficiency and low latency.

This is a tall order. For these reasons and many others, Vespa is considered as a leader in the GigaOm Vector Database report. You can read more about it here.

2. Advanced Multi-phase Ranking

A key aspect of Information Retrieval is the ranking of the search results to surface the most relevant results. For lexical search, you may be familiar with TF-IDF scoring. Bm25 (or Best Match 25) refines and builds upon the limitations of TF-IDF, adding more nuanced term frequency and length normalization adjustments. For vector search, you may be familiar for example with distance metrics between vectors such as cosine similarity, or euclidean distances.

Vespa offers you advanced capabilities when it comes to ranking:

  • Hybrid Ranking: Combining the ranking from traditional text based ranking such as bm25, with vector based similarity ranking.
  • Customizable Ranking Functions: Vespa allows you to entirely customize your ranking using ranking expressions. Those ranking expressions can be dynamic to incorporate real-time signals (e.g., user behavior, contextual information) into ranking, which enhances personalization and dynamic relevance.
  • Support for ML Models: You could leverage your own ML ranking models to be used for inference in real-time (TensorFlow or XGBoost models, or any ONNX format model) and host those on your Vespa cluster.
  • Multi-phase Ranking: Vespa allows you to implement a phased ranking to efficiently rank large volumes of data by breaking down ranking into multiple stages, each with specific computational goals. This approach enables Vespa to process potentially millions of documents while only applying the most complex and computationally intensive ranking functions to a much smaller, refined subset of documents.

3. LLM Integration

Vespa’s architecture supports the direct use of Large Language Models (LLMs) in both query processing and document handling. This capability means you can leverage models like ChatGPT or integrate smaller, specialized models to run natively within Vespa itself. By embedding LLM functionality directly in Vespa, you can build Retrieval-Augmented Generation (RAG) applications without the need for external tools. With Vespa Cloud, you could spin-up GPU nodes for inference directly on top of your data.

Discover More: Vespa and LLMs

4. Scalability and Performance

Vespa was built for the ground up for web scale applications, supporting real-time personalization and content serving for global 10 web sites like yahoo.com. These are billion user scale systems with low latency and high throughput As a result, its entire architecture is designed accordingly: It’s a distributed architecture which supports ingestion, storage, processing, querying and retrieving data at scale.

Vespa supports auto-scaling to dynamically adjust compute and storage resources in response to workload changes, ensuring optimal performance and cost-efficiency. This capability is essential for applications with fluctuating demand, such as e-commerce sites, recommendation engines, or search systems that need to maintain low latency and high throughput even as traffic spikes or dips.

Vespa can auto-scale dynamically both horizontally by adding more content (data) nodes, and vertically by adding more resources per node. As nodes are added or removed, Vespa automatically redistributes data and rebalances load to ensure even resource utilization and minimal latency.

5. Online, Real-time

Vespa is a high-performance platform designed for low latency and high throughput. It delivers fast, high-quality data retrieval with response times measured in milliseconds, all while supporting hundreds of thousands of queries per second. On top of that, Vespa handles data updates with ease, processing tens of thousands of updates per second per node. As your cluster grows, these throughput capabilities scale seamlessly, enabling you to handle even larger volumes of data and queries.

This feature is a game-changer for global websites or use cases like e-commerce platforms, where the need to serve continuously updated and relevant personalized content is crucial. In the world of AI applications, Vespa unlocks the power to provide real-time enterprise information through Retrieval-Augmented Generation (RAG) systems. For instance, an e-commerce chatbot powered by Vespa can recommend products that are not only in stock but also newly added to inventory, while incorporating real-time trends, such as frequently purchased items or similar products based on user behavior. There are many other AI applications across all verticals that could use real-time data such as IOT/Manufacturing, Smart Cities, Supply Chain control tower.

6. Fault Tolerance

Vespa automatically maintains multiple copies of data across different nodes. This redundancy ensures that if a node goes down, data can still be accessed from other replicas without downtime. In the event of a node failure, Vespa reroutes traffic to healthy nodes and uses replica data to continue serving queries. This failover happens seamlessly, maintaining service availability. When a node fails or becomes unreachable, Vespa detects the issue and redistributes data to maintain the specified redundancy. New replicas are created to replace any lost copies, ensuring data integrity and maintaining consistent query performance. —

Why Vespa?

This is a legitimate question to ask. After all, if you have read so far, you may already realize that if your use case is a web scale RAG chatbot, Vespa has to be on your shopping list.

But, you may be questioning why you need such an evolved solution if you can get away with an easy to use library which provides some vector functionality, enabling vector operations, indexing, and similarity search for your RAG application. This may be very well sufficient for the needs of your prototype or functional test of your application.

However, after experimenting with RAG applications for a while, many customers realize how critical quality and relevancy of information retrieval is for their Gen AI apps. It’s most often the Achilles heel of their application which they need to get right: ”How to ensure the right data is served for inference to the LLM?”

After exploring Gen AI applications through proof of concepts and prototypes, client conversations have evolved around putting together Enterprise AI reference architectures requiring an Enterprise AI data repository to index and vectorize a diverse set of enterprise data and serve it to Enterprise wide Gen AI Applications. This data repository is going to be foundational for an AI readiness by providing best of breed information retrieval capabilities.

By choosing Vespa, organizations gain a best-in-class foundation for their Gen AI data repository, offering a future-proof platform that unlocks the art of the possible. Vespa combines advanced, online, real-time information retrieval capabilities with robust scalability and high performance, ensuring your AI applications are ready to meet evolving demands.

One may think a simple cosine similarity search, with a black box hybrid ranking algorithm may suffice for prototyping, but after some experimentation on non-trivial real enterprise dataset, most will quickly realize similarity search limitations. Options that Vespa offers for Hybrid search, just starting with simple customized ranking expressions will do wonders to improve the quality of information retrieval. As the organization gets more mature, Vespa will support more advanced techniques requiring more elaborate ML ranking models. Additional retrieval techniques native in Vespa such as search personalization based on the user profile or behavior can also be leveraged. This opens up the realm of AI chatbots who appear to “know” their users.

Gen AI applications will also grow within the enterprise in terms of data stored, number of users and queries. Vespa is able to accommodate web scale workloads with advanced horizontal, and vertical scalability with fault tolerance.

As many organizations are gearing towards having Gen AI applications in early production, the overall cost is considered a preoccupying risk factor; Many of the costing models for GPU compute workloads like LLM inference are usage based billing, based on the number of tokens.
Being a web scale platform, Vespa offers many optimizations such as a binary quantization on vector search, or the ability to have GPU inference directly on top of your data to help customers optimize their costs as they scale their applications, and achieve a better cost governance for their Gen AI applications through a better cost visibility.

With Vespa, you can start small both in terms of scale and complexity of use cases.
You can start on-boarding your RAG applications with simple semantic retrieval requirements. As your use cases become more complex and your data requirements increase, Vespa will grow with you, providing a robust foundation that supports both scale and sophistication. By adopting Vespa early on, you lay down a solid data infrastructure for AI, preventing the buildup of technical debt and ensuring your platform can evolve as your AI capabilities advance.


Next Steps

If you’re intrigued by the possibilities Vespa brings to your enterprise AI journey, there’s so much more to explore. Check out how Vespa can elevate your applications by visiting the Vespa blog Vespa Blog, where you’ll find in-depth articles, tutorials, and updates on all things Vespa.

And stay tuned for the Vespa for Dummies series that I’m planning to start —a perfect guide to unlocking Vespa’s full potential, if you are new to the Information Retrieval domain. As well, please reach out to learn more and discover how Vespa can be the foundation for your next-generation AI applications.

Read more

Vespa Terminology for Elasticsearch, OpenSearch or Solr People

Glossary of Vespa concepts, translated for engineers familiar with Lucene search engines: Elasticsearch, OpenSearch or Solr

Getting Started

Welcome to Vespa, the open big data serving engine! Here you'll find resources for getting started.

Free Trial

Deploy your application for free. Get started now to get $300 in free credits. No credit card required!