Vespa Blog

Deploying RAG at Scale: Key Questions for Vendors

Mon, 28 Oct 2024 00:00:00 +0000

Retrieval-augmented generation (RAG) has emerged as a vital technology for organizations embracing generative AI. By connecting large language models (LLMs) to corporate data in a controlled and secure manner, RAG enables AI to be deployed in specific business use cases, such as enhancing customer service through conversational AI. For those new to RAG, I recommend this BARC research note: Why and How Retrieval-Augmented Generation Improves GenAI Outcomes, available for free here.

In its recent Hype Cycle for Generative AI, Gartner identifies RAG as an early-stage technology driving innovation. However, it is approaching a peak of inflated expectations as ambitions for RAG outpace the practicalities of deploying it at scale. Vendor exuberance likely raises the bar on expectations—hype around RAG and generative AI is at a fever pitch!

Our discussions with large enterprises reveal that while generative AI pilots are proving value, scaling these solutions across the enterprise is a concern. Managers ask:

How can we scale from concept to enterprise-wide deployment?
Given the intensive processing demands of generative AI, how can I control costs?
How can I ensure compliance with data privacy and security regulations?
How can I integrate all relevant data sources beyond just vector databases?
How can I stay current with emerging technologies and best practices?

Before answering these questions, let’s introduce Vespa:

Vespa is a robust platform for developing real-time, search-based AI applications. Its large-scale distributed architecture enables efficient data processing, inference, and logic management, making it ideal for applications handling vast datasets and high volumes of concurrent queries.

From Concept to Enterprise Deployment

Proving the value of RAG in the lab is one thing, but scaling it across an entire enterprise introduces numerous challenges. These include integrating with existing data sources, ensuring strict data privacy and security, delivering required performance, and managing this complex large-scale run-time environment. Scalability is also a significant concern, as AI models must handle vast amounts of growing data and increasingly diverse use cases while maintaining high performance and reliability.

Vespa has been wrestling with these challenges since 2011—long before AI hit the mainstream. Originally developed to address Yahoo’s large-scale requirements, Vespa runs 150 applications integral to the company’s operations. These applications deliver personalized content across Yahoo in real-time and manage targeted advertisements within one of the world’s largest ad exchanges. Collectively, these applications serve an impressive user base of nearly one billion individuals, processing 800,000 queries per second.

Vespa offers two essential components for enterprise RAG deployment:

a comprehensive platform for developing generative AI applications
a scalable deployment architecture to address the demands of large enterprises.

The Platform Approach

Vespa is a fully integrated platform that offers all the essential components needed to build robust AI applications. It includes a versatile vector database, hybrid search capabilities, RAG, natural language processing (NLP), machine learning, and LLM support. The platform connects easily with existing operational systems and databases through APIs and SDKs, enabling AI applications to support your specific requirements. This allows organizations to integrate existing data infrastructure easily.

Vespa’s hybrid search capabilities enhance the accuracy of generative AI by combining various data types, including vectors, text, and both structured and unstructured information. Machine learning algorithms score and rank results to align with user intent, delivering precise and relevant answers. A key feature of the platform is its advanced natural language processing, which enables efficient semantic search. By understanding the meaning behind user queries rather than just matching keywords, Vespa supports vector search with embeddings and integrates custom or pre-trained machine learning models for more precise content retrieval.

Visual search is a hot topic, and Vespa offers intelligent document retrieval that combines images and text to enable detailed contextual searches. This creates a visually intuitive search experience that feels more natural and human-like.

An Execution Environment for Large Scale Enterprise Deployment

Vespa Cloud streamlines large-scale deployment, delivering high performance but simplifying performance management to ensure a seamless user experience. Applications running on Vespa dynamically adjust to fluctuating workloads, optimizing performance and cost—eliminating over-provisioning to keep costs in check and users happy.

Designed for high performance at scale, Vespa’s distributed architecture ensures instant query processing and advanced data management. It offers low-latency query execution, real-time data updates, and sophisticated ranking algorithms, enabling enterprises to efficiently process and utilize data across their operations without sacrificing speed or accuracy.

The platform’s robust, always-on architecture guarantees uninterrupted service. By distributing data, queries, and machine learning models across multiple nodes, Vespa achieves high availability and fault tolerance, ensuring continuous operation even under demanding conditions.

Security and compliance are core elements of Vespa’s design. With computation performed close to the data and distributed across nodes, the platform reduces network bandwidth costs and minimizes latency. It adheres to data residency and security policies, with encryption at rest and secure, authenticated internal communications between nodes. This comprehensive approach provides a secure and governed environment for deploying AI applications at scale.

Future Proofing

RAG deployments naturally evolve as experience and confidence grow and use case requirements become more sophisticated. What begins as a basic Q&A system for tasks like customer service chatbots can scale into dynamic, live knowledge bases that are rapidly consulted hundreds or even thousands of times, drawing on conclusions reached by the AI. Vespa supports this stepwise development, allowing for controlled, incremental rollouts that adapt to evolving needs.

Future-proofing also involves adopting the latest technologies and best practices. For example, Vespa enables visual search capabilities in eCommerce, where searches are driven by images, and supports standards like ColPali for large-scale PDF search. With Vespa Cloud, our engineers continually integrate emerging RAG best practices, ensuring your enterprise stays ahead. We incorporate this best practice so that you can focus on your AI applications.

Summary

RAG is crucial for businesses adopting generative AI. However, scaling it across an enterprise presents challenges, including integration, data privacy, infrastructure management, and performance at scale. Vespa addresses these challenges with a comprehensive platform and scalable deployment architecture. Proven by Yahoo’s large-scale needs, Vespa Cloud supports AI applications with real-time, low-latency query processing, hybrid search, and advanced data processing.

This blog turned into a sales pitch! Sorry.

Announcing support for global significance models

Fri, 18 Oct 2024 00:00:00 +0000

We are excited to announce support for global significance models in Vespa. This feature improves ranking for streaming mode and ensures deterministic results in multi-node deployments using indexed mode.

Significance measures how rare a term is in a collection of documents. Rare terms, like “neurotransmitter”, are weighted higher in ranking than common terms, such as “body”. Significance is used in bm25 and nativeRank ranking functions.

By default, Vespa calculates significance values locally on each content node, based on the documents stored on that node. This makes ranking non-deterministic as the same query may be processed by different node groups or when documents are redistributed due to scaling or failure recovery. In addition, local significance is not supported in streaming search.

To address these scenarios, we introduce support for global significance models that share significance values across all content nodes, which also works in streaming search.

Example

Global significance models are specified in the significance element in services.xml:

 version="1.0">
    
             model-id="significance-en-wikipedia-v1"/>
             url="https://some/uri/mymodel.multilingual.json" />
             path="models/mymodel.en.json.zst" />

Vespa Cloud users have access to pre-built models, identified by model-id. In addition, all users can specify their own models by providing a url to an external resource or a path to a model file within the application package. Vespa provides a command line tool to generate model files from documents. The order in which the models are specified determines the model precedence, see model resolution for details.

The significance feature must be also enabled in the rank-profile section of the schema:

schema example {
    document example {
        field content type string {
            indexing: index | summary
            index: enable-bm25
        }
    }

    rank-profile default {
        significance {
            use-model: true
        }
    }
}

Experiments

We have conducted experiments to evaluate how a global significance model influences ranking quality, both in indexing and streaming search scenarios.

Three publicly available information retrieval datasets were used in these experiments:

NFCorpus - document retrieval dataset, medical domain, 3.6K docs and 323 test queries.
TREC-COVID - document retrieval dataset, about COVID-19, 171K docs and 50 test queries.
MS MARCO - passage retrieval dataset, not domain specific, with ca. 8.8M docs and 7K test queries.

The global significance model was generated from English Wikipedia, which includes approximately 4.4 million articles. We compared this model to the default local model in Vespa. Our ranking expression is the sum of BM25 scores applied to a title and text fields.

The experiments were executed locally in a docker container with one content node. When running on one content node the local model is generated from all documents fed to Vespa without non-determinism due to document distribution. This corresponds to a global model generated from all documents within the collection, which is ideal for large collections. However, for streaming search, there is no local model and significance is set to a constant value.

The results are presented in the table below showing Normalized Discounted Cumulative Gain at 10 (NDCG@10) for different datasets and scenarios (indexing and streaming):

For the table we can see that the global model substantially improves streaming search. With indexing, a slight decrease in ranking scores is observed, especially in a larger, general domain dataset like MARCO. This can be mitigated by generating a model from the documents themselves, rather than relying on one built from external data.

It is worth noting that the global model incurs no performance cost for indexing and querying.

Summary

Global significance improves streaming search and provides deterministic search results in multi-node deployments. For small document collections (less than 10k documents), models generated from large external data (e.g. Wikipedia) provide good results. For larger collections, a model generated from the documents within the collection is recommended. See the significance model documentation page for details. The feature is available in Vespa as of version 8.426.8.

Got questions? Join the Vespa community in Vespa Slack.

Vinted moves from Elasticsearch to Vespa

Tue, 01 Oct 2024 00:00:00 +0000

Vinted is Europe’s largest online C2C marketplace for second-hand fashion. In a new blog post they are reporting impressive results from migrating their search experience from Elasticsearch to Vespa:

The migration was a roaring success. We managed to cut the number of servers we use in half (down to 60). The consistency of search results has improved since we’re now using just one deployment (or cluster, in Vespa terms) to handle all traffic. Search latency has improved by 2.5x and indexing latency by 3x. The time it takes for a change to be visible in search has dropped from 300 seconds (Elasticsearch’s refresh interval) to just 5 seconds. Our search traffic is stable, the query load is deterministic, and we’re ready to scale even further.

Load distribution is evenly distributed across all nodes, meaning no more “hot nodes”. We’ve also increased our ranking depth by more than 3 times, up to 200,000 candidate items. This had a significant business impact, making our search results more relevant due to the increase in ranking depth. Plus, we’re saved some hassle, as we no longer need to toil about continually fine-mingling Elasticsearch shards and replica ratio.

(emphasis added by us). Read the full post over at Vinted’s blog.

Vespa Newsletter, September 2024

Mon, 30 Sep 2024 00:00:00 +0000

In the previous update, we mentioned Pyvespa and Vespa CLI improvements, Improved multi-threading performance with text matching, Chinese segmentation, improved English stemming, and new ranking features. Today, we’re excited to share the following updates:

Optimized MaxSim with Hamming distance for multivector documents
Pyvespa
IDE support

Optimized MaxSim with Hamming distance for multivector documents

The Vespa Tensor Ranking framework lets users write arbitrarily complex ranking functions. Multi-vector MaxSim is increasingly important for many use cases, see scaling-colpali-to-billions for an example:

model max_sim {
  inputs {
    query(qt) tensor(q{}, v[128])
  }
  function max_sim(query, page) {
    expression {
      sum(
        reduce(
          sum(
            query * page , v
          ),
          max,
          p
        ),
      q
      )
    }
  }
  first-phase {
    expression: max_sim(query(qt), attribute(embedding))
  }
}

From Vespa 8.404, this operation is optimized with Hamming distance using int8, to support bit-resolution embeddings - this is a 32x reduction in memory usage compared to floats. Tests show a 30x reduction in latency:

In Vespa, the Multivector MaxSim similarity is a dot product between all the query token embeddings and all the chunk/patch embeddings, followed by a max reduce operation over the chunk/patch dimension, followed by a sum reduce operation over the query tokens. Read more in #32232 Optimize MaxSim with hamming (sum of max inverted hamming distances).

Pyvespa

Pyvespa 0.49 has been released, highlights:

Pyvespa query performance to Vespa Cloud is improved by eliminating redundant authentication calls - see #894 infer auth method without request.
With the new compress argument (defaults to “auto”), Pyvespa will compress feed and query operations if the body is larger than 1024 bytes - see #914 expose “compress” argument.
Add rank to StructField in #913 Add rank to StructField.

We have also revamped and added to the notebooks, particularly new ones about ColPali! We also recommend the original paper at ColPali: Efficient Document Retrieval with Vision Language Models.

For those interested in full release details, check out github.com/vespa-engine/pyvespa/releases. Thanks to these external contributors for adding to Pyvespa since the last newsletter!

IDE support

Vespa provides plugins for working with schemas and rank profiles in IDE’s:

VSCode: VS Code extension
IntelliJ, PyCharm or WebStorm: Jetbrains plugin
Vim: neovim

See the documentation for details, and read the blog post for how our 2024 summer interns created them!

Other

From Vespa 8.411, Vespa uses ONNX Runtime 1.19.2

New posts from our blog

Other companies blogging about how and why they build on Vespa

From Vinted Engineering by Ernestas Poškus: Vinted Search Scaling Chapter 8: Goodbye Elasticsearch, Hello Vespa Search Engine. “The migration was a roaring success. We managed to cut the number of servers we use in half (down to 60). The consistency of search results has improved since we’re now using just one deployment (or cluster, in Vespa terms) to handle all traffic. Search latency has improved by 2.5x and indexing latency by 3x. The time it takes for a change to be visible in search has dropped from 300 seconds (Elasticsearch’s refresh interval) to just 5 seconds. Our search traffic is stable, the query load is deterministic, and we’re ready to scale even further.”
Guest blog post by Yuhong Sun, CoFounder/CoCEO Danswer: Why Danswer - the biggest open source project in Enterprise Search - uses Vespa. “We chose Vespa because of its richness of features, the amazing team behind it, and their commitment to staying up to date on every innovation in the search and NLP space. We look forward to the exciting features that the Vespa team is building and are excited to finalize our own migration to Vespa Cloud.”

Events

Haystack EU 2024 Berlin, September 30, Jo Kristian Bergum: What You See Is What You Search: Vision Language Models for PDF Retrieval
MLCon, New York, October 8, Kristian Aune: Retrieval and Adaptive In-Context Learning - the fastest flywheel
Data Science Connect, COLLIDE 2024, Atlanta, October 10, Kristian Aune: Using In-Context Learning to Get Started With AI, Safely
GenAI Summit Silicon Valley 2024, Santa Clara, November 1, Jon Bratseth
DataScienceSalon, San Francisco, November 7, Jon Bratseth: Roundtable

Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? Deploy your application for free on Vespa Cloud today.

Why Danswer - the biggest open source project in Enterprise Search - uses Vespa

Mon, 23 Sep 2024 00:00:00 +0000

Hello, I am Yuhong, the CoFounder of Danswer. We connect all of the disparate knowledge sources of a team (like Google Drive, Slack, Salesforce, etc.) and make all of this available via a single search/chat interface and help users digest the content with GenAI. As one might expect, having a quality and performant search is absolutely core to our value proposition. Today I’ll share why we as a team decided to migrate to Vespa and why it was worth it even when it meant ripping out the core of our previous stack.

Background on the Migration

At Danswer, we’re making Large Language Models more intelligent by bringing in the context of the user and the knowledge of the organization. The way we do this is by retrieving relevant context before passing it to the LLM (Retrieval Augmented Generation - RAG for short). As we scaled to large enterprise scales of data, we discovered that our needs for fine tuning our search pipeline started to exceed the capabilities of our previous vector DB.

Challenges of Enterprise Search and Vespa to the Rescue

Custom Boost and Decay Functions

We previously used a vector only search but we discovered that a lot of team specific terms were critical for providing a quality experience to our users. Internal names like “Meechum” or “Foundry” were common but had no general English representation that the deep learning models could capture. So in response, we added a keyword search that was completely separate from the vector component. But this led to issues of weighting when the two could not be considered together. Vespa allows for an easy normalization across multiple search types and finally allowed us to achieve the accuracy we wanted.

As our pipeline improved, we also introduced other features like time based decay. Since internal documents aren’t always cleaned up correctly, it’s fairly common to run into multiple versions of the same document with conflicting information. Our users asked us to support decaying the relevance of documents if nobody touches or reads them for a long period. At the search engine level, this translated to a requirement to have flexible document ranking functions during the search step. We needed to take a “time last touched” attribute and apply a decay based on the difference of the attribute value and the current time. Luckily Vespa has one of the most flexible syntaxes in terms of ranking functions and there were even plenty of examples of this exact use case which make the implementation really easy.

To capture both overarching contexts and specific details in the documents we index, we also implemented a multipass approach to indexing. This means every document is split into different sections for processing and each pass has a different size context. Vespa is (as far as I know) the only hybrid search engine that is capable of doing multiple vector embeddings for a single document. This optimization prevents duplication of documents for every section and every chunk size which greatly reduces the resources necessary to serve the document index. Since Danswer is largely a self-hosted software (for data security purposes), the savings on resources allows for more teams to use Danswer even if they don’t have access to powerful servers or large budgets for expensive cloud instances.

What brought us to Vespa

As an open source project we immediately limited our choices down to a smaller set of self-hosted options. At the time we were using two separate search engines (one for vector and one for keyword), both of which were relatively new players in the space. We were actually pleasantly surprised at how stable these new search engines were, however they were all built around being “easy to get going” and the actual feature sets were pretty limited. We tried to find hacks around the problems presented above (for example applying the time decay after the initial search as a post processing step), but these workarounds often suffered in accuracy once the scale of documents increased past several million. Of the most established projects, we were looking at OpenSearch, ElasticSearch, Weaviate and Vespa. Vespa was the clear leader in several ways:

The most advanced NLP options including multiple vectors per doc, late interaction models like ColBERT, different nearest neighbor implementations, etc. We knew we would be using the latest techniques so picking a project that was the most on the cutting edge felt like an easy choice.
Vespa is permissively licensed, the entire repo is apache-2.0 licensed which means it could be used for anything including building commercial software.
Scale would never become an issue with Vespa, we were serving scales of up to tens of millions of documents per customer but Vespa was built for internet scales of data (previously serving Yahoo’s search).

Challenges

Vespa is definitely a developer facing software. If you’re more of a weekend hacker whose goal is to prototype something quickly and make a Medium or LinkedIn post about it, Vespa may not be the right choice for you. The downside of flexibility is that there is inherently more complexity with all of the configuration, deployment and even query/indexing options. We’re still in the process of understanding their multi-node kubernetes deployments but luckily Vespa Cloud provides a managed service where all of this complexity is instead managed by the experienced Vespa team. For Danswer Cloud, we’re currently in the process of migrating from our self-managed Vespa running on AWS to Vespa Cloud instead.

Danswer Chat powered by Vespa

Summary

At Danswer we’re trying to make all the teams in the world more efficient with context aware GenAI assistants. For us, the core value that we provide on top of the LLMs is the ability to bring in the context of the user and the unique knowledge of the team. This of course necessitates a high quality search that remains accurate and performant at scale. We chose Vespa because of its richness of features, the amazing team behind it, and their commitment to staying up to date on every innovation in the search and NLP space. We look forward to the exciting features that the Vespa team is building and are excited to finalize our own migration to Vespa Cloud.

AI Needs More Than a Vector Database

Mon, 23 Sep 2024 00:00:00 +0000

Interest in vector databases is skyrocketing, as evidenced by Google Trends data. In its latest report, Vector Databases Landscape, Q2 2024 Forrester highlights over 20 vector databases, classifying them into two main categories: specialized native vector databases and multimodal databases that integrate vector storage within a broader data ecosystem.

Native vector databases are designed for optimal scale and performance, while multimodal databases offer the versatility to handle multiple data types, reducing the complexity of managing separate systems. For a deeper dive into leading native vector databases, refer to the GigaOm Sonar Report for Vector Databases.

A vector database is a specialized database designed to store, manage and query high-dimensional vectors, which are crucial for applications that retrieve content by semantic similarity. Emerging in the late 2010s, interest in vector databases has been driven by generative AI, as they enable fast and accurate similarity searches essential for tasks like recommendation systems, natural language processing and image recognition, thereby significantly enhancing AI application quality and versatility.

While vector databases are considered the key to generative AI, vectors alone are just one piece of the larger puzzle. Achieving relevant answers in generative AI relies on a robust and comprehensive search capability powered by machine learning algorithms that detect patterns in historical data, predict outcomes, identify anomalies and recommend actions.

This must be done across billions of rapidly changing data points, with results delivered instantly (<100 milliseconds) while supporting large user populations, potentially executing thousands of queries per second. Although some data may be vectors, most business applications require integrating and analyzing unstructured data, such as PDFs, alongside traditional structured data to generate vectors.

Given this complexity, focusing solely on a vector database can miss the broader picture. According to Forrester, you choose a best-of-breed vector database but must then integrate the necessary components, such as machine learning, support for non-vector data types, and workload management for performance and high concurrency. Or you can choose a multimodal database that at least provides broader data types but requires fitting in with an application set it was never designed to support.

Enter the AI Database

A new type of database is emerging: the AI database. An AI database is a multipurpose platform that, in addition to vectors, also manages structured and unstructured data. It applies AI models to various data formats, combining signals for more accurate outputs. The AI database enhances computing efficiency and supports scalability by consolidating models and data types. It organizes data by clustering similar vectors in query results and supporting compliance while also searching tables, text and vectors for specific values, document matches and similarity searches to generate inferences using AI models.

AI databases support three primary AI model types: functions approximating machine learning (ML), natural language processing (NLP) and generative AI.

ML models find patterns in historical data to predict trends, identify anomalies, rank/score results and recommend actions. They primarily select data like tables, text or images for further use.
NLP models interpret and generate text or speech for tasks like translation or sentiment analysis, mainly processing text files.
Generative AI models generate content such as text, images, audio or video based on existing data, predicting the next elements in a sequence.

These models, often hosted and run within the AI database, learn patterns, make inferences and create outputs based on the data they receive. If you want to know more about AI databases, I recommend this report from BARC for a deeper dive into the AI database.

The AI database represents a significant advancement, yet it remains only a partial solution due to its lack of application logic and runtime management. To meet generative AI’s demanding scale and latency requirements, substantial effort is needed to integrate tools and optimize runtime performance. The most effective approach is a platform that seamlessly combines data, application logic and large-scale execution, offering a comprehensive solution that addresses all these critical needs.

Vespa: An Open Source AI Engineer’s Platform

Vespa.ai is an open source platform for developing and running real-time AI-driven applications for search, recommendation, personalization and retrieval-augmented generation (RAG). Vespa efficiently manages data, inference, and logic, supporting applications with large data volumes and high concurrent query rates. It’s available as a managed service and open source. Learn more about Vespa here.

This blog was originally published in The New Stack

Scaling ColPali to billions of PDFs with Vespa

Fri, 20 Sep 2024 00:00:00 +0000

This blog post deep dives into scaling “ColPali: Efficient Document Retrieval with Vision Language Models” ¹ to large collections of documents. We demonstrate how we can use a phased retrieval and ranking pipeline in Vespa to scale ColPali to billions of documents. To do this, we introduce a new similarity function, a hamming based MaxSim that works with binary vectors produced by binary quantization (BQ). This technique allows us to scale ColPali to large collections of documents while maintaining high accuracy, with a significant reduction in computational cost and vector storage requirements. The suggested deployment also supports real-time indexing and CRUD operation support.

Introduction

ColPali surpasses traditional text-based retrieval methods by leveraging a vision-capable language model, (PaliGemma), to “see” the text, but also the visual elements of a page, including figures, tables and infographics.

ColPali is short for Contextualized Late Interaction over PaliGemma and builds on two key concepts:

Contextualized Vision Embeddings from a Vision Language Model (VLM): ColPali generates contextualized embeddings directly from images of pages, using PaliGemma, a powerful VLM with strong visual text understanding capabilities.
Late Interaction ColPali uses a late interaction similarity function to compare query and document embeddings at query time, allowing for interaction between all the image grid cell vector representations and all the query text token vector representations.

For a longer introduction to ColPali, see Beyond Text: The Rise of Vision-Driven Document Retrieval for RAG

Do not sleep on VLMs

VLMs have gained popularity for their ability to understand and generate text based on combined text and visual inputs. VLMs display enhanced capabilities in Visual Question Answering (VQA) ², captioning, and document understanding tasks.

The ColPali model builds on this foundation to tackle the challenge of document retrieval, where the goal is to find relevant documents (pages) based on a user query. The top-k retrieved pages from ColPali could be used for further processing, like summarization or question answering.

VLM document understanding example. The input to the VLM model is a text question and an image. The VLM generates an answer (output) based on the question and the image. This example from a Huggingface Demo Space.

The above illustrates how a VLM can read complex infographics or text rendered in images. The above uses the Qwen/Qwen2-VL-7B-Instruct VLM.

In this example, we provide the VLM with an image of a document page along with a question, and it generates an answer. However, in the context of document retrieval, the goal is to efficiently retrieve the most relevant documents based on a user query, even when working with large collections. Instead of repeating the above process for every page in the collection—which could take days or even years for a single query on a large scale—this approach aims to streamline retrieval and avoid inefficiencies.

Instead, during offline processing of the PDFs, we obtain embeddings for all the pages in the collection, and at query time, run a similarity search over the page embeddings to find the k (e.g. 10) most relevant pages.

This is the approach taken by ColPali and the late interaction similarity mechanism. After having retrieved relevant pages for a query, we can feed the retrieved pages to a VLM for further processing, like summarization or question answering, as demonstrated in the example above.

Scaling VLM retrieval with ColPali

ColPali produces tensors for both the query and the page (technically, the image of a page). The query tensor is made up of text tokens, while the page tensor is made up of image grid cell vectors.

ColPali Page Embeddings: ColPali generates contextualized embeddings solely from images of PDF pages, bypassing the need for text extraction, OCR, and layout analysis. Each image is represented as a 32x32 = 1024 image grid (patches), where each patch is projected into a 128-dimensional latent vector space. In addition to these patch tokens, there is six instruction text tokens that are prepended to the image input: (“Describe the image.”). In total, a single screenshot of a PDF page is represented by 1030 128-d vectors.

ColPali Query Text Embeddings: The query text is tokenized and each token is also represented in the same latent 128-dimensional vector space as the patch embeddings. The number of query tokens is dynamic (not fixed length as the patch embeddings). The query tokens also include a prefix instruction, “Question: “, and mask padding tokens. This is similar to the query expansion mask tokens used in the ColBERT architecture for the text-only domain.

Using these tensor reprensentation, we can score documents for a query using the late interaction similarity mechanism.

Scoring with MaxSim (The Late Interaction part of ColPali)

ColPali employs a so-called late interaction similarity mechanism, where query and document embeddings are compared at query time using a similarity function, allowing for interaction between all the page patch vector representations and the query text token vector representations.

The late interaction similarity mechanism allows offline embedding generation for all the pages in a collection, and at query time, score and rank the pages based on the similarity between the query and the document embeddings.

Another name for the similarity mechanism used in ColPali is MaxSim, but a more accurate description would be SumMaxSim, as it involves an outer sum operation over the query tokens.

MaxSim ranking of the indexed pages for a query. We want to scale this scoring process to billions of pages.

The MaxSim similarity mechanism is a dot product between all the query token embeddings and all the patch embeddings, followed by a max reduce operation over the patch dimension, followed by a sum reduce operation over the query tokens.

We can represent the tensors and express the MaxSim similarity function using the Vespa schema language.

schema pdf_page {
  document pdf_page {
    field embedding type tensor(p{}, v[128]) {
      indexing: attribute
    }
  }
  
  model max_sim {
    inputs {
      query(qt) tensor(q{}, v[128])
    }
    function max_sim(query, page) {
      expression {
        sum(
          reduce(
            sum(
              query * page , v
            ),
            max, 
            p
          ),
        q
        )
      }
    }
    first-phase {
      expression: max_sim(query(qt), attribute(embedding))
    }
  }
}

Both the embedding field and the input query tensor (query(qt)) are examples of mixed tensors, combining a named mapped dimension with a named indexed bound dimension (v). The v dimension is bound and has a fixed size of 128, representing the vector dimension.

The q and p dimensions are unbound and allows representing an unbound number of vectors. Read more in tensor guide.

Let us zoom in on the MaxSim similarity function:

function max_sim(query, page) {
  expression {
    sum(
      reduce(
        sum(
          query * page , v
        ),
        max, 
        p
      ),
      q
    )
  }
}

This function returns a single scalar value representing the MaxSim similarity between the query and the page. The inner sum of query*page computes the dot product between all the query token embeddings and all the patch embeddings, which results in a similarity matrix of size [q, p]. We then apply a reduce aggregation over the p dimension using max, and finally sum over the q dimension.

This function is configured in a the schema in a Vespa ranking expression to rank documents based on the similarity between the query and the document. See a MaxSim example in the Vespa tensor playground.

In PyTorch, a popular deep learning framework, we could express the MaxSim as a function as follows using the torch.einsum function:

def max_sim(query, page):
  """
  Computes the MaxSim similarity between query and page tensors using einsum.

  Args:
    query: Tensor of shape [querytoken, 128] representing query token vectors.
    page: Tensor of shape [patch, 128] representing page patch vectors.

  Returns:
     the MaxSim similarity score.
  """
  return torch.sum(torch.einsum('qv,pv->qp', query, page).max(dim=1).values, dim=0)

Explanation of the Pytorch max_sim function above:

torch.einsum(‘qv,pv->qp’, query, page): Calculates the dot product between q and p vectors, resulting in a similarity matrix of size [q, p].
- qv represents the query tensor with dimensions querytoken (q) and vector (v).
- pv represents the page tensor with dimensions patch (p) and vector (v).
- ->qp specifies the output tensor with dimensions querytoken (q) and patch (p).
.max(dim=1).values: Finds the maximum value along each row (each query token) of the similarity matrix. Returns a tensor of size [q] containing the maximum similarity scores for each query token.
.sum(dim=0): Sums the maximum similarity scores across all query tokens, producing a single scalar value representing the overall MaxSim similarity between the query and the page.

Note that PyTorch does not support unbound dimensions (mapped) or dimension names like Vespa tensors, so we have to specify the dimension sizes.

Scaling MaxSim

MaxSim FLOPs scales with [q * p * v] (roughly). We can reduce the FLOPs by reducing the number of query tokens (q), the number of patch vectors (p), or the vector dimensionionality (v).

We are interested in scaling a pre-trained model checkpoint and architecture to larger collections of documents, changing the dimensionality (v) would require retraining the model.

We can reduce the number of patch vector embeddings by pooling or clustering. See the excellent work from Answer.ai on reducing the number of document vectors in their blog post: A little pooling goes a long way for multi-vector representations.

Another direction is to use a cheaper similarity function than dot product, in other words, a function that requires fewer CPU instructions than float dot products. This can also be used in combination with pooling or clustering to reduce the number of vectors to score.

An alternative to a float dot product is the inverted hamming distance, which is a simple and fast distance function. The hamming distance is the number of positions at which the corresponding bits are different, and can be computed with fewer CPU instructions than a float dot product.

To use hamming distance in Vespa, we need to represent the ColPali text and patch vectors as binary vectors, using the int8 tensor cell type in Vespa, where we can “pack” the 128-dim float vectors into 16-dim int8 tensors in Vespa. Each int8 cell represents 8 bits.

The following demonstrates how we can use the hamming distance instead of float dot product as the core similarity in MaxSim. Since hamming is a distance metric (closer to 0 is more similar), we invert the distance to a similarity score by inversion: 1/(1 + hamming(query, page)).

(Note that hamming is a built in Vespa ranking expression function):

schema pdf_page {
  document pdf_page {
    field embedding type tensor(p{}, v[16]) {
      indexing: attribute
    }
  }

  model max_sim {
    inputs {
      query(qt) tensor(q{}, v[16])
    }
    function max_sim(query, page) {
      expression {
        sum(
          reduce(
            1/(1 + sum(
              hamming(query, page) ,v)
            ),
            max, 
            p
          ),
          q
        )    
      }    
    }

    first-phase {
      expression: max_sim(query(qt), attribute(embedding))
    }
  }
}

We use binary quantization (BQ) to convert the floating point 128-dimensional vectors to 128-bit vectors, represented in the schema as 16-d vector with cell type int8. See our blog post on Matryoshka 🤝 Binary vectors: Slash vector search costs with Vespa for more information about hamming distance and binary quantization.

With this approach, we can reduce the number of CPU instructions required to compute the similarity between the query and the document, making the MaxSim ranking process more efficient.

Optimizing the MaxSim with hamming

In this Vespa performance test, we measure and track the difference in latency between the two versions of the MaxSim function.

This test simulates ranking 1,000 pages for a query with 20 vectors and 1030 vectors per page, using a single CPU core. It’s important to notice that we can scale latency linearly (almost) with the number of CPU cores by Vespa intra-query multi-threading.

In the performance test we measure the end-to-end latency of the two MaxSim versions. With a single ranking thread, latency is a good proxy for computional cost.

The blue line represents the hamming distance version of the MaxSim function, while the orange line represents the float dot product version. As we can see the hamming version is about 3.5 times faster than the float dot product version. This means that we can rank more pages for the same latency budget or lower the latency for the user query. For services that doesn’t nessessarily need high query throughput, but were we want to lower latency for a better user experience, we can throw CPU cores at the problem using intra-query multithreading to reduce latency further, e.g by using 2 or 4 CPU cores we can reduce latency by a factor of 2 or 4, respectively.

With 1000 pages, 20 query token vectors, and 1030 patch vectors, the MaxSim involves 20M 128-dimensions dot products or 128-dimensions hamming distances (bitwise). For the hamming version, with 100ms latency, this translates to about 200M 128-bit hamming distances per second per CPU core/thread.

As part of this work with scaling ColPali, we optimized the evaluation of the 1/(1 + sum(hamming(query, page), v)) expression. This is done by recognizing the specific use case, allowing for HW optimized evaluation by the Vespa tensor engine. This type of optimization was already in place for the float dot product version after our work with ColBERT.

Scaling to Billions of Documents

So far we have discussed how to compute the MaxSim similarity function without considering the scale of the document collection. Performing MaxSim over 1000 pages in an index is not a huge problem. We can brute-force score and rank all of them for every query.

But at larger scale (either #documents or #queries), we need to distribute the computation across multiple nodes and also find a way to perform candidate selection (retrieval) so that we avoid scoring all the pages (brute-force) with the MaxSim expression. This is where Vespa’s phased retrieval and ranking pipeline comes into play.

The standard retrieval strategy for ColBERT and ColPali is to use approximate nearestNeighbor search to retrieve candidate documents based on the query token embeddings, and then rank the retrieved documents using MaxSim.

For example, if we have 20 query token vectors, we do a candidate search for each of the 20 query token vectors and the union of the candidate sets are ranked using MaxSim. Each nearest neighbor operation finds the closest k pages with the closest patch vector to the query token vector.

Vespa supports multi-vector HNSW indexing, so we can index multiple patch vectors per page, and perform approximate nearest neighbor search over all the patch vectors in a single query.

field embedding type tensor(p{}, v[16]) {
  indexing: attribute | index
  attribute {
    distance-metric: hamming
  }
  index {
    hnsw {
      max-links-per-node: 32
      neighbors-to-explore-at-insert: 200
    }
  }
}

We use the hamming distance metric for this index. We enable HNSW by adding index to the embedding field in the schema. Notice also how we configure the distance-metric and HNSW index hyper parameters that are tradoffs between speed and accuracy (comparing to exact nearest neighbor search).

For a single query token vector we could do a nearest neighbor search over all the patch vectors using the Vespa nearestNeighbor query operator using the Vespa query language:

{
  "yql": "select documentid, embedding from pdf_page where {targetHits:100}nearestNeighbor(embedding,q1)",
  "input.query(q1)": [23,-34,12,45,67,23,45,12,45,67,23,45,12,45,67,23],
  "ranking": "vector_similarity_only",
  "hits": 100
}

This query would return the 100 closest pages to the query token vector q1 based on the hamming distance to the closest patch vector in the page.

However, we do not want to run one nearest neighbor query for each query token vector, but rather run a single Vespa query that exposes the top-k pages for all the query token vectors to a Vespa ranking expression. This allows scoring and ranking using MaxSim without transferring the page vectors to some external ranking service.

Why? Consider the case where we have 20 query token vectors, and where we want to retrieve the top-100 pages for each query token vector for MaxSim re-ranking, we would have up to 2K (20x100) pages to rank using a MaxSim implementation. Each page is made up of 1030 vectors with 16 bytes (int8 values), each user query request would need to transfer 2K x 1030 x 16 bytes ±= 32MB of data to the “MaxSim ranking service”. This is sequential latency added before we start scoring the pages with MaxSim.

At any meaningful query throughput scale, fetching this amount of vector data per user query would quickly become a scaling bottleneck even with a fast high-throughput network.

Instead, we want to move the tensor computations (here MaxSim) to the data and instead perform the candidate selection and ranking in a single Vespa query request using multiple nearest neighbor operators, combined using boolean operators (which also allows combining ColPali/ColBERT with query filters).

{
  "yql": "select documentid from pdf_page where ({targetHits:100}nearestNeighbor(embedding,q1)) or ({targetHits:100}nearestNeighbor(embedding,q2))",
  "input.query(q1)": [23,-34,12,45,67,23,45,12,45,67,23,45,12,45,67,23],
  "input.query(q2)": [12,4,87,23,45,12,45,67,23,45,12,45,67,23,45,12],
  ....
  "ranking": "max_sim",
  "hits": 10
}

A Vespa schema for scalable retrieval and ranking of ColPali could look like this, note that all the input query tensors are defined in the model section of the schema (model is also an alias for rank-profile). We can define as many input query tensors as we want or need. Defining query tensors has no overhead, as the input tensors are only used when a query is executed.

schema pdf_page {
  document pdf_page {
    field embedding type tensor(p{}, v[16]) {
      indexing: attribute | index
      attribute {
        distance-metric: hamming
      }
      index {
        hnsw {
          max-links-per-node: 32
          neighbors-to-explore-at-insert: 200
        }
      }
    }
  }

  model max_sim {
    inputs {
      query(qt) tensor(q{}, v[16])
      query(q1) tensor(v[16])
      query(q2) tensor(v[16])
      query(q3) tensor(v[16])
      query(q4) tensor(v[16])
      .... 
    }
    function max_sim(query, page) {
      expression {
        sum(
          reduce(
            1/(1 + sum(
              hamming(query, page) ,v)
            ),
            max, 
            p
          ),
          q
        )    
      }    
    }
    first-phase {
      expression: max_sim(query(qt), attribute(embedding))
    }
  }
}

With this type of schema, we can perform a query request where we both retrieve candidate documents and rank them in a single request as the nearestNeighbor query operator in Vespa can be combined using boolean operators and combined with filters.

This type of retrieval and ranking pipeline for late-interaction models avoids the need to transfer large amounts of vector data between services as everything is computed inside the Vespa content nodes. Inside the content node, we can transfer vector data at the speed of memory and not the network. This is a core design principle in Vespa, allowing developers to express complex ranking pipelines that might involve lots of vector data in a single query request.

Note that with the query tensor separation, where there are N single-vector representing each query token vector and a single mixed tensor used for the MaxSim ranking, we can also perform query token pruning ³ ⁴. By using a subset of the query token vectors we can speed up the retrieval phase, and rank fewer pages with the MaxSim variants. Pruning attempts to remove less important query token vectors from the retrieval phase while they are retained for the ranking phase. This query tensor separation also allows different levels of targetHits per query token vector for the retrieval phase.

Is hamming a good similarity function for ColPali?

In the previous sections, we discussed how to scale ColPali to billions of documents using Vespa’s phased retrieval and ranking pipeline, using hamming distance for the nearest neighbor search and MaxSim for ranking. How does this approach perform in practice?

We evaluated the performance of the ColPali model on the DocVQA dataset using the ColPali embeddings and the proposed hamming-based MaxSim function in Vespa.

In the following we compare four different ranking strategies on the DocVQA test dataset using nDCG@5 as the evaluation metric. The four strategies are:

float-float The “normal” float dot product in MaxSim
binary-binary The suggested hamming-based MaxSim
binary-binary-reranking Adding a re-ranking phase on top of the binary-binary version. In the re-ranking phase we are using float resolution for the query tensor, but an unpacked float representation obtained from the binary vector. This step requires that we pass both the representations of the query tensor to Vespa.

float-float	52.4
binary-binary (hamming)	49.5
binary-binary (hamming) + float-float re-ranking	51.6

A notebook that demonstrates the impact of re-ranking depth can be found here. As seen from the above table, we can save 32x on storage (memory) by using the binary representation instead of float, make MaxSim 4x more efficient by using hamming distance, and retain the most of the accuracy. The drop from 52.4 to 51.6 is a small price to pay for the efficiency gain.

Summary

ColPali is a groundbreaking document retrieval model, probably one of the most significant advancements in the field of document retrieval in recent years. By leveraging the power of vision language models, ColPali can effectively retrieve documents based on both textual and visual information in them. With the increased interest in scaling ColPali to large collections of PDF documents, Vespa provides a powerful platform for implementing and deploying ColPali at scale.

ColPali or VLMs in general simplifies the ingestion pipeline by eliminating the need for text extraction, OCR, and layout analysis. This makes it easier to implement and deploy in real-world applications with fewer preprocessing and extraction steps.

As part of this blog, we have also created a comprehensive notebook that demonstrates all the concepts discussed in this blog post, check it out here: Scaling ColPali with Vespa.

The demo notebook features:

Obtaining ColPali embeddings for queries and PDF pages
How to use the MaxSim similarity function in Vespa and use both versions in a phased ranking pipeline
How to use the Vespa nearestNeighbor query operator to retrieve candidate documents for ranking

In addition, we also demonstrate how to reproduce the accuracy results on the DocVQA dataset using the ColPali embeddings and the MaxSim similarity functions described in this post: ColPali Benchmark DocVQA.

If you want to learn more about ColPali, and how to represent ColPali in Vespa, also check our previous posts on ColPali and Vespa.

Other useful resources on ColPali:

For those interested in learning more about Vespa or ColPali, feel free to join the Vespa community on Slack or Discord to exchange ideas, seek assistance from the community, or stay in the loop on the latest Vespa developments. Also check out the FAQ section below for more information on how to use ColPali in Vespa.

FAQ

Why can’t you just use a frontier like VLM with a large context window for this?

Screenshot from Google AI studio using Gemini FLash with a large context window for RAG.

Frontier VLM models like GPT-4/Gemini Flash handles visual inputs well (The PDF above is converted to images of the PDF pages, just like ColPali), but they have a limited context window.

In this case, a 62-page PDF occupies 34,567 tokens so that we can fit about 30 of those into the LLM context window.
We argue that there are use cases where you want to retrieve over more extensive collections than 30. In addition, as can be seen above, inference takes about 16 seconds for every query over this single PDF.

Certainly, foundational models with large context windows can be used for RAG over small collections, but with high latency. They are not a universal solution for large collections of complex document formats.

In this case, you turn to ColPali to retrieve relevant pages, and then feed those to the foundational model to generate the best of two paradigms.

I have many structured text fields and meta data, can I use ColPali in Vespa?

Yes, you can combine the ColPali embeddings with other features in Vespa.

Can I combine ColPali with query filters in Vespa?

Yes, ColPali, is primarily a way to score documents based on a query, and can be combined with any Vespa query formulation, including filters, and also result grouping.

Our custom ranking uses many business-oriented ranking signals, can I use ColPali?

Yes, Vespa’s ranking framework allows many different signals of rank features as we call them in Vespa. Features can be combined in ranking expressions in ranking phases.

The MaxSim expression can be used as any other feature, and in combination with other custom features, even as a feature in a GBDT-based ranker using Vespa’s xgboost or lightgbm support.

How does ColPali relate to Vespa’s support for nearestNeighbor search? If we want to use ColPali representations in retrieval and not just for ranking, we can use the Vespa nearestNeighbor search operator to retrieve candidate documents based on the query token embeddings. The candidate documents are then ranked using the MaxSim similarity function (and or custom ranking features) in a Vespa ranking expression.

How does ColPali relate to Vespa HNSW indexing for mixed tensors or multi-vectors? It’s handy if we need to use ColPali representations for retrieval, allowing for efficient candidate selection based on the query token embeddings. See more in multi-vector indexing.

Any plans to integrate ColPali as a native Vespa embedder

Yes, see this issue.

Can I combine ColPali with ColBERT in Vespa? Yes, you can combine ColPali with ColBERT or SPLADE or other fancier text ranking method in Vespa. Those would be two different embedding (tensor) fields in the Vespa schema, and you can use two (or multiple) MaxSim expressions and combine the score in ranking expressions.

How does ColPali compare to hybrid search?

ColPali can be used in a hybrid search pipeline as just another neural scoring feature, used in any of the Vespa ranking phases (preferably in second-phase for optimal performance to avoid moving vector data up the stateless container for global-phase evaluation).

Vespa allows combining the MaxSim score with other scores using, for example, reciprocal rank fusion or other normalization rank features. The sample application features examples of using ColBERT MaxSim in a hybrid ranking pipeline.

Can I run ColPali if I’m GPU-poor?

Yes, you can. All the notebooks were developed using PyTorch with MPS support on M1 Macs, and the Vespa content backend is CPU-based.

Encoding the PDF pages takes about 2.5 seconds per page with batches of 4 on MPS (if you ensure to close down Chrome or other GPU-hungry applications). PaliGemma works well with batch size 4 on a 16GB T4 GPU (or similar).

What are the tradeoffs? If it stores a vector per patch, it must be expensive!

We can look at performance and deployment cost along three axes: Effectiveness (ranking quality), storage, and computations. In this context, we can recommend Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking, which provides a framework to compare different methods.

A natural comparision is recent large decoder-based text embedding models (e.g. NV-Embed-v2) that produces 4096-dimensional float embeddings for short texts, their storage footprint is 16KB per chunk (and you will need more than one chunk to accurately represent a page). ColPali with binarized patch vectors has a similar storage footprint (16KB).

In practise, this means that ColPali is more storage efficient than large decoder-based text embedding models, but requires more computations to score and rank the documents (as we have to compute the MaxSim similarity function).

Why didn’t you expose the ColBERT 2 PLAID retrieval optimization in Vespa? Could this be an alternative for scaling ColPali?

Primarily because Vespa is designed for low-latency real-time indexing with CRUD support. The PLAID indexing optimization requires batch processing the document token vectors to find centroids. This centroid selection would not scale in a real-time setting where users expect outstanding performance from document number one to billions of documents. As demonstrated in this post, we can replace the float dot product with a hamming distance, which is a much faster operation than a float dot product and then we can use the Vespa nearestNeighbor query operator to retrieve candidate documents for ranking. This makes the approach more scalable and efficient for real-time document retrieval.

That is a lot of hamming distances - do you use any acceleration to speed it up?

Yes, Vespa’s core backend is written in C++, and the dot products and hamming are accelerated using SIMD instructions.

Is ColPali good or not? ColPali is a more of a direction than a model checkpoint to rule them all. We believe in the power of vision language models for complex document retrieval, and we are excited to see how the field evolves. We envision a future where we can have models trained on different VLM backbones like Qwen/Qwen2-VL-7B-Instruct.

Can I use ColPali for other tasks than document retrieval? The ColPali model is actually a Low-Rank Adapter (LoRA) model on top of PaliGemma, so you can use it for any task that PaliGemma is good at if you remove the adapter layer.

References

Parsing Through the Summer: A Tale of Schemas, Syntax, and Shenanigans

Mon, 26 Aug 2024 00:00:00 +0000

One of the first steps when creating a Vespa application is defining one or more schemas. Using the Vespa Schema Language, you can define document types and the kinds of computations you want to do over them. The language is very powerful, but most IDEs and editors do not currently provide support when using it. It is 2024 and the hottest topic in language tooling is LSP. It may be something you have heard of before, but not quite know what it is. Or maybe you know all the ins and outs for configuring LSP servers in your favorite editor?

Regardless, our mission this summer was to investigate the mystical “LSP”, and try to harness its powers to make writing schema files a pleasant experience. Join two interns climbing the learning ladder on the quest for providing the ultimate language support!

Starting the project

We first spent a couple of days learning about Vespa and how to use it by following the getting started guide. Here we got some hands-on experience with the language we were supposed to create tooling for, as well as a general understanding of Vespa and its capabilities. It was pretty cool!

After this we had to figure out a bunch of stuff:

What exactly is LSP and how does it work?
How are language servers usually implemented?
What programming language should we use to implement the language server?
How do we parse schema files?

In this blog post we will discuss some of the answers we found to these questions during the development.

Language server protocol

So what is LSP? The Language Server Protocol is a protocol to standardize language support features for IDEs and editors. To create support for a specific language, all you need to do is to create a program called a language server, which is able to respond to requests defined in the protocol. A client, which in this context represents an editor like VSCode, can then launch the server as a separate process and use the server to provide developer tooling for the user. This way, the language support logic is disjoint from any specific editor or environment, and supporting new editors should be easy.

It turns out that the full protocol specification is quite extensive. We had to make a choice regarding which parts would be most useful to implement. These were our goals:

Diagnostics: Highlighting of errors and warnings found when parsing schema files.
Code navigation: “Go-to-definition” and “find references”
Semantic token syntax highlighting
Code actions: Quick fixes for common errors
Completion
Documentation on hover

Usually a language server is written in the same language that it provides support for. We figured that writing a language server in the Vespa schema language would impose some difficulties. Therefore we decided to implement the server in Java instead, because most of the relevant parts of Vespa’s existing codebase is written in Java.

Parser

For a language server to actually do something useful, it needs to parse the language in question. The core functionality is therefore what happens when a text document is changed. LSP captures this event through the “textDocument/didChange” notification. This happens on every keystroke. The new document content must be parsed, and symbols are registered for handling other requests later.

The existing parser for schema files was generated using a parser-generator tool called JavaCC. In JavaCC you write production rules like:

field() : 
{ 
     <FIELD> identifier() <TYPE> dataType() <LBRACE> fieldBody() <RBRACE> 
}

Here identifier, dataType and fieldBody are all production rules themselves. JavaCC takes the list of rules as input, and generates a Java program which will lex and parse any string written in the given language. JavaCC also makes it possible to inject Java code to be executed during the actual parsing. For example:

field() :
{
    String name;
}
{
    <FIELD> name = identifier() <TYPE> ... 
}
{
    if (isReservedName(name)) 
        throw new IllegalArgumentException(name + " is reserved!");
    // ... 
}

This is done excessively in the actual schema parser implementation. That way, a model of the Vespa schema is built simultaneously with the parsing, instead of having to construct it from some AST representation. It is actually quite elegant.

This approach does, however, not work as well when trying to make a language server. There are some reasons for this:

The intermediate representation does not contain any references to the original document. When creating an LSP feature like “go-to-definition”, the exact location of the symbol in question needs to be known by the language server. What we need is a Concrete Syntax Tree (CST).
When writing a schema file, most of the time the file will not be syntactically correct. A default JavaCC parser is not fault tolerant. If it encounters an error during parsing it will simply throw an exception and quit. This would lead to poor language support.
The language server might need some very specific information about what the syntax tree looks like at a particular position, for instance to generate relevant completion items. This information is lost during the JavaCC parsing.

For these reasons, we found it necessary to find another way to parse schema files. We had some requirements:

Ideally, the parser is closely related to the existing JavaCC parser.
The parser should be fault tolerant, i.e. being able to continue parsing after a syntax error.
The parser should generate a Concrete Syntax Tree, where every node knows its position in the original document.

After some research, we found a project called CongoCC. CongoCC is the continuation of a project called JavaCC 21, which itself aimed to be a successor of JavaCC. It meets all our requirements! The syntax is similar to JavaCC, it is fault tolerant and it generates a CST out of the box. Our next mission was then to port the parser to CongoCC. There were in total about 5000 lines of JavaCC code we had to convert to CongoCC. It took a couple of days.

When the core parser was ready, it was time to layout the full pipeline to execute when a document changes:

Embracing the CST

When you have a CST, the life of a language server developer gets significantly easier. Every LSP request turns into a tree problem:

Go-to-definition: Find the node at the cursor position. Find a symbol there. Find the node in the tree corresponding to its definition. Return the location of said node.
Completion: Do some kind of pattern matching with the CST around the cursor position. Give valid completion items based on the matched pattern.
Semantic token highlighting: Leaf nodes in the CST are tokens. They provide the basis for syntax highlighting. But some tokens have different meanings based on the context. So by inspecting the CST we can give better highlighting than a pure token-based highlighting.

To simplify the different types of requests we can get through LSP, it is useful to do some processing of the CST after the initial parsing. In particular, we want to be able to keep track of the different symbols that can exist in a schema document. A symbol is any user defined construct with an identifier, for example a field, a function or a rank-profile. To keep track of the symbols, we created an index called SchemaIndex. When all definitions are added to the index, we can go through the symbol references, and search for the definition. If no definition was found, we can send an error message back to the user. To resolve all the references, the index also keeps track of inheritance to search for all the valid places a symbol could be defined.

Once we had the CST and index structure in place, the rest of the summer was spent actually implementing features, reading the Vespa schema and the LSP documentation, making sure the tool worked as smoothly as possible. Oh, and when creating a language server there are of course edge cases. A lot of edge cases.

Unexpected side mission

It turns out that IntelliJ does not fully support LSP yet. They only support a subset of the requests. An important part they don’t support is semantic tokens. Usually, syntax highlighting of a language is split into two components. A basic and fast highlighter highlights most keywords, separate from LSP. The language server can then provide additional highlighting information through the semantic token request, which gives a more “correct” highlighting, but it is slower. Our highlighting scheme, however, relies solely on the semantic tokens. It takes some time when opening a document for the first time, but after this we deemed it fast enough to do all the highlighting work. But this meant that highlighting didn’t work in IntelliJ! Oh no. Highlighting is a quite important part of providing language support. So how do we get highlighting in IntelliJ? There are really only two options:

Implement the semantic token functionality ourselves in the IntelliJ plugin.
Implement a basic syntax highlighting running in the IntelliJ plugin.

The first option seemed a bit difficult. Therefore, we decided to make a custom highlighter only for the IntelliJ plugin. To do this, the plugin API requires you to implement an abstract class called “Lexer”. The lexer will break the document into a series of tokens, which then can be highlighted based on their type. The lexer interface is for an incremental lexer, meaning that it can start and stop at arbitrary places. Luckily for us, CongoCC already has generated a lexer for us! If only we could plug it into IntelliJ…

The solution was to wire the right connections between the IntelliJ interface and the CongoCC interface. Although the solution was not ideal, it worked. For instance, the generated lexer is not incremental in the same way that IntelliJ requires. So the middleware has to create a new instance of the generated lexer for each call to “start”, and pretend that we got an entirely new document. A bit suboptimal, but better than writing (yet another) definition of the schema language in something like JFlex. The bonus is that, as soon as Jetbrains implements the semanticToken request for IntelliJ, syntax highlighting will automatically improve. This also holds for other LSP features.

Results

The result of our work was a language server capable of most features we set out to implement in the beginning. The main client we worked on is in the form of a VSCode extension. In the GIF below, you can see a demonstration of some features.

Neovim plugin

Did we mention that the language server works in Neovim? Neovim has an LSP client built in, which means that Neovim can communicate with our language server. All you need to do is to download the language server and add an attach script for the appropriate file types in your init.lua. Instructions can be found at the release link🚀.

Limitations and future work

Even though the language server has simplified writing schema files significantly, there are still some missing features. For example, certain errors in rank expressions are not detected by the language server, meaning users may only discover these errors during deployment or preparation. An example of this is attempting to fetch an attribute from a field, without the attribute indexing type set, a mistake that currently goes unnoticed by the server.

Moreover, the language server does not yet support multiple workspaces, which can lead to issues if multiple editors across different workspaces rely on the same language server. This limitation can be particularly problematic when one workspace contains .profile files in a separate folder. This can cause the server to display errors in valid schemas and struggle with identifying correct symbol relationships.

Additionally, there are several features that would greatly enhance the language server. For instance, better integration with services.xml would allow for automatic file updates when editing schema files. Support for formatting requests would ensure uniformity in schema files, making them easier to read and manage.

Lastly, adding support for Vespa Query Language is another milestone to reach. This could be implemented as another language server, ideally one that are aware of the current deployment to provide completion. Running queries from within the IDE can also be done with the codelens feature in LSP. This would simplify the development of Vespa applications.

Our experience at Vespa

Our experience at Vespa.ai provided us with a deep understanding of the architecture of language servers and the intricacies of parsing Vespa schemas. Additionally, we gained valuable insights into the dynamics and working conditions within a tech start-up environment.

Contributing to a large open-source project like Vespa was both exciting and challenging. Initially, we found the scale of the project overwhelming, but after working through some getting-started tutorials and engaging in a bit of trial and error, we were able to identify the most relevant parts of the project. Whenever we were stuck we always had someone to guide us, which made our time at Vespa not only productive but also enjoyable. We extend our sincere thanks to all our colleagues, with special recognition to Kristian Aune, Øyvind Grønnesby, and Arne Henrik Juul for their daily stand-ups and continuous support.

Vespa Newsletter, August 2024

Fri, 23 Aug 2024 00:00:00 +0000

In the previous update, we mentioned RAG in Vespa, cheaper vector search, fuzzy search with prefix match, distance calculation performance improvements, and new Pyvespa features. Today, we’re excited to share the following updates:

Pyvespa improvements
Vespa CLI improvements
Performance: Improved multi-threading performance with text matching
New Vespa features: Chinese segmentation, improved English stemming, and new ranking features.

Pyvespa improvements

ColPali is a method that will transform search and RAG for visual documents, such as PDFs (typically containing figures and tables). Our very own @jobergum demonstrated how to use ColPali with Vespa in the Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models notebook - also see the blog post on PDF Retrieval with Vision Language Models.

Features and fixes:

Support for deploying applications to Vespa Cloud production with Pyvespa. Use deploy_to_prod to start deployment of a new application package revision (typically automatically triggered by a build job) to Vespa Cloud. You can also use check_production_build_status for deployment tracking.
Key/cert-generation for mTLS-auth is now generated using Vespa CLI. This reduces the discrepancy in the authentication method between Pyvespa and Vespa CLI. This was previously done separately in Pyvespa, which could cause certificate mismatches in some cases.
Interactive control-plane auth. By adopting interactive auth (opening auth link in browser) from Vespa CLI, it is now a lot easier to interact with Vespa Cloud from Python. Check out the updated Quickstart on Vespa Cloud-notebook, also see authenticating-to-vespa-cloud.
Switch the VespaAsync HTTP-client to use httpx[http2]. As Vespa supports HTTP/2, this enables the Async Pyvespa client to multiplex HTTP requests over a single connection.
app.feed_async_iterable(), with a similar signature as the sync feed_iterable(), using the async client. The feed_async_iterable typically performs better (while using less resources) than its synchronous counterpart, especially in cases where network latency is larger. For details, check out this notebook.
Bugfix: Pyvespa version 0.45 had a bug that resulted in ImportError: No module named termios for Windows users. This is now fixed. Note that interactive login is not yet supported on Windows. We have also implemented a cross-platform matrix for unit tests, to catch platform-dependent errors earlier in the future.

For those interested in full release details, check out github.com/vespa-engine/pyvespa/releases. We also encourage the community to keep creating issues, whether it is enhancements or bug fixes / other ideas. We would also like to thank the following external contributors for contributing since the last newsletter - you rock!

Vespa CLI improvements

vespa log now supports self-hosted Vespa instances (from Vespa 8.359).
vespa deploy now detects and warns if the certificate added to the application package does not match the configured application key pair.
vespa deploy now supports a .vespaignore file which allows excluding unwanted files from the deployed application package. See the documentation for more details.
vespa query now handles large tokens (> 64K) when streaming responses from an LLM.
vespa feed now supports sending custom headers using the new –header option.
vespa feed performance increased by 27% when feeding large documents (> 10K).
vespa document get now supports a --field-set option (like vespa visit) that specifies which fields to include when retrieving a document. See the documentation for more details.

Performance

The default query operator in Vespa is weakand, and Vespa lets you control how many cores to use to execute each query. The weakand operator now uses a shared heap across threads used in the matching phase. This has reduced CPU usage and latency on text/hybrid queries. In a sample performance test measuring theoretical perfect resource utilization, we saw an increase of 37.5%. The specific improvements you’ll see depend on your data and queries - we recommend you use the latest Vespa release and try it yourself! Changing the number of search threads only requires a content node restart, done automatically when running on Vespa Cloud.

New Vespa features

Ranking: To eliminate low-scoring hits from later ranking phases you can use rank-score-drop-limit in the first ranking phase. Since Vespa 8.354, rank-score-drop-limit is also available in the second rank phase. This can be set in the ranking profile or use the ranking.secondphase.rankscoredroplimit Query API parameter.
Ranking: When writing ranking functions, you can pass the names of features as function arguments. From Vespa 8.371, you can also do the same with dimension names.
Linguistics: Since 8.379, Vespa supports Chinese segmentation in the default linguistics implementation. To enable this, add this config to your elements(s) in services.xml :
```
   id="default" version="1.0">
      …
           name="ai.vespa.opennlp.open-nlp">
              true
              true
          
```
Note that if you change this on a live field, you will reduce recall until reindexing is completed.
Linguistics: In the default Vespa linguistics implementation, the stemmer for English is an implementation called kStem, while other languages use Snowball. Since 8.388, Vespa lets you switch to Snowball for English as well, by setting the configuration below. This will cause more words to be stemmed in English, and therefore higher recall.
```
   id="default" version="1.0">
  …
       name="ai.vespa.opennlp.open-nlp">
          true
      
```
Note that if you change this on a live field you will reduce recall until reindexing is completed - also see the documentation.
Operations: As Clusters grow in size and features, application owners continuously resize and reconfigure for performance and const optimizations. Vespa’s elasticity functions makes this easy with auto data migration to new nodes under regular query load. This also means that clusters are often redistributing data. Using the new cluster-controller_cluster-buckets-out-of-sync-ratio metric makes it easy to know the status and interpolate when the redistribution is complete.

Events

Meet us at the MLCon in New York City, October 8-9 and COLLIDE DATA CONFERENCE in Atlanta, October 10-11!

Thanks for joining us in exploring the frontiers of AI with Vespa. Ready to take your projects to the next level? Deploy your application for free on Vespa Cloud today.

Beyond Text: The Rise of Vision-Driven Document Retrieval for RAG

Mon, 19 Aug 2024 00:00:00 +0000

This blog post deep dives into “ColPali: Efficient Document Retrieval with Vision Language Models” which introduces a novel approach to document retrieval that leverages the power of vision language models (VLMs).

Introduction

Imagine a world where search engines could understand documents the way humans do – not just by reading the words, but by truly seeing and interpreting the visual information. This is the vision behind ColPali, a refreshing document retrieval model that leverages the power of vision language models (VLMs) to unlock the potential of visual information in Retrieval-Augmented Generation (RAG) pipelines.

Just as our brains integrate visual and textual cues when we read a document, ColPali goes beyond traditional text-based retrieval methods. It uses a vision-capable language model (PaliGemma) to “see” the text, but also the visual elements of a document, including figures, tables and infographics. This enables a richer understanding of the content, leading to more accurate and relevant retrieval results.

ColPali is short for Contextualized Late Interaction over PaliGemma and builds on two key concepts:

Contextualized Vision Embeddings from a VLM: ColPali generates contextualized embeddings directly from images of document pages, using PaliGemma, a powerful VLM with built-in OCR capabilities (visual text understanding). This allows ColPali to capture the full richness of both text and visual information in documents. This is the same approach as for text-only retrieval with ColBERT. Instead of pooling the visual embeddings for a page into a single embedding representation, ColPali uses a grid to represent the image of a page where each grid cell is represented as an embedding.
Late Interaction ColPali uses a late interaction similarity mechanism to compare query and document embeddings at query time, allowing for interaction between all the image grid cell vector representations and all the query text token vector representations. This enables ColPali to effectively match user queries to relevant documents based on both textual and visual content. In other words, the late interaction is a way to compare all the image grid cell vector representations with all the query text token vector representations. Traditionally, with a VLM, like PaliGemma, one would input both the image and the text into the model at the same time. This is computationally expensive and inefficient for most use cases. Instead, ColPali uses a late interaction mechanism to compare the image and text embeddings at query time, which is more efficient and effective for document retrieval tasks and allows for offline precomputation of the contextualized grid embeddings.

The Problem with Traditional Document Retrieval

Document retrieval is a fundamental task in many applications, including search engines, information extraction, and retrieval-augmented generation (RAG). Traditional document retrieval systems primarily rely on text-based representations, often struggling to effectively utilize the rich visual information present in documents.

Extraction: The process of indexing a standard PDF document involves multiple steps, including PDF parsing, optical character recognition (OCR), layout detection, chunking, and captioning. These steps are time-consuming and can introduce errors that impact the overall retrieval performance. Beyond PDF, other document formats like images, web pages, and handwritten notes pose additional challenges, requiring specialized processing pipelines to extract and index the content effectively. Even after the text is extracted, one still needs to chunk the text into meaningful chunks for use with traditional single-vector text embedding models that are typically based on context-length limited encoder-styled transformer architectures like BERT or RoBERTa.
Text-Centric Approach: Most systems focus on text-based representations, neglecting the valuable visual information present in documents, such as figures, tables, and layouts. This can limit the system’s ability to retrieve relevant documents, especially in scenarios where visual elements are crucial for understanding the content.

The ColPali model addresses these problems by leveraging VLMs (in this case: PaliGemma – Google’s Cutting-Edge Open Vision Language Model) to generate high-quality contextualized embeddings directly from images of PDF pages. No text extraction, OCR, or layout analysis is required. Furthermore, there is no need for chunking or text embedding inference as the model directly uses the image representation of the page.

Essentially, the ColPali approach is the WYSIWYG (What You See Is What You Get) for document retrieval. In other words; WYSIWYS (What You See Is What You Search).

ColPali: A Vision Language Model for Document Retrieval

Vision Language Models (VLMs) have gained popularity for their ability to understand and generate text based on combined text and visual inputs. These models combine the power of computer vision and natural language processing to process multimodal data effectively. The Pali in ColPali is short for PaliGemma as the contextual image embeddings come from the PaliGemma model from Google.

PaliGemma is a family of vision-language models with an architecture consisting of SigLIP-So400m as the image encoder and Gemma-2B as text decoder. SigLIP is a state-of-the-art model that can understand both images and text. Like CLIP, it consists of an image and text encoder trained jointly.

Quote from PaliGemma model from Google.

For a deeper introduction to Vision Language Models (VLMs), check out Vision Language Models Explained.

VLMs display enhanced capabilities in Visual Question Answering ¹, captioning, and document understanding tasks. The ColPali model builds on this foundation to tackle the challenge of document retrieval, where the goal is to find the most relevant documents based on a user query.

To implement retrieval with a VLM, one could add the pages as images into the context window of a frontier VLM (Like Gemini Flash or GPT-4o), but this would be computationally expensive and inefficient for most (not all) use cases. In addition to computing, latency would be high when searching large collections.

ColPali is designed to be more efficient and effective for document retrieval tasks by generating contextualized embeddings directly from images of document pages. Then the retrieved pages could be used as input to a downstream VLM model. This type of serving architecture could be described as a retrieve and read pipeline or in the case of ColPali, a retrieve and see pipeline.

Note that the ColPali model is not a full VLM like GPT-4o or Gemini Flash, but rather a specialized model for document retrieval. The model is trained to generate embeddings from images of document pages and compare them with query embeddings to retrieve the most relevant documents.

ColPali architecture

Comparing standard retrieval methods which use parsing, text extraction, OCR and layout detection with ColPali. Illustration from ². Notice that the simpler pipeline achieves better accuracy on the ViDoRe benchmark than the more complex traditional pipeline. 0.81 NDCG@5 for ColPali vs 0.66 NDCG@5 for the traditional pipeline.

Image-Based Page Embeddings: ColPali generates contextualized embeddings solely from images of document pages, bypassing the need for text extraction and layout analysis. Each page is represented as a 32x32 image grid (patch), where each patch is represented as a 128-dimensional vector.
Query Text Embeddings: The query text is tokenized and each token is represented as a 128-dimensional vector.
Late Interaction Matching: ColPali employs a so-called late interaction similarity mechanism, where query and document embeddings are compared at query time (dotproduct), allowing for interaction between all the image patch vector representations and all the query text token vector representations. See our previous work on ColBERT, ColBERT long-context for more details on the late interaction (MaxSiM) similarity mechanisms. The Col in ColPali is short for the interaction mechanism introduced in ColBERT where Col stands for Contextualized Late Interaction.

Illustration of a 32x32 image grid (patch) representation of a document page. Each patch is a 128-dimensional vector. This image is from the vidore/docvqa_test_subsampled test set.

The above figure illustrates the patch grid representation of a document page. Each patch is a 128-dimensional vector, and the entire page is represented as a 32x32 grid of patches. This grid is then used to generate contextualized embeddings for the document page.

The Problem with Traditional Document Retrieval Benchmarks

Most Information Retrieval (IR) and Natural Language Processing (NLP) benchmarks are only concerned with clean and preprocessed texts. For example, MS MARCO and BEIR.

The problem is that real-world retrieval use cases don’t have the luxury of preprocessed clean text data. These real-world systems need to handle the raw data as it is, with all its complexity and noise. This is where the need for a new benchmark arises. The ColPali paper ² introduces a new and challenging benchmark: ViDoRe: A Comprehensive Benchmark for Visual Document Retrieval (ViDoRe). This benchmark is designed to evaluate the ability of retrieval systems to match user queries to relevant documents at the page level, considering both textual and visual information.

ViDoRe features:

Multiple Domains: It covers various domains, including industrial, scientific, medical, and administrative.
Academic: It uses existing visual question-answering benchmarks like DocVQA (Document Visual Question Answering), InfoVQA (Infographic Visual Question Answering) and TAT-DQA. These visual question-answering datasets focus on specific modalities like figures, tables, and infographics.
Practical: It introduces new benchmarks based on real-world documents, covering topics like energy, government, healthcare, and AI. These benchmarks are designed to be more realistic and challenging than repurposed academic datasets. Most of the queries for real-world documents are generated by frontier VLMs.

ViDoRe: A Comprehensive Benchmark for Visual Document Retrieval (ViDoRe)

ViDoRe provides a robust and realistic benchmark for evaluating the performance of document retrieval systems that can effectively leverage both textual and visual information. The full details of the ViDoRe benchmark can be found in the ColPali paper ² and the dataset is hosted as a collection on Hugging Face Datasets. One can view some of the examples using the dataset viewer:infovqa test.

The paper ² evaluates ColPali against several baseline methods, including traditional text-based retrieval methods, captioning-based approaches, and contrastive VLMs. The results demonstrate that ColPali outperforms all other systems on ViDoRe, achieving the highest NDCG@5 scores across all datasets.

Looking at the result table we can observe that:

Visually Rich Datasets: ColPali significantly outperforms all other methods on datasets with complex visual elements, like ArXivQ, and TabF. These datasets require OCR for text-based retrieval methods. This demonstrates its ability to effectively leverage visual information for retrieval.
Overall Performance: ColPali achieves the highest NDCG@5 scores across all datasets, including those with primarily textual content where also simple text extraction and BM25 work well (AI, Energy, Gov, Health). Even on these datasets, ColPali’s performance is better, indicating that PaliGemma has strong OCR capabilities.

Examples from the ViDoRe benchmark

Examples of documents in the ViDoRe benchmark. These documents contain a mix of text and visual elements, including figures, tables, and infographics. The ColPali model is designed to effectively retrieve documents based on both textual and visual information.

Limitations

While ColPali is a powerful tool for document retrieval, it does have limitations:

Focus on PDF-like Documents: ColPali was primarily trained and evaluated on documents similar to PDFs, which often have structured layouts and contain images, tables, and figures. Its performance on other types of documents, like handwritten notes or web page screenshots, might be less impressive.
Limited Multilingual Support: Although ColPali shows promise in handling non-English languages (TabQuAD dataset which is French), it was mainly trained on English data, so its performance with other languages may be less consistent.
Limited Domain-Specific Knowledge: While ColPali achieves strong results on various benchmarks, it might not generalize as well to highly specialized domains requiring specialized knowledge. Fine-tuning the model for specific domains might be necessary for optimal performance.

We consider the architecture of ColPali to be more important than the underlying VLM model or the exact training data used to train the model checkpoint. The ColPali approach generalizes to other VLMs as they become available, and we expect to see more models and model checkpoints trained on more data. Similarly, for text-only, we have seen that the ColBERT architecture is used to train new model checkpoints, 4 years after its introduction. We expect to see the same with the ColPali architecture in the future.

ColPali’s role in RAG Pipelines

The primary function of the RAG pipeline is to generate answers based on the retrieved documents. The retriever’s role is to find the most relevant documents, and the generator then extracts and synthesizes information from those documents to create a coherent answer.

Therefore, if ColPali is effectively retrieving the most relevant documents based on both text and visual cues, the generative phase in the RAG pipeline can focus on processing and summarizing the retrieved results from ColPali.

ColPali already incorporates visual information directly into its retrieval process and for optimal results, one should consider feeding the retrieved image data into the VLM for the generative phase. This would allow the model to leverage both textual and visual information when generating answers.

Summary

ColPali is a groundbreaking document retrieval model, probably one of the most significant advancements in the field of document retrieval in recent years. By leveraging the power of vision language models, ColPali can effectively retrieve documents based on both textual and visual information. This approach has the potential to revolutionize the way we search and retrieve information, making it more intuitive and efficient. Not only does ColPali outperform traditional text-based retrieval methods, but it also opens up new possibilities for integrating visual information into retrieval-augmented generation pipelines.

As this wasn’t enough, it also greatly simplifies the document retrieval pipeline by eliminating the need for text extraction, OCR, and layout analysis. This makes it easier to implement and deploy in real-world applications with minimal preprocessing and extraction steps during indexing.

If you want to learn more about ColPali, and how to represent ColPali in Vespa, check out our previous post on PDF Retrieval with Vision Language Models.

Other useful resources on ColPali:

Vespa Blog

Deploying RAG at Scale: Key Questions for Vendors

Before answering these questions, let’s introduce Vespa:

From Concept to Enterprise Deployment

The Platform Approach

An Execution Environment for Large Scale Enterprise Deployment

Future Proofing

Summary

Announcing support for global significance models

Example

Experiments

Summary

Vinted moves from Elasticsearch to Vespa

Vespa Newsletter, September 2024

Optimized MaxSim with Hamming distance for multivector documents

Pyvespa

IDE support

Other

New posts from our blog

Other companies blogging about how and why they build on Vespa

Events

Why Danswer - the biggest open source project in Enterprise Search - uses Vespa

Background on the Migration

Challenges of Enterprise Search and Vespa to the Rescue

What brought us to Vespa

Challenges

Summary

AI Needs More Than a Vector Database

Enter the AI Database

Vespa: An Open Source AI Engineer’s Platform

Scaling ColPali to billions of PDFs with Vespa

Introduction

Do not sleep on VLMs

Scaling VLM retrieval with ColPali

Scoring with MaxSim (The Late Interaction part of ColPali)

Scaling MaxSim

Optimizing the MaxSim with hamming

Scaling to Billions of Documents

Is hamming a good similarity function for ColPali?

Summary

FAQ

References

Parsing Through the Summer: A Tale of Schemas, Syntax, and Shenanigans

Starting the project

Language server protocol

Parser

Embracing the CST

Unexpected side mission

Results

Neovim plugin

Limitations and future work

Our experience at Vespa

Vespa Newsletter, August 2024

Pyvespa improvements

Vespa CLI improvements

Performance

New Vespa features

We highly recommend taking a look at the blog posts we have published since the last newsletter

Events

Beyond Text: The Rise of Vision-Driven Document Retrieval for RAG

Introduction

The Problem with Traditional Document Retrieval

ColPali: A Vision Language Model for Document Retrieval

ColPali architecture

The Problem with Traditional Document Retrieval Benchmarks

Limitations

ColPali’s role in RAG Pipelines

Summary

References