Lester Solbakken
Lester Solbakken
Principal Vespa Engineer

Accelerating stateless model evaluation on Vespa

A central architectural feature of Vespa.ai is the division of work between the stateless container cluster and the content cluster.

Most computation, such as evaluating machine-learned models, happens in the content cluster. However, it has become increasingly important to efficiently evaluate models in the container cluster as well, to process or transform documents or queries before storage or execution. One prominent example is to generate a vector representation of natural language text for queries and documents for nearest neighbor retrieval.

We have recently implemented accelerated model evaluation using ONNX Runtime in the stateless cluster, which opens up new usage areas for Vespa.

Introduction

At Vespa.ai we differentiate between stateful and stateless machine-learned model evaluation. Stateless model evaluation is what one usually thinks about when serving machine-learned models in production. For instance, one might have a stand-alone model server that is called from somewhere in a serving stack. The result of evaluating a model there only depends upon its input.

In contrast, stateful model serving combines input with stored or persisted data. This poses some additional challenges. One is that models typically need to be evaluated many times per query, once per data point. This has been a focus area of Vespa.ai for quite some time, and we have previously written about how we accelerate stateful model evaluation in Vespa.ai using ONNX Runtime.

However, stateless model evaluation does have its place in Vespa.ai as well. For instance, transforming query input or document content using Transformer models. Or finding a vector representation for an image for image similarity search. Or translating text to another language. The list goes on.

Vespa.ai has actually had stateless model evaluation for some time, but we’ve recently added acceleration of ONNX models using ONNX Runtime. This makes this feature much more powerful and opens up some new use cases for Vespa.ai. In this post, we’ll take a look at some capabilities this enables:

  • The automatically generated REST API for model serving.
  • Creating lightweight request handlers for serving models with some custom code without the need for content nodes.
  • Adding model evaluation to searchers for query processing and enrichment.
  • Adding model evaluation to document processors for transforming content before ingestion.
  • Batch-processing results from the ranking back-end for additional ranking models.

We’ll start with a quick overview of the difference between where we evaluate machine-learned models in Vespa.ai.

Vespa.ai applications: container and content nodes

Vespa.ai is a distributed application consisting of various types of services on multiple nodes. A Vespa.ai application is fully defined in an application package. This single unit contains everything needed to set up an application, including all configuration, custom components, schemas, and machine-learned models. When the application package is deployed, the admin cluster takes care of configuring all the services across all the system’s nodes, including distributing all models to the nodes that need them.

Vespa architecture

The container nodes process queries or documents before passing them on to the content nodes. So, when a document is fed to Vespa, content can be transformed or added before being stored. Likewise, queries can be transformed or enriched in various ways before being sent for further processing.

The content nodes are responsible for persisting data. They also do most of the required computation when responding to queries. As that is where the data is, this avoids the cost of transferring data across the network. Query data is combined with document data to perform this computation in various ways.

We thus differentiate between stateless and stateful machine-learned model evaluation. Stateless model evaluation happens on the container nodes and is characterized by a single model evaluation per query or document. Stateful model evaluation happens on the content nodes, and the model is typically evaluated a number of times using data from both the query and the document.

The exact configuration of the services on the nodes is specified in services.xml. Here the number of container and content nodes, and their capabilities, are fully configured. Indeed, a Vespa.ai application does not need to be set up with any content nodes, purely running stateless container code, including serving machine-learned models.

This makes it easy to deploy applications. It offers a lot of flexibility in combining many types of models and computations out of the box without any plugins or extensions. In the next section, we’ll see how to set up stateless model evaluation.

Stateless model evaluation

So, by stateless model evaluation we mean machine-learned models that are evaluated on Vespa container nodes. This is enabled by simply adding the model-evaluation tag in services.xml:

...
<container>
    ...
    <model-evaluation/>
    ...
</container>
...

When this is specified, Vespa scans through the models directory in the application packages to find any importable machine-learned models. Currently, supported models are TensorFlow, ONNX, XGBoost, LightGBM or Vespa’s own stateless models.

There are two effects of this. The first is that a REST API for model discovery and evaluation is automatically enabled. The other is that custom components can have a special ModelsEvaluator object dependency injected into their constructors.

Stateless model evaluation

In the following we’ll take a look at some of the usages of these, and use the model-evaluation sample app for demonstratation.

REST API

The automatically added REST API provides an API for model discovery and evaluation. This is great for using Vespa as a standalone model server, or making models available for other parts of the application stack.

To get a list of imported models, call http://host:port/model-evaluation/v1. For instance:

$ curl -s 'http://localhost:8080/model-evaluation/v1/'
{
    "pairwise_ranker": "http://localhost:8080/model-evaluation/v1/pairwise_ranker",
    "transformer": "http://localhost:8080/model-evaluation/v1/transformer"
}

This application has two models, the transformer model and the pairwise_ranker model. We can inspect a model to see expected inputs and outputs:

$ curl -s 'http://localhost:8080/model-evaluation/v1/transformer/output'
{
    "arguments": [
        {
            "name": "input",
            "type": "tensor(d0[],d1[])"
        },
        {
            "name": "onnxModel(transformer).output",
            "type": "tensor<float>(d0[],d1[],d2[16])"
        }
    ],
    "eval": "http://localhost:8080/model-evaluation/v1/transformer/output/eval",
    "function": "output",
    "info": "http://localhost:8080/model-evaluation/v1/transformer/output",
    "model": "transformer"
}

All model inputs and output are Vespa tensors. See the tensor user guide for more information.

This model has one input, with tensor type tensor(d0[],d1[]). This tensor has two dimensions: d0 is typically a batch dimension, and d1 represents for, this model, a sequence of tokens. The output, of type tensor<float>(d0[],d1[],d2[16]) adds a dimension d2 which represents the embedding dimension. So the output is an embedding representation for each token of the input.

By calling /model-evaluation/v1/transformer/eval and passing an URL encoded input parameter, Vespa evaluates the model and returns the result as a JSON encoded tensor.

Please refer to the sample application for a runnable example.

Request handlers

The REST API takes exactly the same input as the models it serves. In some cases one might want to pre-process the input before providing it to the model. A common example is to tokenize natural language text before passing the token sequence to a language model such as BERT.

Vespa provides request handlers which lets applications implement arbitrary HTTP APIs. With custom request handlers, arbitrary code can be run both before and after model evaluation.

When the model-evaluation tag has been supplied, Vespa makes a special ModelsEvaluator object available which can be injected into a component (such as a request handler):

public class MyHandler extends ThreadedHttpRequestHandler {

    private final ModelsEvaluator modelsEvaluator;

    public MyHandler(ModelsEvaluator modelsEvaluator, Context context) {
        super(context);
        this.modelsEvaluator = modelsEvaluator;
    }

    @Override
    public HttpResponse handle(HttpRequest request) {

        // Get the input
        String inputString = request.getProperty("input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate the model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor result = evaluator.bind("input", input).evaluate();

        // Perform any post-processing to the tensor
        // ...
    }

A full example can be seen in the MyHandler class in the sample application and it’s unit test.

As mentioned, arbitrary code can be run here. Pragmatically, it is often more convenient to put the processing pipeline in the model itself. While not always possible, this helps protect against divergence between the data processing pipeline in training and in production.

Document processors

The REST API and request handler can work with a purely stateless application, such as a model server. However, it is much more common for Vespa.ai applications to have content. As such, it is fairly common to process incoming documents before storing them. Vespa provides a chain of document processors for this.

Applications can implement custom document processors, and add them to the processing chain. In the context of model evaluation, a typical task is to use a machine-learned model to create a vector representation for a natural language text. The text is first tokenized, then run though a language model such as BERT to generate a vector representation which is then stored. Such a vector representation can be for instance used in nearest neighbor search. Other examples are sentiment analysis, creating representations of images, object detection, translating text, and so on.

The ModelsEvaluator can be injected into your component as already seen:

public class MyDocumentProcessor extends DocumentProcessor {

    private final ModelsEvaluator modelsEvaluator;

    public MyDocumentProcessor(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;
    }

    @Override
    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                DocumentPut put = (DocumentPut) op;
                Document document = put.getDocument();

                // Get tokens
                Tensor tokens = (Tensor) document.getFieldValue("tokens").getWrappedValue();

                // Perform any pre-processing to the tensor
                // ...

                // Evaluate the model
                FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
                Tensor result = evaluator.bind("input", input).evaluate();

                // Reshape and extract the embedding vector (not shown)
                Tensor embedding = ...

                // Set embedding in document
                document.setFieldValue("embedding", new TensorFieldValue(embedding));
            }
        }
    }
}

Notice the code looks a lot like the previous example for the request handler. The document processor receives a pre-constructed ModelsEvaluator from Vespa which contains the transformer model. This code receives a tensor contained in the tokens field, runs that through the transformer model, and puts the resulting embedding into a new field. This is then stored along with the document.

Again, a full example can be seen in the MyDocumentProcessor class in the sample application and it’s unit test.

Searchers: query processing

Similar to document processing, queries are processed along a chain of searchers. Vespa provides a default chain of searchers for various tasks, and applications can provide additional custom searchers as well. In the context of model evaluation, the use cases are similar to document processing: a typical task for text search is to generate vector representations for nearest neighbor search.

Again, the ModelsEvaluator can be injected into your component:

public class MySearcher extends Searcher {

    private final ModelsEvaluator modelsEvaluator;

    public MySearcher(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;
    }

    @Override
    public Result search(Query query, Execution execution) {

        // Get the query input
        String inputString = query.properties().getString("input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor output = evaluator.bind("input", input).evaluate();

        // Reshape and extract the embedding vector (not shown)
        Tensor embedding = ...

        // Add this tensor to query
        query.getRanking().getFeatures().put("query(embedding)", embedding);

        // Continue processing
        return execution.search(query);
    }
}

As before, a full example can be seen in the MySearcher class in the sample application and it’s unit test.

Searchers: result post-processing

Searchers don’t just process queries before being sent to the back-end: they are just as useful in post-processing the results from the back-end. A typical example is to de-duplicate similar results in a search application. Another is to apply business rules to reorder the results, especially if coming from various back-ends. In the context of machine learning, one example is is to de-tokenize tokens back to a natural language text.

Post-processing is similar to the example above, but the search is executed first, and tensor fields from the documents are extracted and used as input to the models. In the sample application we have a model that compares all results with each other to perform another phase of ranking. See the MyPostProcessing searcher for details.

Conclusion

In Vespa.ai, most of the computation required for executing queries has traditionally been run in the content cluster. This makes sense as it avoids transmitting data across the network to external model servers; this quickly becomes a scalability bottleneck.

With the introduction of accelerated machine-learned model evaluation in the container cluster, we further increase the capabilities of Vespa as a fully-featured platform for low-latency computations over large, evolving data.

In summary, Vespa.ai offers ease of deployment, flexibility in combining many types of models and computations out of the box without any plugins or extensions, efficient evaluation and a less complex system to maintain. This makes Vespa.ai an attractive platform.

In a later post, we will follow up with performance measurements and some guidelines on when to move model evaluation out of the content node and to the container.