Kai Borgen
Kai Borgen
Technical Product Engineer

Vespa Quickstart - How to build an Application with Vespa

Get started with Vespa and set up your first application. Build your first Vespa instance using Python.

In this guide we will show you how to set up, run and query an instance of Vespa. We will be building a simple search application from scratch to search through some of the highest-rated films on IMDB. The code snippets in this guide can be copied and pasted into a jupyter notebook if you want to follow along.

We will be using Vespa-Cloud to host our instance to show how easy it is to get started with the worlds most capable AI search platform!

Decorative image

Photo by Jakob Owens on Unsplash

Setting up Vespa Cloud

First, create a free Vespa Cloud account, and a tenant name when prompted.

Installing Vespa

To start using Vespa in python we need to install pyvespa >= 0.59, and the Vespa CLI.

Pyvespa is the python wrapper for your Vespa instance, and the Vespa CLI is used for creating and deploying your Vespa instance (Vespa Cloud Security Guide):

# MacOS and Linux
pip3 install pyvespa vespacli --upgrade

# Windows
pip install pyvespa vespacli --upgrade

Setting up the Vespa Application

Creating a Vespa application always starts with understanding and preparing our data. For this quickstart guide we will be using a datset of the top 100 rated films on IMDB.

Download the dataset zip file found Here, and add it to your workspace. Take a look at the data and familiarise yourself with it a little.

import json

with open('IMDB_top_100.json', 'r') as file:
    imdb_data = json.load(file)
print(type(imdb_data))   
print(type(imdb_data[0])) # data is a list of dictionaries containing information on each film
print(imdb_data[0].keys()) 

To be able to use the imdb_data, we must transform it into a format (documents) that we can feed to Vespa.

Pyvespa expects the data we feed it with to be presented as a list of dictionaries with a document id, and a set of fields (which is also a dictionary).

{ 
    "id":     "unique_id_of_document",
    "fields": {
                "field_1": "value"
                ...
                ...
                ...
                "field_N": "value"
              }
}

The “fields” are the attributes of the data that we want to be able to search or use during a query.

For this tutorial, we will just be using parts of the data set for simplicity. Let’s transform our IMDB data into the correct format for Vespa!

vespa_feed = []

for film in imdb_data:
    film_dict = {
        'id': film['id'],
        'fields': {
            'film_id':          film['id'],
            'film_title':       film['Series_Title'],
            'synopsis':         film['Synopsis'], 
            'year':             film['Released_Year'],
        }
    }
    vespa_feed.append(film_dict)

Note: If we want to be able to see and return the id of the film when querying, we need to add it as a field as well!

Configuring Vespa

Now that we understand and have prepared the format of our data, we must configure our Vespa application.

To configure Vespa we create an ApplicationPackage. The application package needs a name, one (or more) Schema(s), and usually some components.

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="imdbquickstart")
# Name must be lowercase. alphanumerical only, and less than 20 characters long 

Schema

A schema describes what the document we feed to Vespa looks like, and what we want to do with that document. Lets start with a blank schema!

from vespa.package import Schema, Document

# The schema needs a name, and must be initiated with a Document class
film_schema = Schema(name="imdb_film_schema", document=Document())
Fields

In our “vespa_feed” we tranformed the documents in our dataset into a set of Fields: “film_title”, “synopsis”, “year” and “film_id”.

Our Schema needs to be configured to capture these fields.

from vespa.package import Field


film_schema.add_fields(
    Field(
        name="film_id",            
        type="string",                
        indexing=["summary"],  
    ),
    Field(
        name="film_title",                   
        type="string",                  
        indexing=["index", "summary"],           
    ),
    Field(
        name="synopsis",                    
        type="string",
        indexing=["index", "summary"],            
        bolding=True,                   
    ),
    Field(
        name = 'year',
        type = "int",
        indexing = ["attribute"]
    )
    
)

name - must match the name of the field in the dictionary.

type - specifies the data type of the field

indexing - specifies how Vespa processes the field during document feeding:

  • “summary” means that the field is included in full in the document summary, i.e. that this value can be returned during a search.
  • adding “index” in this variable means that Vespa will create a searchable index for the field.
  • “attribute” means that the field is stored in memory for quick sorting and grouping.
  • (Read more about the indexing language)
Fieldsets

A Fieldset groups fields together for searching. Below we are defining a fieldset, named “default” which groups the title and synopsis together for searching.

Naming this fieldset “default” means that Vespa will default to using this fieldset if a field or fieldset is not specified when querying with userQuery() (more on that in a bit).

from vespa.package import FieldSet

film_schema.add_field_set(
    FieldSet(name="default", fields=["film_title", "synopsis"]) 
)

Our Schema is now finished. For this tutorial we are keeping it simple, but there is way more you can do with your Schema!

(E.g. adding vector-search capabilities, different Fieldsets and Rank profiles, all of which we will be looking at in another tutorial!)

Finishing the Application Package

We can now add our Schema to our application package

app_package.add_schema(film_schema)

Our application package is now finished and ready to be deployed!

Deploying to Vespa Cloud

The following code shows you how to authenticate for cloud deployment. You may be prompted to open your browser in order to authenticate.

from vespa.deployment import VespaCloud

# Authentication to cloud
vespa_cloud = VespaCloud(
    tenant="YourTenantNameHere", 
    # the tenant name you created when signing up to the cloud
    application=app_package.name,
    application_package=app_package,
)

And now we can deploy!

(It might take a few minutes the first time)

app = vespa_cloud.deploy()

Our application is now deployed in the cloud!

Lets get our certificate and key, as well as the endpoint so that we can connect to our deployed instance from anywhere.

cert_path = app.cert
key_path = app.key
print('Certificate:', cert_path)
print('Key:', key_path)

endpoint = vespa_cloud.get_mtls_endpoint()

Connecting to a Deployed Vespa Instance

Now that we have a deployed instance we want to be able to connect to it. In order to connect we need:

  • The data plane certificate
  • The data plane private key
  • The endpoint where our instance is residing

Which is what we fetched in the previous step

from vespa.application import Vespa

vespa_instance = Vespa(endpoint, cert=cert_path, key=key_path)

Feeding the data

Lets feed our deployed application the data that we prepared earlier.

from vespa.io import VespaResponse

def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Error when feeding document {id}: {response.get_json()}")


vespa_instance.feed_iterable(vespa_feed, schema="imdb_film_schema", callback=callback)

Querying

Our Vespa application is deployed and we have fed it our data.

It is now ready to use!

# We'll make a function so that it is easier to see the results

from vespa.io import VespaQueryResponse

def print_hits(response: VespaQueryResponse):
    for i, hit in enumerate(response.hits):
        print(f'{i+1:>3}  {hit["fields"]['film_id']:>3}  {hit['fields']['film_title']}')

Lets make our first query.

# Searching for films with "good" in the title

response = vespa_instance.query(
    yql = "select * from sources * where film_title contains 'good'"
)
print_hits(response)
Result:

1    99  Good Will Hunting

2    12  The good, the Bad and the Ugly

And lets do some more!

# Searching for films with "good" in the title that are made after 1970

response = vespa_instance.query(
    yql = "select * from sources * where film_title contains 'good' AND year > 1970"
)
print_hits(response)
Result:

  1  99  Good Will Hunting
# Searching for films with 'batman' in the default fieldset, i.e. in the title or the synopsis

response = vespa_instance.query(
    yql="select * from sources * where default contains 'batman'"
)

print_hits(response)
Result:

  1  63  The Dark Knight Rises
  2  2  The Dark Knight

userQuery() Function

We recommend you use the userQuery() function for handling user inputs. userQuery() parses and cleans the input automatically protecting against unintended operations and potential attacks

# Searching for 'world war' using userQuery(). This function uses the default fieldset

response = vespa_instance.query(
    yql="select * from sources * where userQuery()", # The default limit in Vespa is 10 results
    query = "world war", # when using userQuery() you must provide the query as a parameter
)

print_hits(response)
Result:

 1   80  Paths of Glory
 2   50  Casablanca
 3   84  1917
 4   46  Grave of the Fireflies
 5   24  Saving Private Ryan
 6   93  Inglourious Basterds
 7   16  Star Wars: Episode V - The Empire Strikes Back
 8   29  Star Wars
 9   26  Life is Beautiful
10   54  Ayla: The Daughter of War
# Searching for 'world war' using userQuery() and limiting the number of results

response = vespa_instance.query(
    yql="select * from sources * where userQuery() limit 5",
    query = "world war",
)

print_hits(response)
Result:

 1   80  Paths of Glory
 2   50  Casablanca
 3   84  1917
 4   46  Grave of the Fireflies
 5   24  Saving Private Ryan

And Voila! You have created, deployed and queried your first Vespa instance! We encourage you to play with code provided here or maybe even try out your own dataset. When you are done, just remember to delete your deployment in the Vespa Cloud so you don’t use up your credits!

This has been a tutorial to get you up and running quickly with Vespa, but you are yet to witness the true beauty of Vespa!

In the next tutorials we will be looking at how Vespa can do vector search of semantic content, custom ranking of results and so much more!

Or if you want to see immediatly what a full blown Vespa application can do, check out these links:

Read more