Vespa Quickstart - How to build an Application with Vespa
Get started with Vespa and set up your first application. Build your first Vespa instance using Python.
In this guide we will show you how to set up, run and query an instance of Vespa. We will be building a simple search application from scratch to search through some of the highest-rated films on IMDB. The code snippets in this guide can be copied and pasted into a jupyter notebook if you want to follow along.
We will be using Vespa-Cloud to host our instance to show how easy it is to get started with the worlds most capable AI search platform!
Photo by Jakob Owens on Unsplash
Setting up Vespa Cloud
First, create a free Vespa Cloud account, and a tenant name when prompted.
Installing Vespa
To start using Vespa in python we need to install pyvespa >= 0.59, and the Vespa CLI.
Pyvespa is the python wrapper for your Vespa instance, and the Vespa CLI is used for creating and deploying your Vespa instance (Vespa Cloud Security Guide):
# MacOS and Linux
pip3 install pyvespa vespacli --upgrade
# Windows
pip install pyvespa vespacli --upgrade
Setting up the Vespa Application
Creating a Vespa application always starts with understanding and preparing our data. For this quickstart guide we will be using a datset of the top 100 rated films on IMDB.
Download the dataset zip file found Here, and add it to your workspace. Take a look at the data and familiarise yourself with it a little.
import json
with open('IMDB_top_100.json', 'r') as file:
imdb_data = json.load(file)
print(type(imdb_data))
print(type(imdb_data[0])) # data is a list of dictionaries containing information on each film
print(imdb_data[0].keys())
To be able to use the imdb_data, we must transform it into a format (documents) that we can feed to Vespa.
Pyvespa expects the data we feed it with to be presented as a list of dictionaries with a document id, and a set of fields (which is also a dictionary).
{
"id": "unique_id_of_document",
"fields": {
"field_1": "value"
...
...
...
"field_N": "value"
}
}
The “fields” are the attributes of the data that we want to be able to search or use during a query.
For this tutorial, we will just be using parts of the data set for simplicity. Let’s transform our IMDB data into the correct format for Vespa!
vespa_feed = []
for film in imdb_data:
film_dict = {
'id': film['id'],
'fields': {
'film_id': film['id'],
'film_title': film['Series_Title'],
'synopsis': film['Synopsis'],
'year': film['Released_Year'],
}
}
vespa_feed.append(film_dict)
Note: If we want to be able to see and return the id of the film when querying, we need to add it as a field as well!
Configuring Vespa
Now that we understand and have prepared the format of our data, we must configure our Vespa application.
To configure Vespa we create an ApplicationPackage. The application package needs a name, one (or more) Schema(s), and usually some components.
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="imdbquickstart")
# Name must be lowercase. alphanumerical only, and less than 20 characters long
Schema
A schema describes what the document we feed to Vespa looks like, and what we want to do with that document. Lets start with a blank schema!
from vespa.package import Schema, Document
# The schema needs a name, and must be initiated with a Document class
film_schema = Schema(name="imdb_film_schema", document=Document())
Fields
In our “vespa_feed” we tranformed the documents in our dataset into a set of Fields: “film_title”, “synopsis”, “year” and “film_id”.
Our Schema needs to be configured to capture these fields.
from vespa.package import Field
film_schema.add_fields(
Field(
name="film_id",
type="string",
indexing=["summary"],
),
Field(
name="film_title",
type="string",
indexing=["index", "summary"],
),
Field(
name="synopsis",
type="string",
indexing=["index", "summary"],
bolding=True,
),
Field(
name = 'year',
type = "int",
indexing = ["attribute"]
)
)
name - must match the name of the field in the dictionary.
type - specifies the data type of the field
indexing - specifies how Vespa processes the field during document feeding:
- “summary” means that the field is included in full in the document summary, i.e. that this value can be returned during a search.
- adding “index” in this variable means that Vespa will create a searchable index for the field.
- “attribute” means that the field is stored in memory for quick sorting and grouping.
- (Read more about the indexing language)
Fieldsets
A Fieldset groups fields together for searching. Below we are defining a fieldset, named “default” which groups the title and synopsis together for searching.
Naming this fieldset “default” means that Vespa will default to using this fieldset if a field or fieldset is not specified when querying with userQuery() (more on that in a bit).
from vespa.package import FieldSet
film_schema.add_field_set(
FieldSet(name="default", fields=["film_title", "synopsis"])
)
Our Schema is now finished. For this tutorial we are keeping it simple, but there is way more you can do with your Schema!
(E.g. adding vector-search capabilities, different Fieldsets and Rank profiles, all of which we will be looking at in another tutorial!)
Finishing the Application Package
We can now add our Schema to our application package
app_package.add_schema(film_schema)
Our application package is now finished and ready to be deployed!
Deploying to Vespa Cloud
The following code shows you how to authenticate for cloud deployment. You may be prompted to open your browser in order to authenticate.
from vespa.deployment import VespaCloud
# Authentication to cloud
vespa_cloud = VespaCloud(
tenant="YourTenantNameHere",
# the tenant name you created when signing up to the cloud
application=app_package.name,
application_package=app_package,
)
And now we can deploy!
(It might take a few minutes the first time)
app = vespa_cloud.deploy()
Our application is now deployed in the cloud!
Lets get our certificate and key, as well as the endpoint so that we can connect to our deployed instance from anywhere.
cert_path = app.cert
key_path = app.key
print('Certificate:', cert_path)
print('Key:', key_path)
endpoint = vespa_cloud.get_mtls_endpoint()
Connecting to a Deployed Vespa Instance
Now that we have a deployed instance we want to be able to connect to it. In order to connect we need:
- The data plane certificate
- The data plane private key
- The endpoint where our instance is residing
Which is what we fetched in the previous step
from vespa.application import Vespa
vespa_instance = Vespa(endpoint, cert=cert_path, key=key_path)
Feeding the data
Lets feed our deployed application the data that we prepared earlier.
from vespa.io import VespaResponse
def callback(response: VespaResponse, id: str):
if not response.is_successful():
print(f"Error when feeding document {id}: {response.get_json()}")
vespa_instance.feed_iterable(vespa_feed, schema="imdb_film_schema", callback=callback)
Querying
Our Vespa application is deployed and we have fed it our data.
It is now ready to use!
# We'll make a function so that it is easier to see the results
from vespa.io import VespaQueryResponse
def print_hits(response: VespaQueryResponse):
for i, hit in enumerate(response.hits):
print(f'{i+1:>3} {hit["fields"]['film_id']:>3} {hit['fields']['film_title']}')
Lets make our first query.
# Searching for films with "good" in the title
response = vespa_instance.query(
yql = "select * from sources * where film_title contains 'good'"
)
print_hits(response)
Result:
1 99 Good Will Hunting
2 12 The good, the Bad and the Ugly
And lets do some more!
# Searching for films with "good" in the title that are made after 1970
response = vespa_instance.query(
yql = "select * from sources * where film_title contains 'good' AND year > 1970"
)
print_hits(response)
Result:
1 99 Good Will Hunting
# Searching for films with 'batman' in the default fieldset, i.e. in the title or the synopsis
response = vespa_instance.query(
yql="select * from sources * where default contains 'batman'"
)
print_hits(response)
Result:
1 63 The Dark Knight Rises
2 2 The Dark Knight
userQuery() Function
We recommend you use the userQuery() function for handling user inputs. userQuery() parses and cleans the input automatically protecting against unintended operations and potential attacks
# Searching for 'world war' using userQuery(). This function uses the default fieldset
response = vespa_instance.query(
yql="select * from sources * where userQuery()", # The default limit in Vespa is 10 results
query = "world war", # when using userQuery() you must provide the query as a parameter
)
print_hits(response)
Result:
1 80 Paths of Glory
2 50 Casablanca
3 84 1917
4 46 Grave of the Fireflies
5 24 Saving Private Ryan
6 93 Inglourious Basterds
7 16 Star Wars: Episode V - The Empire Strikes Back
8 29 Star Wars
9 26 Life is Beautiful
10 54 Ayla: The Daughter of War
# Searching for 'world war' using userQuery() and limiting the number of results
response = vespa_instance.query(
yql="select * from sources * where userQuery() limit 5",
query = "world war",
)
print_hits(response)
Result:
1 80 Paths of Glory
2 50 Casablanca
3 84 1917
4 46 Grave of the Fireflies
5 24 Saving Private Ryan
And Voila! You have created, deployed and queried your first Vespa instance! We encourage you to play with code provided here or maybe even try out your own dataset. When you are done, just remember to delete your deployment in the Vespa Cloud so you don’t use up your credits!
This has been a tutorial to get you up and running quickly with Vespa, but you are yet to witness the true beauty of Vespa!
In the next tutorials we will be looking at how Vespa can do vector search of semantic content, custom ranking of results and so much more!
Or if you want to see immediatly what a full blown Vespa application can do, check out these links: