Thiago Martins
Thiago Martins
Vespa Data Scientist

Build a basic text search application from python with Vespa

Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.

UPDATE 2023-12-05: Code examples and links are updated to work with the latest releases of pyvespa. The learntorank library is now deprecated.

This post will introduce you to the simplified pyvespa API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.

Decorative image

Photo by Sarah Dorweiler on Unsplash

pyvespa exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.

Install

The pyvespa simplified API introduced here was released in version 0.2.0

pip3 install pyvespa>=0.2.0 learntorank

Define the application

As an example, we will build an application to search through CORD19 sample data.

Create an application package

The first step is to create a Vespa ApplicationPackage:

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="cord19")

Add fields to the Schema

We can then add fields to the application’s Schema created by default in app_package.

from vespa.package import Field

app_package.schema.add_fields(
    Field(
        name = "cord_uid", 
        type = "string", 
        indexing = ["attribute", "summary"]
    ),
    Field(
        name = "title", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    ),
    Field(
        name = "abstract", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    )
)
  • cord_uid will store the cord19 document ids, while title and abstract are self explanatory.

  • All the fields, in this case, are of type string.

  • Including "index" in the indexing list means that Vespa will create a searchable index for title and abstract. You can read more about which options is available for indexing in the Vespa documentation.

  • Setting index = "enable-bm25" makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.

Search multiple fields when querying

A Fieldset groups fields together for searching. For example, the default fieldset defined below groups title and abstract together.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

Define how to rank the documents matched

We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25 rank profile that combines that BM25 scores computed over the title and abstract fields.

from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(
        name = "bm25", 
        first_phase = "bm25(title) + bm25(abstract)"
    )
)

Deploy your application

We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package using Docker without leaving the notebook, by creating an instance of VespaDocker, as shown below:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()

app = vespa_docker.deploy(application_package = app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

app now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.

It is important to know that pyvespa simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy export Vespa configuration files to the disk_folder defined above. Going through those files is an excellent way to start learning about Vespa syntax.

Feed some data

Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame containing 100 rows and the cord_uid, title, and abstract columns required by our schema definition.

from pandas import read_csv

parsed_feed = read_csv(
    "https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
parsed_feed
cord_uid title abstract
0 ug7v899j Clinical features of culture-proven Mycoplasma... OBJECTIVE: This retrospective chart review des...
1 02tnwd4m Nitric oxide: a pro-inflammatory mediator in l... Inflammatory diseases of the respiratory tract...
2 ejv2xln0 Surfactant protein-D and pulmonary host defense Surfactant protein-D (SP-D) participates in th...
3 2b73a28n Role of endothelin-1 in lung disease Endothelin-1 (ET-1) is a 21 amino acid peptide...
4 9785vg6d Gene expression in epithelial cells in respons... Respiratory syncytial virus (RSV) and pneumoni...
... ... ... ...
95 63bos83o Global Surveillance of Emerging Influenza Viru... BACKGROUND: Effective influenza surveillance r...
96 hqc7u9w3 Transmission Parameters of the 2001 Foot and M... Despite intensive ongoing research, key aspect...
97 87zt7lew Efficient replication of pneumonia virus of mi... Pneumonia virus of mice (PVM; family Paramyxov...
98 wgxt36jv Designing and conducting tabletop exercises to... BACKGROUND: Since 2001, state and local health...
99 qbldmef1 Transcript-level annotation of Affymetrix prob... BACKGROUND: The wide use of Affymetrix microar...

100 rows × 3 columns

We can then iterate through the DataFrame above and feed each row by using the app.feed_data_point method:

  • The schema name is by default set to be equal to the application name, which is cord19 in this case.

  • When feeding data to Vespa, we must have a unique id for each data point. We will use cord_uid here.

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

You can also inspect the response to each request if desired.

response.json
{'pathId': '/document/v1/cord19/cord19/docid/qbldmef1',
 'id': 'id:cord19:cord19::qbldmef1'}

Query your application

With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.

query = {
    'yql': 'select * from sources * where userQuery()',
    'query': 'What is the role of endothelin-1',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 3
}
res = app.query(body=query)
res.hits[0]
{'id': 'id:cord19:cord19::2b73a28n',
 'relevance': 20.79338929607865,
 'source': 'cord19_content',
 'fields': {'sddocname': 'cord19',
  'documentid': 'id:cord19:cord19::2b73a28n',
  'cord_uid': '2b73a28n',
  'title': 'Role of endothelin-1 in lung disease',
  'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}
  • match our documents using the OR operator, which matches all the documents that share at least one term with the query.
  • rank the matched documents using the bm25 rank profile defined in our application package.
from learntorank.query import QueryModel, OR, Ranking, send_query

res = send_query(
    app=app,
    query="What is the role of endothelin-1", 
    query_model = QueryModel(
        match_phase=OR(), 
        ranking=Ranking(name="bm25")
    )
)
res.hits[0]
{
    'id': 'id:cord19:cord19::2b73a28n',
    'relevance': 20.79338929607865,
    'source': 'cord19_content',
    'fields': {
        'sddocname': 'cord19',
        'documentid': 'id:cord19:cord19::2b73a28n',
        'cord_uid': '2b73a28n',
        'title': 'Role of endothelin-1 in lung disease',
        'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'
    }
}

Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer. In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments, but this is a future post topic.

Jump to Build a basic text search application from python with Vespa: Part 2 or clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Read more