Build a basic text search application from python with Vespa
Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.
UPDATE 2023-12-05: Code examples and links are updated to work with the latest releases of pyvespa. The learntorank library is now deprecated.
This post will introduce you to the simplified pyvespa API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.
Photo by Sarah Dorweiler on Unsplash
pyvespa
exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.
Install
The pyvespa simplified API introduced here was released in version 0.2.0
pip3 install pyvespa>=0.2.0 learntorank
Define the application
As an example, we will build an application to search through CORD19 sample data.
Create an application package
The first step is to create a Vespa ApplicationPackage:
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="cord19")
Add fields to the Schema
We can then add fields to the application’s Schema created by default in app_package
.
from vespa.package import Field
app_package.schema.add_fields(
Field(
name = "cord_uid",
type = "string",
indexing = ["attribute", "summary"]
),
Field(
name = "title",
type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
),
Field(
name = "abstract",
type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
)
)
-
cord_uid
will store the cord19 document ids, whiletitle
andabstract
are self explanatory. -
All the fields, in this case, are of type
string
. -
Including
"index"
in theindexing
list means that Vespa will create a searchable index fortitle
andabstract
. You can read more about which options is available forindexing
in the Vespa documentation. -
Setting
index = "enable-bm25"
makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.
Search multiple fields when querying
A Fieldset groups fields together for searching. For example, the default
fieldset defined below groups title
and abstract
together.
from vespa.package import FieldSet
app_package.schema.add_field_set(
FieldSet(name = "default", fields = ["title", "abstract"])
)
Define how to rank the documents matched
We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25
rank profile that combines that BM25 scores computed over the title
and abstract
fields.
from vespa.package import RankProfile
app_package.schema.add_rank_profile(
RankProfile(
name = "bm25",
first_phase = "bm25(title) + bm25(abstract)"
)
)
Deploy your application
We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package
using Docker without leaving the notebook,
by creating an instance of VespaDocker,
as shown below:
from vespa.deployment import VespaDocker
vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package = app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.
app
now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.
It is important to know that pyvespa
simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy
export Vespa configuration files to the disk_folder
defined above. Going through those files is an excellent way to start learning about Vespa syntax.
Feed some data
Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame
containing 100 rows and the cord_uid
, title
, and abstract
columns required by our schema definition.
from pandas import read_csv
parsed_feed = read_csv(
"https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
parsed_feed
cord_uid | title | abstract | |
---|---|---|---|
0 | ug7v899j | Clinical features of culture-proven Mycoplasma... | OBJECTIVE: This retrospective chart review des... |
1 | 02tnwd4m | Nitric oxide: a pro-inflammatory mediator in l... | Inflammatory diseases of the respiratory tract... |
2 | ejv2xln0 | Surfactant protein-D and pulmonary host defense | Surfactant protein-D (SP-D) participates in th... |
3 | 2b73a28n | Role of endothelin-1 in lung disease | Endothelin-1 (ET-1) is a 21 amino acid peptide... |
4 | 9785vg6d | Gene expression in epithelial cells in respons... | Respiratory syncytial virus (RSV) and pneumoni... |
... | ... | ... | ... |
95 | 63bos83o | Global Surveillance of Emerging Influenza Viru... | BACKGROUND: Effective influenza surveillance r... |
96 | hqc7u9w3 | Transmission Parameters of the 2001 Foot and M... | Despite intensive ongoing research, key aspect... |
97 | 87zt7lew | Efficient replication of pneumonia virus of mi... | Pneumonia virus of mice (PVM; family Paramyxov... |
98 | wgxt36jv | Designing and conducting tabletop exercises to... | BACKGROUND: Since 2001, state and local health... |
99 | qbldmef1 | Transcript-level annotation of Affymetrix prob... | BACKGROUND: The wide use of Affymetrix microar... |
100 rows × 3 columns
We can then iterate through the DataFrame
above and feed each row by using the app.feed_data_point method:
-
The schema name is by default set to be equal to the application name, which is
cord19
in this case. -
When feeding data to Vespa, we must have a unique id for each data point. We will use
cord_uid
here.
for idx, row in parsed_feed.iterrows():
fields = {
"cord_uid": str(row["cord_uid"]),
"title": str(row["title"]),
"abstract": str(row["abstract"])
}
response = app.feed_data_point(
schema = "cord19",
data_id = str(row["cord_uid"]),
fields = fields,
)
You can also inspect the response to each request if desired.
response.json
{'pathId': '/document/v1/cord19/cord19/docid/qbldmef1',
'id': 'id:cord19:cord19::qbldmef1'}
Query your application
With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.
query = {
'yql': 'select * from sources * where userQuery()',
'query': 'What is the role of endothelin-1',
'ranking': 'bm25',
'type': 'any',
'presentation.timing': True,
'hits': 3
}
res = app.query(body=query)
res.hits[0]
{'id': 'id:cord19:cord19::2b73a28n',
'relevance': 20.79338929607865,
'source': 'cord19_content',
'fields': {'sddocname': 'cord19',
'documentid': 'id:cord19:cord19::2b73a28n',
'cord_uid': '2b73a28n',
'title': 'Role of endothelin-1 in lung disease',
'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}
- match our documents using the
OR
operator, which matches all the documents that share at least one term with the query. - rank the matched documents using the
bm25
rank profile defined in our application package.
from learntorank.query import QueryModel, OR, Ranking, send_query
res = send_query(
app=app,
query="What is the role of endothelin-1",
query_model = QueryModel(
match_phase=OR(),
ranking=Ranking(name="bm25")
)
)
res.hits[0]
{
'id': 'id:cord19:cord19::2b73a28n',
'relevance': 20.79338929607865,
'source': 'cord19_content',
'fields': {
'sddocname': 'cord19',
'documentid': 'id:cord19:cord19::2b73a28n',
'cord_uid': '2b73a28n',
'title': 'Role of endothelin-1 in lung disease',
'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'
}
}
Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer. In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments, but this is a future post topic.
Jump to Build a basic text search application from python with Vespa: Part 2 or clean up:
vespa_docker.container.stop()
vespa_docker.container.remove()