Thiago Martins
Thiago Martins
Vespa Data Scientist

Build a News recommendation app from python with Vespa: Part 1

Part 1 - News search functionality.

We will build a news recommendation app in Vespa without leaving a python environment. In this first part of the series, we want to develop an application with basic search functionality. Future posts will add recommendation capabilities based on embeddings and other ML models.

Decorative image

Photo by Filip Mishevski on Unsplash

This series is a simplified version of Vespa’s News search and recommendation tutorial. We will also use the demo version of the Microsoft News Dataset (MIND) so that anyone can follow along on their laptops.

Dataset

The original Vespa news search tutorial provides a script to download, parse and convert the MIND dataset to Vespa format. To make things easier for you, we made the final parsed data required for this tutorial available for download:

1
2
3
4
5
6
import requests, json

data = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_fields_parsed.json").text
)
data[0]
1
2
3
4
5
6
7
8
9
{'abstract': "Shop the notebooks, jackets, and more that the royals can't live without.",
 'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By',
 'subcategory': 'lifestyleroyals',
 'news_id': 'N3112',
 'category': 'lifestyle',
 'url': 'https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata',
 'date': 20191103,
 'clicks': 0,
 'impressions': 0}

The final parsed data used here is a list where each element is a dictionary containing relevant fields about a news article such as title and category. We also have information about the number of impressions and clicks the article has received. The demo version of the mind dataset has 28.603 news articles included.

1
len(data)
1
28603

Install pyvespa

1
!pip install pyvespa

Create the search app

Create the application package. app_package will hold all the relevant data related to your application’s specification.

1
2
3
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="news")

Add fields to the schema. Here is a short description of the non-obvious arguments used below:

  • indexing argument: configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.

  • index argument: configure how Vespa should create the search index.

    • “enable-bm25”: set up an index compatible with bm25 ranking for text search.
  • attribute argument: configure how Vespa should treat an attribute field.

    • “fast-search”: Build an index for an attribute field. By default, no index is generated for attributes, and search over these defaults to a linear scan.
1
2
3
4
5
6
7
8
9
10
11
12
13
from vespa.package import Field

app_package.schema.add_fields(
    Field(name="news_id", type="string", indexing=["summary", "attribute"], attribute=["fast-search"]),
    Field(name="category", type="string", indexing=["summary", "attribute"]),
    Field(name="subcategory", type="string", indexing=["summary", "attribute"]),
    Field(name="title", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="abstract", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="url", type="string", indexing=["index", "summary"]),        
    Field(name="date", type="int", indexing=["summary", "attribute"]),            
    Field(name="clicks", type="int", indexing=["summary", "attribute"]),            
    Field(name="impressions", type="int", indexing=["summary", "attribute"]),                
)

Add a fieldset to the schema. Fieldset allows us to search over multiple fields easily. In this case, searching over the default fieldset is equivalent to searching over title and abstract.

1
2
3
4
5
from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name="default", fields=["title", "abstract"])
)

We have enough to deploy the first version of our application. Later in this tutorial, we will include an article’s popularity into the relevance score used to rank the news that matches our queries.

Deploy the app on Docker

If you have Docker installed on your machine, you can deploy the app_package in a local Docker container:

1
2
3
4
5
6
7
8
9
10
from vespa.package import VespaDocker

vespa_docker = VespaDocker(
    port=8080,
    container_memory="8G", 
    disk_folder="/Users/tmartins/news" # change for your desired absolute folder    
)
app = vespa_docker.deploy(
    application_package=app_package, 
)
1
2
3
4
5
6
7
8
9
10
11
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.

vespa_docker will parse the app_package and write all the necessary Vespa config files to the disk_folder. It will then create the docker containers and use the Vespa config files to deploy the Vespa application. We can then use the app instance to interact with the deployed application, such as for feeding and querying. If you want to know more about what happens behind the scenes, we suggest you go through this getting started with Docker tutorial.

Feed data to the app

We can use the feed_data_point method. We need to specify:

  • data_id: unique id to identify the data point

  • fields: dictionary with keys matching the field names defined in our application package schema.

  • schema: name of the schema we want to feed data to. When we created an application package, we created a schema by default with the same name as the application name, news in our case.

1
2
3
4
5
6
for article in data:
    res = app.feed_data_point(
        data_id=article["news_id"], 
        fields=article, 
        schema="news"
    )

Query the app

We can use the Vespa Query API through app.query to unlock the full query flexibility Vespa can offer.

Search over indexed fields using keywords

Select all the fields from documents where default (title or abstract) contains the keyword ‘music’.

1
2
res = app.query(body={"yql" : "select * from sources * where default contains 'music';"})
res.hits[0]
1
2
3
{'id': 'index:news_content/0/5f1b30d14d4a15050dae9f7f',
 'relevance': 0.25641557752127125,
 'source': 'news_content'}

Select title and abstract where title contains ‘music’ and default contains ‘festival’.

1
2
res = app.query(body = {"yql" : "select title, abstract from sources * where title contains 'music' AND default contains 'festival';"})
res.hits[0]
1
2
3
4
5
{'id': 'index:news_content/0/988f76793a855e48b16dc5d3',
 'relevance': 0.19587240022210403,
 'source': 'news_content',
 'fields': {'title': "At Least 3 Injured In Stampede At Travis Scott's Astroworld Music Festival",
  'abstract': "A stampede Saturday outside rapper Travis Scott's Astroworld musical festival in Houston, left three people injured. Minutes before the gates were scheduled to open at noon, fans began climbing over metal barricades and surged toward the entrance, according to local news reports."}}

Search by document type

Select the title of all the documents with document type equal to news. Our application has only one document type, so the query below retrieves all our documents.

1
2
res = app.query(body = {"yql" : "select title from sources * where sddocname contains 'news';"})
res.hits[0]
1
2
3
4
{'id': 'index:news_content/0/698f73a87a936f1c773f2161',
 'relevance': 0.0,
 'source': 'news_content',
 'fields': {'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'}}

Search over attribute fields such as date

Since date is not specified with attribute=["fast-search"] there is no index built for it. Therefore, search over it is equivalent to doing a linear scan over the values of the field.

1
2
res = app.query(body={"yql" : "select title, date from sources * where date contains '20191110';"})
res.hits[0]
1
2
3
4
5
{'id': 'index:news_content/0/debbdfe653c6d11f71cc2353',
 'relevance': 0.0017429193899782135,
 'source': 'news_content',
 'fields': {'title': 'These Cranberry Sauce Recipes Are Perfect for Thanksgiving Dinner',
  'date': 20191110}}

Since the default fieldset is formed by indexed fields, Vespa will first filter by all the documents that contain the keyword ‘weather’ within title or abstract, before scanning the date field for ‘20191110’.

1
2
res = app.query(body={"yql" : "select title, abstract, date from sources * where default contains 'weather' AND date contains '20191110';"})
res.hits[0]
1
2
3
4
5
6
{'id': 'index:news_content/0/bb88325ae94d888c46538d0b',
 'relevance': 0.27025156546141466,
 'source': 'news_content',
 'fields': {'title': 'Weather forecast in St. Louis',
  'abstract': "What's the weather today? What's the weather for the week? Here's your forecast.",
  'date': 20191110}}

We can also perform range searches:

1
2
res = app.query({"yql" : "select date from sources * where date <= 20191110 AND date >= 20191108;"})
res.hits[0]
1
2
3
4
{'id': 'index:news_content/0/c41a873213fdcffbb74987c0',
 'relevance': 0.0017429193899782135,
 'source': 'news_content',
 'fields': {'date': 20191109}}

Sorting

By default, Vespa sorts the hits by descending relevance score. The relevance score is given by the nativeRank unless something else is specified, as we will do later in this post.

1
2
res = app.query(body={"yql" : "select title, date from sources * where default contains 'music';"})
res.hits[:2]
1
2
3
4
5
6
7
8
9
10
[{'id': 'index:news_content/0/5f1b30d14d4a15050dae9f7f',
  'relevance': 0.25641557752127125,
  'source': 'news_content',
  'fields': {'title': 'Music is hot in Nashville this week',
   'date': 20191101}},
 {'id': 'index:news_content/0/6a031d5eff95264c54daf56d',
  'relevance': 0.23351089409559303,
  'source': 'news_content',
  'fields': {'title': 'Apple Music Replay highlights your favorite tunes of the year',
   'date': 20191105}}]

However, we can explicitly order by a given field with the order keyword.

1
2
res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date;"})
res.hits[:2]
1
2
3
4
5
6
7
8
9
[{'id': 'index:news_content/0/d0d7e1c080f0faf5989046d8',
  'relevance': 0.0,
  'source': 'news_content',
  'fields': {'title': "Elton John's second farewell tour stop in Cleveland shows why he's still standing after all these years",
   'date': 20191031}},
 {'id': 'index:news_content/0/abf7f6f46ff2a96862075155',
  'relevance': 0.0,
  'source': 'news_content',
  'fields': {'title': 'The best hair metal bands', 'date': 20191101}}]

order sorts in ascending order by default, we can override that with the desc keyword:

1
2
res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date desc;"})
res.hits[:2]
1
2
3
4
5
6
7
8
9
10
[{'id': 'index:news_content/0/934a8d976ff8694772009362',
  'relevance': 0.0,
  'source': 'news_content',
  'fields': {'title': 'Korg Minilogue XD update adds key triggers for synth sequences',
   'date': 20191113}},
 {'id': 'index:news_content/0/4feca287fdfa1d027f61e7bf',
  'relevance': 0.0,
  'source': 'news_content',
  'fields': {'title': 'Tom Draper, Black Music Industry Pioneer, Dies at 79',
   'date': 20191113}}]

Grouping

We can use Vespa’s grouping feature to compute the three news categories with the highest number of document counts:

  • news with 9115 articles

  • sports with 6765 articles

  • finance with 1886 articles

1
2
res = app.query(body={"yql" : "select * from sources * where sddocname contains 'news' limit 0 | all(group(category) max(3) order(-count())each(output(count())));"})
res.hits[0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{'id': 'group:root:0',
 'relevance': 1.0,
 'continuation': {'this': ''},
 'children': [{'id': 'grouplist:category',
   'relevance': 1.0,
   'label': 'category',
   'continuation': {'next': 'BGAAABEBGBC'},
   'children': [{'id': 'group:string:news',
     'relevance': 1.0,
     'value': 'news',
     'fields': {'count()': 9115}},
    {'id': 'group:string:sports',
     'relevance': 0.6666666666666666,
     'value': 'sports',
     'fields': {'count()': 6765}},
    {'id': 'group:string:finance',
     'relevance': 0.3333333333333333,
     'value': 'finance',
     'fields': {'count()': 1886}}]}]}

Use news popularity signal for ranking

Vespa uses nativeRank to compute relevance scores by default. We will create a new rank-profile that includes a popularity signal in our relevance score computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from vespa.package import RankProfile, Function

app_package.schema.add_rank_profile(
    RankProfile(
        name="popularity",
        inherits="default",
        functions=[
            Function(
                name="popularity", 
                expression="if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0)"
            )
        ], 
        first_phase="nativeRank(title, abstract) + 10 * popularity"
    )
)

Our new rank-profile will be called popularity. Here is a breakdown of what is included above:

  • inherits=”default”

This configures Vespa to create a new rank profile named popularity, which inherits all the default rank-profile properties; only properties that are explicitly defined, or overridden, will differ from those of the default rank-profile.

  • function popularity

This sets up a function that can be called from other expressions. This function calculates the number of clicks divided by impressions for indicating popularity. However, this isn’t really the best way of calculating this, as an article with a low number of impressions can score high on such a value, even though uncertainty is high. But it is a start :)

  • first-phase

Relevance calculations in Vespa are two-phased. The calculations done in the first phase are performed on every single document matching your query. In contrast, the second phase calculations are only done on the top n documents as determined by the calculations done in the first phase. We are just going to use the first-phase for now.

  • expression: nativeRank + 10 * popularity

This expression is used to rank documents. Here, the default ranking expression — the nativeRank of the default fieldset — is included to make the query relevant, while the second term calls the popularity function. The weighted sum of these two terms is the final relevance for each document. Note that the weight here, 10, is set by observation. A better approach would be to learn such values using machine learning, which we’ll get back to in future posts.

Redeploy the application

Since we have changed the application package, we need to redeploy our application:

1
2
3
app = vespa_docker.deploy(
    application_package=app_package, 
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.
1
app.deployment_message
1
2
3
4
5
6
7
8
9
10
11
["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session",
 "Session 3 for tenant 'default' created.",
 'Preparing session 3 using http://localhost:19071/application/v2/tenant/default/session/3/prepared',
 "WARNING: Host named 'news' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.",
 "Session 3 for tenant 'default' prepared.",
 'Activating session 3 using http://localhost:19071/application/v2/tenant/default/session/3/active',
 "Session 3 for tenant 'default' activated.",
 'Checksum:   fa83365f9aacba5133026e09c3e42cea',
 'Timestamp:  1615287349323',
 'Generation: 3',
 '']

Query using the new popularity signal

When the redeployment is complete, we can use it to rank the matched documents by using the ranking argument.

1
2
3
4
5
res = app.query(body={
    "yql" : "select * from sources * where default contains 'music';", 
    "ranking" : "popularity"
})
res.hits[0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{'id': 'id:news:news::N5870',
 'relevance': 5.156596018746151,
 'source': 'news_content',
 'fields': {'sddocname': 'news',
  'documentid': 'id:news:news::N5870',
  'news_id': 'N5870',
  'category': 'music',
  'subcategory': 'musicnews',
  'title': 'Country music group Alabama reschedules their Indy show until next October 2020',
  'abstract': 'INDIANAPOLIS, Ind.   Fans of the highly acclaimed country music group Alabama, scheduled to play Bankers Life Fieldhouse Saturday night, will have to wait until next year to see the group. The group famous for such notable songs like "If You\'re Gonna Play in Texas", "Love In The First Degree", and "She and I", made the announcement that their 50th Anniversary Tour is being rescheduled till ...',
  'url': 'https://www.msn.com/en-us/music/musicnews/country-music-group-alabama-reschedules-their-indy-show-until-next-october-2020/ar-BBWB0d7?ocid=chopendata',
  'date': 20191108,
  'clicks': 1,
  'impressions': 2}}