Announcing IN query operator

Decorative
image

Today, we are pleased to announce a new query operator in Vespa. The IN operator is useful in various scenarios where we need to filter documents based on whether a given field matches at least one of a set of values. This is a shorthand for multiple OR conditions that enables writing more concise queries with much better performance.

Example

Consider a scenario where we want to retrieve product documents with specific IDs:

schema product {
    document product {
        field id type int {
            indexing: attribute | summary
            attribute: fast-search
            rank: filter
        }
        field name type string {
            indexing: attribute | summary
        }
        ...
    }
}

Instead of using multiple OR conditions,

select id,name from product where id = 10 or id = 20 or id = 30

we simplify this by using the IN operator instead:

select id,name from product where id in (10, 20, 30)

In this case product documents with IDs 10, 20 or 30 are retrieved. We can also retrieve documents that do not match any of these IDs by using the NOT operator:

select id,name from product where !(id in (10, 20, 30))

Performance

The weightedSet query operator has previously been used to solve similar filter use cases the IN operator is tailored for. However, weightedSet was mainly optimized for searching weightedset attribute fields. The IN operator however has been optimized for searching all supported types: singlevalue and multivalue fields with basic type byte, int, long, or string. The weightedSet operator also benefit from these optimizations, as they share most of the low-level matching code.

The illustration below shows historic results from one test case in the IN operator performance test. It shows the average end-to-end latency when querying a corpus with 10M documents using an IN operator over a singlevalue integer attribute field with fast-search. The queries return 1M documents, and the different graphs use different amount of values in the operator: 1, 10, 100, 1000. 1 thread per search is used on the content node.

The baseline performance is equal to the previous performance of the weightedSet operator. After optimizations the performance when searching for many values has been greatly improved. With 100 values the latency went from 150ms to 60ms, a 2.5x speedup. With 1000 values the latency went from 1200ms to 95ms, a 12.5x speedup. Note that similar improvements does not necessary apply when using the IN operator in combination with other query filters.

Historic latency graphs from IN operator performance test

Summary

The new IN operator replaces the weightedSet operator for filtering use cases, making it easy to write concise queries with much better performance than using multiple OR conditions. The operator is supported for singlevalue and multivalue fields with basic type byte, int, long, or string. See the IN operator reference documentation for more details, and Multi lookup set filtering for more examples. The new operator is available in Vespa 8.293.15.

Got questions? Join the Vespa community in Vespa Slack.