Vespa Terminology for Elasticsearch, OpenSearch or Solr People
If you’re like me, coming to Vespa from a Lucene search engine background, you’ll find that a lot of concepts overlap. Some more than others. For example, to get started, you have to create a new application package and deploy it. What is an application, an application package, how to deploy it and why?
Turns out, an application is a Vespa deployment and an application package is its complete specification: config, schema, even the number of nodes and their CPUs. To create/update an application, you’d run vespa deploy
on the application package. Much like you’d use kubectl apply
in Kubernetes: you always use the full spec to create a new instance of this application or to update it.
Applications were a bit hard to understand - at least for me, in the beginning - but other concepts rolled more easily. For example, content nodes: they sound similar to Elasticsearch data nodes. Sure, in Vespa they have to be independent processes. And they’re native, not on the JVM like container nodes (wait, aren’t these coordinating nodes?). Still, content nodes store data, compute the initial phase of the query, it all sounds very familiar.
Here, I’m trying to explain Vespa concepts (most from the glossary are here) for the Lucene search engine person. In the process, I’ll point out similar functionality from Elasticsearch/OpenSearch/Solr, where they overlap and where they differ. At least at a relatively high level.
To make the most use of the table below, I’d suggest reading the whole row of the concept(s) you’re interested in. Even if you’re only familiar with one search engine. I’ve tried to reduce data duplication wherever possible, so for example you’ll find interesting info in the Elasticsearch/OpenSearch section, even if Solr is all you know.
Enjoy! And feel free to give us feedback about it in the Vespa Slack.
Concept and definition | Elasticsearch and OpenSearch equivalent | Solr equivalent |
---|---|---|
Application Unit of deployment, as defined by an application package (see below) which can span multiple clusters (i.e. groups of nodes, see below) and/or schemas (i.e. definitions for document types, which are separate chunks of data, more on both concepts below). |
An Elasticsearch or OpenSearch cluster is the closest concept. A Vespa application can contain multiple clusters and schemas for document types (i.e. similar to indices). If you really want an equivalent in Elasticsearch, you’d maybe have a Kubernetes CRD that declares the ideal state of one or more clusters. |
Same comments as with Elasticsearch and OpenSearch. |
Application Package Directory structure that contains all things configuration. It needs to contain the schema and the topology of the cluster (services.xml). But this is also the place for custom components, models, etc. For a configuration to be applied, it needs to be deployed (see below). |
The desired cluster state: index settings and mappings, cluster settings. Files referenced from the cluster state (e.g. synonym files, models) are, in Vespa, also stored in an application package. | A Configset has some overlap, but Vespa also has solr.xml-like settings and potentially other referenced files (e.g. models) in an application package. |
Attribute Schema keyword telling Vespa how to store and handle a field. This mode supports most of the functionality from exact matching and retrieving original values to exact nearest neighbor search (see below). The biggest missing pieces are full-text search and approximate nearest neighbor: use index for those. |
In the context of strings, attribute vs index in Vespa is like text vs keyword in Elasticsearch. Attributes are backed by a columnar structure providing similar functionality to doc_values. For example, attributes support grouping (i.e. aggregations, see below). But “attribute” can also be applied on numeric fields, tensors (more on those below), etc. Like most non-string fields in Elasticsearch are also backed by DocValues. |
Attributes roughly map to docValues-backed fields (e.g. strings, numbers) as opposed to text fields (which, in Vespa, would use the index keyword). The organization of properties is also a little different in Vespa: you’d put other options under attribute and index respectively. For example, if you want an inverted index for an string→attribute field, you’d put fast-search under attribute. In Solr, the equivalent indexed=true would be at the root of the string field definition. |
Cluster Within an application, a set of nodes that perform the same task. There are three types of nodes: - content nodes, hosting data (more below) - container nodes, running a stateless JVM process that runs queries, processes documents, etc. (more below) - admin nodes, running “administrative” processes like config servers (more on config servers below) Multiple node types can live on the same host, as they do for example in the quick start Docker container. |
Nodes with the same role: like the content cluster would be equivalent to the data nodes. An OpenSearch cluster is a false friend because it’s closer to a Vespa application than to a content cluster for example. Vespa nodes host multiple services that are separate processes. For details on some of them, have a look at config server and container concepts below. |
Similar to OpenSearch: nodes with the same node roles would be a similar concept. Though the actual “roles” are a little different: while the container cluster (see container below) is equivalent to Solr coordinators, the content cluster (see content node below) represents data nodes, there’s no clear equivalent to the admin cluster. The admin cluster deals with everything from serving configuration to serving logs and metrics. |
Component Pluggable components to Vespa containers (more on them below). Can be used for document/query processing or response rendering. They can usually be chained: e.g. a query can go through multiple searchers sequentially. |
An Elasticsearch plugin would be the equivalent of a component. The difference is that Elasticsearch plugins can be more diverse, while Vespa encourages you to write plugins for specific tasks - making those specific tasks easier. | Vespa components are quite similar to some of the Solr components. Document processors are similar to update request processors, searchers are similar to search components, processors are like request handlers and result renderers are like response writers. As with Solr, you can define chains of e.g. document processors and inject your custom ones using the configuration. |
Configuration Server Service responsible for accepting and serving configuration (from application packages, more on them above) to other services. This is where application packages are deployed. |
The active master or cluster manager is the closest concept. The main difference is that cluster managers do a lot more than writing new configuration and serving it. In OpenSearch, the cluster manager pushes the updated cluster state to all nodes in the cluster, irrespective of the role. This way, external clients can read configuration from any node. In Vespa, it’s much more specialized: a configuration proxy can act as a cache in front of a cluster of config servers in order to serve the configuration from memory. In Elasticsearch, the cluster state (maintained by the cluster manager) contains more than the configuration. For example, which shard is on which node, so that nodes receiving a new document know where to distribute it. In Vespa, document distribution is again handled separately in a pull fashion. You have a cluster controller that polls for the cluster state. Clients get distribution of data from distributors. Distributors serve the distribution from memory, much like the configuration proxy does with the config. Note: “cluster state” is a false friend. In Elasticsearch, it refers to all the cluster metadata: nodes, shards, mappings, settings, templates, aliases, etc. In Vespa, “cluster state” only refers to the state of content (i.e. data) nodes. The distribution of data is derived from it via the distribution algorithm. |
Overseer is the closest concept. Like Overseers, Configuration servers write changes to Zookeeper: by default, each config server has an embedded Zookeeper. As with Solr, it’s usually the client’s responsibility to pull the latest config from Zookeeper. The main difference is that Vespa separates the config part from the data distribution (i.e. sharding) part. The latter is handled by the cluster controller. Another difference is that both the config server and the cluster controller have “proxies” for serving their data: configuration proxy and distributor respectively. Finally, Vespa config servers don’t hold a leader election. There’s no active/passive here. Instead, the config proxy reads from any [available] config server. |
Container Stateless Java process. Also called JDisc: Java Data Intensive Serving Container. Does query and document parsing, aggregates per-content-node results, global (3rd phase) ranking and renders results. |
Coordinating node. Except that in Elasticsearch, every node can be a coordinating node. In Vespa, the container process is different from the content node process. | Coordinator node. As with Elasticsearch, this job is “bundled” into all nodes by default, while with Vespa the container is in a JVM and the content node is a compiled (from C++) binary. |
Content node: The data layer, made up of protons (which manage actual data), cluster controllers (polling for the cluster state, only one is active - master-elected) and the distributor (which serves the cluster state to clients). |
Data nodes. Though the Vespa equivalent is really just the proton process of the content node. The content layer also has its own “master” processes for maintaining the cluster state - which for Vespa refers to the data distribution, not the configuration (see Configuration Server above). This distribution-related cluster state is maintained by cluster controllers (only one can be elected as the master) and served to clients by distributors. | Data nodes. Though it’s a similar distinction as with Elasticsearch: data nodes are functionally more like Vespa’s protons. Unlike Solr, Vespa doesn’t save data-distribution-related metadata in Zookeeper. Instead, distribution data is polled by the cluster controller. |
Deploy The vespa deploy CLI command, which verifies the new application package, uploads it and applies it. You can also do this via the Deploy API. |
An Update Cluster/Index Settings API or Update Mapping API call. The difference is that you always work with the whole application, you don’t typically send the diff (i.e. the new setting). | Roughly maps to uploading a config set via the Configsets API. Though application packages (see above) can also contain cluster-level settings (e.g. replicationFactor, or “redundancy” in Vespa), so there’s some overlap with the Collections API as well. |
Diversity Ensures results aren’t too similar (e.g. belonging to the same group). This can be done by using diversity in the rank profile, though grouping is the preferred way: you can return summaries of top N per each group. |
The Top Hits aggregation or the Collapse parameter are functionally equivalent to the grouping approach in Vespa. Diversity in Vespa’s rank profile works differently: it restricts the number of documents per group earlier: during the match phase and/or the first phase of ranking. |
Result Grouping or the Collapse Query Parser are functionally equivalent to the grouping approach in Vespa. Same comment about Vespa’s rank profile diversity - there’s no equivalent at the moment. |
Document A (typically JSON) representation of the data, made up of key-value pairs. |
Document. In Vespa, namespace + document type + user-provided ID constitutes the full unique identifier in the cluster. Read more about document types (similar to indices) and namespaces (similar to old Elasticsearch mapping types) below. | Document. Also represents key-value pairs, though in Vespa you can have structs: a value can represent another bunch of key-value pairs and so on (like objects in Elasticsearch). |
Document summary A definition of fields (or snippets of fields) to return from each hit. |
A combination of stored_fields, docvalue_fields and highlighted fragments, stored in the configuration. Queries can reference one of these summaries to indicate what kind of data to be returned. There is no real equivalent of _source in Vespa, but you can create a summary containing all fields, which will effectively return the whole document. Provided that non-attribute fields have the summary option enabled (which is close to stored=true), you can reference them in the summary you define. All attribute fields can be retrieved, because they are backed by a structure similar to Lucene’s DocValues. The summary of a field can also be dynamic, which returns highlighted snippets. It’s a more restricted version of the highlighting exposed in Elasticsearch or OpenSearch. |
It’s like defining fl as a request handler default. Except that Vespa summaries are independent of the other parameters. Similar to Solr, a field needs to have the required data structure in order to be returned: stored=true would be indexing:summary in Vespa. Attribute fields have the docValues equivalent always on, so you can retrieve them. You can only choose if attributes should always be served from memory or can be paged to disk (e.g. if there’s more data than RAM). |
Document processor Pluggable and chainable component for processing documents before writing them. |
Somewhere between an ingest pipeline and a full-blown plugin, both regarding ease of use and flexibility. | Maps quite well to Solr’s Update Request Processors. |
Document type A type of document (i.e. a collection of fields) defined in a schema. Data corresponding to different document types is handled separately. |
Index. Although Vespa has a different approach to splitting document-type-specific data across nodes, see below. If you’re thinking of the old Elasticsearch mapping types, don’t 🙂 It’s a false friend. Document types in Vespa are stored in different files, like indices are in Elasticsearch. |
Collection. Data is distributed differently between nodes in Vespa, where there’s no exact equivalent of a shard. Read more about Vespa’s elasticity below. |
Elasticity How you can add and remove content nodes without downtime. Vespa stores documents in buckets - think of buckets as micro-shards. It also separates the persistence layer from the serving layer (that is in memory) a little more. Which has the following advantages compared to the other engines here: - you don’t have to decide on sharding upfront, nor do you have to change sharding or reindex to a new index with a different number of shards - serving data is often from memory (read: faster) - the number of replicas that serve searches (searchable-copies) can be lower than the total number of stored replicas (redundancy) - partial updates are real-time and most of them are really partial. Regular writes are real-time as well, no need for refresh |
Elasticsearch is also… elastic 😎… but in a different way. You have shards made up of segments. This has its own advantages compared to Vespa, which helps with logs and other time-series data: - if you have lots of data per node and can tolerate higher latencies (e.g. log analytics), the per-segment approach that relies a lot on OS caches works better - each index can have a different number of shards, which allows you to have many small indices that don’t have to be distributed on all nodes - you have more control over which index is served by which nodes (e.g. hot-warm architectures) |
Similar comment to Elasticsearch. The only other difference is that in Vespa, elasticity is automatic: the cluster expands/recovers on its own when nodes are added/removed. With Solr, you’ll want to use something like the Solr Operator [for Kubernetes] for a more hands-off approach. |
Federation Vespa’s ability to combine (and modify) results from multiple sources, which can be different Vespa applications and/or other data stores. |
There’s some overlap with cross-cluster search. In Vespa, the implementation can involve custom code and non-Vespa data sources can be included in the search configuration. Coupling is more loose in Vespa, in the sense that sources (and the federation layer itself) don’t care about e.g. the schemas of underlying Vespa clusters. |
There’s some overlap with Solr’s ability to combine multiple data sources (Solr or otherwise) in a single result set via Streaming Expressions. Or data from multiple collections via Cross-Collection Join: a specific implementation on top of Streaming Expressions. |
Field Key-value pairs in documents. Though you can have structs, synthetic fields… there’s some functionality here that’s worth outlining/comparing. |
Field 😊 The syntax is a bit different in Vespa, but high-level functionality is often similar: structs are like objects in Elasticsearch and synthetic fields are like copy_to. Numeric fields are similar, in Vespa you have index (i.e. text fields), attribute (like keywords, more info above) and so on. |
Field. Same comment as for Elasticsearch and OpenSearch, with one more difference here: Vespa also supports structs (i.e. nested structures that are flattened under the hood). Handling dates is more verbose in Vespa: you’d use synthetic fields to parse date strings into milliseconds since epoch. Which also means that date range queries and range facets are more limited. |
Fieldset There are two separate meanings here: 1. Searchable fieldset: an alias to multiple fields, so you can more easily search in multiple fields. Settings need to be compatible because tokens are only generated once at query-time. 2. Document fieldset: fields to be returned during GET by ID or Visit. More info on Visit below. |
For searchable fieldsets there’s no equivalent. The cross-fields type of the multi-match query and the Combined fields query have some similarities: they also need compatible analysis to work as expected. But scoring is different in Vespa (it’s different even between cross-fields and Combined fields) and in Elasticsearch you still have to specify the fields at query time. For document fieldsets, source filtering provides similar functionality. |
Searchable fieldsets are similar to defining different search handlers with e.g. the edismax query parser and a default qf. Document fieldsets are like defining fl during a realtime GET request. |
Garbage collection Automatic document expiry. Not to be confused with JVM GC. |
Similar to Elasticsearch’s old _ttl field. For time-series data, most people have time/size-based indices and use ILM/ISM to delete old indices. For deleting data within an index, you’d normally run a delete-by-query command on a schedule (e.g. Linux or Kubernetes cron). | Similar to Solr’s DocExpirationUpdateProcessor. |
Grouping Vespa’s faceting implementation. Can be hierarchical (e.g. top N tags for each of the top M countries) and can compute stats, show hits, etc. per bucket. |
Aggregations. Though the Vespa syntax (from YQL) is more like ES|QL than the JSON aggregations. | It’s close (functionality-wise) to the JSON facet implementation with more of an SQL-like syntax. |
Indexing Search engine parlé for writing data, because there are [more] indexing structures that need to be created as well. Unlike Lucene search engines, Vespa doesn’t need a soft commit to make new data visible: when the client got the ACK for a write, the new document will come up in searches. |
Indexing. It’s worth noting that, while Elasticsearch, OpenSearch and Solr support writing in bulk, Vespa achieves high write throughput by using HTTP/2 multiplexing. | Indexing. Same comment about batching (or lack of them in Vespa), which makes it more difficult to write high-throughput Vespa clients at first. But it’s easier to detect (and recover from) errors, since each reply is individual. |
Namespace A meta-attribute of documents that can be used as in the document selector language while visiting (i.e. exporting - more on visiting below). |
Similar to how mapping types used to be, in the sense that you could “see” only documents of a specific type, though they were all co-located in the same index. | No equivalent. You can use different collections to separate documents, but then they’re really separated, with pros and cons: queries on one section are faster, but queries across sections are slower. |
Nearest neighbor search Vector similarity search. Either exact or approximate. Many distance metrics are available. Pre and post-filtering, too. Tensor fields (which can store vectors, more on them below) can be multi-valued and you can have N vector fields in a single document. |
Knn query. The implementation is quite different, though. Most importantly, when it comes to approximate nearest neighbor (ANN), Vespa has one HNSW graph per node, while the Lucene implementation has one HNSW graph per segment. This makes ANN typically much faster in Vespa, because it’s like searching in an index force-merged to 1 segment. Note: one graph also implies one search thread. While Elasticsearch and OpenSearch can run multiple search threads per query per index (because they can work with different segments) splitting HNSW graphs and parallelizing work doesn’t help with latency that much: you’re still better off with a merge policy that results in fewer segments. Fewer&bigger graphs help with efficiency and usually with latency, too. That said, both Vespa and Elasticsearch support similar vector types (e.g. float, byte and bit) and distance metrics (e.g. euclidean, dot product, hamming distance). |
Same comments as with Elasticsearch or OpenSearch’s Lucene engine. It’s worth noting that OpenSearch also supports NMSLIB and FAISS. And that when it comes to exposing Lucene features, the implementation isn’t exactly the same (and released at the same time) between Elasticsearch, OpenSearch and Solr. But this is constantly changing and I’m trying to focus on the big picture here, not the tiny details. Especially since this area changes a lot lately. |
Node Collection of services that focus on the same task. Can be a content node (see above), a container node (see above) or an admin node. The same host can be a content, container and admin node at the same time. |
Node. It’s just that in OpenSearch, nodes can have one or more roles, all in the same process. In Vespa, there are different processes even within the same role. Think of them as sub-roles, which are more fine-grained. | Node. Same comment as with OpenSearch. |
Parent/child Join implementation to replace denormalizing, for when the use-case has relational data. |
There are some similarities to join fields and they solve similar use-cases. The implementation in Vespa prioritizes query latency over functionality and disk usage. It works by referencing attribute fields (i.e. DocValues-like, more details above) from other documents. The topmost parent is called a “global” document. Global documents are replicated on all content (i.e. data) nodes. So there are pros and cons: + There’s a cheap indirection step to access the referenced (i.e. parent) value at query time. This scales much better with the number of children than the two-step join queries in Elasticsearch - You can’t do full text search in referenced fields: they wouldn’t be attributes - High disk and memory usage if you have a lot of global documents |
Functionally, it’s close to the Join Query Parser working across single-shard collections: one side needs to be replicated on all nodes. But Vespa makes a similar performance-over-functionality trade-off as explained in the Elasticsearch section: it only works with attributes (so no full-text search on referenced strings), but because Vespa pulls attributes by reference you can expect queries to be faster than even nested documents (Block Join). Unlike Block Join, documents can be added/deleted/updated independent of each other. More on [partial] updates below. |
Partial update Partially updating documents 🙂 |
Similar to the Update API functionally, but in Vespa’s updates are more efficient: - Because writes in Vespa are real-time, there’s no need for a refresh or transaction log lookup for an update. Which makes retrieving the document much faster, which is usually the most expensive part of an update - On attribute fields (i.e. DocValues - see above), updates are really partial: nothing to read here. Which is useful for keeping counters like items in stock |
Vespa’s partial updates for attribute fields are even more localized than Solr’s in-place updates: the write is only for the document and not for a whole segment, like DocValues updates. Vespa reads existing documents (i.e. like atomic updates but without transaction log lookup) for: - summary (i.e. stored) fields - index fields (i.e. text fields, not backed by the docValues-like structure) - referenced fields (see parent/child above) - multiple structs (i.e. flattened objects), whether they are in an array or a map - predicate fields (used for matching documents that store query logic, like in lucene-monitor or Percolator from OpenSearch) |
Query Running a search, typically via YQL. Returns results and/or grouping information. |
A request to the Search API. In Elasticsearch, we usually refer to “query” as just the query part (e.g. excluding aggregations), while in Vespa it refers to the overall read request (e.g. including grouping). Vespa’s YQL query syntax is more like ES|QL than the Query DSL from OpenSearch. Vespa also has Simple Query Language, which is close to the Simple query string query. Conceptually, there are lots of similarities in other query options. For example, Vespa’s Tracing returns similar information to the Profile API in OpenSearch or to Solr’s debug parameter. |
A request to a search handler. Like search handler options can provide defaults, Vespa has query profiles. In Solr, you can put relevance-related logic either in the search handler definition or at query time. In Vespa, most relevance logic would go in a Rank Profile (more on Ranking below) which is part of the schema definition. Only query logic (e.g. which terms to look for) goes in the Vespa query. |
Ranking Computing the relevance score on documents matching the query (i.e. from retrieval). Ranking can have multiple phases. Ranking-related logic is defined in a rank profile in the schema, using ranking expressions. Typically, you have the application package (more info above) somewhere in source control, and when you update the ranking logic you commit and deploy the new application package. Which can also trigger some tests before it takes effect. Vespa’s approach differs from other engines here (especially Elasticsearch and OpenSearch) where ranking logic is usually in the query payload. Vespa’s approach has pros (e.g. easier to test and deploy) and cons (e.g. harder to tweak during development). |
Ranking. Terminology might slightly differ here, for example the relevance score is called “score” in Lucene-based engines, while in Vespa is “relevance” 🙂 The important bit is how it works: Vespa does ranking in phases. First phase happens on the content nodes (i.e. data nodes, see above) and computes an initial score on all matching documents, much like Elasticsearch does. As with Elasticsearch, WAND allows Vespa to skip unlikely candidate documents. Second phase is optional and also on the content nodes: to rescore the top N results with a potentially more expensive ranking expression. Like the rescore from Elasticsearch. Third phase is also optional: the global phase. It runs on the container node (i.e. the coordinator node for that query, more info above) and runs on the top N aggregated results from all the content nodes. |
Ranking 🙂 Unlike Lucene-based engines, where the default scoring function is BM25, in Vespa the default is called nativeRank and it’s a combination of the following factors: - How well the query terms match the document terms - How early do those terms show up in the document (similar to span_first) - How close the terms are to each other (like span_near) This works across index (i.e. text) and attribute (i.e. string) fields and produces a reasonably good&efficient score. It’s also highly configurable: fields can have different importance, proximity score can decay on a different curve, etc. etc. Note that there’s no IDF (term rarity) calculation here. There is a term significance that’s calculated, though, and you can override it at query time. But if you want to be precise with IDF, you can use BM25 instead or in combination with nativeRank. |
Schema Contains the document type definition (which in turn contain field definitions - how to index data) and rank profiles (i.e. how to query data). |
Mostly overlaps with mappings. But a Vespa schema also contains other settings related to indexing documents and running queries. For example, equivalents of analysis settings (linguistics). | Mostly overlaps with schema, but includes query-related settings like rank profiles, which somewhat overlap with search handler definitions (see Ranking above). |
Searcher Pluggable component that implements custom query handling. |
Like a query type, which could be implemented as a plugin. | Similar to search components. Like with search components in a search handler, in Vespa you can chain multiple searchers in a search chain. |
Semantic search Refers to vector search (see nearest neighbor search above) in contrast with lexical search, which is about tokenizing and matching terms. One can combine semantic with lexical search scores during the global ranking phase using something like reciprocal rank fusion (RRF). |
Same concepts in Elasticsearch and OpenSearch as well. Just note that Semantic Search and RRF are both paid in Elasticsearch at the time of writing this. | Same concepts, same comments as in Nearest neighbor search above. |
Service A process that does a specific task, as defined in services.xml |
A (usually more specific) part of a node role. Vespa has more of a microservices architecture - see Cluster and Node above for more details. | Same comment as with Elasticsearch and OpenSearch. |
Streaming search A separate mode of storing data, which is different from the default (document mode=indexed). The trade-off is that Vespa doesn’t compute index data structures when writing data, it only stores it, so it uses little disk space but won’t perform on large datasets. It works great for many small subsets of documents, which can be searched separately by group name. Some features aren’t available in streaming search, for example parent/child and ANN search (though exact nearest neighbor works - more on NN and ANN above). Also, you don’t have access to all the linguistics (i.e. analysis) functionality. |
No similar concept. Streaming search works well for multi-tenant use-cases, where the number of tenants is large and the data per tenant isn’t very large. Also, you can search across tenants if you need to. For such use-cases, you’d use routing, but Vespa’s approach is very different, like a distributed grep: if the searched dataset is usually small, let’s work on the raw content instead of creating “accelerator” data structures. Only raw data and attributes (i.e. DocValues, see above) are stored on disk, no indexing data structures. Memory usage of attributes is only 45 bytes per document - much less than in the typical Vespa deployment. |
Similar comment to Elasticsearch and OpenSearch: the use-case is having many tenants, but the underlying implementation differs. “Tenant” here refers to datasets that are searched separately (e.g. mailboxes). In other reading, discrete datasets are also referred to as personal search or partitioned data. It’s a different concept than Vespa Cloud tenants. |
Tensor Data structure that consists of a set of labels and dimensions. They support a wide range of functions (tensor math) that allow computing everything from keyword weights to vector distance for semantic search (see above) to a fully-fledged neural network. |
There’s no native way to express tensors and do tensor math in Elasticsearch or OpenSearch. That said, depending on the use-case, one can use existing data structures and query types to perform similar functions. For example, the script score query can access dense vector data to perform exact nearest neighbor search using various distance functions. | Similar to Elasticsearch and OpenSearch, there’s no way to natively store tensors. But there are features offering similar kind of functionality: - Function queries allow you to perform computations on top of results from other queries. You can also implement your own function as a plugin. - Streaming expressions can perform a wide range of computations on top of tuples (which in turn can be generated in various ways): vector similarity, matrix multiplication, clustering and much more. |
Visit An efficient way to go over all documents in Vespa, or a subset of them defined by a document selection expression. |
Similar to search_after. The Vespa implementation makes yet another performance-over-functionality trade-off: the document selection expression isn’t a full-blown query, but a boolean expression that can match text using regex, can also express numeric ranges and light transformations (i.e. analysis). Like OpenSearch’s scroll - a visit can be sliced and parallelized, but it doesn’t offer a point-in-time view. | It’s like a mix between the efficiency of the Export handler and the page-at-a-time approach of cursors. Vespa returns a continuation ID for getting the next chunk of documents, similar to Solr’s cursorMark and the scroll ID from Elasticsearch and OpenSearch. |