Magnus Eide-Fredriksen
Intern

Theodor Kvalsvik Lauritzen
Intern

26 Aug 2024

Parsing Through the Summer: A Tale of Schemas, Syntax, and Shenanigans

One of the first steps when creating a Vespa application is defining one or more schemas. Using the Vespa Schema Language, you can define document types and the kinds of computations you want to do over them. The language is very powerful, but most IDEs and editors do not currently provide support when using it. It is 2024 and the hottest topic in language tooling is LSP. It may be something you have heard of before, but not quite know what it is. Or maybe you know all the ins and outs for configuring LSP servers in your favorite editor?

Regardless, our mission this summer was to investigate the mystical “LSP”, and try to harness its powers to make writing schema files a pleasant experience. Join two interns climbing the learning ladder on the quest for providing the ultimate language support!

Starting the project

We first spent a couple of days learning about Vespa and how to use it by following the getting started guide. Here we got some hands-on experience with the language we were supposed to create tooling for, as well as a general understanding of Vespa and its capabilities. It was pretty cool!

After this we had to figure out a bunch of stuff:

What exactly is LSP and how does it work?
How are language servers usually implemented?
What programming language should we use to implement the language server?
How do we parse schema files?

In this blog post we will discuss some of the answers we found to these questions during the development.

Language server protocol

So what is LSP? The Language Server Protocol is a protocol to standardize language support features for IDEs and editors. To create support for a specific language, all you need to do is to create a program called a language server, which is able to respond to requests defined in the protocol. A client, which in this context represents an editor like VSCode, can then launch the server as a separate process and use the server to provide developer tooling for the user. This way, the language support logic is disjoint from any specific editor or environment, and supporting new editors should be easy.

It turns out that the full protocol specification is quite extensive. We had to make a choice regarding which parts would be most useful to implement. These were our goals:

Diagnostics: Highlighting of errors and warnings found when parsing schema files.
Code navigation: “Go-to-definition” and “find references”
Semantic token syntax highlighting
Code actions: Quick fixes for common errors
Completion
Documentation on hover

Usually a language server is written in the same language that it provides support for. We figured that writing a language server in the Vespa schema language would impose some difficulties. Therefore we decided to implement the server in Java instead, because most of the relevant parts of Vespa’s existing codebase is written in Java.

Parser

For a language server to actually do something useful, it needs to parse the language in question. The core functionality is therefore what happens when a text document is changed. LSP captures this event through the “textDocument/didChange” notification. This happens on every keystroke. The new document content must be parsed, and symbols are registered for handling other requests later.

The existing parser for schema files was generated using a parser-generator tool called JavaCC. In JavaCC you write production rules like:

field() : 
{ 
     <FIELD> identifier() <TYPE> dataType() <LBRACE> fieldBody() <RBRACE> 
}

Here identifier, dataType and fieldBody are all production rules themselves. JavaCC takes the list of rules as input, and generates a Java program which will lex and parse any string written in the given language. JavaCC also makes it possible to inject Java code to be executed during the actual parsing. For example:

field() :
{
    String name;
}
{
    <FIELD> name = identifier() <TYPE> ... 
}
{
    if (isReservedName(name)) 
        throw new IllegalArgumentException(name + " is reserved!");
    // ... 
}

This is done excessively in the actual schema parser implementation. That way, a model of the Vespa schema is built simultaneously with the parsing, instead of having to construct it from some AST representation. It is actually quite elegant.

This approach does, however, not work as well when trying to make a language server. There are some reasons for this:

The intermediate representation does not contain any references to the original document. When creating an LSP feature like “go-to-definition”, the exact location of the symbol in question needs to be known by the language server. What we need is a Concrete Syntax Tree (CST).
When writing a schema file, most of the time the file will not be syntactically correct. A default JavaCC parser is not fault tolerant. If it encounters an error during parsing it will simply throw an exception and quit. This would lead to poor language support.
The language server might need some very specific information about what the syntax tree looks like at a particular position, for instance to generate relevant completion items. This information is lost during the JavaCC parsing.

For these reasons, we found it necessary to find another way to parse schema files. We had some requirements:

Ideally, the parser is closely related to the existing JavaCC parser.
The parser should be fault tolerant, i.e. being able to continue parsing after a syntax error.
The parser should generate a Concrete Syntax Tree, where every node knows its position in the original document.

After some research, we found a project called CongoCC. CongoCC is the continuation of a project called JavaCC 21, which itself aimed to be a successor of JavaCC. It meets all our requirements! The syntax is similar to JavaCC, it is fault tolerant and it generates a CST out of the box. Our next mission was then to port the parser to CongoCC. There were in total about 5000 lines of JavaCC code we had to convert to CongoCC. It took a couple of days.

When the core parser was ready, it was time to layout the full pipeline to execute when a document changes:

Embracing the CST

When you have a CST, the life of a language server developer gets significantly easier. Every LSP request turns into a tree problem:

Go-to-definition: Find the node at the cursor position. Find a symbol there. Find the node in the tree corresponding to its definition. Return the location of said node.
Completion: Do some kind of pattern matching with the CST around the cursor position. Give valid completion items based on the matched pattern.
Semantic token highlighting: Leaf nodes in the CST are tokens. They provide the basis for syntax highlighting. But some tokens have different meanings based on the context. So by inspecting the CST we can give better highlighting than a pure token-based highlighting.

To simplify the different types of requests we can get through LSP, it is useful to do some processing of the CST after the initial parsing. In particular, we want to be able to keep track of the different symbols that can exist in a schema document. A symbol is any user defined construct with an identifier, for example a field, a function or a rank-profile. To keep track of the symbols, we created an index called SchemaIndex. When all definitions are added to the index, we can go through the symbol references, and search for the definition. If no definition was found, we can send an error message back to the user. To resolve all the references, the index also keeps track of inheritance to search for all the valid places a symbol could be defined.

Once we had the CST and index structure in place, the rest of the summer was spent actually implementing features, reading the Vespa schema and the LSP documentation, making sure the tool worked as smoothly as possible. Oh, and when creating a language server there are of course edge cases. A lot of edge cases.

Unexpected side mission

It turns out that IntelliJ does not fully support LSP yet. They only support a subset of the requests. An important part they don’t support is semantic tokens. Usually, syntax highlighting of a language is split into two components. A basic and fast highlighter highlights most keywords, separate from LSP. The language server can then provide additional highlighting information through the semantic token request, which gives a more “correct” highlighting, but it is slower. Our highlighting scheme, however, relies solely on the semantic tokens. It takes some time when opening a document for the first time, but after this we deemed it fast enough to do all the highlighting work. But this meant that highlighting didn’t work in IntelliJ! Oh no. Highlighting is a quite important part of providing language support. So how do we get highlighting in IntelliJ? There are really only two options:

Implement the semantic token functionality ourselves in the IntelliJ plugin.
Implement a basic syntax highlighting running in the IntelliJ plugin.

The first option seemed a bit difficult. Therefore, we decided to make a custom highlighter only for the IntelliJ plugin. To do this, the plugin API requires you to implement an abstract class called “Lexer”. The lexer will break the document into a series of tokens, which then can be highlighted based on their type. The lexer interface is for an incremental lexer, meaning that it can start and stop at arbitrary places. Luckily for us, CongoCC already has generated a lexer for us! If only we could plug it into IntelliJ…

The solution was to wire the right connections between the IntelliJ interface and the CongoCC interface. Although the solution was not ideal, it worked. For instance, the generated lexer is not incremental in the same way that IntelliJ requires. So the middleware has to create a new instance of the generated lexer for each call to “start”, and pretend that we got an entirely new document. A bit suboptimal, but better than writing (yet another) definition of the schema language in something like JFlex. The bonus is that, as soon as Jetbrains implements the semanticToken request for IntelliJ, syntax highlighting will automatically improve. This also holds for other LSP features.

Results

The result of our work was a language server capable of most features we set out to implement in the beginning. The main client we worked on is in the form of a VSCode extension. In the GIF below, you can see a demonstration of some features.

Demo GIF

Neovim plugin

Did we mention that the language server works in Neovim? Neovim has an LSP client built in, which means that Neovim can communicate with our language server. All you need to do is to download the language server and add an attach script for the appropriate file types in your init.lua. Instructions can be found at the release link🚀.

Limitations and future work

Even though the language server has simplified writing schema files significantly, there are still some missing features. For example, certain errors in rank expressions are not detected by the language server, meaning users may only discover these errors during deployment or preparation. An example of this is attempting to fetch an attribute from a field, without the attribute indexing type set, a mistake that currently goes unnoticed by the server.

Moreover, the language server does not yet support multiple workspaces, which can lead to issues if multiple editors across different workspaces rely on the same language server. This limitation can be particularly problematic when one workspace contains .profile files in a separate folder. This can cause the server to display errors in valid schemas and struggle with identifying correct symbol relationships.

Additionally, there are several features that would greatly enhance the language server. For instance, better integration with services.xml would allow for automatic file updates when editing schema files. Support for formatting requests would ensure uniformity in schema files, making them easier to read and manage.

Lastly, adding support for Vespa Query Language is another milestone to reach. This could be implemented as another language server, ideally one that are aware of the current deployment to provide completion. Running queries from within the IDE can also be done with the codelens feature in LSP. This would simplify the development of Vespa applications.

Our experience at Vespa

Our experience at Vespa.ai provided us with a deep understanding of the architecture of language servers and the intricacies of parsing Vespa schemas. Additionally, we gained valuable insights into the dynamics and working conditions within a tech start-up environment.

Contributing to a large open-source project like Vespa was both exciting and challenging. Initially, we found the scale of the project overwhelming, but after working through some getting-started tutorials and engaging in a bit of trial and error, we were able to identify the most relevant parts of the project. Whenever we were stuck we always had someone to guide us, which made our time at Vespa not only productive but also enjoyable. We extend our sincere thanks to all our colleagues, with special recognition to Kristian Aune, Øyvind Grønnesby, and Arne Henrik Juul for their daily stand-ups and continuous support.

Vespa for Dummies

A beginner's guide to Vespa, exploring its role in information retrieval and its advantages for enterprise AI applications.

Work at Vespa.ai

Vespa is growing fast and we’re always hiring. Please see if our open positions fits your profile.

Free Trial

Deploy your application for free. Get started now to get $300 in free credits. No credit card required!

internships

« Vespa Newsletter, August 2024 Scaling ColPali to billions of PDFs with Vespa »

Vespa Blog