Syntax Highlighting

from [HN]

Sourcegraph dev here :) At Sourcegraph we’ve actually tried a lot of different syntax highlighters in the past. Syntax highlighting is actually a really hard problem. Syntax highlighting for basically all editors today is provided by TextMate language grammers[1] (.tmLanguage files) ultimately, which are just giant regex definition files, basically. These are the ubiquitous files for syntax highlighting, and you can almost always find a TextMate grammer for any programming language.

Some ‘modern’ syntax highlighters, mostly ones designed for use in the web, try not to use these TextMate files – but in doing so they lose support for TONS of languages, and the languages that they do support are often very buggy / incorrect because the grammers are just simply not that well tested (nobody is testing them in their editor daily).

The annoyance with TextMate grammers is that they are based on Oniguruma Regular Expressions, which means that while they are written in very powerful regex syntax, you cannot really use them with any regex engine built in to most languages today. Almost all editors directly use the original C implementaton of Oniguruma, even editors like Visual Studio Code which are based on web tech primarily.

And even once you have a working Oniguruma regex engine, it is still very non-trivial to use it for highlighting actual code because the TextMate files themselves are pretty complex on their own.

To make matters more interesting, Sublime created their own language grammer format (.sublime-syntax files), which is a superset of the TextMate language grammers and supports more features. Sublime can convert any TextMate language grammer into the new .sublime-syntax format.

So when I started seriously looking into syntax highlighting about a year ago, the Pygments syntax highlighter was my first choice and it is pretty good, but I found it to be lacking support for languages we cared about such as TypeScript at the time. Additionally, Pygments was pretty slow when we tried it out. GitHub switched away from it for a similar reason[2]

This was about the time that I ran into the Syntect[3] project. For us, this had huge benefits:

  • It was by far the fastest syntax highlighter we found, which meant page load times improved a lot for us.

  • Its results were really, really good. I found multiple cases where the output was better than other highlighters such as Pygments.

  • The built-in language support, all language Sublime Text 3 supports by default, was great out of the box.

  • Because Sublime can convert TextMate grammers to .sublime-text grammers, we ultimately get support for basically all languages.

  • It took me one day to write syntect_server[4] which highlights arbitrary code for our Go application, which was a spectacular turnaround time given how much time I’d already invested elsewhere in researching all of this.

Overall, we really couldn’t be more pleased with Syntect and Rust for modern syntax highlighting. It’s a beautiful, fast, high-quality combination.

[1] https://macromates.com/manual/en/language_grammars

[2] https://github.com/github/linguist/issues/1717#issuecomment-…

[3] https://github.com/trishume/syntect

[4] https://github.com/sourcegraph/syntect_server

Written on September 7, 2018, Last update on September 7, 2018
parser