Treetop (parser)

Treetop is a Ruby library that allows you to create parsers easily by describing them using a Parsing Expression Grammar (PEG). - A quick intro / github

Home (still alive)

performance are good compared to other
Performance of parsers: PEG vs LALR(1) or LL(k) - PEG parsers can use unlimited lookahead (while maintaining linear parse time on average, via packrat) unlike (default) LL(k), or LR(k) parsers which use limited lookahead, while maintining linear parse time.

Tutorials

They are 3 kind of objects to implements:

the Parser itself, by levaring treetop, that where we will have a parse method.
the Grammar, which will layout how we will parse and build the tree
the Node extension classes, which is optional, but will allows to customize the AST tree to host pertient information regarding what has been parsed.

Output

the Tree - if parsing succeed, this will be your result.

The Parser

It would look like

# In file parser.rb
require 'treetop'

# Load our custom syntax node classes so the parser can use them
require_relative 'node_extensions.rb'

class Parser
  
  # Load the Treetop grammar from the 'python3_parser.treetop' file -> produce <grammar name>Parser
  # and then create a new instance of that parser as a class variable so we don't have to re-create
  # it every time we need to parse a string
  Treetop.load( 'python3_parser.treetop')
  @@parser = Python3Parser.new
  tree = @@parser.parse("some compliant code") # return -> tree if parsed or nil 
  
end

It require a least one rule (the root node), to make the above code work: The first rule becomes the root of the grammar, causing its expression to be matched when a parser for the grammar is fed a string.

# python3_parser.treetop
grammar Python3
  rule hello
    'hello chomsky'
  end
end

Contrary to Lex/yacc approach, here the grammar directly integrate the lexer. It works directly on the stream of characters.

Notes

only base expression are considered terminals (like “hello”, [Az]*) each one them will lead to a Node being created.
fortunately Treetop now support regexp / 2 as terminals, so it can helps to reduce number of nodes created. eg: using "([\t ]*\n)+"r to replace ([\t ]* "\n") in a rule (otherwise you got 3 nodes created with the latter: 1 parent <-> 2 children.
how to handle spaces? - because there is no lexer, you need to includes space in your grammar
- It’s not enough to define the space rule, you have to use it anywhere there might be space
- Because this occurs often, I usually use a shorter rule name S for mandatory space, and the lowercase version s for optional space. - cliffordheath

The Nodes

The parser run by Treetop follow the grammar rules and build a Tree of Treetop::Runtime::SyntaxNode. The grammar allows to instanciate differente kind of Node if you like to. To do it, you just need to provide the class in bracket inside the rule, like so:

# python3_parser.treetop
grammar Python3
   rule a
     something <A>
   end
end

Then in node_extensions.rb

# node_extensions.rb

class A < Treetop::Runtime::SyntaxNode
    def initialize( input, interval, elements = nil)
        super( input, interval, elements)
    end
end

Notes

the SyntaxNode has some usefull methods like terminal?
Ruby being ruby, you can extend Treetop::Runtime::SyntaxNode, and even their instance (using the block mechanism supported by the grammar)
see below to advice about using module to provide these Node class.

Inline Method Definition

So rather than instanciating new Nodes, you can overide default method or add custom members directly to the SyntaxNode like so:

# python3_parser.treetop
rule abc
  a b c {
    # overide the to_s method of the instanciated nodes of abc rule
    def to_s   
      a.to_s + b.to_s + c.to_s
    end
  }
end

Labels

Subexpressions can be given an explicit label to have an element accessor method defined for them. The labeled expressions could have been extracted to their own rules, but if they aren’t used elsewhere, labels still enable them to be referenced by a name within the expression’s methods.

# python3_parser.treetop
rule labels
  first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
    def letters
      [first_letter] + rest_letters.elements.map do |comma_and_letter|
        comma_and_letter.letter
      end
    end
  }
end

You can then override the element accessor produced by label with access to the super keyword.

# python3_parser.treetop
rule labels
  first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
    def letters
      [first_letter] + rest_letters
    end

    def rest_letters
      super.elements.map { |comma_and_letter| comma_and_letter.letter }
    end
  }
end

The Tree

TBD

Debuging

Enable error messages: Check parser.failure_reason and parser.failure_line if parsing fails. You can even implement something like that (from Treetop: Typical Errors† )

  parser.failure_reason =~ /^(Expected .+) after/m
  puts "ERROR: #{$1.gsub("\n", '$NEWLINE')}:"
  puts input.lines.to_a[ parser.failure_line - 1]
  puts "#{'~' * ( parser.failure_column - 1)}^"

Which will output, the exact point where grammar diverge from input:

# python3_parser.treetop
ERROR: Expected "(" at line 1, column 8 (byte 8):
def foo[]):
~~~~~~~^

When tree is build, you can also inspect it

# python3_parser.treetop
puts tree    # will iterate on to_s
p    tree    # will iterate on inspect

Parsed successfully!
SyntaxNode+Program0 offset=0, "def foo():\n":
  SyntaxNode offset=0, ""
  SyntaxNode offset=0, "def foo():\n":
    FuncDef+Funcdef0 offset=0, "def foo():\n" (s1,s2,name,S,s3):
      SyntaxNode offset=0, "def"
      SyntaxNode offset=3, " "
      Name+Name0 offset=4, "foo" (s):
        SyntaxNode offset=4, "foo"
        SyntaxNode offset=7, ""
      SyntaxNode offset=7, "("
      SyntaxNode offset=8, ""
      SyntaxNode offset=8, ")"
      SyntaxNode offset=9, ""
      SyntaxNode offset=9, ":"
      SyntaxNode offset=10, ""
      SyntaxNode offset=10, "\n"

Method	Output Style	Uses Which Method?	Adds Newline?	Purpose
`puts node`	Human-readable string	Calls `to_s`	Yes	For clean user-facing output
`p node`	Debug-style (raw Ruby syntax)	Calls `inspect`	Yes	For debugging and inspecting objects

Doc

Turning a parse tree into an abstract syntax tree (AST)
- filtering parse tree to get an ast
  - Treetop’s Confusing Behavior
  - class vs module - use rule ( subrule1 / subrule2 ) <class object> see 1
    - treetop will inject several definition so using module is more general and adviced, see 2

rule foo; 	a <A_In_Foo> / b <B_In_Foo>; end

rule a; 	something <A>;	end

rule b;	something_else <B>; end

In the case where “foo” matches “something_else”, the SyntaxNode for “something_else” will have two mixed-in modules (both “B” and “B_In_Foo”). Use parentheses around a list of alternates if you want to mix in the same module regardless of which possibility is matched.

Treetop: Typical Errors
- error reporting à la clang
- Matching whitespaces
- prevent AST squashing: use 1..1
Doc
- Semantic Interpretation
  - Methods Available on Treetop::Runtime::SyntaxNode - elements elements of the matched sequence
  - labels: for subexpression, like citrus
- Syntactic Recognition
  - c like comment grammar
- keywords - Striping spaces in a SQL parser
  - keywords must not be immediately followed by a symbol character. I use negative lookahead for this.
  - The top rule consumes leading whitespace, then almost every rule consumes following whitespace (this is my policy). You should test your grammar with minimum and maximum whitespace.
- Advanced Techniques
  - Matching a Keyword But Not Words Prefixed Therewith
Debugging
- using in ruby
  - parser options: start from a different rule / continue
  - debugging option
  - AST guideline
  - Advanced Grammar debugging
    - Generating grammar, and instrumenting code
google groups (2018)
- custom data for syntax node => his is a significant weakness in the design of Treetop. I made a custom proxy for the “input” and attached my Context to it. You might get away with doing something similar
Ruby Conference 2007 Treetop: Syntactic Analysis with Ruby by Nathan Sobo

Some treetop example

Clifford Heath CQL
my sql grammar POC sql
A “grue”some look at Statemachine and Treetop

Advanced Techniques

Semantic Predicates

Sometimes, you need to run external Ruby code to decide whether this syntax rule should continue or should fail. You can do this using either positive or negative semantic predicates. These are Ruby code blocks (lambdas) which are called when the parser reaches that location.

rule reserved
  word &{ |s| symbol_reserved?(s[0].text_value) } # positive predicate
  word !{ |s| symbol_reserved?(s[0].text_value) } # negative predicate
end

Notes

the semantic predicate &{ ... } syntax while close to the Node extension { ... } is fundamentally different in behavior.

Alternatives

Why you should not use (f)lex, yacc and bison (from ANTLR)

Lex and Yacc were the first popular and efficient lexers and parsers generators, flex and Bison were the first widespread open-source versions compatible with the original software. Each of these software has more than 30 years of history, which is an achievement in itself.

Stability and Development of New Features - they are stable and maintained software but there is no active development.
Separation between Grammar and Code - flex and Bison maintain an old-school design with little support for readability or productivity.
Unicode Support - flex does not directly support Unicode
Grammar Format - Bison only supports BNF, which makes grammars more complicated
Features of Lexing Algorithms - flex supports regular expressions to define rules, which works for most elements, but adds complexity
Features of Parsing Algorithms - Bison supports two parsing algorithms that cover all ranges of performance and languages. It gives cryptic error messages
parslet
- compared to other - uses the PEG formalism for its grammars
- Get started
- Stack Traces in Parslet
whittle ? - A little LALR(1) Parser in Pure Ruby
- Whittle: the new, powerful, easy-to-use Parser on the ruby block (all ruby, all the time, based on the LALR(1) algorithm)
rexical / racc
- example sql-parser - sqlite-parser (js)

Written on July 16, 2020, Last update on June 27, 2025

ruby parser AST

#4577

Treetop (parser)

see also

Home (still alive)

Tutorials

The Parser

The Grammar

The Nodes

The Tree

Debuging

Doc

Some treetop example

Advanced Techniques

Semantic Predicates

Alternatives