Treetop (parser)

Treetop is a Ruby library that allows you to create parsers easily by describing them using a Parsing Expression Grammar (PEG). - A quick intro / github

treetop logo

see also

Home (still alive)

Tutorials

They are 3 kind of objects to implements:

  • the Parser itself, by levaring treetop, that where we will have a parse method.
  • the Grammar, which will layout how we will parse and build the tree
  • the Node extension classes, which is optional, but will allows to customize the AST tree to host pertient information regarding what has been parsed.

Output

  • the Tree - if parsing succeed, this will be your result.

The Parser

It would look like

# In file parser.rb
require 'treetop'

# Load our custom syntax node classes so the parser can use them
require_relative 'node_extensions.rb'

class Parser
  
  # Load the Treetop grammar from the 'python3_parser.treetop' file -> produce <grammar name>Parser
  # and then create a new instance of that parser as a class variable so we don't have to re-create
  # it every time we need to parse a string
  Treetop.load( 'python3_parser.treetop')
  @@parser = Python3Parser.new
  tree = @@parser.parse("some compliant code") # return -> tree if parsed or nil 
  
end

The Grammar

It require a least one rule (the root node), to make the above code work: The first rule becomes the root of the grammar, causing its expression to be matched when a parser for the grammar is fed a string.

# python3_parser.treetop
grammar Python3
  rule hello
    'hello chomsky'
  end
end

Contrary to Lex/yacc approach, here the grammar directly integrate the lexer. It works directly on the stream of characters.

Notes

  • only base expression are considered terminals (like “hello”, [Az]*) each one them will lead to a Node being created.
  • fortunately Treetop now support regexp / 2 as terminals, so it can helps to reduce number of nodes created. eg: using "([\t ]*\n)+"r to replace ([\t ]* "\n") in a rule (otherwise you got 3 nodes created with the latter: 1 parent <-> 2 children.
  • how to handle spaces? - because there is no lexer, you need to includes space in your grammar
    • It’s not enough to define the space rule, you have to use it anywhere there might be space
    • Because this occurs often, I usually use a shorter rule name S for mandatory space, and the lowercase version s for optional space. - cliffordheath

The Nodes

The parser run by Treetop follow the grammar rules and build a Tree of Treetop::Runtime::SyntaxNode. The grammar allows to instanciate differente kind of Node if you like to. To do it, you just need to provide the class in bracket inside the rule, like so:

# python3_parser.treetop
grammar Python3
   rule a
     something <A>
   end
end

Then in node_extensions.rb

# node_extensions.rb

class A < Treetop::Runtime::SyntaxNode
    def initialize( input, interval, elements = nil)
        super( input, interval, elements)
    end
end

Notes

  • the SyntaxNode has some usefull methods like terminal?
  • Ruby being ruby, you can extend Treetop::Runtime::SyntaxNode, and even their instance (using the block mechanism supported by the grammar)
  • see below to advice about using module to provide these Node class.

Inline Method Definition

So rather than instanciating new Nodes, you can overide default method or add custom members directly to the SyntaxNode like so:

# python3_parser.treetop
rule abc
  a b c {
    # overide the to_s method of the instanciated nodes of abc rule
    def to_s   
      a.to_s + b.to_s + c.to_s
    end
  }
end

Labels

Subexpressions can be given an explicit label to have an element accessor method defined for them. The labeled expressions could have been extracted to their own rules, but if they aren’t used elsewhere, labels still enable them to be referenced by a name within the expression’s methods.

# python3_parser.treetop
rule labels
  first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
    def letters
      [first_letter] + rest_letters.elements.map do |comma_and_letter|
        comma_and_letter.letter
      end
    end
  }
end

You can then override the element accessor produced by label with access to the super keyword.

# python3_parser.treetop
rule labels
  first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
    def letters
      [first_letter] + rest_letters
    end

    def rest_letters
      super.elements.map { |comma_and_letter| comma_and_letter.letter }
    end
  }
end

see also

The Tree

TBD

Debuging

Enable error messages: Check parser.failure_reason and parser.failure_line if parsing fails. You can even implement something like that (from Treetop: Typical Errors† )

  parser.failure_reason =~ /^(Expected .+) after/m
  puts "ERROR: #{$1.gsub("\n", '$NEWLINE')}:"
  puts input.lines.to_a[ parser.failure_line - 1]
  puts "#{'~' * ( parser.failure_column - 1)}^"     

Which will output, the exact point where grammar diverge from input:

# python3_parser.treetop
ERROR: Expected "(" at line 1, column 8 (byte 8):
def foo[]):
~~~~~~~^

When tree is build, you can also inspect it

# python3_parser.treetop
puts tree    # will iterate on to_s
p    tree    # will iterate on inspect
Parsed successfully!
SyntaxNode+Program0 offset=0, "def foo():\n":
  SyntaxNode offset=0, ""
  SyntaxNode offset=0, "def foo():\n":
    FuncDef+Funcdef0 offset=0, "def foo():\n" (s1,s2,name,S,s3):
      SyntaxNode offset=0, "def"
      SyntaxNode offset=3, " "
      Name+Name0 offset=4, "foo" (s):
        SyntaxNode offset=4, "foo"
        SyntaxNode offset=7, ""
      SyntaxNode offset=7, "("
      SyntaxNode offset=8, ""
      SyntaxNode offset=8, ")"
      SyntaxNode offset=9, ""
      SyntaxNode offset=9, ":"
      SyntaxNode offset=10, ""
      SyntaxNode offset=10, "\n"
Method Output Style Uses Which Method? Adds Newline? Purpose
puts node Human-readable string Calls to_s Yes For clean user-facing output
p node Debug-style (raw Ruby syntax) Calls inspect Yes For debugging and inspecting objects

Doc

rule foo; 	a <A_In_Foo> / b <B_In_Foo>; end

rule a; 	something <A>;	end

rule b;	something_else <B>; end

In the case where “foo” matches “something_else”, the SyntaxNode for “something_else” will have two mixed-in modules (both “B” and “B_In_Foo”). Use parentheses around a list of alternates if you want to mix in the same module regardless of which possibility is matched.

caption

Some treetop example

Advanced Techniques

Semantic Predicates

Sometimes, you need to run external Ruby code to decide whether this syntax rule should continue or should fail. You can do this using either positive or negative semantic predicates. These are Ruby code blocks (lambdas) which are called when the parser reaches that location.

rule reserved
  word &{ |s| symbol_reserved?(s[0].text_value) } # positive predicate
  word !{ |s| symbol_reserved?(s[0].text_value) } # negative predicate
end

Notes

  • the semantic predicate &{ ... } syntax while close to the Node extension { ... } is fundamentally different in behavior.

Alternatives

Why you should not use (f)lex, yacc and bison (from ANTLR)

Lex and Yacc were the first popular and efficient lexers and parsers generators, flex and Bison were the first widespread open-source versions compatible with the original software. Each of these software has more than 30 years of history, which is an achievement in itself.

Written on July 16, 2020, Last update on June 20, 2025
ruby parser AST