Tutorial on implementing a recursive descent parser using 70 lines of Python code

Step 1: Tokenization

The first step in processing an expression is to convert it into a list of individual symbols. This step is simple and not the focus of this article, so I've omitted much of it here.
First, I define some tokens (numbers are not included here, they are the default tokens) and a token type:

token_map = {'+':'ADD', '-':'ADD',
       '*':'MUL', '/':'MUL',
       '(':'LPAR', ')':'RPAR'}
 
Token = namedtuple('Token', ['name', 'value'])

Here's the code I used to mark up the `expr` expression:

split_expr = ('[\d.]+|[%s]' % ''.join(token_map), expr)
tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]

The first line is a trick to split the expression into basic tokens, so the

'1.2 / ( 11+3)' --&gt; ['1.2', '/', '(', '11', '+', '3', ')']

The next line names the tokens so that the analyzer can recognize them by classification:

['1.2', '/', '(', '11', '+', '3', ')']
-&gt;
[Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token(name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

Any token not in token_map is assumed to be a number. Our tokenizer lacks the property called validation to prevent non-numbers from being accepted, but fortunately the operator will handle it later.
That's it.
Step 2: Syntax Definition

The parser I chose to implement comes from a native vertical parser derived from a simple version of the LL parser. It is one of the simplest parser implementations, in fact, only 14 lines of code. It is a top-down parser, which means that the parser starts parsing from the top rule (like:expression) and then recursively tries to parse in the way of its sub-rules until it matches the bottom rule (like:number). In other words, the bottom-up parser (LR) progressively shrinks the tokens so that the rule is included in other rules until finally only one rule remains, while the top-down parser (LL) progressively unfolds the rules and goes to a small number of abstract rules until it is able to match the input tokens exactly.
Before we dive into the actual parser implementation, we can discuss the syntax. In my previous post, I have used the LR parser and I can define the calculator syntax like the following (tokens are in uppercase):

add: add ADD mul | mul;
mul: mul MUL atom | atom;
atom: NUM | '(' add ')' | neg;
neg: '-' atom;

(If you don't understand the above syntax, read myPreviously published articles)

Now I use the LL parser to define the syntax of the calculator in the following way:

rule_map = {
  'add' : ['mul ADD add', 'mul'],
  'mul' : ['atom MUL mul', 'atom'],
  'atom': ['NUM', 'LPAR add RPAR', 'neg'],
  'neg' : ['ADD atom'],
}

As you can see, there is a subtle change here. The recursive definition of "add and mul" has been reversed. This is a very important detail, and I'll explain it in more detail.

The LR version uses a left recursive pattern. When the LL parser encounters recursion, it tries to match the rules. So, when left recursion occurs is, the parser goes into infinite recursion.Even clever LL parsers such as ANTLR can't escape this problem!, which will replace endless recursion with friendly error messages, unlike our toy parser.

Left recursion can be easily transformed into right recursion, which is what I did. But the parser isn't that simple, and it creates another problem: while the left recursion correctly parses 3-2-1 as (3-2)-1, the right recursion incorrectly parses it as 3-(2-1). I haven't thought of a simple solution yet, so to keep things simple I decided to let it continue to use the incorrect parsing format and deal with the problem later (see step 4)

Step 3: Parse to an AST

The algorithm is actually quite simple. We'll define a recursive method that receives two arguments: the first argument is the name of the rule we want to try to match, and the second is the list of identifiers we want to keep. We'll start with the add (top rule) method, which already contains the complete list of identifiers, and the recursive call is already very clear. The method will return an array containing the elements: one for the current match and another for the list of identifiers to keep for the match. We will implement the logo matching functionality to make this code usable (they are both of string type; one in upper case format and the other in lower case format).

Here is the code for the parser implementation:

RuleMatch = namedtuple('RuleMatch', ['name', 'matched'])
 
def match(rule_name, tokens):
  if tokens and rule_name == tokens[0].name:   # Does it match the logo?
    return RuleMatch(tokens[0], tokens[1:])
  for expansion in rule_map.get(rule_name, ()):  # Does it match the rules?
    remaining_tokens = tokens
    matched_subrules = []
    for subrule in ():
      matched, remaining_tokens = match(subrule, remaining_tokens)
      if not matched:
        break  # With bad luck, jump out of the loop and deal with the next extension definition!
      matched_subrules.append(matched)
    else:
      return RuleMatch(rule_name, matched_subrules), remaining_tokens
  return None, None  # No matches

Lines 4 to 5 of the code state that if the rule name (rule_name) is indeed a tokens and is included in the list of tokens, it is checked if it matches the current tokens. If so, the expression returns the matching method and the list of tokens is used anyway.

Explanation of line 6 of the code: The iteration will loop through the sub-rules corresponding to the rule name to check if they match, and the matching of each sub-rule will be achieved by recursion. If the rule name satisfies the condition of matching the identifier, the get() method will return an empty array, while the code will return the null value (see line 16).

Lines 9-15, the implementation iterates over the current sub-rule and tries to match them sequentially. Each iteration matches as many logos as possible. If one of the identifiers fails to match, we drop the entire sub-rule. however, if all of the identifiers match successfully, we reach the else statement and return the matching value of rule_name, along with the remaining identifiers.

Now run and see the result of 1.2/(11+3).

>>> tokens = [Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token (name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]
 
>>> match('add', tokens)
 
(RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='1.2')]), Token(name='MUL', value='/'), RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='LPAR', value='('), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='11')])]), Token(name='ADD', value='+'), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='3')])])])]), Token(name='RPAR', value=')')])])])]), [])

The result is a tuple, but of course we don't see any remaining logos. The matching result is not easy to read, so let me draw a graph of the result:

add
  mul
    atom
      NUM '1.2'
    MUL '/'
    mul
      atom
        LPAR  '('
        add
          mul
            atom
              NUM '11'
          ADD '+'
          add
            mul
              atom
                NUM '3'
        RPAR  ')'

That's the conceptual AST. it's a good exercise to visualize how the parser works, either through the logic of your mind, or by picturing it on paper. I wouldn't dare say that this is necessary, unless you want to be godforsaken. You can use the AST to help you implement the right algorithm.

So far, we have completed parsers that can handle binary operations, unary operations, parentheses, and operator precedence.

Now there is only one error left to be resolved, which we will resolve in the following steps.

Step 4: Follow-up

My parser doesn't work in every situation. Most importantly, it doesn't handle left recursion, forcing me to write my code in a right recursive manner. This resulted in the following AST result when parsing the expression 8/4/2:

add
  mul
    atom
      NUM 8
    MUL '/'
    mul
      atom
        NUM 4
      MUL '/'
      mul
        atom
          NUM 2

If we try to compute the result via AST, we will prioritize 4/2, which is of course wrong. Some LL parsers choose to fix the correlation inside the tree. This requires writing multiple lines of code ;). This is not adopted, we need to flatten it. The algorithm is simple: for each rule inside the AST 1) it needs to be corrected 2) it is a binary operation (with sub-rules) 3) the same rule applies to the operator on the right: flatten the latter into the former. By "flattening", I mean replacing a node by its son in the context of its parent. Since our traversal is DFS in backward order, meaning that it starts at the edge of the tree and goes all the way to the root, the effect will be cumulative. Below is the code:

fix_assoc_rules = 'add', 'mul'
 
def _recurse_tree(tree, func):
  return map(func, ) if  in rule_map else tree[1]
 
def flatten_right_associativity(tree):
  new = _recurse_tree(tree, flatten_right_associativity)
  if  in fix_assoc_rules and len(new)==3 and new[2].name==:
    new[-1:] = new[-1].matched
  return RuleMatch(, new)

This code makes any structured addition or multiplication expression into a flat list (without confusion). Parentheses break the order, but of course, they are not affected.

Based on all of the above, I can refactor the code into a left association:

def build_left_associativity(tree):
  new_nodes = _recurse_tree(tree, build_left_associativity)
  if  in fix_assoc_rules:
    while len(new_nodes)>3:
      new_nodes[:3] = [RuleMatch(, new_nodes[:3])]
  return RuleMatch(, new_nodes)

However, I am not going to do that. I need less code, and swapping out the calculation code for the processing list would require less code than refactoring the whole tree.

Step 5: Operators

The operations on the tree are very simple. Simply traverse the tree in a similar way to the code for post-processing (i.e., DFS postorder) and perform the operations according to each of these rules. For the operators, since we are using a recursive algorithm, each rule must contain only numbers and operators. The code is as follows:

bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
def calc_binary(x):
  while len(x) > 1:
    x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
  return x[0]
 
calc_map = {
  'NUM' : float,
  'atom': lambda x: x[len(x)!=1],
  'neg' : lambda (op,num): (num,-num)[op=='-'],
  'mul' : calc_binary,
  'add' : calc_binary,
}
 
def evaluate(tree):
  solutions = _recurse_tree(tree, evaluate)
  return calc_map.get(, lambda x:x)(solutions)

I use the calc_binary function for addition and subtraction operations (and their cousins). It computes these operations in the list in a left-conjugate way, which makes it less easy to get the results with our LL syntax.

Step 6: REPL

The plainest REPL:

if __name__ == '__main__':
  while True:
    print( calc(raw_input('> ')) )

Don't make me explain it :)
Appendix: Combining them: a 70-line calculator

'''A Calculator Implemented With A Top-Down, Recursive-Descent Parser'''
# Author: Erez Shinan, Dec 2012
 
import re, collections
from operator import add,sub,mul,div
 
Token = ('Token', ['name', 'value'])
RuleMatch = ('RuleMatch', ['name', 'matched'])
 
token_map = {'+':'ADD', '-':'ADD', '*':'MUL', '/':'MUL', '(':'LPAR', ')':'RPAR'}
rule_map = {
  'add' : ['mul ADD add', 'mul'],
  'mul' : ['atom MUL mul', 'atom'],
  'atom': ['NUM', 'LPAR add RPAR', 'neg'],
  'neg' : ['ADD atom'],
}
fix_assoc_rules = 'add', 'mul'
 
bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
def calc_binary(x):
  while len(x) > 1:
    x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
  return x[0]
 
calc_map = {
  'NUM' : float,
  'atom': lambda x: x[len(x)!=1],
  'neg' : lambda (op,num): (num,-num)[op=='-'],
  'mul' : calc_binary,
  'add' : calc_binary,
}
 
def match(rule_name, tokens):
  if tokens and rule_name == tokens[0].name:   # Match a token?
    return tokens[0], tokens[1:]
  for expansion in rule_map.get(rule_name, ()):  # Match a rule?
    remaining_tokens = tokens
    matched_subrules = []
    for subrule in ():
      matched, remaining_tokens = match(subrule, remaining_tokens)
      if not matched:
        break  # no such luck. next expansion!
      matched_subrules.append(matched)
    else:
      return RuleMatch(rule_name, matched_subrules), remaining_tokens
  return None, None  # match not found
 
def _recurse_tree(tree, func):
  return map(func, ) if  in rule_map else tree[1]
 
def flatten_right_associativity(tree):
  new = _recurse_tree(tree, flatten_right_associativity)
  if  in fix_assoc_rules and len(new)==3 and new[2].name==:
    new[-1:] = new[-1].matched
  return RuleMatch(, new)
 
def evaluate(tree):
  solutions = _recurse_tree(tree, evaluate)
  return calc_map.get(, lambda x:x)(solutions)
 
def calc(expr):
  split_expr = ('[\d.]+|[%s]' % ''.join(token_map), expr)
  tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]
  tree = match('add', tokens)[0]
  tree = flatten_right_associativity( tree )
  return evaluate(tree)
 
if __name__ == '__main__':
  while True:
    print( calc(raw_input('> ')) )