Tutorial on parsing in Python using the SimpleParse module

Like most programmers, I often need to identify components and structures that exist within text documents: log files, configuration files, delimited data, and more freely formatted (but still semi-structured) report formats. All of these documents have their own "little language" that dictates what can appear within them.

My approach to writing programs that handle these informal parsing tasks has always been a bit of a hodgepodge of custom state machines, regular expressions, and context-driven string testing. The pattern in these programs was probably always something like, "Read some text, figure out if you can do something with it, then maybe read some more text, and keep trying."

Various forms of parsers distill the descriptions of components and structures in a document into concise, clear, and descriptive rules that specify how to identify the components of a document. Here, the descriptive aspect is the most striking. All my old ad hoc parsers used this style: read some characters, make a decision, accumulate some variables, clear, repeat. As reviewed in the section of this column on functional programming, the method style of program flow is relatively error-prone and difficult to maintain.

Formal parsers almost always use variants on the Extended Backus-Naur Form (EBNF) to describe the "grammar" of the language they describe. The tools we study here do this, as does the popular compiler development tool YACC (and its variants). Basically, the EBNF syntax assigns names to parts that you might find in the documentation; in addition, smaller parts are often formed into larger parts. Operators - usually the same symbols you see in regular expressions - specify how often and in what order the smaller parts appear in the larger ones. In parser-talk, each named part of the grammar is called a "production".

It's possible that readers who don't even know about EBNF yet have already seen a running EBNF description. For example, the familiar Python Language Reference book defines what floating-point numbers look like in Python:
EBNF-style floating-point number descriptions

Or you may have seen XML DTD elements defined in EBNF styles. For example, the <body> of the developerWorks tutorial is similar:
Description of EBNF Styles in the developerWorks DTD

Copy Code The code is as follows.

<!ELEMENT body ((example-column | image-column)?, text-column) >

The spelling is slightly different, but the general concepts of quantization, alternation, and ordering are present in all EBNF-style language grammars.
Using SimpleParse to build a list of tokens

SimpleParse is an interesting tool. To use it, you need the underlying module mxTextTools, which implements a "markup engine" in C. mxTextTools is powerful, but rather difficult to use. mxTextTools (see references later in this article) is powerful, but quite difficult to use. Once you put SimpleParse on top of mxTextTools, the job is much easier.

Using SimpleParse is really simple, since most of the complexity of mxTextTools does not need to be considered. First, an EBNF-style grammar should be created to describe the language to be processed. The second step is to call mxTextTools to create a list of tags that describe all successful products when the grammar is applied to the document. Finally, the list of tokens returned by mxTextTools is used to perform the actual operation.

For this article, the "language" we're parsing is the set of markup codes used by "smart ASCII" to represent things like boldface, module names, and book titles. This is the same language that mxTextTools identified earlier, using regular expressions and state machines in the previous section. The language is much simpler than a full programming language, but is complex enough to be representative.

Here, we may need to review. What is the "list of tags" that mxTextTools provides us with? It's basically a nested structure that just gives us the character offsets in the source text that each product matches. mxTextTools quickly traverses the source text, but it doesn't do anything with the source text itself (at least not when using the SimpleParse syntax). Let's examine a simplified list of tokens:
List of tokens generated from the SimpleParse syntax

(1,
 [('plain',
  0,
  15,
  [('word', 0, 4, [('alphanums', 0, 4, [])]),
  ('whitespace', 4, 5, []),
  ('word', 5, 10, [('alphanums', 5, 10, [])]),
  ('whitespace', 10, 11, []),
  ('word', 11, 14, [('alphanums', 11, 14, [])]),
  ('whitespace', 14, 15, [])]),
 ('markup',
  15,
  27,
 ...
 289)

The ellipsis in the center indicates a batch of more matches. But the part we see describes the following. The root product ("para") succeeds and ends at offset 289 (the length of the source text). The sub-product "plain" has offsets from 0 to 15. The "plain" sub-product itself consists of smaller products. After the "plain" product, the "markup" product has offsets from 15 to 27. Detailed information is omitted here, but the first "markup" consists of The details are omitted here, but the first "markup" consists of a component, and another product succeeds later in the source text.

Syntax for "smart ASCII" EBNF styles

We have already browsed through the list of tokens that SimpleParse + mxTextTools can provide. But we really need to look at the syntax used to generate this list of tokens. The actual work happens in the syntax, and the EBNF syntax reads with little explanation (although it does take a bit of thought and testing to design a syntax):

para      := (plain / markup)+
plain     := (word / whitespace / punctuation)+
whitespace   := [ \t\r\n]+
alphanums   := [a-zA-Z0-9]+
word      := alphanums, (wordpunct, alphanums)*, contraction?
wordpunct   := [-_]
contraction  := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')
markup     := emph / strong / module / code / title
emph      := '-', plain, '-'
strong     := '*', plain, '*'
module     := '[', plain, ']'
code      := "'", plain, "'"
title     := '_', plain, '_'
punctuation  := (safepunct / mdash)
mdash     := '--'
safepunct   := [!@#$%^&()+=|\{}:;<>,.?/"]

The syntax is almost identical to the way you would verbally describe "smart ASCII", and is very clear. A paragraph consists of some plain text and some markup text. Plain text consists of a collection of words, spaces, and punctuation marks. The markup text may be emphasized text, highlighted text, module names, etc. The highlighted text is surrounded by asterisks. Emphasized text is surrounded by asterisks. Markup text is made up of such parts. There are several features to consider, such as what exactly constitutes a "word" or what symbols can be used to end an acronym, but the syntax of EBNF is not an obstacle.

In contrast, the use of regular expressions allows for a more concise description of similar rules. The first version of the "Smart ASCII" tagging program did this. But it is much more difficult to write this kind of refinement, and much more difficult to adjust it later. The following code represents a largely (but not precisely) identical set of rules:
Python regexs for smart ASCII

# [module] names
    
re_mods =  
    r""'([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;'"?!/-])"""
# *strongly emphasize* words
    
re_strong = 
    r""'([\(\s'/"]|^)\*(.*?)\*([\s\.\),:;'"?!/-])"""
# -emphasize- words
    
re_emph =  
    r""'([\(\s'/"]|^)-(.*?)-([\s\.\),:;'"?!/])"""
# _Book Title_ citations
    
re_title = 
    r""'([\(\s'/"]|^)_(.*?)_([\s\.\),:;'"?!/-])"""
# 'Function()" names
    
re_funcs = 
    r""'([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])"""

If you have found or invented some slightly updated variant of the language, it is much easier to use it with the EBNF syntax than with those regular expressions. In addition, it is often even faster to use mxTextTools to perform pattern operations.

Generating and using tag lists

For the sample program, we placed the actual syntax in a separate file. For most purposes, this organization is better for ease of use. Often, changing the syntax and changing the application logic are different kinds of tasks; these files reflect that. But all we do with the syntax is pass it as a string to the SimpleParse function, so we can largely include it in the main application (or even generate it dynamically in some way).

Let's examine the full (simplified) markup application:

import
     os
    from
     sys 
    import
     stdin, stdout, stderr
    from
     simpleparse 
    import
     generator
    from
      
    import
     TextTools
input = ()
decl = open(
    ''
    ).read()
    from
     typo_html 
    import
     codes
parser = (decl).parserbyname(
    'para'
    )
taglist = (input, parser)
    for
     tag, beg, end, parts 
    in
     taglist[1]:
  
    if
     tag == 
    'plain'
    :
    (input[beg:end])
  
    elif
     tag == 
    'markup'
    :
    markup = parts[0]
    mtag, mbeg, mend = markup[:3]
    start, stop = (mtag, (
    '<!-- unknown -->'
    ,
    '<!-- / -->'
    ))
    (start + input[mbeg+1:mend-1] + stop)
(
    'parsed %s chars of %s\n'
     % (taglist[-1], len(input)))

This is what it does. First we read in the grammar, then we create an mxTextTools parser based on the grammar. Next, we apply the token list/parser to the input source to create a list of tokens. Finally, we loop through the list of tokens and emit some new tokenized text. Of course, the loop can do whatever else we want with each product it encounters.

Due to the special syntax used by Smart ASCII, anything in the source text can be categorized as either a "plain" product or a "markup" product. So for looping through a single level in a list of markups, it is sufficient (unless we happen to be looking for a level one level below the level of a particular markup product, such as "title"). But more freely formatted grammars - such as those found in most programming languages - make it easy to recurse down through the list of tokens and look for product names at each level. For example, if a syntax allows nested markup code, it might be possible to use this style of recursion. You might enjoy the exercise of figuring out how to tweak the syntax (hint: remember to allow products to be recursive with each other).

The special markup code that goes to the output is still stored in another file, for organizational rather than intrinsic reasons. One trick we use here is to use a dictionary as a switch statement (although the otherwise case in the example is still too narrow). The idea is that in the future we may want to create documents in multiple "output formats", such as HTML, DocBook, LaTeX, or others. The special markup file used for the example looks like this:
typo_html.py

codes = \
{ 
    'emph'
      : (
    '<em>'
    , 
    '</em>'
    ),
 
    'strong'
     : (
    '<strong>'
    , 
    '</strong>'
    ),
 
    'module'
     : (
    '<em><code>'
    , 
    '</code></em>'
    ),
 
    'code'
      : (
    '<code>'
    , 
    '</code>'
    ),
 
    'title'
      : (
    '<cite>'
    , 
    '</cite>'
    ),
}

Extending this format to other output formats is simple.

concluding remarks

SimpleParse provides a concise and very readable EBNF-style wrapper for the basic functionality and speed of the ambiguous mxTextTools C module. In addition, many programmers are already quite familiar with EBNF syntax, even if they learned it in passing. I can't provide proof of what's easier to understand - that's a matter of intuition - but I can give a quantitative assessment based on the length of the source code. The size of the previously hand-developed mxTypographify module is as follows:

Copy Code The code is as follows.

199 776 7041

A significant number of these 199 lines are comments. Eighteen of these lines are the regular expression version of the tagged function, which is included for timing comparisons. However, the functionality of the program is essentially the same as that listed above. In comparison, our SimpleParse program, including its support files, is the following size:

Copy Code The code is as follows.

wc typo*.def typo*.py

19      79     645
20      79     721
6      25     205 typo_html.py
45     183    1571 total

In other words, there are about a quarter as many lines. There are fewer comments in this version, but that's mostly because the EBNF syntax is so self-descriptive. I don't want to put too much emphasis on the number of lines of code - obviously you can fiddle around by minimizing or maximizing code length. But one of the few practical empirical conclusions from studying the work of programmers in general is that "thousands of lines of code/person-month" is pretty close to a constant, and has little to do with the language or the library. Of course, in turn, the regular expression version is about a third of the length of the SimpleParse version - but I think the density of its expressions makes it extremely difficult to maintain and harder to write. All in all, I think SimpleParse is the best of the methods considered.