Context-Free or Parsing Expressions

What regular expression recognizes a string that uses \ for escaping ?

"([^\\"]|\\.)*"
  ━A━━━━ ━B━

I will call this expression « deterministic » because it can be easily translated to imperative code that runs in linear time.

  i = 0
  if i >= str.length or '"' != str[i++]: 
    return REJECT

  while i < str.length:
    switch str[i]: 

      case '"': 
        return ACCEPT

      case '\':   ┃
        i += 2    B \\.
        break     ┃

      default:    ┃    
        i += 1    A [^\\"]
        break     ┃

  return REJECT

Not all regular expressions are deterministic. What is an expression that extracts the TLD from a lowercase ASCII domain name ?

[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]+)
           ━A━━━━━━━━━━  ━B━━━━━━━━

It is hard to translate this expression to a simple loop, because the character . can start either section A (a domain part that is not the TLD) or section B (the TLD) depending on whether there are other dots ahead in the string. When the imperative program reaches a . it must choose whether it attempts A or B.

Of course, this expression can still be translated to imperative code which either uses backtracking (choose branch A, if it fails choose branch B) or keeps track of both branches at the same time. Both solutions are used by regular expression engines to cope with non-deterministic expressions.

There are two ways of understanding how regular expressions describe what they should recognize.

The academic view is that regular expressions describe derivations : a regular expression can be used to generate or « derive » all the strings that it recognizes, so a string is recognized by a regular expression if and only if it is a member of that set—if and only there exists a derivation that generates the string.

The derivation rules themselves are very simple:

L(ε)   = { ε } // The empty word
L(a)   = { 'a' }
L(A|B) = L(A) ∪ L(B)
L(AB)  = { ab : a ∈ A, b ∈ B }
L(A*)  = { aⁿ : a ∈ A, n ≥ 0 }

The engineering view is that regular expressions are a shorthand notation for a backtracking program. A string is recognized if the program can reach the end of the string and stop there.

This has led to some interesting developments that do not have clear equivalents in the academic view. In fact, the two views have become distinct enough that it is worth calling them different names : « regex » for the engineering concept implemented in libraries, and « regular languages » for the academic concept.

Among the many differences :

Look-ahead and look-behind lets a regex test characters without consuming them.
Ambiguities (situations where more than one derivation generates the string) are important when a regex is expected to extract the values recognized by its sub-expressions. Does (.*)X(.*) applied to AXBXC return groups AXB and C, or A and BXC ?

To resolve these ambiguities, regex syntax provides both greedy * (if there are two derivations, pick the one that has the longest match) and lazy *? (if there are two derivations, pick the one that has the shortest match).

These have an obvious meaning in a back-tracking program : when (.*)X(.*) sees an X, it has a choice between leaving or staying in the (.*) group. Which alternative should be tried first ? Greedy means to stay in the group (if that fails, backtrack and leave the group), lazy means to leave the group (if that fails, backtrack and stay in the group).
Backreferences let the regex reference parts of the string that it has already matched in its current attempt. ([abc])\1 recognize "aa", "bb" and "cc".

Using backreferences, it is possible to test if a number P is prime: see if the letter A, repeated P times, is matched by (AA+)\1+. This will only be the case if there is a sequence of length N > 1 which is be repeated exactly M > 1 times, so that P = N×M is not prime.
Subroutines let a regex define and call another regex. These calls can even be recursive. The well-known language { aⁿbⁿ : n ≥ 1 } is not regular, but is matched by a(?R)?b.

A regex is more powerful than a theoretical regular expression, but it has also lost a few useful qualities. For example, a true regular expression can be compiled to a deterministic automaton which recognizes or rejects any string in O(N) time. There is no such guarantee for a regexp. In fact, if you wanted to write a regular expression that trims white-space at the end of a string, you might use the regex (.*?) + (match as few characters as possible, followed by a sequence white-space).

Faced with a string that consists of N-1 white-space characters, followed by the letter A, this regex takes a staggering O(N²) time to reject it. Every time the regex reaches the letter A, it backtracks and attempts another way of splitting all those white-space characters between (.*?) and +.

This has been the cause of a StackOverflow outage.

A similar divide has appeared in the world of grammars. The academic concept of context-free grammar (or CFG) is also defined based on derivation rules. In fact, we can use the exact same rules as for regular expressions:

L(ε)   = { ε }
L(a)   = { 'a' }               // a is called a 'terminal'
L(A|B) = L(A) ∪ L(B)           // A and B are called 'non-terminals'
L(AB)  = { ab : a ∈ A, b ∈ B }

The difference is that, unlike regular expressions, grammar definitions can be recursive : each non-terminal can reference any other non-terminal in the language, including itself. This is why the rule for L(A*) is not necessary—it can be defined in terms of the other rules.

A* → ε | A+
A+ → AA*

CFGs have been used on the engineering side for decades, with very few additions that didn't fit snugly into the existing theory. In fact, most parser generators support less powerful grammars, such as SLR, because there are algorithms that parse SLR CFGs in O(N), as opposed to O(N³) for arbitrary CFGs.

The engineering equivalent is the recursive descent parser, which happens when engineers throw their hands in the air, say « I give up ! » and write the parsing code by hand. Interestingly, this happens either because the language is too simple, so it is not worth learning and supporting a parser generation toolchain, or because the language is too complex—maybe parsing depends on recognizing identifiers as variables or type names, as in C++, or maybe the language can alter its syntax on the fly, as in Perl.

Parser combinators are a functional programmer's approach to recursive descent parsers : instead of writing by hand a function for each construct we can recognize, we should define those functions by combining other functions. Where a recursive descent parser would look like this:

// Recognize declaration 'type var;'
def parseDeclaration():
  Type type
  VarName var
  if (type = parseType()) == BACKTRACK:
    return BACKTRACK
  if (var = parseVar()) == BACKTRACK:
    return BACKTRACK
  if next() != ';':
    return BACKTRACK
  return new Declaration(type, var)

The parser combinator equivalent would be:

parseDeclaration = 
  do type <- parseType 
     var  <- parseVar 
     char ';'
     return (Declaration type var)

Shorter, but does the exact same thing. For comparison, a typical parser generator would have required writing :

declaration: type var ';' { new Declaration($1, $2) }

Even shorter, but can be attributed to being a domain-specific language (as opposed to being embedded in a general-purpose language and having to fight its syntax) instead of a theoretical advantage of CFGs over recursive descent parsers.

Modern practices in recursive descent parsing have recently been given a backing theory, in the form of Parsing Expression Grammars (or PEG), in 2004. It recognizes that recursive descent is a backtracking approach, and therefore the order in which alternatives are attempted matters. PEGs do not use the CFG choice operator |, and instead introduce an ordered choice operator /. To parse a string with rule A/B, the PEG will first attempt to parse it with A and, if it fails, it will attempt B instead.

Just as for regular expressions, it is not enough to know that a derivation exists, the author of the parser is almost always interested in knowing which derivation was used. If several derivations are possible, the parser should specify which one should be used. A common limitation of CFGs is that there is no simple way of doing so.

A common issue is operator associativity and precedence. A grammar for mathematical expressions could be written as:

E → NUM
  | VAR
  | E + E 
  | E - E
  | E * E

It would recognize an expression like 1 - 2 + 3, but would it be the correct derivation E[E[1 - 2] + 3], or the incorrect E[1 - E[2 + 3]] ? This is an associativity problem, and must be solved by altering the rules :

F → NUM 
  | VAR
E → F
  | E + F
  | E - F
  | E * F

Then the correct derivation E[E[1 - 2] + 3] remains possible, but E[1 - E[2 + 3]] is not, because the rule E → ? - E does not exist.

However, this derives 1 - 2 * 3 as E[E[1-2] * 3] which is a precedence problem, and is solved by further altering the rules:

G → NUM 
  | VAR
F → G
  | F * G
E → F
  | E + F
  | E - F

In practice, it is sometimes acceptable to just treat a mathematical expression as a binary operator soup E → (NUM|VAR) (OP (NUM|VAR))*, which has exactly one possible derivation (a list of operator-separated numbers and variables), and then implement associativity and precedence rules outside the parser.

Another well-known ambiguity is the dangling else problem. Given the code if(A) if(B) C; else D;, should we recognize it as if(A){ if(B) C; else D; } or if(A){ if(B) C; } else D; ? Most languages choose the former (that is, the else should be associated to the closest if without one). Again, a grammar change :

// Neutral statements not involved in dangling else mess
N → VAR '=' E
  | 'return' E
  | '{' S '}'

// Protected statements would not capture an 'else' to their immediate right
P → 'if' '(' E ')' (P|N) 'else' (P|N)

// Capturing statements will capture an 'else' to their immediate right
C → 'if' '(' E ')' (C|N)
  | 'if' '(' E ')' (P|N) 'else' (C|N)

// Statements
S → P | N | C

All these rule changes quickly become messy. By contrast, PEGs use / to specify the choice of derivation :

S → VAR '=' E
  / 'return' E
  / '{' S '}'
  / 'if' '(' E ')' S 'else' S 
  / 'if' '(' E ')' S

Not only is it easier to pick a derivation in a PEG, the absence of unordered choice | implies that PEGs are never ambiguous in the « which derivation to choose ? » sense, even if they are ambiguous in the « several derivations are possible » sense.

Most CFG parser generators, however, are able to pinpoint ambiguities. For SLR, this takes the form of reduce-reduce conflicts (a sequence is recognized by two derivations) or shift-reduce (a given sequence is recognized by derivation A but could completed into a sequence recognized by unrelated derivation B). Modern generators, such as Menhir, can be very explicit about the ambiguous sequence. This means it is very hard to introduce an accidental ambiguity in a well-maintained CFG.

There are no equivalent tools for PEGs, because it is normal for several derivations to exist, and a tool cannot determine whether the choice of a derivation through the use of / was intentional on the part of the author, or whether the author did not even know multiple derivations were possible.

This makes PEGs very brittle in terms of changes, in the same way that C lets an expert programmer elegantly, concisely and efficiently shoot themselves in the foot. Consider a simple grammar that, as should be expected of any code base, is a bit of a mess :

STMT → DECL 
     | ... 
     | AUTO 

DECL → TYPE name
     | TYPE name '=' EXPR

AUTO → 'auto' name '=' EXPR

In a CFG, there are many transformations that we can perform without altering the meaning of the grammar, thanks to the fact that | is both associative and commutative. This lets us safely merge the AUTO rule back into DECL :

STMT → DECL 
     | ...

DECL → TYPE name
     | TYPE name '=' EXPR
     | 'auto' name '=' EXPR

Even better, we could merge auto and TYPE to simply the rule even further and to issue a warning more helpful than a syntax error when writing auto foo;.

STMT → DECL 
     | ...

DECL → (TYPE | 'auto') name ('=' EXPR)?

This transformation might be possible in the equivalent PEG. It really depends on whether there is a rule in the ominous ... above that would break if AUTO was given a higher priority. There's no way to tell, except by squinting very hard at the details.

The consequences of such a mistake would likely be very subtle—some edge case which used to be parsed one way will now be parsed in another way. It is no longer enough to have a handful of simple unit tests—every ambiguity in the « more than one derivation exists » sense should have an associated test to ensure that grammar changes have not reversed the priority order in those cases.