What regular expression recognizes a string that uses \
for escaping ?
"([^\\"]|\\.)*"
━A━━━━ ━B━
I will call this expression « deterministic » because it can be easily translated to imperative code that runs in linear time.
i = 0
if i >= str.length or '"' != str[i++]:
return REJECT
while i < str.length:
switch str[i]:
case '"':
return ACCEPT
case '\': ┃
i += 2 B \\.
break ┃
default: ┃
i += 1 A [^\\"]
break ┃
return REJECT
Not all regular expressions are deterministic. What is an expression that extracts the TLD from a lowercase ASCII domain name ?
[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]+)
━A━━━━━━━━━━ ━B━━━━━━━━
It is hard to translate this expression to a simple loop, because
the character .
can start either section A
(a domain part that is not the TLD) or section B (the
TLD) depending on whether there are other dots ahead in the
string. When the imperative program reaches a .
it
must choose whether it attempts A or B.
Of course, this expression can still be translated to imperative code which either uses backtracking (choose branch A, if it fails choose branch B) or keeps track of both branches at the same time. Both solutions are used by regular expression engines to cope with non-deterministic expressions.
There are two ways of understanding how regular expressions describe what they should recognize.
The academic view is that regular expressions describe derivations : a regular expression can be used to generate or « derive » all the strings that it recognizes, so a string is recognized by a regular expression if and only if it is a member of that set—if and only there exists a derivation that generates the string.
The derivation rules themselves are very simple:
L(ε) = { ε } // The empty word
L(a) = { 'a' }
L(A|B) = L(A) ∪ L(B)
L(AB) = { ab : a ∈ A, b ∈ B }
L(A*) = { aⁿ : a ∈ A, n ≥ 0 }
The engineering view is that regular expressions are a shorthand notation for a backtracking program. A string is recognized if the program can reach the end of the string and stop there.
This has led to some interesting developments that do not have clear equivalents in the academic view. In fact, the two views have become distinct enough that it is worth calling them different names : « regex » for the engineering concept implemented in libraries, and « regular languages » for the academic concept.
Among the many differences :
Look-ahead and look-behind lets a regex test characters without consuming them.
Ambiguities (situations where more than one derivation
generates the string) are important when a regex is expected to
extract the values recognized by its
sub-expressions. Does (.*)X(.*)
applied to AXBXC
return groups AXB and C, or A and BXC ?
To resolve these ambiguities, regex syntax provides
both greedy *
(if there are two derivations, pick the one
that has the longest match) and lazy *?
(if there are
two derivations, pick the one that has the shortest match).
These have an obvious meaning in a back-tracking program :
when (.*)X(.*)
sees an X, it has a choice between
leaving or staying in the (.*)
group. Which
alternative should be tried first ? Greedy means to stay in
the group (if that fails, backtrack and leave the group), lazy
means to leave the group (if that fails, backtrack and stay in
the group).
Backreferences let the regex reference parts of the
string that it has already matched in its current
attempt. ([abc])\1
recognize "aa", "bb" and "cc".
Using backreferences, it is possible to test if a number P is
prime: see if the letter A
, repeated P times, is matched
by (AA+)\1+
. This will only be the case if there is
a sequence of length N > 1 which is be repeated exactly M > 1
times, so that P = N×M is not prime.
Subroutines let a regex define and call another
regex. These calls can even be recursive. The well-known
language { aⁿbⁿ : n ≥ 1 }
is not regular, but
is matched by a(?R)?b
.
A regex is more powerful than a theoretical regular expression, but
it has also lost a few useful qualities. For example, a true regular
expression can be compiled to a deterministic automaton which
recognizes or rejects any string in O(N) time. There is no such
guarantee for a regexp. In fact, if you wanted to write a regular
expression that trims white-space at the end of a string, you might
use the regex (.*?) +
(match as few characters as
possible, followed by a sequence white-space).
Faced with a string that consists of N-1 white-space characters,
followed by the letter A, this regex takes a staggering O(N²)
time to reject it.
Every time the regex reaches the letter A, it backtracks and
attempts another way of splitting all those white-space characters
between (.*?)
and +
.
This has been the cause of a StackOverflow outage.
A similar divide has appeared in the world of grammars. The academic concept of context-free grammar (or CFG) is also defined based on derivation rules. In fact, we can use the exact same rules as for regular expressions:
L(ε) = { ε }
L(a) = { 'a' } // a is called a 'terminal'
L(A|B) = L(A) ∪ L(B) // A and B are called 'non-terminals'
L(AB) = { ab : a ∈ A, b ∈ B }
The difference is that, unlike regular expressions, grammar
definitions can be recursive : each non-terminal can reference
any other non-terminal in the language, including itself. This is why the rule for L(A*)
is not necessary—it can be defined in terms of the other rules.
A* → ε | A+
A+ → AA*
CFGs have been used on the engineering side for decades, with very few additions that didn't fit snugly into the existing theory. In fact, most parser generators support less powerful grammars, such as SLR, because there are algorithms that parse SLR CFGs in O(N), as opposed to O(N³) for arbitrary CFGs.
The engineering equivalent is the recursive descent parser, which happens when engineers throw their hands in the air, say « I give up ! » and write the parsing code by hand. Interestingly, this happens either because the language is too simple, so it is not worth learning and supporting a parser generation toolchain, or because the language is too complex—maybe parsing depends on recognizing identifiers as variables or type names, as in C++, or maybe the language can alter its syntax on the fly, as in Perl.
Parser combinators are a functional programmer's approach to recursive descent parsers : instead of writing by hand a function for each construct we can recognize, we should define those functions by combining other functions. Where a recursive descent parser would look like this:
// Recognize declaration 'type var;'
def parseDeclaration():
Type type
VarName var
if (type = parseType()) == BACKTRACK:
return BACKTRACK
if (var = parseVar()) == BACKTRACK:
return BACKTRACK
if next() != ';':
return BACKTRACK
return new Declaration(type, var)
The parser combinator equivalent would be:
parseDeclaration =
do type <- parseType
var <- parseVar
char ';'
return (Declaration type var)
Shorter, but does the exact same thing. For comparison, a typical parser generator would have required writing :
declaration: type var ';' { new Declaration($1, $2) }
Even shorter, but can be attributed to being a domain-specific language (as opposed to being embedded in a general-purpose language and having to fight its syntax) instead of a theoretical advantage of CFGs over recursive descent parsers.
Modern practices in recursive descent parsing have recently been
given a backing theory, in the form of Parsing Expression
Grammars (or PEG), in 2004. It recognizes that recursive
descent is a backtracking approach, and therefore the order in which
alternatives are attempted matters. PEGs do not use the CFG choice
operator |
, and instead introduce an ordered choice
operator /
. To parse a string with
rule A/B
, the PEG will first attempt to parse it
with A
and, if it fails, it will attempt B
instead.
Just as for regular expressions, it is not enough to know that a derivation exists, the author of the parser is almost always interested in knowing which derivation was used. If several derivations are possible, the parser should specify which one should be used. A common limitation of CFGs is that there is no simple way of doing so.
A common issue is operator associativity and precedence. A grammar for mathematical expressions could be written as:
E → NUM
| VAR
| E + E
| E - E
| E * E
It would recognize an expression like 1 - 2 + 3
, but
would it be the correct derivation E[E[1 - 2] + 3]
, or the
incorrect E[1 - E[2 + 3]]
? This is an associativity
problem, and must be solved by altering the rules :
F → NUM
| VAR
E → F
| E + F
| E - F
| E * F
Then the correct derivation E[E[1 - 2] + 3]
remains
possible, but E[1 - E[2 + 3]]
is not, because the
rule E → ? - E
does not exist.
However, this derives 1 - 2 * 3
as E[E[1-2] *
3]
which is a precedence problem, and is solved by further
altering the rules:
G → NUM
| VAR
F → G
| F * G
E → F
| E + F
| E - F
In practice, it is sometimes acceptable to just treat a mathematical
expression as a binary operator soup E → (NUM|VAR) (OP
(NUM|VAR))*
, which has exactly one possible derivation (a
list of operator-separated numbers and variables), and then
implement associativity and precedence rules outside the parser.
Another well-known ambiguity is the dangling else problem. Given the
code if(A) if(B) C; else D;
, should we recognize it
as if(A){ if(B) C; else D; }
or if(A){ if(B) C; }
else D;
? Most languages choose the former (that is,
the else
should be associated to the
closest if
without one). Again, a grammar change :
// Neutral statements not involved in dangling else mess
N → VAR '=' E
| 'return' E
| '{' S '}'
// Protected statements would not capture an 'else' to their immediate right
P → 'if' '(' E ')' (P|N) 'else' (P|N)
// Capturing statements will capture an 'else' to their immediate right
C → 'if' '(' E ')' (C|N)
| 'if' '(' E ')' (P|N) 'else' (C|N)
// Statements
S → P | N | C
All these rule changes quickly become messy. By contrast, PEGs
use /
to specify the choice of derivation :
S → VAR '=' E
/ 'return' E
/ '{' S '}'
/ 'if' '(' E ')' S 'else' S
/ 'if' '(' E ')' S
Not only is it easier to pick a derivation in a PEG, the absence of
unordered choice |
implies that PEGs are never
ambiguous in the « which derivation to
choose ? » sense, even if they are ambiguous in the
« several derivations are possible » sense.
Most CFG parser generators, however, are able to pinpoint ambiguities. For SLR, this takes the form of reduce-reduce conflicts (a sequence is recognized by two derivations) or shift-reduce (a given sequence is recognized by derivation A but could completed into a sequence recognized by unrelated derivation B). Modern generators, such as Menhir, can be very explicit about the ambiguous sequence. This means it is very hard to introduce an accidental ambiguity in a well-maintained CFG.
There are no equivalent tools for PEGs, because it is normal for
several derivations to exist, and a tool cannot determine whether
the choice of a derivation through the use of /
was
intentional on the part of the author, or whether the author did not
even know multiple derivations were possible.
This makes PEGs very brittle in terms of changes, in the same way that C lets an expert programmer elegantly, concisely and efficiently shoot themselves in the foot. Consider a simple grammar that, as should be expected of any code base, is a bit of a mess :
STMT → DECL
| ...
| AUTO
DECL → TYPE name
| TYPE name '=' EXPR
AUTO → 'auto' name '=' EXPR
In a CFG, there are many transformations that we can perform without
altering the meaning of the grammar, thanks to the fact
that |
is both associative and commutative. This lets
us safely merge the AUTO
rule back
into DECL
:
STMT → DECL
| ...
DECL → TYPE name
| TYPE name '=' EXPR
| 'auto' name '=' EXPR
Even better, we could merge auto
and TYPE
to simply the rule even further and to issue a warning more
helpful than a syntax error when writing auto foo;
.
STMT → DECL
| ...
DECL → (TYPE | 'auto') name ('=' EXPR)?
This transformation might be possible in the equivalent
PEG. It really depends on whether there is a rule in the
ominous ...
above that would break
if AUTO
was given a higher priority. There's no way to
tell, except by squinting very hard at the details.
The consequences of such a mistake would likely be very subtle—some edge case which used to be parsed one way will now be parsed in another way. It is no longer enough to have a handful of simple unit tests—every ambiguity in the « more than one derivation exists » sense should have an associated test to ensure that grammar changes have not reversed the priority order in those cases.