Version 1 (modified by Austin_Hastings, 5 years ago)

--

Operator Precedence Parsing

The optable sub-parser built in to NQPrx is secretly a bottom-up, shift-reduce parser like you'd get from bison or yacc. It's great for parsing expressions because all the parts of an expression are basically the same (operators,terms), and the only real question is how to structure the resulting tree.

NOTE: By structuring the tree, I mean that the tree for x + y * 4 will look different than the tree for x * y + 4. If you don't totally understand this, you need to go talk to the folks on #parrot, or dig up some notes from an old CS 200 course, or something.

Parsing with Optable

A set of generic terminals is defined by Regex::Cursor. Those terminals are understood to interact in a particular way in expressions, and your operators have to conform to that understanding. Here's how it works:

EXPR

The rule that drives all this is <EXPR>. If you like, you can specify a precedence cutoff, like <EXPR('foo')>, so that only a certain set of operators (with precedence greater than foo) will be parsed.

EXPR knows that expressions are made of a sequence of terms and operators. And generally, an expression looks like:

   y = m * x + b;

Of course, the semicolon(;) isn't a part of the expression - that's part of the expression_statement rule that includes this expression:

rule expression_statement { <EXPR> ';' }

But the rest of it is fair game for expression to parse. So what's in an expression? Well, operators and terms. An operator is something concrete, like '+' or '*' or '='. And a term is the stuff between the operators, right? And maybe there's only the one term, so that should be "zero-or-more":

rule EXPR { 
    <term> [ <op> <term> ]* 
} 

But wait! What about the dreaded "unary" operators, like '-' (negative sign). And what about the "prefix" and "postfix" operators like '++' and '--'?

Well, prefix and postfix ops are generally attached to the term. So there's still a sequence of term-ish blobs separated by binary operators:

rule EXPR {
    <termish> [ <op> <termish> ]*
}

rule termish {
    <prefix>*
    <term>
    <postfix>*
}

So allowing multiple unary operators means we can do things like *++x = -y---- * 2, but maybe we shouldn't.

This still leaves a couple of special edge-cases, having to do with operators that don't fit into this pattern of an "atomic" operator and one or two operands. First, because I know you're bouncing in your seat saying "what about ternary? what about ternary?", there's the ternary.

ternary

(If you don't know about the ternary, don't fret. Just nod and smile when your buddies get all giddy...)

There's a couple of choices for how to recognize ternary, but the most direct is the easiest, and since there's only one ternary op, it's the route the optable parser takes: you do it''

The trick is that parsing the individual <op> strings is still the responsibility of the coder. So is exponentiation going to be ^ or **? It's up to you. And so ternary is treated as a binary operator, because that code is already written. It's just that you need to recognize both parts of the op, and recognize the intermediary sub-expression:

rule infix:sym<ternary> { '?' <EXPR('f=')> ':' }

Disappointing, eh?

circumfix and postcircumfix

It would be tempting to treat circumfix and postcircumfix as variations of term, but they clearly aren't. So yes, there's special handling for them. It follows the well-established pattern if adding -ish and adding options:

rule termish {
    <prefixish>*
    <term>
    <postfixish>*
}

rule postfixish {
    | <postfix>
    | <postcircumfix>
}

W-wh-wha-what? Prefixish?

Heh, gotcha! The only thing in prefixish is some whitespace - there's no precircumfix. At least, not yet.

token prefixish {
   <prefix> <.ws>
}

Hey! If you're going to let the coder write ++ ++x, you may as well encourage some whitespace...

And what about circumfix, you ask? Well, that turns out to be just a variation on term:

token term:sym<circumfix> { <circumfix> }

Presto!

Summary

So here's the list of proto-tokens you can extend:

circumfix usually just parens around a (subexpression) term
infix plain old binary operator between terms, like x + y
postfix unary operator after a term, like x--
postcircumfix circumfix operator after a term, like f(x) or a[i]
prefix unary operator before a term, like '!x'
term basic atom that is used by operators: identifier, constant, or (subexpression)

Defining term

Let's get this out of the way, since it's pretty easy. First, circumfix is built-in as a term:

token term:sym<circumfix> { <circumfix> }

This means that any circumfix you define automatically becomes a term. If you don't like that behavior, you need to *not* define your operation as a circumfix.

Identifiers

You get the <ident> target for free, which recognizes C-like names (which are the same names used by pretty much every other recent programming language on the planet: letter or underscore to start, then 0 or more letter, underscore, or digit). But that rule is not automatically a part of <term> since maybe you have a different rule for variable names (like, Hello! Perl?).

So an easy win is:

token term:sym<identifier> { <ident> }

Literals

If you can use literals in expressions, you'll need to add 'em:

token term:sym<integer> { <integer> }
token term:sym<float>   { <dec_number> }
token term:sym<string>  { <quoted_string> }

Please note that while <integer> and <dec_number> are provided for free, quoted-string handling is a little more complex. I'm skipping it.

Defining and Categorizing Operators

The simplest part of defining operators is specifying what they look like:

token infix:sym<+> { <sym> }

Note that the <sym> target is magic, and just re-uses the :sym<> part of the name.

Once you specify the appearance, you'll need to specify any other parsing details. As mentioned above, this is important for ternary and postcircumfix operators, where you have to parse part of the internals.

token infix:sym<?? !!> {
    '??' 
    <.ws>
    <EXPR('i=')>
    '!!'
}

But also, you need to tell the Optable parser the precedence and associativity of your operator, along with any other details you might want to use internally. First, put an INIT block into your grammar that sets up these definitions:

grammar My::Grammar is HLL::Grammar;

INIT {
    My::Grammar.O( ':prec<09>, :assoc<unary>', '%negative' );
    My::Grammar.O( ':prec<07>, :assoc<left>',  '%multiplicative' );
    My::Grammar.O( ':prec<05>, :assoc<left>', '%additive' );
}

Doing this isn't a strict requirement, but it's recommended since the information stored here will have to be specified in a bunch of places. Then, in your operator targets, you re-use the '%'-tags you defined here:

token infix:sym<*>   { <sym> <O('%multiplicative')> }
token infix:sym<+>   { <sym> <O('%additive')> }
token infix:sym<->   { <sym> <O('%additive')> }
token infix:sym<mod> { <sym> <O('%multiplicative')> }
token infix:sym<div> { <sym> <O('%multiplicative')> }

postcircumfix

token postcircumfix:sym<[ ]> { '[' <EXPR> ']' }

Note in particular that many programming languages have slightly different parsing rules inside postcircumfix expressions. For example, C-like languages have to set a precedence threshold when parsing function calls, because the comma expression operator is not allowed - comma is an argument separator in postcircumfix-(). Similar logic may apply for named parameters and for separating multiple indices of a complex array - a[i, j].

infix binary

This is the classic operator format. Many of the all-time greats were infix binary ops: plus(+), minus(-), even bitwise exclusive or(^) was an infix binary operator.

An infix binary operator goes in between the two (hence, binary) operands, like: x * y or a + 1.