Version 2 (modified by Austin_Hastings, 4 years ago)

--

Operator Precedence Parsing

The optable sub-parser built in to NQPrx is secretly a bottom-up, shift-reduce parser like you'd get from bison or yacc. It's great for parsing expressions because all the parts of an expression are basically the same (operators,terms), and the only real question is how to structure the resulting tree.

NOTE: By structuring the tree, I mean that the tree for x + y * 4 will look different than the tree for x * y + 4. If you don't totally understand this, you need to go talk to the folks on #parrot, or dig up some notes from an old CS 200 course, or something.

Parsing with Optable

A set of generic terminals is defined by Regex::Cursor. Those terminals are understood to interact in a particular way in expressions, and your operators have to conform to that understanding. Here's how it works:

EXPR

The rule that drives all this is <EXPR>. If you like, you can specify a precedence cutoff, like <EXPR('foo')>, so that only a certain set of operators (with precedence greater than foo) will be parsed.

EXPR knows that expressions are made of a sequence of terms and operators. And generally, an expression looks like:

   y = m * x + b;

Of course, the semicolon(;) isn't a part of the expression - that's part of the expression_statement rule that includes this expression:

rule expression_statement { <EXPR> ';' }

But the rest of it is fair game for expression to parse. So what's in an expression? Well, operators and terms. An operator is something concrete, like '+' or '*' or '='. And a term is the stuff between the operators, right? And maybe there's only the one term, so that should be "zero-or-more":

rule EXPR { 
    <term> [ <op> <term> ]* 
} 

But wait! What about the dreaded "unary" operators, like '-' (negative sign). And what about the "prefix" and "postfix" operators like '++' and '--'?

Well, prefix and postfix ops are generally attached to the term. So there's still a sequence of term-ish blobs separated by binary operators:

rule EXPR {
    <termish> [ <op> <termish> ]*
}

rule termish {
    <prefix>*
    <term>
    <postfix>*
}

So allowing multiple unary operators means we can do things like *++x = -y---- * 2, but maybe we shouldn't.

This still leaves a couple of special edge-cases, having to do with operators that don't fit into this pattern of an "atomic" operator and one or two operands. First, because I know you're bouncing in your seat saying "what about ternary? what about ternary?", there's the ternary.

ternary

(If you don't know about the ternary, don't fret. Just nod and smile when your buddies get all giddy...)

There's a couple of choices for how to recognize ternary, but the most direct is the easiest, and since there's only one ternary op, it's the route the optable parser takes: you do it''

The trick is that parsing the individual <op> strings is still the responsibility of the coder. So is exponentiation going to be ^ or **? It's up to you. And so ternary is treated as a binary operator, because that code is already written. It's just that you need to recognize both parts of the op, and recognize the intermediary sub-expression:

rule infix:sym<ternary> { '?' <EXPR('f=')> ':' }

Disappointing, eh?

circumfix and postcircumfix

It would be tempting to treat circumfix and postcircumfix as variations of term, and circumfix is. But postcircumfix is not, there's special handling. It follows the well-established pattern if adding -ish and adding options:

rule termish {
    <prefixish>*
    <term>
    <postfixish>*
}

rule postfixish {
    | <postfix>
    | <postcircumfix>
}

W-wh-wha-what? Prefix-ish? Heh, gotcha! The only thing in prefixish is some whitespace - there's no precircumfix. At least, not yet.

token prefixish {
   <prefix> <.ws>
}

(Hey! If you're going to let the coder write ++ ++x, you may as well encourage some whitespace...)

And what about circumfix, you ask? Well, that turns out to be just a variation on term:

token term:sym<circumfix> { <circumfix> }

Summary

So here's the list of proto-tokens you can extend:

circumfix usually just parens around a (subexpression) term
infix plain old binary operator between terms, like x + y
postfix unary operator after a term, like x--
postcircumfix circumfix operator after a term, like f(x) or a[i]
prefix unary operator before a term, like '!x'
term basic atom that is used by operators: identifier, constant, or (subexpression)

Defining term

Let's get this out of the way, since it's pretty easy. First, circumfix is built-in as a term:

token term:sym<circumfix> { <circumfix> }

This means that any circumfix you define automatically becomes a term. If you don't like that behavior, you need to not define your operation as a circumfix.

Identifiers

You get the <ident> target for free - inherited by grammars from Regex::Cursor, which recognizes C-like names (which are the same names used by pretty much every other recent programming language on the planet: letter or underscore to start, then 0 or more letter, underscore, or digit). But that rule is not automatically a part of <term> since maybe you have a different rule for variable names (like Perl).

So an easy win is:

token term:sym<identifier> { <ident> }

Unless you've got non-C-like identifier rules, in which case you'll want to do:

token term:sym<identifier> { <identifier> }

token identifier { # your rule goes here
}

Literals

If you can use literals in expressions, you'll need to add them to term:

token term:sym<integer> { <integer> }
token term:sym<float>   { <dec_number> }
token term:sym<string>  { <quoted_string> }

Please note that while <integer> and <dec_number> are provided for free, quoted-string handling is a little more complex. I'm skipping it here - maybe another page?

Predefined symbols

You may want to make some predefined symbols explicitly terms, especially if they can only be used in an expression context. Perl does this with the self predefined symbol. (Alternatively, you could just add it to the symbol table and let your identifier rule catch it. In the case of Perl, self doesn't look like a variable, so...)

Defining and Categorizing Operators

The simplest part of defining operators is specifying what they look like:

token infix:sym<+> { <sym> }

Note that the <sym> target is magic, and just re-uses the :sym<> part of the name.

Once you specify the appearance, you'll need to specify any other parsing details. As mentioned above, this parsing is doubly important for ternary and postcircumfix operators, where you have to parse part of the internals:

token infix:sym<?? !!> { # Perl-like ?? !! ternary
    '??' 
    <.ws>
    <EXPR('i=')>
    '!!'
}

But also, you need to tell the Optable parser the precedence and associativity of your operator, along with any other details you might want to use internally. First, put an INIT block into your grammar that sets up these definitions:

grammar My::Grammar is HLL::Grammar;

INIT {
    My::Grammar.O( ':prec<09>, :assoc<unary>', '%negative' );
    My::Grammar.O( ':prec<07>, :assoc<left>',  '%multiplicative' );
    My::Grammar.O( ':prec<05>, :assoc<left>', '%additive' );
}

Doing this isn't a strict requirement, but it's recommended since the information stored here will have to be specified in a bunch of places. Then, in your operator targets, you re-use the '%'-tags you defined here:

token infix:sym<*>   { <sym> <O('%multiplicative')> }
token infix:sym<+>   { <sym> <O('%additive')> }
token infix:sym<->   { <sym> <O('%additive')> }
token infix:sym<mod> { <sym> <O('%multiplicative')> }
token infix:sym<div> { <sym> <O('%multiplicative')> }

As mentioned, you can explicitly set the precedence and associativity on each op, but that'd be silly.

Precedence

Note that the precedence (:prec<...>) is a string. And it gets compared as a string, not a number or any other magic foo. So if you have a precedence of 9 and a precedence of 10, you had better either start zero-padding your precedence codes, or expect to be surprised. The precedence strings are compared asciibetically, so 0 < 9 < A < a < z.

Note that for unary operators, the precedence is used to determine whether a prefix or postfix operator is applied first. So in code like *x++ the precedence determines whether the result is (*x)++ or *(x++). Obviously, this also applies to postcircumfix, so *x[1] is subject to the same rules.

Associativity

The :assoc<...> field specifies associativity. Possible values are unary, left, right, and list. These control how the expression parser reduces the list of operators and terms.

unary

A unary association means that the operator associates only with the single term it applies to (post- or pre- fix-ishly).

IMPORTANT: Never mark a unary operator as anything but unary associative. Even if the language grammar specifies "oh, this operator is left-associative", you must mark it as unary. Marking it as unary tells the optable "don't look for anything else but this", whereas marking it left or right would imply "there should be another operand somewhere, you should look for it" - resulting in painful horrible death, or worse.

left, right

A left- or right- associative operator will require two or more arguments, but will group them differently when they appear in bunches. A left association, like addition, will create a group starting on the left: a + b + c becomes (a+b) +c. A right association, like exponentiation, groups the other way: x ** y ** z becomes (x** (y**z)).

It makes a difference: consider 2 ** 2 ** 3. In a left-grouping scenario, you get (2**2=4)**3 = 64. But a right-grouping scenario gives you 2**(2**3=8) = 256.

IMPORTANT: See note in unary, above.

list

List associativity is a special treatment, mainly for the comma(,) operator. Instead of building a tree, the expression parser will build an array with the operands in order:

   'a', 'b', 'c'

produces an array [ 'a', 'b', 'c' ]. The parser is smart enough to recognized different operators, so if you define, say, semicolon and comma as different list operators, you will get different lists:

   1, 1; 2, 3; 5, 8

produces [ [1,1], [2,3], [5,8] ], I think. (Haven't bothered to test it, but the code is there.)

Operator Types

circumfix

token circumfix:sym<( )>  { '(' ~ ')' <EXPR> }

Because circumfix is implemented as a part of term instead of as an operator, there is no need to worry about precedence or associativity. On the other hand, you do need to worry about the difference between circumfix () and postcircumfix ().

infix

token infix:sym<*>   { <sym> <O('%multiplicative')> }

token infix:sym<**>  { <sym> <O('%exponent')> }

token infix:sym<? :> { '?' <EXPR('i=')> ':' <O('%ternary')> }

This is the classic operator format. Many of the all-time greats were infix binary ops: plus(+), minus(-), even bitwise exclusive or(^) was an infix binary operator.

And yes, ternary is implemented as an infix binary with a lot of special processing inside.

postcircumfix

token postcircumfix:sym<[ ]> { '[' ~ ']' $index=<EXPR><O('%array_index')> }

token postcircumfix:sym<( )> { '(' ~ ')' <arglist> <O('%function_call')> }

Note in particular that many programming languages have slightly different parsing rules inside postcircumfix expressions. For example, C-like languages have to set a precedence threshold when parsing function calls, because the comma expression operator is not allowed - comma is an argument separator in postcircumfix-(). Similar logic may apply for named parameters and for separating multiple indices of a complex array - a[i, j].

postfix & prefix

token postfix:sym<?>  { <sym> <O('%smart_nullification')> }

token prefix:sym<+>   { <sym> <O('%unary_plus')> }
token prefix:sym<->   { <sym> <O('%unary_plus')> }

When defining pre- and postfix operators, be aware of the possibility that you may conflict with a binary operator of the same name. In that case, language specifications generally give precedence to the unary form.

IMPORTANT: While you can re-use the %-tags for different operators, do not try to re-use the %-tags at different levels. They will take on only one value, and leave you trying to debug the internals of the expression parser. One you assign a %-tag in your INIT-block, you should never try to re-defined it.

On the other hand, multiple operators can reference the same %-tag. It's not used to specify behavior, it's used to specify attributes.

term

token term:sym<identifier> { <identifier> }

Terms are the 'atomic' parts of an expression. They may be more like the 'subatomic' parts, if you weren't expecting to see things like postcircumfix-(). But whether they are quarks or atoms, they're definitely the place where the expression parser either stops, or recurses on itself.

One thing to watch out for is that your terms should pretty much always reference another rule, instead of doing any matching themselves. This is because for rules like literals and identifiers, you are very likely to need to recognize the same thing in a different, non-expression context. (Many languages have a special syntax for parameter or variable declarations that is definitely not an expression.) So put that knowledge in a separate rule, and call it from within term.