= Operator Precedence Parsing = The optable sub-parser built in to NQPrx is secretly a bottom-up, shift-reduce parser like you'd get from bison or yacc. It's great for parsing expressions because all the parts of an expression are basically the same (operators,terms), and the only real question is how to structure the resulting tree. '''NOTE:''' By structuring the tree, I mean that the tree for `x + y * 4` will look different than the tree for `x * y + 4`. If you don't ''totally'' understand this, you need to go talk to the folks on #parrot, or dig up some notes from an old CS 200 course, or something. == Parsing with Optable == A set of generic non-terminals is defined by `Regex::Cursor`. Those productions are understood to interact in a particular way in expressions, and your operators have to conform to that understanding. Here's how it works: === EXPR === The rule that drives all this is ``. If you like, you can specify a precedence cutoff, like ``, so that only a certain set of operators (with precedence greater than `foo`) will be parsed. `EXPR` knows that expressions are made of a sequence of '''terms''' and operators. And generally, an expression looks like: {{{ y = m * x + b; }}} Of course, the semicolon(`;`) isn't a part of the expression - that's part of the `expression_statement` rule that ''includes'' this expression: {{{ rule expression_statement { ';' } }}} But the rest of it is fair game for expression to parse. So what's in an expression? Well, operators and terms. An operator is something concrete, like '+' or '*' or '='. And a term is the stuff between the operators, right? And maybe there's only the one term, so that should be "zero-or-more": {{{ rule EXPR { [ ]* } }}} But wait! What about the dreaded "unary" operators, like '-' (negative sign). And what about the "prefix" and "postfix" operators like '++' and '--'? Well, prefix and postfix ops are generally attached to the term. So there's still a sequence of ''term-ish'' blobs separated by binary operators: {{{ rule EXPR { [ ]* } rule termish { * * } }}} So allowing multiple unary operators means we can do things like `*++x = -y---- * 2`, but maybe we shouldn't. This still leaves a couple of special edge-cases, having to do with operators that don't fit into this pattern of an "atomic" operator and one or two operands. First, because I know you're bouncing in your seat saying "what about ternary? what about ternary?", there's the ternary. ==== ternary ==== (If you don't know about the ternary, don't fret. Just nod and smile when your buddies get all giddy...) There's a couple of choices for how to recognize ternary, but the most direct is the easiest, and since there's only one ternary op, it's the route the optable parser takes: ''you do it! '' The trick is that parsing the individual `` strings is still the responsibility of the coder. So is exponentiation going to be `^` or `**`? It's up to you. And so ternary is treated as a ''binary'' operator, because that code is already written. It's just that you need to recognize both parts of the op, ''and'' recognize the intermediary sub-expression: {{{ rule infix:sym { '?' ':' } }}} Disappointing, eh? ==== circumfix and postcircumfix ==== It would be tempting to treat circumfix and postcircumfix as variations of `term,` and circumfix is. But postcircumfix is not, there's special handling. It follows the well-established pattern if adding -ish and adding options: {{{ rule termish { * * } rule postfixish { | | } }}} W-wh-wha-what? ''Prefix-ish?'' Heh, gotcha! The only thing in prefixish is some whitespace - there's no precircumfix. At least, not yet. {{{ token prefixish { <.ws> } }}} (Hey! If you're going to let the coder write `++ ++x`, you may as well encourage some whitespace...) And what about `circumfix,` you ask? Well, that turns out to be just a variation on `term`: {{{ token term:sym { } }}} ==== Summary ==== So here's the list of proto-tokens you can extend: || `circumfix` || usually just parens around a (subexpression) `term` || || `infix` || plain old binary operator between terms, like `x + y` || || `postfix` || unary operator after a term, like `x--` || || `postcircumfix` || circumfix operator after a term, like `f(x)` or `a[i]` || || `prefix` || unary operator before a term, like '!x' || || `term` || basic atom that is used by operators: identifier, constant, or (subexpression) || == Defining `term` == Let's get this out of the way, since it's pretty easy. First, `circumfix` is built-in as a `term`: {{{ token term:sym { } }}} This means that any circumfix you define automatically becomes a term. If you don't like that behavior, you need to '''not''' define your operation as a `circumfix`. === Identifiers === You get the `` target for free - inherited by grammars from Regex::Cursor - which recognizes C-like names (which are the same names used by pretty much every other recent programming language on the planet: letter or underscore to start, then 0 or more letter, underscore, or digit). But that rule is '''not''' automatically a part of `` since maybe you have a different rule for variable names (like Perl). So an easy win is: {{{ token term:sym { } }}} Unless you've got non-C-like identifier rules, in which case you'll want to do: {{{ token term:sym { } token identifier { # your rule goes here } }}} === Literals === If you can use literals in expressions, you'll need to add them to `term:` {{{ token term:sym { } token term:sym { } token term:sym { } }}} Please note that while `` and `` are provided for free, quoted-string handling is a little more complex. I'm skipping it here - maybe another page? === Predefined symbols === You may want to make some predefined symbols explicitly terms, especially if they can only be used in an expression context. Perl does this with the `self` predefined symbol. (Alternatively, you could just add it to the symbol table and let your identifier rule catch it. In the case of Perl, `self` doesn't look like a variable, so...) == Defining and Categorizing Operators == The simplest part of defining operators is specifying what they look like: {{{ token infix:sym<+> { } }}} Note that the `` target is magic, and just re-uses the `:sym<>` part of the name. Once you specify the appearance, you'll need to specify any other parsing details. As mentioned above, this parsing is doubly important for `ternary` and `postcircumfix` operators, where you have to parse part of the internals: {{{ token infix:sym { # Perl-like ?? !! ternary '??' <.ws> '!!' } }}} But also, you need to tell the Optable parser the precedence and associativity of your operator, along with any other details you might want to use internally. First, put an INIT block into your grammar that sets up these definitions: {{{ grammar My::Grammar is HLL::Grammar; INIT { My::Grammar.O( ':prec<09>, :assoc', '%negative' ); My::Grammar.O( ':prec<07>, :assoc', '%multiplicative' ); My::Grammar.O( ':prec<05>, :assoc', '%additive' ); } }}} Doing this isn't a strict requirement, but it's recommended since the information stored here will have to be specified in a bunch of places. Then, in your operator targets, you re-use the '%'-tags you defined here: {{{ token infix:sym<*> { } token infix:sym<+> { } token infix:sym<-> { } token infix:sym { } token infix:sym
{ } }}} As mentioned, you can explicitly set the precedence and associativity on each op, but that'd be silly. === Precedence === Note that the precedence (`:prec<...>`) is a ''string.'' And it gets compared as a string, not a number or any other magic foo. So if you have a precedence of 9 and a precedence of 10, you had better either start zero-padding your precedence codes, or expect to be surprised. The precedence strings are compared asciibetically, so 0 < 9 < A < a < z. Note that for `unary` operators, the precedence is used to determine whether a prefix or postfix operator is applied first. So in code like `*x++` the precedence determines whether the result is `(*x)++` or `*(x++)`. Obviously, this also applies to postcircumfix, so `*x[1]` is subject to the same rules. === Associativity === The `:assoc<...>` field specifies associativity. Possible values are `unary,` `left,` `right,` and `list`. These control how the expression parser ''reduces'' the list of operators and terms. ==== unary ==== A unary association means that the operator associates only with the single term it applies to (post- or pre- fix-ishly). '''IMPORTANT:''' Never mark a unary operator as anything but unary associative. Even if the language grammar specifies "oh, this operator is left-associative", you must mark it as `unary`. Marking it as unary tells the optable "don't look for anything else but this", whereas marking it `left` or `right` would imply "there should be another operand somewhere, you should look for it" - resulting in painful horrible death, or worse. ==== left, right ==== A left- or right- associative operator will require two or more arguments, but will group them differently when they appear in bunches. A left association, like addition, will create a group starting on the left: `a + b + c` becomes `(a+b) +c`. A right association, like exponentiation, groups the other way: `x ** y ** z` becomes `(x** (y**z))`. It makes a difference: consider 2 ** 2 ** 3. In a left-grouping scenario, you get (2**2=4)**3 = 64. But a right-grouping scenario gives you 2**(2**3=8) = 256. '''IMPORTANT:''' See note in unary, above. ==== list ==== List associativity is a special treatment, mainly for the comma(`,`) operator. Instead of building a tree, the expression parser will build an array with the operands in order: {{{ 'a', 'b', 'c' }}} produces an array `[ 'a', 'b', 'c' ]`. The parser is smart enough to recognized different operators, so if you define, say, semicolon and comma as different list operators, you will get different lists: {{{ 1, 1; 2, 3; 5, 8 }}} produces `[ [1,1], [2,3], [5,8] ]`, I think. (Haven't bothered to test it, but the code is there.) == Operator Types == === circumfix === {{{ token circumfix:sym<( )> { '(' ~ ')' } }}} Because circumfix is implemented as a part of `term` instead of as an operator, there is no need to worry about precedence or associativity. On the other hand, you ''do'' need to worry about the difference between circumfix () and ''post''circumfix (). === infix === {{{ token infix:sym<*> { } token infix:sym<**> { } token infix:sym { '?' ':' } }}} This is the classic operator format. Many of the all-time greats were infix binary ops: `plus(+),` `minus(-),` even `bitwise exclusive or(^)` was an infix binary operator. And yes, ternary is implemented as an infix binary with a ''lot'' of special processing inside. === postcircumfix === {{{ token postcircumfix:sym<[ ]> { '[' ~ ']' $index= } token postcircumfix:sym<( )> { '(' ~ ')' } }}} Note in particular that many programming languages have slightly different parsing rules inside `postcircumfix` expressions. For example, C-like languages have to set a precedence threshold when parsing function calls, because the comma expression operator is not allowed - comma is an argument separator in postcircumfix-(). Similar logic may apply for named parameters and for separating multiple indices of a complex array - `a[i, j].` === postfix & prefix === {{{ token postfix:sym { } token prefix:sym<+> { } token prefix:sym<-> { } }}} When defining pre- and postfix operators, be aware of the possibility that you may conflict with a binary operator of the same name. In that case, language specifications generally give precedence to the unary form. '''IMPORTANT:''' While you can re-use the %-tags for different operators, ''do not'' try to re-use the %-tags at different levels. They will take on only one value, and leave you trying to debug the internals of the expression parser. One you ''assign'' a %-tag in your INIT-block, you should never try to re-defined it. On the other hand, multiple operators can reference the same %-tag. It's not used to specify ''behavior,'' it's used to specify attributes. === term === {{{ token term:sym { } }}} Terms are the 'atomic' parts of an expression. They may be more like the 'subatomic' parts, if you weren't expecting to see things like ''postcircumfix-()''. But whether they are quarks or atoms, they're definitely the place where the expression parser either stops, or recurses on itself. One thing to watch out for is that your terms should pretty much always reference another rule, instead of doing any matching themselves. This is because for rules like literals and identifiers, you are ''very'' likely to need to recognize the same thing in a different, non-expression context. (Many languages have a special syntax for parameter or variable declarations that is definitely ''not'' an expression.) So put that knowledge in a separate rule, and call it from within `term`.