Changes between Version 1 and Version 2 of NQP-rx Operator Precedence Parsing

Show
Ignore:
Timestamp:
03/22/10 06:22:20 (4 years ago)
Author:
Austin_Hastings
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • NQP-rx Operator Precedence Parsing

    v1 v2  
    6161==== circumfix and postcircumfix ==== 
    6262 
    63 It would be tempting to treat circumfix and postcircumfix as variations of `term,` but they clearly aren't. So yes, there's special handling for them. It follows the well-established pattern if adding -ish and adding options: 
     63It would be tempting to treat circumfix and postcircumfix as variations of `term,` and circumfix is. But postcircumfix is not, there's special handling. It follows the well-established pattern if adding -ish and adding options: 
    6464{{{ 
    6565rule termish { 
     
    7474} 
    7575}}} 
    76 W-wh-wha-what? ''Prefixish?''  
    77  
    78 Heh, gotcha! The only thing in prefixish is some whitespace - there's no precircumfix. At least, not yet. 
     76W-wh-wha-what? ''Prefix-ish?'' Heh, gotcha! The only thing in prefixish is some whitespace - there's no precircumfix. At least, not yet. 
    7977{{{ 
    8078token prefixish { 
     
    8280} 
    8381}}} 
    84 Hey! If you're going to let the coder write `++ ++x`, you may as well encourage some whitespace... 
     82(Hey! If you're going to let the coder write `++ ++x`, you may as well encourage some whitespace...) 
    8583 
    8684And what about `circumfix,` you ask? Well, that turns out to be just a variation on `term`: 
     
    8886token term:sym<circumfix> { <circumfix> } 
    8987}}} 
    90 Presto! 
    9188 
    9289==== Summary ==== 
     
    106103token term:sym<circumfix> { <circumfix> } 
    107104}}} 
    108 This means that any circumfix you define automatically becomes a term. If you don't like that behavior, you need to *not* define your operation as a `circumfix`. 
     105This means that any circumfix you define automatically becomes a term. If you don't like that behavior, you need to '''not''' define your operation as a `circumfix`. 
    109106 
    110107=== Identifiers === 
    111108 
    112 You get the `<ident>` target for free, which recognizes C-like names (which are the same names used by pretty much every other recent programming language on the planet: letter or underscore to start, then 0 or more letter, underscore, or digit). But that rule is '''not''' automatically a part of `<term>` since maybe you have a different rule for variable names (like, Hello! Perl?). 
     109You get the `<ident>` target for free - inherited by grammars from Regex::Cursor, which recognizes C-like names (which are the same names used by pretty much every other recent programming language on the planet: letter or underscore to start, then 0 or more letter, underscore, or digit). But that rule is '''not''' automatically a part of `<term>` since maybe you have a different rule for variable names (like Perl). 
    113110 
    114111So an easy win is: 
     
    116113token term:sym<identifier> { <ident> } 
    117114}}} 
    118  
     115Unless you've got non-C-like identifier rules, in which case you'll want to do: 
     116{{{ 
     117token term:sym<identifier> { <identifier> } 
     118 
     119token identifier { # your rule goes here 
     120} 
     121}}} 
    119122=== Literals === 
    120123 
    121 If you can use literals in expressions, you'll need to add 'em: 
     124If you can use literals in expressions, you'll need to add them to `term:` 
    122125{{{ 
    123126token term:sym<integer> { <integer> } 
     
    126129}}} 
    127130 
    128 Please note that while `<integer>` and `<dec_number>` are provided for free, quoted-string handling is a little more complex. I'm skipping it. 
     131Please note that while `<integer>` and `<dec_number>` are provided for free, quoted-string handling is a little more complex. I'm skipping it here - maybe another page? 
     132 
     133=== Predefined symbols === 
     134 
     135You may want to make some predefined symbols explicitly terms, especially if they can only be used in an expression context. Perl does this with the `self` predefined symbol. (Alternatively, you could just add it to the symbol table and let your identifier rule catch it. In the case of Perl, `self` doesn't look like a variable, so...) 
    129136 
    130137== Defining and Categorizing Operators == 
     
    137144Note that the `<sym>` target is magic, and just re-uses the `:sym<>` part of the name. 
    138145 
    139 Once you specify the appearance, you'll need to specify any other parsing details. As mentioned above, this is important for `ternary` and `postcircumfix` operators, where you have to parse part of the internals. 
    140 {{{ 
    141 token infix:sym<?? !!> { 
     146Once you specify the appearance, you'll need to specify any other parsing details. As mentioned above, this parsing is doubly important for `ternary` and `postcircumfix` operators, where you have to parse part of the internals: 
     147{{{ 
     148token infix:sym<?? !!> { # Perl-like ?? !! ternary 
    142149    '??'  
    143150    <.ws> 
     
    167174token infix:sym<div> { <sym> <O('%multiplicative')> } 
    168175}}} 
     176As mentioned, you can explicitly set the precedence and associativity on each op, but that'd be silly. 
     177 
     178=== Precedence === 
     179 
     180Note that the precedence (`:prec<...>`) is a ''string.'' And it gets compared as a string, not a number or any other magic foo. So if you have a precedence of 9 and a precedence of 10, you had better either start zero-padding your precedence codes, or expect to be surprised. The precedence strings are compared asciibetically, so 0 < 9 < A < a < z. 
     181 
     182Note that for `unary` operators, the precedence is used to determine whether a prefix or postfix operator is applied first. So in code like `*x++` the precedence determines whether the result is `(*x)++` or `*(x++)`. Obviously, this also applies to postcircumfix, so `*x[1]` is subject to the same rules. 
     183 
     184=== Associativity === 
     185 
     186The `:assoc<...>` field specifies associativity. Possible values are `unary,` `left,` `right,` and `list`. These control how the expression parser ''reduces'' the list of operators and terms. 
     187 
     188==== unary ==== 
     189 
     190A unary association means that the operator associates only with the single term it applies to (post- or pre- fix-ishly). 
     191 
     192'''IMPORTANT:''' Never mark a unary operator as anything but unary associative. Even if the language grammar specifies "oh, this operator is left-associative", you must mark it as `unary`. Marking it as unary tells the optable "don't look for anything else but this", whereas marking it `left` or `right` would imply "there should be another operand somewhere, you should look for it" - resulting in painful horrible death, or worse. 
     193 
     194=== left, right === 
     195 
     196A left- or right- associative operator will require two or more arguments, but will group them differently when they appear in bunches. A left association, like addition, will create a group starting on the left:  `a + b + c` becomes `(a+b) +c`. A right association, like exponentiation, groups the other way:  `x ** y ** z` becomes `(x** (y**z))`.  
     197 
     198It makes a difference: consider 2 ** 2 ** 3. In a left-grouping scenario, you get (2**2=4)**3 = 64. But a right-grouping scenario gives you 2**(2**3=8) = 256. 
     199 
     200'''IMPORTANT:''' See note in unary, above. 
     201 
     202=== list === 
     203 
     204List associativity is a special treatment, mainly for the comma(`,`) operator. Instead of building a tree, the expression parser will build an array with the operands in order: 
     205{{{ 
     206   'a', 'b', 'c' 
     207}}} 
     208produces an array `[ 'a', 'b', 'c' ]`. The parser is smart enough to recognized different operators, so if you define, say, semicolon and comma as different list operators, you will get different lists: 
     209{{{ 
     210   1, 1; 2, 3; 5, 8 
     211}}} 
     212produces `[ [1,1], [2,3], [5,8] ]`, I think. (Haven't bothered to test it, but the code is there.) 
     213 
     214== Operator Types == 
     215 
     216=== circumfix === 
     217{{{ 
     218token circumfix:sym<( )>  { '(' ~ ')' <EXPR> } 
     219}}} 
     220 
     221Because circumfix is implemented as a part of `term` instead of as an operator, there is no need to worry about precedence or associativity. On the other hand, you ''do'' need to worry about the difference between circumfix () and ''post''circumfix (). 
     222 
     223=== infix === 
     224{{{ 
     225token infix:sym<*>   { <sym> <O('%multiplicative')> } 
     226 
     227token infix:sym<**>  { <sym> <O('%exponent')> } 
     228 
     229token infix:sym<? :> { '?' <EXPR('i=')> ':' <O('%ternary')> } 
     230}}} 
     231 
     232This is the classic operator format. Many of the all-time greats were infix binary ops: `plus(+),` `minus(-),` even `bitwise exclusive or(^)` was an infix binary operator. 
     233 
     234And yes, ternary is implemented as an infix binary with a ''lot'' of special processing inside. 
    169235 
    170236=== postcircumfix === 
    171237{{{ 
    172 token postcircumfix:sym<[ ]> { '[' <EXPR> ']' } 
     238token postcircumfix:sym<[ ]> { '[' ~ ']' $index=<EXPR><O('%array_index')> } 
     239 
     240token postcircumfix:sym<( )> { '(' ~ ')' <arglist> <O('%function_call')> } 
    173241}}} 
    174242 
    175243Note in particular that many programming languages have slightly different parsing rules inside `postcircumfix` expressions. For example, C-like languages have to set a precedence threshold when parsing function calls, because the comma expression operator is not allowed - comma is an argument separator in postcircumfix-(). Similar logic may apply for named parameters and for separating multiple indices of a complex array - `a[i, j].` 
    176244 
    177 === infix binary === 
    178  
    179 This is the classic operator format. Many of the all-time greats were infix binary ops: `plus(+),` `minus(-),` even `bitwise exclusive or(^)` was an infix binary operator. 
    180  
    181 An infix binary operator goes `in` between the ''two'' (hence, binary) operands, like:  `x * y` or `a + 1`. 
    182  
    183  
    184  
     245=== postfix & prefix === 
     246{{{ 
     247token postfix:sym<?>  { <sym> <O('%smart_nullification')> } 
     248 
     249token prefix:sym<+>   { <sym> <O('%unary_plus')> } 
     250token prefix:sym<->   { <sym> <O('%unary_plus')> } 
     251}}} 
     252 
     253When defining pre- and postfix operators, be aware of the possibility that you may conflict with a binary operator of the same name. In that case, language specifications generally give precedence to the unary form.  
     254 
     255'''IMPORTANT:''' While you can re-use the %-tags for different operators, ''do not'' try to re-use the %-tags at different levels. They will take on only one value, and leave you trying to debug the internals of the expression parser. One you ''assign'' a %-tag in your INIT-block, you should never try to re-defined it. 
     256 
     257On the other hand, multiple operators can reference the same %-tag. It's not used to specify ''behavior,'' it's used to specify attributes.  
     258 
     259=== term === 
     260 
     261{{{ 
     262token term:sym<identifier> { <identifier> } 
     263}}} 
     264 
     265Terms are the 'atomic' parts of an expression. They may be more like the 'subatomic' parts, if you weren't expecting to see things like ''postcircumfix-()''. But whether they are quarks or atoms, they're definitely the place where the expression parser either stops, or recurses on itself. 
     266 
     267One thing to watch out for is that your terms should pretty much always reference another rule, instead of doing any matching themselves. This is because for rules like literals and identifiers, you are ''very'' likely to need to recognize the same thing in a different, non-expression context. (Many languages have a special syntax for parameter or variable declarations that is definitely ''not'' an expression.) So put that knowledge in a separate rule, and call it from within `term`.