Version 24 (modified by kjs, 12 years ago)

include and pod parsing

PIRC Introduction

PIRC is a fresh implementation of the PIR language. It is being developed as a replacement for the current PIR compiler, IMCC. Somewhere in the future, we all hope to be able to finish it. However, some help is needed. Most of the tricky parts have been done for you, such as implement all sorts of weird features of the PIR language.

The basic workflow of PIRC is as follows. The lexer and parser are implemented with Flex and Bison specifications. During the parsing phase, a data structure is built that represents the input. To stick with compiler jargon, let's call this the Abstract Syntax Tree (AST). After the parse, this AST is traversed and for each instruction the appropriate bytecode is emitted. Registers are allocated by the built-in vanilla register allocator. This means that for the following code:

.sub main
  $S12 = "Hi there"
  print $S12
  $I44 = 42
  print $I44
.end

$S12 and $I44 will be mapped to the registers S0 and I0 respectively (yes, you guessed it, it starts allocating from 0). As you would expect, the vanilla register allocator is pretty stupid, but the generated bytecode is not too bad, really. If you want to optimize the register usage (which saves runtime memory), you can activate the register optimizer. The register optimizer is based on a Linear Scan Register allocator. The original algorithm, as described in  this paper, assumes a fixed number of available registers. Since Parrot has a variable number of registers available per subroutine, the algorithm has been changed here and there. See the file [source:/trunk/compilers/pirc/src/pirregalloc.c] for the implementation.

PIRC vs IMCC

While PIRC is an implementation of the PIR language which is specified in PDD19, there are some subtle differences with the current implementation, IMCC. In case you were wondering, IMCC stands for IMC Compiler, with IMC being the old name of the PIR language, standing for Intermediate Machine Code. The name was changed a long time ago.

  • "nested" heredocs, can be handled by PIRC, not by IMCC. Yes, it was very painful to implement which is why IMCC doesn't.
  • comments or whitespace in the parameter lists are accepted by PIRC, but not by IMCC. It sounds like an easy fix, but it isn't. Hence, PIRC!
  • reentrant: PIRC is, IMCC is not.
  • checks for improper use syntactic sugar with OUT arguments, such as "$S0 = print". PIRC checks for this, IMCC doesn't. Again, it sounds (and looks) like an easy fix, but it isn't.

Building and running PIRC

PIRC is located in compilers/pirc. In order to compile, do the following:

cd compilers/pirc
make
make test

At this point (August 5, 2009) some tests are failing, so don't be alarmed if you see them failing.

In order to run PIRC:

./pirc -h
./pirc -b test.pir # will generate a file a.pbc

Enjoy!

PIRC Status

PIRC is not complete yet. All stages are implemented (lexer, parser, bytecode generator), but all of them need some additional work to complete them. See the section below for the specific items that need to be fixed. Once these are fixed, PIRC will be done about 98%.

PIRC Development Tasks

Shouldn't-be-too-hard tasks

  • ticket #43: autoheaderize all PIRC sources.
  • ticekt #55: decorate all function arguments with ARGIN macros etc.
  • write tests for the generated output.

Hardcore hacking tasks

  • Fix parser to "calculate" the right signature for ops such as:
    $P0 = new ['Integer']
    

Currently, the argument is encoded as "_ksc", for key, string-constant.

  • Convert all C strings in PIRC into STRINGs. All identifiers and strings that are scanned should be stored as STRING objects, not C strings.
  • Fix ticket #198. It seems that when there is a sequence of more than one instruction dealing with STRINGs or NUMs, the resulting bytecode segfaults. Apparently, PIRC is emitting the wrong bytecode. Bug #186 is related to this issue.
  • Fix ticket #173. Lexicals are not stored correctly in the generated bytecode. The code for storing the lexicals is taken from IMCC, and therefore it doesn't come as a complete surprise it's not working. However, I don't see what's wrong.
  • Fix ticket #14. Braced arguments to macros are not handled correctly. Nested macro expansion isn't correctly handled yet.
  • Fix ticket #163. Keyed multi types must be implemented

PIRC Internals

In this section, PIRC's guts are dissected in order to explain what exactly is going on under the hood. If you are interested in the nitty-gritty details, keep on reading. (Note that this is a work-in-progress and will take some time to be completed)

PIRC Lexer

Heredoc processor

The Heredoc processor has only one task: flattening heredoc strings. By "flattening", I mean the following. This string:

 $S0 = <<'EOS'
This is
 a multi-line
  heredoc
   string
    with
     increasing
      indention
       on each line.
EOS

is "flattened" into:

 $S0 = "This is  a multi-line\n  heredoc\n   string\n    with\n     increasing\n      indention\n       on each line." 

Note that "newline" characters are inserted as well, so that the string is equivalent to the original heredoc string. Besides assigning heredoc strings to String registers, the PIR specification also allows you to use heredoc strings as arguments in subroutine invocations:

.sub main
  foo(<<'A')
This is a heredoc
string argument
A
.end

.sub foo
 # ...
.end

Again, the heredoc string (delimited by the string "A") will be flattened. According to the PIR specification, you can even pass multiple heredoc string arguments, like so:

.sub main
  foo(<<'A', 42, <<'B', 3.14, <<'C')
 I have a Parrot
A
 It is not a bird
B
 It is a virtual machine
C
.end

Note that the heredoc arguments may be mixed with other, simple arguments such as integers and numbers. In the rest of this section, the implementation will be discussed.

Heredoc parsing implementation

The implementation of the Heredoc preprocessor can be found in [source:/trunk/compilers/pirc/src/hdocprep.l]. It is a Lex/Flex lexer specification, which means you need the Flex program to generate the C code for this preprocessor. The preprocessor takes a PIR file that contains heredoc strings, and flattens out all heredoc strings. It writes a temporary file to disk that is exactly the same as the original PIR file, except that all heredoc strings are flattened.

For this discussion, it is assumed you have a basic understanding of the Flex program. For instance, you need to know what "state" means in Flex context. If you don't know, please refer to  the Flex documentation page.

In order to make the heredoc preprocessor reentrant, no global variables are used. Instead, lines [source:/trunk/compilers/pirc/src/hdocprep.l#L83 83 to 98] define a struct global_state. The comments in the code briefly describe what each field is for, but they will be discussed in more detail later if we walk through the actual processing of the heredocs. A new instance of this struct can be created by invoking [source:/trunk/compilers/pirc/src/hdocprep.l#L157 init_global_state]. For now, it is useful to know that this struct has a pointer to a Parrot interpreter object, the name of the file being processed, and a pointer to the output file.

The function [source:/trunk/compilers/pirc/src/hdocprep.l#L208 process_heredocs] is the main function of the heredoc preprocessor that the main compiler program (PIRC) invokes. This function opens the file to be processed, initializes the lexer, creates a new global_state struct instance, as described above, invokes the lexer to do the processing and cleans up afterwards.

We will now walk through two different scenarios, in order to simplify the discussion. Scenario 1 discussed the case of single heredoc parsing, and Scenario 2 discusses multiple heredoc parsing. Multiple heredoc parsing starts out with Scenario 1, but is a bit more advanced.

Scenario 1a: single heredoc string parsing

Consider the following input:

.sub main
  $S0 = <<'EOS'
This
is
a
heredoc
string.

EOS
.end
 

The lexer starts out in the INITIAL state by default (as per Flex specification). When reading input such as <<'EOS', the rule on [source:/trunk/compilers/pirc/src/hdocprep.l#L306 line 306] is activated. The actual string ("EOS") is stored in the field state->delimiter, and an escaped newline character is stored in the heredoc buffer.

Since the preprocessor does not build a data structure representing the input, but instead writes the output directly (to a file), the "rest of the line" needs to be stored somewhere. This is because the <<'EOS' heredoc token is basically a placeholder for the actual (heredoc) string contents. Hence, the [source:/trunk/compilers/pirc/src/hdocprep.l#L318 activation of SAVE_REST_OF_LINE state].

The state SAVE_REST_OF_LINE has only one function, and that is to SAVE the REST OF the LINE :-). It will match all the text after the <<'EOS' heredoc marker up to and include the end-of-line character. This, including an additional "\n" character is stored in the linebuffer field, which always contains the "rest of the line". As you can see, in this scenario there is no "rest of the line", except for the end-of-line character ("\n", or "\r\n" on Windows). See Scenario 1b below for a variant on this, in which the "rest of the line" contains a closing parenthesis of a subroutine invocation.

After the heredoc marker the actual heredoc string must be scanned, hence the activation of the HEREDOC_STRING state on [source:/trunk/compilers/pirc/src/hdocprep.l#L331 line 331]. In the state HEREDOC_STRING, there are three different types of input:

  1. "end-of-line" characters, basically an empty line (see [source:/trunk/compilers/pirc/src/hdocprep.l#L357 line 357]). An escaped newline character ("\\n") will be stored as part of the heredoc string.
  1. "normal" heredoc string lines (see [source:/trunk/compilers/pirc/src/hdocprep.l#L376 line 376]. First the newline character is removed, because we may have found the heredoc string delimiter, that was stored earlier. In order to compare the strings, the newline character is chopped off (see [source:/trunk/compilers/pirc/src/hdocprep.l#L381 lines 381-384]). Then, a string comparison is done in order to see whether we just read the heredoc string delimiter. If so, then we need to continue scanning the "rest of the line" that was saved earlier. However, since we need to switch back later to the current buffer, we need to store this current buffer ([source:/trunk/compilers/pirc/src/hdocprep.l#L395 line 395]). Also, the lexer's state is changed to SCAN_STRING, since we're going to scan a saved string. Then, the lexer's told to read the next input from the string buffer ([source:/trunk/compilers/pirc/src/hdocprep.l#L406 line 406]). If however, we did not read the heredoc delimiter, then it's just a line that's part of the heredoc string, which needs to be stored. In that case, a new buffer is allocated to store the heredoc string so far, plus the new line that's just been scanned. The old buffer is released.
  1. End of file ([source:/trunk/compilers/pirc/src/hdocprep.l#L423 line 423]). When the lexer encounters end-of-file, an error is printed to the screen, and the lexer terminates.

Once the heredoc string has been completely scanned, the SCAN_STRING state is activated. Again, there's a number of different input patterns that may be scanned:

  1. Another heredoc marker (<<{Q_STRING}, [source:/trunk/compilers/pirc/src/hdocprep.l#L428 line 428]). See Scenario 2 for a discussion of this.
  1. End of line ([source:/trunk/compilers/pirc/src/hdocprep.l#L447 line 447]). Nothing is done.
  1. Any character ([source:/trunk/compilers/pirc/src/hdocprep.l#L449 line 449]). The character (for instance, a parenthesis) is written to the output.
  1. End of file ([source:/trunk/compilers/pirc/src/hdocprep.l#L451 line 451]). End of file, in this context, means end of string. So, we've finished scanning the "rest of line" string buffer, so now the lexer needs to switch back to read the next input from the file again. Also, the lexer's state is switched back to the default state (INITIAL).

This completes the processing of a single heredoc string.

Scenario 1b: single heredoc argument parsing

Scenario 1b is almost the same as Scenario 1a, except that instead of a heredoc string being assigned to some target (register), the heredoc string is an argument to a function. Consider the following input:

.sub main
  foo(<<'EOS')
This
is
a
heredoc
string.

EOS
.end
 

The process of parsing this heredoc string is pretty much the same as in Scenario 1a, except that the "rest of the line" contains the closing parenthesis ")" to close the argument list of the invocation of foo.

Scenario 2: multiple heredoc parsing

POD parsing

POD comments are filtered out from the input. This is implemented in [source:/trunk/compilers/pirc/src/hdocprep.l#L287 lines 287 to 301]). Note that [source:/trunk/compilers/pirc/src/hdocprep.l#L287 line 287] is very important: it matches a "=cut" directive (which ends a POD comment) in the INITIAL state (so, when no previous POD comment was seen yet). If this pattern wouldn't be matched in the INITIAL state, the "=cut" directive would actually activate the POD state. This is because "=cut" starts with a "=", which is the first character of a POD directive ([source:/trunk/compilers/pirc/src/hdocprep.l#L289 see line 289]).

include directives

The .include directive is logically a macro expansion directive. It takes one argument, which is the name of a file. If the .include directive is encountered, the lexer switches to the specified file, and starts reading from that file. Once the end of the file has been reached, the lexer switches back to the original file.

The .include directive is implemented in the heredoc preprocessor. This is necessary in order to be able to use heredoc strings in the included file. If the directive would have been implemented in the normal PIR lexer (that implements macro expansion), then the heredoc preprocessor would have to be invoked first on the included file.

Once the .include directive is read, the lexer switches state from INITIAL to INCLUDE ([source:/trunk/compilers/pirc/src/hdocprep.l#L479 line 479]). This is done using the built-in state stack in the Flex-generated lexer. The INCLUDE state is pushed onto the state stack, and immediately activated. (Once the state is popped off, the lexer switches to the state that's then the new top-of-stack. Since an included file can include other files, a stack is used to keep track of this. Four different input patterns are distinguished:

  1. whitespace ([source:/trunk/compilers/pirc/src/hdocprep.l#L483 line 483]). Whitespace is skipped.
  1. a quoted string, which is the name of the file to be included ([source:/trunk/compilers/pirc/src/hdocprep.l#L485 line 485]). Once the quoted string is stripped from its quotes, the file is located and the lexer will start processing that file.
  1. end of line ([source:/trunk/compilers/pirc/src/hdocprep.l#L528 line 528]). This would be the end-of-line after the quoted string that was included. Once this is encountered, the included file has already been completely processed. Therefore, the lexer's state is popped off the lexer state stack.
  1. any other character ([source:/trunk/compilers/pirc/src/hdocprep.l#L532 line 532]), resulting in an error message.

Macro layer

PIRC Parser

Symbol Management

Constant Folding

Strength Reduction

Abstract Syntax Tree

Vanilla Register Allocator

Register Usage Optimizer

Bytecode Generation

Running code at compile time: the :immediate flag