Ticket #1863 (closed feature: fixed)

Opened 11 years ago

Last modified 11 years ago

Parrot IO and encodings

Reported by: nwellnhof Owned by: nwellnhof
Priority: normal Milestone: 3.0
Component: core Version: 2.10.0
Severity: medium Keywords:
Cc: Language:
Patch status: Platform:

Description

The FileHandle PMC is supposed to support different encodings via the 'encoding' method. Currently, this only works for single-byte encodings and UTF-8, that is UTF-16, UCS-2 and UCS-4 are not supported. UCS-2 and UCS-4 are fixed-width and should be pretty easy to support, but UTF-16 would need something like the code in src/io/utf8.c. It would be cleaner to move that logic to the string code. We would only need an additional function in the encoding vtable that can partially decode incomplete variable-width strings.

One thing that I don't like is that the 'read' method works on bytes not on characters. I think this doesn't make sense for multi-byte encodings.

We should also consider to support encodings for Socket or StringHandle PMCs.

Change History

Changed 11 years ago by nwellnhof

  • owner set to nwellnhof
  • status changed from new to assigned
  • milestone set to 3.0

I implemented most of the ideas above in branch nwellnhof/unicode_io.

Changed 11 years ago by nwellnhof

nwellnhof/unicode_io is merged now.

My next plan is to make FileHandle.read accept character sizes instead of byte sizes. Then we should look for a way to use the encoding code for Sockets.

We have to buffer socket recvs for variable length encodings, but we shouldn't buffer sends in most cases. So we should support different setting for input and output buffering. Socket recvs also shouldn't treat partial characters at the end of the buffer as an error, if no more bytes can be read. All these things apply to pipes as well.

I wouldn't share any of the read or readline code between FileHandles and StringHandles. Full encoding support for StringHandles should be quite easy to implement on its own.

Changed 11 years ago by nwellnhof

Maybe it's better that FileHandle.read stays like it is and accepts byte counts, but throws if it encounters a multi-byte encoded buffer that doesn't fit in the requested size. Reading characters can be done with a new method FileHandle.read_chars.

readline() for sockets and pipes with multi-byte characters is even more complicated than I thought. It can happen that we have a partial multi-byte char at the of an input buffer and that the following read/recv doesn't return enough bytes to complete the char.

Also see IOTasklist in the Wiki.

Changed 11 years ago by jkeenan

nwellnhof,

Is this ticket closable? Or, if there are loose ends, can they be moved to a new, more focused ticket?

Thank you very much.

kid51

Changed 11 years ago by nwellnhof

This ticket can be closed. Encoding support for Socket and StringHandle is still not impemented. We could open another ticket for that.

Changed 11 years ago by nwellnhof

  • status changed from assigned to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.