Ticket #752 (closed bug: fixed)

Opened 6 years ago

Last modified 6 years ago

Parrot concatenates iso-8859-1 and utf8 incorrectly

Reported by: pmichaud Owned by:
Priority: normal Milestone:
Component: core Version: 1.2.0
Severity: high Keywords:
Cc: Language: perl6
Patch status: Platform:

Description

Parrot has difficulty concatenating iso-8859-1 and utf8 strings. Here's the test case:

$ cat x.pir
.sub 'main'
    $S0 = unicode:"\u00e5\u263b"

    $S1 = chr 0xe5
    $S2 = chr 0x263b
    $S3 = concat $S1, $S2

    if $S0 == $S3 goto equal
    print "not "
  equal:
    say "equal"
.end
$ ./parrot x.pir
Malformed UTF-8 string

current instr.: 'main' pc 13 (x.pir:7)
$ 

Note that the exception occurs at the point of the == comparison, not when the concatenation occurs. If one outputs the value of $S3, it comes out as four bytes (e5 e2 98 bb). The correct result should be five bytes (c3 a5 e2 98 bb) -- i.e., the iso-8859-1 string that comes back from chr(229) needs to be converted to utf8 before concatenation.

This looks very similar to the bug reported in RT #39930 (which has since been marked as fixed, but apparently doesn't fix this case).

A fix for this is needed for various modules in Rakudo--especially those dealing with url encoding and decoding.

Thanks!

Pm

Attachments

TT_572.patch Download (2.9 KB) - added by NotFound 6 years ago.
Fix/modify charset and encoding conversion rules in concatenating

Change History

Changed 6 years ago by pmichaud

The first form of this ticket incorrectly overspecifies the expected the result. The expected result is that the two strings $S0 and $S3 compare as equal -- there's no requirement that the $S0 be a utf8 encoded string. It just has to be a string that accurately contains the \u00e5 and \u263b codepoints in sequence.

Pm

Changed 6 years ago by NotFound

Fix/modify charset and encoding conversion rules in concatenating

Changed 6 years ago by NotFound

After discussion in #parrot, I made an attempt to fix and modify the conversion rules for concatenation.

I don't expect to have much time this weekend to work on it, so I attach the patch here.

Changed 6 years ago by pmichaud

  • status changed from new to closed
  • resolution set to fixed

Problem now fixed in r39572 -- turns out we needed a more sophisticated 'append' routine.

Added tests to t/op/stringu.t; closing ticket.

Pm

Note: See TracTickets for help on using tickets.