Ticket #752 (closed bug: fixed)

Opened 13 years ago

Last modified 13 years ago

Parrot concatenates iso-8859-1 and utf8 incorrectly

Reported by: pmichaud Owned by:
Priority: normal Milestone:
Component: core Version: 1.2.0
Severity: high Keywords:
Cc: Language: perl6
Patch status: Platform:

Description

Parrot has difficulty concatenating iso-8859-1 and utf8 strings. Here's the test case:

$ cat x.pir
.sub 'main'
    $S0 = unicode:"\u00e5\u263b"

    $S1 = chr 0xe5
    $S2 = chr 0x263b
    $S3 = concat $S1, $S2

    if $S0 == $S3 goto equal
    print "not "
  equal:
    say "equal"
.end
$ ./parrot x.pir
Malformed UTF-8 string

current instr.: 'main' pc 13 (x.pir:7)
$ 

Note that the exception occurs at the point of the == comparison, not when the concatenation occurs. If one outputs the value of $S3, it comes out as four bytes (e5 e2 98 bb). The correct result should be five bytes (c3 a5 e2 98 bb) -- i.e., the iso-8859-1 string that comes back from chr(229) needs to be converted to utf8 before concatenation.

This looks very similar to the bug reported in RT #39930 (which has since been marked as fixed, but apparently doesn't fix this case).

A fix for this is needed for various modules in Rakudo--especially those dealing with url encoding and decoding.

Thanks!

Pm

Attachments

TT_572.patch Download (2.9 KB) - added by NotFound 13 years ago.
Fix/modify charset and encoding conversion rules in concatenating

Change History

Changed 13 years ago by pmichaud

The first form of this ticket incorrectly overspecifies the expected the result. The expected result is that the two strings $S0 and $S3 compare as equal -- there's no requirement that the $S0 be a utf8 encoded string. It just has to be a string that accurately contains the \u00e5 and \u263b codepoints in sequence.

Pm

Changed 13 years ago by NotFound

Fix/modify charset and encoding conversion rules in concatenating

Changed 13 years ago by NotFound

After discussion in #parrot, I made an attempt to fix and modify the conversion rules for concatenation.

I don't expect to have much time this weekend to work on it, so I attach the patch here.

Changed 13 years ago by pmichaud

  • status changed from new to closed
  • resolution set to fixed

Problem now fixed in r39572 -- turns out we needed a more sophisticated 'append' routine.

Added tests to t/op/stringu.t; closing ticket.

Pm

Note: See TracTickets for help on using tickets.