Ticket #724 (closed bug: fixed)

Opened 13 years ago

Last modified 13 years ago

[bug] Parrot fails numeric conversion of ucs2 strings

Reported by: pmichaud Owned by:
Priority: normal Milestone:
Component: core Version: 1.2.0
Severity: medium Keywords:
Cc: Language:
Patch status: Platform:

Description

Parrot's ucs2 strings don't numify properly.

$ cat x.pir
.sub 'main'
    $S0 = "140"
    $N0 = $S0
    say $N0
    $I0 = find_encoding 'ucs2'
    $S0 = trans_encoding $S0, $I0
    $N0 = $S0
    say $N0
.end
$ ./parrot x.pir
140
1
$ 

Of course, I wish I had detected this much sooner -- like before the two or so hours I spent trying to convert PGE and Rakudo to use fixed-width string encodings for Unicode...

Pm

Change History

  Changed 13 years ago by bacek

  • status changed from new to closed
  • resolution set to fixed

Fix commited at r39287.

We have to rethink how strings handled in parrot. Or at least how we interact with C.

-- Bacek

  Changed 13 years ago by pmichaud

Reopening ticket -- bacek has already reverted the r39287 patch. r39287 suffered from two problems -- (1) it caused strings containing non-ascii characters to throw an exception on numification (an API change), and (2) every string-to-number conversion for non-ASCII strings would result in creation of another gc-able object.

Here's a regression test that catches the first issue (and that fails in r39287):

.sub 'main'
    $S0 = unicode:"140 r\x{e9}sum\x{e9}s"
    $N0 = $S0
    say $N0
    $I0 = find_encoding 'ucs2'
    $S0 = trans_encoding $S0, $I0
    $N0 = $S0
    say $N0
.end

Pm

follow-up: ↓ 4   Changed 13 years ago by pmichaud

  • status changed from closed to reopened
  • resolution fixed deleted

in reply to: ↑ 3 ; follow-up: ↓ 5   Changed 13 years ago by bacek

Ok.

Branch (misnamed) tt24_unicode_numification) ready for merge. Only one concern: we never declared format for "NaN", "Inf" and "-Inf". Old version based on atof supported some bizzare variants like " nAn", " inF" and " InFiNity". Version in branch support only canonical "Inf", "-Inf" and "NaN". Question is - should I mimic atof behaviour or we can declare standard format for it?

-- Bacek

in reply to: ↑ 4 ; follow-ups: ↓ 9 ↓ 11   Changed 13 years ago by particle

Replying to bacek:

Branch (misnamed) tt24_unicode_numification) ready for merge. Only one concern: we never declared format for "NaN", "Inf" and "-Inf". Old version based on atof supported some bizzare variants like " nAn", " inF" and " InFiNity". Version in branch support only canonical "Inf", "-Inf" and "NaN". Question is - should I mimic atof behaviour or we can declare standard format for it?

the canonical representation for NaN and Inf are 'NaN' and 'Inf'. other variants of capitalization or spelling are not valid; they must be converted to the format above. don't mimic the old atof behavior.

~particle

  Changed 13 years ago by coke

FYI, if this changes the behavior of $N0 = 'nAn', then some languages may have to change their NaN handling. e.g.:

 http://code.google.com/p/partcl/source/browse/trunk/src/grammar/expr/past2pir.tg#80

Simple fix for partcl and I'll make it asap, but this smells like something that needs a deprecation warning and can probably be forked off from this ticket.

  Changed 13 years ago by coke

The branch name is actually tt24_unicode_numifications (note the s); I just ran it against partcl, and as expected, the change in NaN handling broke several tests: These are now fixed to use the more restrictive case.

However, it also re-broke t/tcl_misc.t; same failures reported earlier under another ticket regarding PMC reuse.

follow-up: ↓ 10   Changed 13 years ago by bacek

Ok,

if there is no complains I'll merge this into branch in new few days.

-- Bacek

in reply to: ↑ 5   Changed 13 years ago by doughera

Replying to particle:

Replying to bacek:

Branch (misnamed) tt24_unicode_numification) ready for merge. Only one concern: we never declared format for "NaN", "Inf" and "-Inf". Old version based on atof supported some bizzare variants like " nAn", " inF" and " InFiNity". Version in branch support only canonical "Inf", "-Inf" and "NaN". Question is - should I mimic atof behaviour or we can declare standard format for it?

the canonical representation for NaN and Inf are 'NaN' and 'Inf'. other variants of capitalization or spelling are not valid; they must be converted to the format above. don't mimic the old atof behavior.

I haven't followed this branch at all, so this may be irrelevant.

Where are the strings coming from? Are they coming from the outside world? If so, it would be better to ignore case, as strtod() does (if it handles NaN and Inf at all.) Note that the printf in Linux glibc prints 'nan' and 'inf'. It seems to me this is a sensible place to apply the "Be liberal in what you accept, and conservative in what you send" principle.

One other thing: Don't forget -NaN.

in reply to: ↑ 8   Changed 13 years ago by coke

Replying to bacek:

Ok, if there is no complains I'll merge this into branch in new few days. -- Bacek

Yes, I just complained, about the deprecation and about the breakage in tcl.

in reply to: ↑ 5 ; follow-up: ↓ 12   Changed 13 years ago by pmichaud

Replying to particle:

the canonical representation for NaN and Inf are 'NaN' and 'Inf'. other variants of capitalization or spelling are not valid; they must be converted to the format above. don't mimic the old atof behavior.

FWIW, that's not what I remember us discussing in the Perl 6 design meeting (2009-01-28). In Perl 6 at least, "inf" is treated as infinity, not zero.

In a larger sense, most groups seem to have accepted that "inf" and "nan" conversions should be case-insensitive for parsing. Displaying the mixed-case "Inf" and "NaN" on output is (afaik) a P6-only phenomenon, most other languages seem to consistently do either "inf" or "INF" (often by simply deferring to whatever atof does).

Pm

in reply to: ↑ 11 ; follow-ups: ↓ 13 ↓ 14   Changed 13 years ago by doughera

Replying to pmichaud:

In a larger sense, most groups seem to have accepted that "inf" and "nan" conversions should be case-insensitive for parsing. Displaying the mixed-case "Inf" and "NaN" on output is (afaik) a P6-only phenomenon, most other languages seem to consistently do either "inf" or "INF" (often by simply deferring to whatever atof does).

For output from C, the story is quite mixed. The output of printf() may differ depending on the conversion flags (%f vs. %F vs. %g) and on the precision (printing "Infinity" rather than "Inf" when the requested precision is high enough), as well as on any compiler flags dictating C99 compliance. In short, even on the same system, it's quite possible to run into programs that output any of inf, Inf, INF, Infinity, INFINITY, nan, NaN, or NAN, either with or without a leading minus sign. I would think parrot ought to accept them all.

in reply to: ↑ 12   Changed 13 years ago by bacek

Replying to doughera:

In short, even on the same system, it's quite possible to run into programs that output any of inf, Inf, INF, Infinity, INFINITY, nan, NaN, or NAN, either with or without a leading minus sign. I would think parrot ought to accept them all.

And it now supported in branch starting from r39372.

Voting for merge it to trunk.

-- Bacek

in reply to: ↑ 12 ; follow-up: ↓ 15   Changed 13 years ago by particle

Replying to doughera:

Replying to pmichaud:

In a larger sense, most groups seem to have accepted that "inf" and "nan" conversions should be case-insensitive for parsing. Displaying the mixed-case "Inf" and "NaN" on output is (afaik) a P6-only phenomenon, most other languages seem to consistently do either "inf" or "INF" (often by simply deferring to whatever atof does).

For output from C, the story is quite mixed. The output of printf() may differ depending on the conversion flags (%f vs. %F vs. %g) and on the precision (printing "Infinity" rather than "Inf" when the requested precision is high enough), as well as on any compiler flags dictating C99 compliance. In short, even on the same system, it's quite possible to run into programs that output any of inf, Inf, INF, Infinity, INFINITY, nan, NaN, or NAN, either with or without a leading minus sign. I would think parrot ought to accept them all.

this also means parrot ought accept the C99 variants SNAN and QNAN, as well as NANS, NANQ, -1.#INF, 1.#IND, 0 (yes, zero!), etc. there's a comprehensive list at:

 http://wiki.apache.org/stdcxx/FloatingPoint

it's not a small list, even only for our core platforms. punting to the HLLs is certainly easier for now.

in reply to: ↑ 14 ; follow-up: ↓ 16   Changed 13 years ago by bacek

Replying to particle:

this also means parrot ought accept the C99 variants SNAN and QNAN, as well as NANS, NANQ, -1.#INF, 1.#IND, 0 (yes, zero!), etc. there's a comprehensive list at:  http://wiki.apache.org/stdcxx/FloatingPoint it's not a small list, even only for our core platforms. punting to the HLLs is certainly easier for now.

I'm voting for support only canonical NaN and Inf representation after 1.4. For now we can provide case-insensitive "nan", "inf" and "infinity"[1].

-- Bacek

[1] Yes, this current behavior of numification in branch.

in reply to: ↑ 15   Changed 13 years ago by bacek

OH HAI.

Any desicion about this branch and floating point handling? From my point of view it's very unlikely that someone used anything different from "NaN" and "Inf" (ignoring case).

-- Bacek.

  Changed 13 years ago by bacek

Hello.

tt24_unicode_numifacations branch was merged into trunk at r39529.

I'll wait till Thuesday and close ticket or revert merged commit if it will cause problems.

-- Bacek

  Changed 13 years ago by bacek

  • status changed from reopened to closed
  • resolution set to fixed

No objections raised. Closing ticket.

Note: See TracTickets for help on using tickets.