Ticket #1212 (new RFC)

Opened 12 years ago

Last modified 11 years ago

.eof returns false if last read call read the last byte of the file, but not beyond

Reported by: jonathan Owned by: cotto
Priority: normal Milestone:
Component: core Version:
Severity: medium Keywords:
Cc: Language:
Patch status: Platform:

Description

Hi,

It seems that the .eof() method on file handles can sometimes return true even if there is nothing more to read. This occurs when you have read upto the last byte of a file (e.g. when a readline reads up to the end of a newline, and that newline is the last thing in the file), but not beyond (which seems to be what causes the EOF flag to be set). I'm thinking this is the wrong behaviour?

Thoughts and fixes welcome!

Jonathan

Change History

  Changed 12 years ago by coke

Originally reported in  http://rt.perl.org/rt3/Ticket/Display.html?id=61224

See also TT #760

  Changed 12 years ago by coke

Comment from allison:

On Tue Dec 09 06:36:51 2008, jonathan@… wrote:

Hi, It seems that the .eof() method on file handles can sometimes return true even if there is nothing more to read.

Yes, a read that returns 0 bytes is what sets the EOF flag. (True of any kind of read other than one that requests 0 bytes.)

This behavior is the same as the old implementation.

This occurs when you have read upto the last byte of a file (e.g. when a readline reads up to the end of a newline, and that newline is the last thing in the file), but not beyond (which seems to be what causes the EOF flag to be set). I'm thinking this is the wrong behaviour?

The way to check if the byte after the last requested byte is the end of the file is to read ahead. Perl (at least 5.10) does this by actually reading the next character and then putting it back with 'ungetc'. Not the best solution. Any read ahead can be a bit expensive. I experimented with a quick patch to use 'peek' in the test for EOF in Parrot, just to see what would happen... it broke a large quantity of code (probably because all the code is expecting the old behavior of the EOF test, or possibly a bug in 'peek').

At the end of the day, it's a cost/benefit question. Individual languages can implement the 1 character lookahead with the current I/O system if they need it. Will all languages (or even most languages) want the 1 character lookahead?

Allison

  Changed 12 years ago by coke

Comment from Joshua Juran:

On Dec 11, 2008, at 5:07 PM, Allison Randal via RT wrote:

The way to check if the byte after the last requested byte is the end of the file is to read ahead. Perl (at least 5.10) does this by actually reading the next character and then putting it back with 'ungetc'. Not the best solution. Any read ahead can be a bit expensive. I experimented with a quick patch to use 'peek' in the test for EOF in Parrot, just to see what would happen... it broke a large quantity of code (probably because all the code is expecting the old behavior of the EOF test, or possibly a bug in 'peek').

If I'm understanding correctly, unrequested read-ahead is an error. The problem is that you can't put the toothpaste back into the tube, so to speak. Continuing with this analogy, calling ungetc() is like putting the extra toothpaste in a paper cup for later. If I'm the next person to brush my teeth, then sure, I'll scrape the toothpaste out of the cup first before I get more from the tube, but any hypothetical roommates would regard the cup as personal to me and ignore it, going straight for the tube.

Toothpaste is fungible, though, and it doesn't matter in what order it's used, whereas the same is not true of streamed bytes. If multiple processes are sharing a file descriptor and coordinating reads from it, any non-undoable read-ahead* will break the protocol.

* Read-ahead could be undone via lseek() for files, or be done non- destructively with recv( ..., MSG_PEEK ) for sockets.

Josh

follow-up: ↓ 6   Changed 12 years ago by coke

This ticket needs a design decision in order to proceed.

  Changed 12 years ago by jkeenan

  • component changed from none to core

in reply to: ↑ 4   Changed 11 years ago by jkeenan

  • cc cotto added

Replying to coke:

This ticket needs a design decision in order to proceed.

Accordingly, cc-ing the current architect.

kid51

follow-up: ↓ 8   Changed 11 years ago by cotto

The original report from jnthn is confusing. I can see it being a potential issue that fh.eof() returns *false* when there are 0 bytes left in a file but a read hasn't returned fewer bytes than requested. A build from the day the original rt was filed displays the same behavior as a current build (modulo some api changes), so presumably that was the intent of the ticket.

Given that, I'd like to avoid trying to secretly read and unread an extra character if we can make fh.eof do so reliably and portably. If that's not an option, we should document the surprising behavior and give HLLs the tools they need to do whatever makes sense for their users.

in reply to: ↑ 7   Changed 11 years ago by doughera

Replying to cotto:

The original report from jnthn is confusing. I can see it being a potential issue that fh.eof() returns *false* when there are 0 bytes left in a file but a read hasn't returned fewer bytes than requested.

This is the documented and expected behavior of the C-level stdio function feof(3). It's intended as an after-the-fact error status check after you've hit the end of the file. (It is occasionally useful in distinguishing among various error conditions that may cause an fread(3) to return less than the asked-for number of bytes).

Perl 5 does implement the ungetc look-ahead trick in some circumstances, but perldoc -f eof also contains this hint:

Practical hint: you almost never need to use "eof" in Perl, because the input operators typically return "undef" when they run out of data, or if there was an error.

Given that, I'd like to avoid trying to secretly read and unread an extra character if we can make fh.eof do so reliably and portably. If that's not an option, we should document the surprising behavior and give HLLs the tools they need to do whatever makes sense for their users.

This ticket appears to request a predictive fh.eof() function that asks what would happen if you tried to read another character, without actually reading another character. I don't know how you could implement that reliably and portably for all different types of inputs. I suppose one could do it only on file handles that also advertise the .ungetc method, but then the .eof function becomes even more magically erratic in its actual behavior.

I think parrot should simply provide the fh.eof function that answers the question of whether the previous read ran off the end of the file. The HLL writer (or the end user) who has more context about the particular input source in question may be able to provide something more predictive.

  Changed 11 years ago by cotto

The approach of providing a dumb eof for FileHandle and letting users build on top of it makes sense to me and seems preferable to trying to dtrt while avoid unexpected costs and side-effects. I updated the description of the relevant functions in 8c0d19cc9 so that anyone reading them will be made aware of their potentially surprising behavior. I'll leave this ticket open in case anyone wants to argue for a smarter fh.eof, but if nobody responds within 2 week we can consider the matter closed.

  Changed 11 years ago by pmichaud

On Fri, Feb 25, 2011 at 05:34:19PM -0000, Parrot wrote:
>  The approach of providing a dumb eof for FileHandle and letting users
>  build on top of it makes sense to me and seems preferable to trying to
>  dtrt while avoid unexpected costs and side-effects.  
> [...]

Just wanted to add a "me too" to this -- in the general case I don't
think it's possible to detect that you've reached the end of file
until you actually try to read beyond it.  And many handles are
really streams where the length _can't_ be known until an attempt
is made to read from the handle and there's nothing more to read.

I'd much rather have a .eof that works consistently across handles
than one that tries to be smart about it in certain (albeit
common) cases.  And the traditional interpretation of "eof" in
most languages I've dealt with has been that it occurs _after_ a
read has failed.

I strongly recommend that the current behavior is the "correct" 
one for Parrot and that we should reject this ticket.

Pm

  Changed 11 years ago by jkeenan

  • cc cotto removed
  • owner set to cotto

cotto:

Am I correct in thinking that the consensus is to reject this ticket?

I'll assign it to you so that you can make the final call.

Thank you very much.

kid51

Note: See TracTickets for help on using tickets.