In Python, if you either open a file without calling close(), or close the file but not using try-finally or the "with" statement, is this a problem? Or does it suffice as a coding practice to rely on the Python garbage-collection to close all files? For example, if one does this:
for line in open("filename"):
# ... do stuff ...
... is this a problem because the file can never be closed and an exception could occur that prevents it from being closed? Or will it definitely be closed at the conclusion of the for statement because the file goes out of scope?
In your example the file isn't guaranteed to be closed before the interpreter exits. In current versions of CPython the file will be closed at the end of the for loop because CPython uses reference counting as its primary garbage collection mechanism but that's an implementation detail, not a feature of the language. Other implementations of Python aren't guaranteed to work this way. For example IronPython, PyPy, and Jython don't use reference counting and therefore won't close the file at the end of the loop.
It's bad practice to rely on CPython's garbage collection implementation because it makes your code less portable. You might not have resource leaks if you use CPython, but if you ever switch to a Python implementation which doesn't use reference counting you'll need to go through all your code and make sure all your files are closed properly.
For your example use:
with open("filename") as f:
for line in f:
# ... do stuff ...
Some Pythons will close files automatically when they are no longer referenced, while others will not and it's up to the O/S to close files when the Python interpreter exits.
Even for the Pythons that will close files for you, the timing is not guaranteed: it could be immediately, or it could be seconds/minutes/hours/days later.
So, while you may not experience problems with the Python you are using, it is definitely not good practice to leave your files open. In fact, in cpython 3 you will now get warnings that the system had to close files for you if you didn't do it.
Moral: Clean up after yourself. :)
Although it is quite safe to use such construct in this particular case, there are some caveats for generalising such practice:
run can potentially run out of file descriptors, although unlikely, imagine hunting a bug like that
you may not be able to delete said file on some systems, e.g. win32
if you run anything other than CPython, you don't know when file is closed for you
if you open the file in write or read-write mode, you don't know when data is flushed
The file does get garbage collected, and hence closed. The GC determines when it gets closed, not you. Obviously, this is not a recommended practice because you might hit open file handle limit if you do not close files as soon as you finish using them. What if within that for loop of yours, you open more files and leave them lingering?
Hi It is very important to close your file descriptor in situation when you are going to use it's content in the same python script. I today itself realize after so long hecting debugging. The reason is content will be edited/removed/saved only after you close you file descriptor and changes are affected to file!
So suppose you have situation that you write content to a new file and then without closing fd you are using that file(not fd) in another shell command which reads its content. In this situation you will not get you contents for shell command as expected and if you try to debug you can't find the bug easily. you can also read more in my blog entry http://magnificentzps.blogspot.in/2014/04/importance-of-closing-file-descriptor.html
During the I/O process, data is buffered: this means that it is held in a temporary location before being written to the file.
Python doesn't flush the buffer—that is, write data to the file—until it's sure you're done writing. One way to do this is to close the file.
If you write to a file without closing, the data won't make it to the target file.
Python uses close() method to close the opened file. Once the file is closed, you cannot read/write data in that file again.
If you will try to access the same file again, it will raise ValueError since the file is already closed.
Python automatically closes the file, if the reference object has been assigned to some another file. Closing the file is a standard practice as it reduces the risk of being unwarrantedly modified.
One another way to solve this issue is.... with statement
If you open a file using with statement, a temporary variable gets reserved for use to access the file and it can only be accessed with the indented block. With statement itself calls the close() method after execution of indented code.
Syntax:
with open('file_name.text') as file:
#some code here
Related
I have a python script which will run various other scripts when it sees various files have been updated. It rapidly polls the files to check for updates by looking at the file modified dates.
For the most part this has worked as expected. When one of my scripts updates a file, another script is triggered and the appropriate action(s) are taken. For reference I am using pickles as the file type.
However, adding a new file and corresponding script into the mix just now, I've noticed an issue where the file has its modified date updated twice. Once when I perform the pickle.dump() and again when I exit the "with" statement (when the file closes). This means that the corresponding actions trigger twice rather than once. I guess this makes sense but what's confusing is this behaviour doesn't happen with any of my other files.
I know a simple workaround would be to poll the files slightly less frequently since the gap between the file updates is extremely small. But I want to understand why this issue is occuring some times but not other times.
I think what you observe is 2 actions: file created and file updated.
To resolve this, create and populate file outside of monitored folders, and once "with" block is over (file is closed), move it from temporary location to a proper place.
to do this, look at tempfile module in standard library
If the pickle is big enough (typically somewhere around 4+ KB, though it will vary by OS/file system), this would be expected behavior. The majority of the pickle would be written during the dump call as buffers filled and got written, but whatever fraction doesn't consume the full file buffer would be left in the buffer until the file is closed (which implicitly flushes any outstanding buffered data before closing the handle).
I agree with the other answer that the usual solution is to write the file in a different folder (but on the same file system), then immediately after closing it, us os.replace to perform an atomic rename that moves it from the temporary location to the final location, so there is no gap between file open, file population, and file close; the file is either there in its entirety, or not at all.
I saw a code where they were using file.flush(). so, i searched around and found This SO post. I kind of understood why there's a flush method. In the answer which was marked as answer the following was written
Typically you don't need to bother with either method, but if you're in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.
So, i was wondering, when we open a file using context manager and write some text, and then when the code exits from this context manager, there might be a chance that the text might not have been written to the file? if yes, why not python does this internally when file.close() is called? is it being done already?
The file objects in the io module (the ones you get from open) and everywhere else you'd expect in the stdlib always flush when they close, or rely on platform APIs that are guaranteed to do so.
Even third-party libraries are required to "close and flush the stream" on their close methods if they want their objects to be file objects.1
The main reason to call flush is when you're not closing the file yet, but some other program might want to see the contents.
For example, a lot of people write code like this:
with open('dump.txt', 'w') as f:
while True:
buf = read_off_some_thingy()
f.write(buf.decode())
time.sleep(5)
… and then they wonder why when they cat dump.txt or open it in Notepad or whatever, it's empty, or missing the last 3 lines, or cuts off in the middle of a line. That's the problem flush solves:
with open('dump.txt', 'w') as f:
while True:
buf = read_off_some_thingy()
f.write(buf.decode())
f.flush()
time.sleep(5)
Or, alternatively, they're running the same code, but the problem is that someone might pull the plug on the computer (or, more likely nowadays, kill your container), and then after restart they'll have a corrupt file that cuts off in mid-line and now the perl script that scans the output won't run and nobody wants to debug perl code. Different problem, same solution.
But if you know for a fact that the file is going to be closed by some point (say, because there's a with statement that ends before there), and you don't need the file to be done before that point, you don't need to call flush.
You didn't mention fsync, which is a whole other issue—and a whole lot more complicated than most people thing—so I won't get into it. But the question you linked already covers the basics.
1. There's always the chance that you're using some third-party library with a file-like object that duck-types close enough to a file object for your needs, but isn't one. And such a type might have a close that doesn't flush. But I honestly don't think I've ever seen an object that had a flush method, that didn't call it on close.
Python does flush the file when it's .close()d, which happens when the context manager is exited.
The linked post refers more to a scenario where you have, say, a log file that's open for a long time, and you want to ensure that everything gets written to disk after each write. That's where you'd want to .write(); .flush();.
In Python, if you either open a file without calling close(), or close the file but not using try-finally or the "with" statement, is this a problem? Or does it suffice as a coding practice to rely on the Python garbage-collection to close all files? For example, if one does this:
for line in open("filename"):
# ... do stuff ...
... is this a problem because the file can never be closed and an exception could occur that prevents it from being closed? Or will it definitely be closed at the conclusion of the for statement because the file goes out of scope?
In your example the file isn't guaranteed to be closed before the interpreter exits. In current versions of CPython the file will be closed at the end of the for loop because CPython uses reference counting as its primary garbage collection mechanism but that's an implementation detail, not a feature of the language. Other implementations of Python aren't guaranteed to work this way. For example IronPython, PyPy, and Jython don't use reference counting and therefore won't close the file at the end of the loop.
It's bad practice to rely on CPython's garbage collection implementation because it makes your code less portable. You might not have resource leaks if you use CPython, but if you ever switch to a Python implementation which doesn't use reference counting you'll need to go through all your code and make sure all your files are closed properly.
For your example use:
with open("filename") as f:
for line in f:
# ... do stuff ...
Some Pythons will close files automatically when they are no longer referenced, while others will not and it's up to the O/S to close files when the Python interpreter exits.
Even for the Pythons that will close files for you, the timing is not guaranteed: it could be immediately, or it could be seconds/minutes/hours/days later.
So, while you may not experience problems with the Python you are using, it is definitely not good practice to leave your files open. In fact, in cpython 3 you will now get warnings that the system had to close files for you if you didn't do it.
Moral: Clean up after yourself. :)
Although it is quite safe to use such construct in this particular case, there are some caveats for generalising such practice:
run can potentially run out of file descriptors, although unlikely, imagine hunting a bug like that
you may not be able to delete said file on some systems, e.g. win32
if you run anything other than CPython, you don't know when file is closed for you
if you open the file in write or read-write mode, you don't know when data is flushed
The file does get garbage collected, and hence closed. The GC determines when it gets closed, not you. Obviously, this is not a recommended practice because you might hit open file handle limit if you do not close files as soon as you finish using them. What if within that for loop of yours, you open more files and leave them lingering?
Hi It is very important to close your file descriptor in situation when you are going to use it's content in the same python script. I today itself realize after so long hecting debugging. The reason is content will be edited/removed/saved only after you close you file descriptor and changes are affected to file!
So suppose you have situation that you write content to a new file and then without closing fd you are using that file(not fd) in another shell command which reads its content. In this situation you will not get you contents for shell command as expected and if you try to debug you can't find the bug easily. you can also read more in my blog entry http://magnificentzps.blogspot.in/2014/04/importance-of-closing-file-descriptor.html
During the I/O process, data is buffered: this means that it is held in a temporary location before being written to the file.
Python doesn't flush the buffer—that is, write data to the file—until it's sure you're done writing. One way to do this is to close the file.
If you write to a file without closing, the data won't make it to the target file.
Python uses close() method to close the opened file. Once the file is closed, you cannot read/write data in that file again.
If you will try to access the same file again, it will raise ValueError since the file is already closed.
Python automatically closes the file, if the reference object has been assigned to some another file. Closing the file is a standard practice as it reduces the risk of being unwarrantedly modified.
One another way to solve this issue is.... with statement
If you open a file using with statement, a temporary variable gets reserved for use to access the file and it can only be accessed with the indented block. With statement itself calls the close() method after execution of indented code.
Syntax:
with open('file_name.text') as file:
#some code here
I have tried this a few different ways, but the result always seems to be the same. I can't get Python to read until the end of the file here. It stops only about halfway through. I've tried Binary and ASCII modes, but both of these have the same result. I've also checked for any special characters in the file where it cuts off and there are none. Additionally, I've tried specifying how much to read and it still cuts off at the same place.
It goes something like this:
f=open("archives/archivelog", "r")
logtext=f.read()
print logtext
It happens whether I call it from bash, or from python, whether I'm a normal user or the root.
HOWEVER, it works fine if the file is in the same directory as I am.
f=open("archivelog", "r")
logtext=f.read()
print logtext
This works like a dream. Any idea why?
The Python reference manual about read() says:
Also note that when in non-blocking mode, less data than was requested
may be returned, even if no size parameter was given.
There is also a draft PEP about that matter, which apparently was not accepted. A PEP is a Python Enhancement Proposal.
So the sad state of affairs is that you cannot rely on read() to give you the full file in one call.
If the file is a text file I suggest you use readlines() instead. It will give you a list containing every line of the file. As far as I can tell readlines() is reliable.
Jumping off from Kelketek's answer:
I can't remember where I read about this, but basically the Python garbage collector runs "occasionally", with no guarantees about when a given object will be collected. The flush() function does the same: http://docs.python.org/library/stdtypes.html#file.flush. What I've gathered is that flush() puts the data in some buffer for writing and it's up to your OS to decide when to do it. Probably one or both of these was your problem.
Were you reading in the file soon after writing it? That could cause a race condition (http://en.wikipedia.org/wiki/Race_condition), which is a class of generally weird, possibly random/hard-to-reproduce bugs that you don't normally expect from a high-level language like Python.
The read method returns the file contents in chunks. You have to call it again until it returns an empty string ('').
http://docs.python.org/tutorial/inputoutput.html#methods-of-file-objects
Ok, gonna write this in notepad first so I don't press 'enter' too early...
I have solved the problem, but I'm not really sure WHY the solution solves the problem.
As it turns out, the reason why one was able to read through and not the other was because the one that was cut off early was created with the Python script, whereas the other had been created earlier.
Even though I closed the file, the file did not appear to be fully written to disk, OR, when I was grabbing it, it was only what was in buffer. Something like that.
By doing:
del f
And then trying to grab the file, I got the whole file. And yes, I did use f.close after writing the file.
So, the problem is solved, but can anyone give me the reason why I had to garbage collect manually in this instance? I didn't think I'd have to do this in Python.
I am writing a script that will be polling a directory looking for new files.
In this scenario, is it necessary to do some sort of error checking to make sure the files are completely written prior to accessing them?
I don't want to work with a file before it has been written completely to disk, but because the info I want from the file is near the beginning, it seems like it could be possible to pull the data I need without realizing the file isn't done being written.
Is that something I should worry about, or will the file be locked because the OS is writing to the hard drive?
This is on a Linux system.
Typically on Linux, unless you're using locking of some kind, two processes can quite happily have the same file open at once, even for writing. There are three ways of avoiding problems with this:
Locking
By having the writer apply a lock to the file, it is possible to prevent the reader from reading the file partially. However, most locks are advisory so it is still entirely possible to see partial results anyway. (Mandatory locks exist, but a strongly not recommended on the grounds that they're far too fragile.) It's relatively difficult to write correct locking code, and it is normal to delegate such tasks to a specialist library (i.e., to a database engine!) In particular, you don't want to use locking on networked filesystems; it's a source of colossal trouble when it works and can often go thoroughly wrong.
Convention
A file can instead be created in the same directory with another name that you don't automatically look for on the reading side (e.g., .foobar.txt.tmp) and then renamed atomically to the right name (e.g., foobar.txt) once the writing is done. This can work quite well, so long as you take care to deal with the possibility of previous runs failing to correctly write the file. If there should only ever be one writer at a time, this is fairly simple to implement.
Not Worrying About It
The most common type of file that is frequently written is a log file. These can be easily written in such a way that information is strictly only ever appended to the file, so any reader can safely look at the beginning of the file without having to worry about anything changing under its feet. This works very well in practice.
There's nothing special about Python in any of this. All programs running on Linux have the same issues.
On Unix, unless the writing application goes out of its way, the file won't be locked and you'll be able to read from it.
The reader will, of course, have to be prepared to deal with an incomplete file (bearing in mind that there may be I/O buffering happening on the writer's side).
If that's a non-starter, you'll have to think of some scheme to synchronize the writer and the reader, for example:
explicitly lock the file;
write the data to a temporary location and only move it into its final place when the file is complete (the move operation can be done atomically, provided both the source and the destination reside on the same file system).
If you have some control over the writing program, have it write the file somewhere else (like the /tmp directory) and then when it's done move it to the directory being watched.
If you don't have control of the program doing the writing (and by 'control' I mean 'edit the source code'), you probably won't be able to make it do file locking either, so that's probably out. In which case you'll likely need to know something about the file format to know when the writer is done. For instance, if the writer always writes "DONE" as the last four characters in the file, you could open the file, seek to the end, and read the last four characters.
Yes it will.
I prefer the "file naming convention" and renaming solution described by Donal.