Do files get closed during an exception exit? - python

Do open files (and other resources) get automatically closed when the script exits due to an exception?
I'm wondering if I need to be closing my resources during my exception handling.
EDIT: to be more specific, I am creating a simple log file in my script. I want to know if I need to be concerned about closing the log file explicitly in the case of exceptions.
since my script has a complex, nested, try/except blocks, doing so is somewhat complicated, so if python, CLIB, or the OS is going to close my text file when the script crashes/errors out, I don't want to waste too much time on making sure the file gets closed.
If there is a part in Python manual that talks about this, please refer me to it, but I could not find it.

A fairly straightforward question.
Two answers.
One saying, “Yes.”
The other saying, “No!”
Both with significant upvotes.
Who to believe? Let me attempt to clarify.
Both answers have some truth to them, and it depends on what you mean by a
file being closed.
First, consider what is meant by closing a file from the operating system’s
perspective.
When a process exits, the operating system clears up all the resources
that only that process had open. Otherwise badly-behaved programs that
crash but didn’t free up their resources could consume all the system
resources.
If Python was the only process that had that file open, then the file will
be closed. Similarly the operating system will clear up memory allocated by
the process, any networking ports that were still open, and most other
things. There are a few exceptional functions like shmat that create
objects that persist beyond the process, but for the most part the
operating system takes care of everything.
Now, what about closing files from Python’s perspective? If any program
written in any programming language exits, most resources will get cleaned
up—but how does Python handle cleanup inside standard Python programs?
The standard CPython implementation of Python—as opposed to other Python
implementations like Jython—uses reference counting to do most of its
garbage collection. An object has a reference count field. Every time
something in Python gets a reference to some other object, the reference
count field in the referred-to object is incremented. When a reference is
lost, e.g, because a variable is no longer in scope, the reference count is
decremented. When the reference count hits zero, no Python code can reach
the object anymore, so the object gets deallocated. And when it gets
deallocated, Python calls the __del__() destructor.
Python’s __del__() method for files flushes the buffers and closes the
file from the operating system’s point of view. Because of reference
counting, in CPython, if you open a file in a function and don’t return the
file object, then the reference count on the file goes down to zero when
the function exits, and the file is automatically flushed and closed. When
the program ends, CPython dereferences all objects, and all objects have
their destructors called, even if the program ends due to an unhanded
exception. (This does technically fail for the pathological case where you have a cycle
of objects with destructors,
at least in Python versions before 3.4.)
But that’s just the CPython implementation. Python the language is defined
in the Python language reference, which is what all Python
implementations are required to follow in order to call themselves
Python-compatible.
The language reference explains resource management in its data model
section:
Some objects contain references to “external” resources such as open
files or windows. It is understood that these resources are freed when
the object is garbage-collected, but since garbage collection is not
guaranteed to happen, such objects also provide an explicit way to
release the external resource, usually a close() method. Programs are
strongly recommended to explicitly close such objects. The
‘try...finally‘ statement and the ‘with‘ statement provide convenient
ways to do this.
That is, CPython will usually immediately close the object, but that may
change in a future release, and other Python implementations aren’t even
required to close the object at all.
So, for portability and because explicit is better than implicit,
it’s highly recommended to call close() on everything that can be
close()d, and to do that in a finally block if there is code between
the object creation and close() that might raise an exception. Or to use
the with syntactic sugar that accomplishes the same thing. If you do
that, then buffers on files will be flushed, even if an exception is
raised.
However, even with the with statement, the same underlying mechanisms are
at work. If the program crashes in a way that doesn’t give Python’s
__del__() method a chance to run, you can still end up with a corrupt
file on disk:
#!/usr/bin/env python3.3
import ctypes
# Cast the memory adress 0x0001 to the C function int f()
prototype = ctypes.CFUNCTYPE(int)
f = prototype(1)
with open('foo.txt', 'w'):
x.write('hi')
# Segfault
print(f())
This program produces a zero-length file. It’s an abnormal case, but it
shows that even with the with statement resources won’t always
necessarily be cleaned up the way you expect. Python tells the operating
system to open a file for writing, which creates it on disk; Python writes hi
into the C library’s stdio buffers; and then it crashes before the with
statement ends, and because of the apparent memory corruption, it’s not safe
for the operating system to try to read the remains of the buffer and flush them to disk. So the program fails to clean up properly even though there’s a with statement. Whoops. Despite this, close() and with almost always work, and your program is always better off having them than not having them.
So the answer is neither yes nor no. The with statement and close() are technically not
necessary for most ordinary CPython programs. But not using them results in
non-portable code that will look wrong. And while they are extremely
helpful, it is still possible for them to fail in pathological cases.

No, they don't.
Use with statement if you want your files to be closed even if an exception occurs.
From the docs:
The with statement is used to wrap the execution of a block with
methods defined by a context manager. This allows common
try...except...finally usage patterns to be encapsulated for convenient reuse.
From docs:
 The with statement allows objects like files to be used in a way that ensures they are always cleaned up promptly and correctly.
with open("myfile.txt") as f:
    for line in f:
        print line,
After the statement is executed, the file f is always closed, even if a problem was encountered while processing the lines. Other objects which provide predefined clean-up actions will indicate this in their documentation.

Yes they do.
This is a CLIB (at least in cpython) and operating system thing. When the script exits, CLIB will flush and close all file objects. Even if it doesn't (e.g., python itself crashes) the operating system closes its resources just like any other process. It doesn't matter if it was an exception or a normal exit or even if its python or any other program.
Here's a script that writes a file and raises an exception before the file contents have been flushed to disk. Works fine:
~/tmp/so$ cat xyz.txt
cat: xyz.txt: No such file or directory
~/tmp/so$ cat exits.py
f = open("xyz.txt", "w")
f.write("hello")
print("file is", open("xyz.txt").read())
assert False
~/tmp/so$ python exits.py
('file is', '')
Traceback (most recent call last):
File "exits.py", line 4, in <module>
assert False
AssertionError
~/tmp/so$ cat xyz.txt
hello

I, as well as other persons in this thread, are left with the question, "Well what is finally true?"
Now, supposing that files are left open in a premature program termination -- and there are a lot of such cases besides exceptions due to file handling -- the only safe way to avoid this, is to read the whole (or part of the) file into a buffer and close it. Then handle the contents in the buffer as needed. This is esp. the case for global search, changes, etc. that have to be done on the file. After changes are done, one can then write the whole buffer to the same or other file at once, avoiding the risk to leave the the newly created file open -- by doing a lot readings and writings -- which is the worst case of all!

Related

Proper finalization in Python

I have a bunch of instances, each having a unique tempfile for its use (save data from memory to disk and retrieve them later).
I want to be sure that at the end of the day, all these files are removed. However, I want to leave a room for a fine-grained control of their deletion. That is, some files may be removed earlier, if needed (e.g. they are too big and not important any more).
What is the best / recommended way to achieve this?
May thoughts on that
The try-finalize blocks or with statements are not an option, as we have many files, whose lifetime may overlap each other. Also, it hardly admits the option of finer control.
From what I have read, __del__ is also not a feasible option, as it is not even guaranteed that it will eventually run (although, it is not entirely clear to me, what are the "risky" cases). Also (if it is still the case), the libraries may not be available when __del__ runs.
tempfile library seems promising. However, the file is gone after just closing it, which is definitely a bummer, as I want them to be closed (when they perform no operation) to limit the number of open files.
The library promises that the file "will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected)."
How do they achieve the implicit close? E.g. in C# I would use a (reliable) finalizer, which __del__ is not.
atexit library seems to be the best candidate, which can work as a reliable finalizer instead of __del__ to implement safe disposable pattern. The only problem, compared to object finalizers, is that it runs truly at-exit, which is rather inconvenient (what if the object eligible to be garbage-collected earlier?).
Here, the question still stands. How the library achieves that the methods always run? (Except in a really unexpected cases with which is hard to do anything)
In ideal case, it seems that a combination of __del__ and atexit library may perform best. That is, the clean-up is both at __del__ and the method registered in atexit, while repeated clean-up would be forbidden. If __del__ was called, the registered will be removed.
The only (yet crucial) problem is that __del__ won't run if a method is registered at atexit, because a reference to the object exists forever.
Thus, any suggestion, advice, useful link and so on is welcomed.
I suggest considering weakref built-in module for this task, more specifically weakref.finalize simple example:
import weakref
class MyClass:
pass
def clean_up(*args):
print('clean_up', args)
my_obj = MyClass()
weakref.finalize(my_obj, clean_up, 'arg1', 'arg2', 'arg3')
del my_obj # optional
when run it will output
clean_up ('arg1', 'arg2', 'arg3')
Note that clean_up will be executed even without del-ing of my_obj (you might delete last line of code and behavior will not change). clean_up is called after all strong references to my_obj are gone or at end (like using atexit module).

How to do proper file locking on NFS?

I am trying to implement a "record manager" class in python 3x and linux/macOS. The class is relatively easy and straightforward, the only "hard" thing I want is to be able to access the same file (where results are saved) on multiple processes.
This seemed pretty easy, conceptually: when saving, acquire an exclusive lock on the file. Update your information, save the new information, release exclusive lock on the file. Easy enough.
I am using fcntl.lockf(file, fcntl.LOCK_EX) to acquire the exclusive lock. The problem is that, looking on the internet, I am finding a lot of different websites saying how this is not reliable, that it won't work on windows, that the support on NFS is shaky, and that things could change between macOS and linux.
I have accepted that the code won't work on windows, but I was hoping to be able to make it work on macOS (single machine) and on linux (on multiple servers with NFS).
The problem is that I can't seem to make this work; and after a while of debugging and after the tests passed on macOS, they failed once I tried them on the NFS with linux (ubuntu 16.04). The issue is an inconsistency between the informations saved by multiple processes - some processes have their modifications missing, which means something went wrong in the locking and saving procedure.
I am sure there is something I am doing wrong, and I suspect this may be related to the issues that I read about online. So, what is the proper way to deal multiple access to the same file that works on macOS and linux over NFS?
Edit
This is what the typical method that writes new informations to disk looks like:
sf = open(self._save_file_path, 'rb+')
try:
fcntl.lockf(sf, fcntl.LOCK_EX) # acquire an exclusive lock - only one writer
self._raw_update(sf) #updates the records from file (other processes may have modified it)
self._saved_records[name] = new_info
self._raw_save() #does not check for locks (but does *not* release the lock on self._save_file_path)
finally:
sf.flush()
os.fsync(sf.fileno()) #forcing the OS to write to disk
sf.close() #release the lock and close
While this is how a typical method that only read info from disk looks like:
sf = open(self._save_file_path, 'rb')
try:
fcntl.lockf(sf, fcntl.LOCK_SH) # acquire shared lock - multiple writers
self._raw_update(sf) #updates the records from file (other processes may have modified it)
return self._saved_records
finally:
sf.close() #release the lock and close
Also, this is how _raw_save looks like:
def _raw_save(self):
#write to temp file first to avoid accidental corruption of information.
#os.replace is guaranteed to be an atomic operation in POSIX
with open('temp_file', 'wb') as p:
p.write(self._saved_records)
os.replace('temp_file', self._save_file_path) #pretty sure this does not release the lock
Error message
I have written a unit test where I create 100 different processes, 50 that read and 50 that write to the same file. Each process does some random waiting to avoid accessing the files sequentially.
The problem is that some of the records are not kept; at the end there are some 3-4 random records missing, so I only end up with 46-47 records rather than 50.
Edit 2
I have modified the code above and I acquire the lock not on the file itself, but on a separate lock file. This prevents the issue that closing the file would release the lock (as suggested by #janneb), and makes the code work correctly on mac. The same code fails on linux with NFS though.
I don't see how the combination of file locks and os.replace() can make sense. When the file is replaced (that is, the directory entry is replaced), all the existing file locks (probably including file locks waiting for the locking to succeed, I'm not sure of the semantics here) and file descriptors will be against the old file, not the new one. I suspect this is the reason behind the race conditions causing you to lose some of the records in your tests.
os.replace() is a good technique to ensure that a reader doesn't read a partial update. But it doesn't work robustly in the face of multiple updaters (unless losing some of the updates is ok).
Another issues is that fcntl is a really really stupid API. In particular, the locks are bound to the process, not the file descriptor. Which means that e.g. a close() on ANY file descriptor pointing to the file will release the lock.
One way would be to use a "lock file", e.g. taking advantage of the atomicity of link(). From http://man7.org/linux/man-pages/man2/open.2.html:
Portable
programs that want to perform atomic file locking using a
lockfile, and need to avoid reliance on NFS support for
O_EXCL, can create a unique file on the same filesystem (e.g.,
incorporating hostname and PID), and use link(2) to make a
link to the lockfile. If link(2) returns 0, the lock is
successful. Otherwise, use stat(2) on the unique file to
check if its link count has increased to 2, in which case the
lock is also successful.
If it's Ok to read slightly stale data then you can use this link() dance only for a temp file that you use when updating the file and then os.replace() the "main" file you use for reading (reading can then be lockless). If not, then you need to do the link() trick for the "main" file and forget about shared/exclusive locking, all locks are then exclusive.
Addendum: One tricky thing to deal with when using lock files is what to do when a process dies for whatever reason, and leaves the lock file around. If this is to run unattended, you might want to incorporate some kind of timeout and removal of lock files (e.g. check the stat() timestamps).
Using randomly named hard links and the link counts on those files as lock files is a common strategy (E.g. this), and arguable better than using lockd but for far more information about the limits of all sorts of locks over NFS read this: http://0pointer.de/blog/projects/locking.html
You'll also find that this is a long standing standard problem for MTA software using Mbox files over NFS. Probably the best answer there was to use Maildir instead of Mbox, but if you look for examples in the source code of something like postfix, it'll be close to best practice. And if they simply don't solve that problem, that might also be your answer.
NFS is great for file sharing. It sucks as a "transmission" medium.
I've been down the NFS-for-data-transmission road multiple times. In every instance, the solution involved moving away from NFS.
Getting reliable locking is one part of the problem. The other part is the update of the file on the server and expecting the clients to receive that data at some specific point-in-time (such as before they can grab the lock).
NFS isn't designed to be a data transmission solution. There are caches and timing involved. Not to mention paging of the file content, and file metadata (e.g. the atime attribute). And client O/S'es keeping track of state locally (such as "where" to append the client's data when writing to the end of the file).
For a distributed, synchronized store, I recommend looking at a tool that does just that. Such as Cassandra, or even a general-purpose database.
If I'm reading the use-case correctly, you could also go with a simple server-based solution. Have a server listen for TCP connections, read messages from the connections, and then write each to file, serializing the writes within the server itself. There's some added complexity in having your own protocol (to know where a message starts and stops), but otherwise, it's fairly straight-forward.

When/How does an anonymous file object close?

In the comments of this question about a python one-liner, it occurred to me I have no idea how python handles anonymous file objects. From the question:
open(to_file, 'w').write(open(from_file).read())
There are two calls to open without using the with keyword (which is usually how I handle files). I have, in the past, used this kind of unnamed file. IIRC, it seemed there was a leftover OS-level lock on the file that would expire after a minute or two.
So what happens to these file handles? Are they cleaned up by garbage collection? By the OS? What happens to the Python machine and file when close() is called, and will it all happen anyway when the script finishes and some time passes?
Monitoring the file descriptor on Linux (by checking /proc/$$/fds) and the File Handle on Windows (using SysInternals tools) it appears that the file is closed immediately after the statement.
This cannot be guarenteed however, since the garbage collector has to execute. In the testing I have done it does get closed at once every time.
The with statement is recommended to be used with open, however the occasions when it is actually needed are rare. It is difficult to demonstrate a scenario where you must use with, but it is probably a good idea to be safe.
So your one-liner becomes:
with open(to_file, 'w') as tof, open(from_file) as fof:
tof.write(fof.read())
The advantage of with is that the special method (in the io class) called __exit__() is guaranteed* to be called.
* Unless you do something like os._exit().
The files will get closed after the garbage collector collects them, CPython will collect them immediately because it uses reference counting, but this is not a guaranteed behavior.
If you use files without closing them in a loop you might run out of file descriptors, that's why it's recommended to use the with statement (if you're using 2.5 you can use from __future__ import with_statement).

python: what happens to opened file if i quit before it is closed?

i am opening a csv file:
def get_file(start_file): #opens original file, reads it to array
with open(start_file,'rb') as f:
data=list(csv.reader(f))
header=data[0]
counter=collections.defaultdict(int)
for row in data:
counter[row[10]]+=1
return (data,counter,header)
does the file stay in memory if i quit the program inside the WITH loop?
what happens to the variables in general inside the program when i quit the program without setting all variables to NULL?
The operating system will automatically close any open file descriptors when your process terminates.
File data stored in memory (e.g. variables, Python buffers) will be lost. Data buffered in the operating system may be flushed to disk when the file is implicitly closed (checking the exact semantics of in-kernel dirty-buffers here would be educational, though you should not rely on it).
Your variables cease to exist when your process terminates.
My understanding of the with statement is that, no matter what, it will take care of closing your file handles for you when you exit it's scope. That should still happen if your program exits inside the with block.
As far as other variables are concerned, they're deleted from memory when your program exits automatically. If you are interested in finding ways to make something persistent between runs you can look at the pickle (http://docs.python.org/library/pickle.html) or shelve (http://docs.python.org/library/shelve.html) modules. Personally, I prefer shelve to pickle, but they both work well for that.
#gotgenes - Thanks for the suggestion. It's important to note that shelve uses pickle in its underlying implementation. When I say I prefer shelve to pickle, I mean that for the ways that persistence is important in what I'm currently designing using shelve is easier because it's not doing anything more than serving as a dictionary that persists between runs.
you never have to set variables to NULL, as soon as your program terminates the memory is freed. the same holds true for the file - it stays in memory no more or less whether you quit in the with loop or anywhere else. however, it is good practice to manually close the file so you can be sure that any pending operations are performed before the program is exited. in general, this should happen anyway, but especially when writing, I generally prefer the close.

for line in open(filename)

I frequently see python code similar to
for line in open(filename):
do_something(line)
When does filename get closed with this code?
Would it be better to write
with open(filename) as f:
for line in f.readlines():
do_something(line)
filename would be closed when it falls out of scope. That normally would be the end of the method.
Yes, it's better to use with.
Once you have a file object, you perform all file I/O by calling methods of this object. [...] When you are done with the file, you should finish by calling the close method on the object, to close the connection to the file:
input.close()
In short scripts, people often omit this step, as Python automatically closes the file when a file object is reclaimed during garbage collection (which in mainstream Python means the file is closed just about at once, although other important Python implementations, such as Jython and IronPython, have other, more relaxed garbage collection strategies). Nevertheless, it is good programming practice to close your files as soon as possible, and it is especially a good idea in larger programs, which otherwise may be at more risk of having excessive numbers of uselessly open files lying about. Note that try/finally is particularly well suited to ensuing that a file gets closed, even when a function terminates due to an uncaught exception.
Python Cookbook, Page 59.
Drop .readlines(). It is redundant and undesirable for large files (due to memory consumption). The variant with 'with' block always closes file.
with open(filename) as file_:
for line in file_:
do_something(line)
When file will be closed in the bare 'for'-loop variant depends on Python implementation.
The with part is better because it close the file afterwards.
You don't even have to use readlines(). for line in file is enough.
I don't think the first one closes it.
python is garbage-collected - cpython has reference counting and a backup cycle detecting garbage collector.
File objects close their file handle when the are deleted/finalized.
Thus the file will be eventually closed, and in cpython will closed as soon as the for loop finishes.

Categories