The python built-ins open, and file work with context managers in a way that I don't quite understand.
It is to my understanding that open will create a file. file implements the context-manager methods __enter__ and __exit__. I would initially expect __enter__ to implement the actual opening of the file descriptor.
However, using open outside of a with block will return a file which is already open. So, it appears either file.__init__ or open is actually opening the file descriptor, and as far as I can tell file.__enter__ isn't doing anything. Or maybe file.__init__/open calls file.__enter__ directly?
First question:
What is the execution-flow of the open built-in? What does open handle, what does file.__init__ handle, and what does file.__enter__ handle? How does this work when re-using one file object for multiple cycles of opening/closing the file? How is this different from re-using other contextmanager objects for multiple context-cycles?
Second question:
Objects such as file objects have a setup steps and teardown steps. The setup occurs in __init__ , and the tear-down occurs in either close or __exit__.
Is this a good design pattern? Should this design pattern be implemented for custom functions/context managers?
If you look in _pyio.py (a pure-Python implementation of the io module) you find the following code in class IOBase:
### Context manager ###
def __enter__(self): # That's a forward reference
"""Context management protocol. Returns self (an instance of IOBase)."""
self._checkClosed()
return self
def __exit__(self, *args):
"""Context management protocol. Calls close()"""
self.close()
This contains the answers to most of your questions. The important thing to understand is that the context manager's function is to insure that you close the file when you are done with it. It does this simply by calling the close function, which saves you the trouble of doing so.
What does file.__enter__ handle? Nothing. It simply returns to you the file object that was the result of the call to the built-in function open().
How does this work when using one file object for multiple cycles of opening and closing the file? The context manager is not very useful for that purpose, since you must explicitly open the file each time.
Is this a good design pattern? Yes, because it reduces the amount of code you have to write, it's easy to read and understand.
Should this pattern be implemented for custom functions/context managers? Any time you have an object that needs to be cleaned up, or has usage that involves some type of open/close concept, you should consider this pattern. The standard library has many other examples.
For Question 1
In CPython, open() does nothing but creating a file object, which the underlying C type is PyFileObject; See source code in bltinmodule.c and fileobject.c
static PyObject *
builtin_open(PyObject *self, PyObject *args, PyObject *kwds)
{
return PyObject_Call((PyObject*)&PyFile_Type, args, kwds);
}
file.__init__ would open the file
file.__enter__ indeed do nothing except doing empty check on field file.fp
file.__exit__ invoke close() method to close file
For Question 2
Why file design like this is due to a historical reason.
open and with are two different keywords introduced on different versions of CPython. with was introduced till Python 2.5 (see PEP 343). At that time, open has been used for a long time.
For our customized type, we could design like file or not, depends on the concrete application context.
For example, threading.Lock is a different design, its init and enter are separately.
Related
Trying to reason about the CPython source and was curious about the built-in open() method.
This method is defined in _pyio.py and returns a FileIO object, so I dug through source and found that (on Windows) there is a call to _wopen (source line). Interestingly enough, I stumbled into fileutils.c where _Py_open is defined and subsequently _Py_open_impl. The latter makes a call to open (source line) which has a different signature than _wopen which I presume is referencing _wfopen; however, below that there are _Py_wfopen, _Py_fopen and _Py_fopen_obj. Their comment lines seem to indicate that they are wrappers around the C functions provided from #include's, so I know they're calling the originals and extending their functionality.
I'm not a C person by any means, mostly I can dig around code for debugging. This, however, has me lost. How are all these methods tied together (on Windows)? So far I have:
open() -> io.py -> _pyio.py (_io) -> _iomodule.c -> ?
Not seeing where _Py_fopen or _Py_wfopen are called explicitly called (or used to wrap library functions) other than in main.c for startup file operations.
The latter makes a call to open (source line) which has a different signature than _wopen which I presume is referencing _wfopen
What you mean is not clear but the call to open is referencing the unix open(2) syscall, nothing to do with Windows.
open() -> io.py -> _pyio.py (_io) -> _iomodule.c -> ?
_iomodule.c defines _io_open_impl which instantiates PyFileIO_Type from fileio.c.
That actually opens a file in _io_FileIO___init___impl which, after some faffing around, simply calls _wopen on windows and open(2) elsewhere: https://github.com/python/cpython/blob/master/Modules/_io/fileio.c#L383
While it is simple to search by using help for most methods that have a clear help(module.method) arrangement, for example help(list.extend), I cannot work out how to look up the method .readline() in python's inbuilt help function.
Which module does .readline belong to? How would I search in help for .readline and related methods?
Furthermore is there any way I can use the interpreter to find out which module a method belongs to in future?
Don't try to find the module. Make an instance of the class you want, then call help on the method of that instance, and it will find the correct help info for you. Example:
>>> f = open('pathtosomefile')
>>> help(f.readline)
Help on built-in function readline:
readline(size=-1, /) method of _io.TextIOWrapper instance
Read until newline or EOF.
Returns an empty string if EOF is hit immediately.
In my case (Python 3.7.1), it's defined on the type _io.TextIOWrapper (exposed publicly as io.TextIOWrapper, but help doesn't know that), but memorizing that sort of thing isn't very helpful. Knowing how to figure it out by introspecting the specific thing you care about is much more broadly applicable. In this particular case, it's extra important not to try guessing, because the open function can return a few different classes, each with different methods, depending on the arguments provided, including io.BufferedReader, io.BufferedWriter, io.BufferedRandom, and io.FileIO, each with their own version of the readline method (though they all share a similar interface for consistency's sake).
From the text of help(open):
open() returns a file object whose type depends on the mode, and
through which the standard file operations such as reading and writing
are performed. When open() is used to open a file in a text mode ('w',
'r', 'wt', 'rt', etc.), it returns a TextIOWrapper. When used to open
a file in a binary mode, the returned class varies: in read binary
mode, it returns a BufferedReader; in write binary and append binary
modes, it returns a BufferedWriter, and in read/write mode, it returns
a BufferedRandom.
See also the section of python's io module documentation on the class hierarchy.
So you're looking at TextIOWrapper, BufferedReader, BufferedWriter, or BufferedRandom. These all have their own sets of class hierarchies, but suffice it to say that they share the IOBase superclass at some point - that's where the functions readline() and readlines() are declared. Of course, each subclass implements these functions differently for its particular mode - if you do
help(_io.TextIOWrapper.readline)
you should get the documentation you're looking for.
In particular, you're having trouble accessing the documentation for whichever version of readline you need, because you can't be bothered to figure out which class it is. You can actually call help on an object as well. If you're working with a particular file object, then you can spin up a terminal, instantiate it, and then just pass it to help() and it'll show you whatever interface is closest to the surface. Example:
x = open('some_file.txt', 'r')
help(x.readline)
I have read, that file opened like this is closed automatically when leaving the with block:
with open("x.txt") as f:
data = f.read()
do something with data
yet when opening from web, I need this:
from contextlib import closing
from urllib.request import urlopen
with closing(urlopen('http://www.python.org')) as page:
for line in page:
print(line)
why and what is the difference? (I am using Python3)
The details get a little technical, so let's start with the simple version:
Some types know how to be used in a with statement. File objects, like what you get back from open, are an example of such a type. As it turns out, the objects that you get back from urllib.request.urlopen, are also an example of such a type, so your second example could be written the same way as the first.
But some types don't know how to be used in a with statement. The closing function is designed to wrap such types—as long as they have a close method, it will call their close method when you exit the with statement.
Of course some types don't know how to be used in a with statement, and also can't be used with closing because their cleanup method isn't named close (or because cleaning them up is more complicated than just closing them). In that case, you need to write a custom context manager. But even that isn't usually that hard.
In technical terms:
A with statement requires a context manager, an object with __enter__ and __exit__ methods. It will call the __enter__ method, and give you the value returned by that method in the as clause, and it will then call the __exit__ method at the end of the with statement.
File objects inherit from io.IOBase, which is a context manager whose __enter__ method returns itself, and whose __exit__ calls self.close().
The object returned by urlopen is (assuming an http or https URL) an HTTPResponse, which, as the docs say, "can be used with a with statement".
The closing function:
Return a context manager that closes thing upon completion of the block. This is basically equivalent to:
#contextmanager
def closing(thing):
try:
yield thing
finally:
thing.close()
It's not always 100% clear in the docs which types are context managers and which types aren't. Especially since there's been a major drive since 3.1 to make everything that could be a context manager into one (and, for that matter, to make everything that's mostly-file-like into an actual IOBase if it makes sense), but it's still not 100% complete as of 3.4.
You can always just try it and see. If you get an AttributeError: __exit__, then the object isn't usable as a context manager. If you think it should be, file a bug suggesting the change. If you don't get that error, but the docs don't mention that it's legal, file a bug suggesting the docs be updated.
You don't. urlopen('http://www.python.org') returns a context manager too:
with urlopen('http://www.python.org') as page:
This is documented on the urllib.request.urlopen() page:
For ftp, file, and data urls and requests explicity handled by legacy URLopener and FancyURLopener classes, this function returns a urllib.response.addinfourl object which can work as context manager [...].
Emphasis mine. For HTTP responses, http.client.HTTPResponse() object is returned, which also is a context manager:
The response is an iterable object and can be used in a with statement.
The Examples section also uses the object as a context manager:
As the python.org website uses utf-8 encoding as specified in it’s meta tag, we will use the same for decoding the bytes object.
>>> with urllib.request.urlopen('http://www.python.org/') as f:
... print(f.read(100).decode('utf-8'))
...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm
Objects returned by open() are context managers too; they implement the special methods object.__enter__() and object.__exit__().
The contextlib.closing() documentation uses an example with urlopen() that is out of date; in Python 2 the predecessor for urllib.request.urlopen() did not produce a context manager and you needed to use that tool to auto-close the connection with a context manager. This was fixed with issues 5418 and 12365, but that example was not updated. I created issue 22755 asking for a different example.
I am really confused when to use os.open and when to use os.fdopen
I was doing all my work with os.open and it worked without any problem but I am not able to understand under what conditions we need file descriptors and all other functions like dup and fsync
Is the file object different from file descriptor
i mean f = os.open("file.txt",w)
Now is f the fileobject or its the filedescriptor?
You are confusing the built-in open() function with os.open() provided by the os module. They are quite different; os.open(filename, "w") is not valid Python (os.open accepts integer flags as its second argument), open(filename, "w") is.
In short, open() creates new file objects, os.open() creates OS-level file descriptors, and os.fdopen() creates a file object out of a file descriptor.
File descriptors are a low-level facility for working with files directly provided by the operating system kernel. A file descriptor is a small integer that identifies the open file in a table of open files kept by the kernel for each process. A number of system calls accept file descriptors, but they are not convenient to work with, typically requiring fixed-width buffers, multiple retries in certain conditions, and manual error handling.
File objects are Python classes that wrap file descriptors to make working with files more convenient and less error-prone. They provide, for example, error-handling, buffering, line-by-line reading, charset conversions, and are closed when garbage collected.
To recapitulate:
Built-in open() takes a file name and returns a new Python file object. This is what you need in the majority of cases.
os.open() takes a file name and returns a new file descriptor. This file descriptor can be passed to other low-level functions, such as os.read() and os.write(), or to os.fdopen(), as described below. You only need this when writing code that depends on operating-system-dependent APIs, such as using the O_EXCL flag to open(2).
os.fdopen() takes an existing file descriptor — typically produced by Unix system calls such as pipe() or dup(), and builds a Python file object around it. Effectively it converts a file descriptor to a full file object, which is useful when interfacing with C code or with APIs that only create low-level file descriptors.
Built-in open can be emulated by combining os.open() (to create a file descriptor) and os.fdopen() (to wrap it in a file object):
# functionally equivalent to open(filename, "r")
f = os.fdopen(os.open(filename, os.O_RDONLY))
I would like to create a class that describes a file resource and then pickle it. This part is straightforward. To be concrete, let's say that I have a class "A" that has methods to operate on a file. I can pickle this object if it does not contain a file handle. I want to be able to create a file handle in order to access the resource described by "A". If I have an "open()" method in class "A" that opens and stores the file handle for later use, then "A" is no longer pickleable. (I add here that opening the file includes some non-trivial indexing which cannot be cached--third party code--so closing and reopening when needed is not without expense). I could code class "A" as a factory that can generate file handles to the described file, but that could result in multiple file handles accessing the file contents simultaneously. I could use another class "B" to handle the opening of the file in class "A", including locking, etc. I am probably overthinking this, but any hints would be appreciated.
The question isn't too clear; what it looks like is that:
you have a third-party module which has picklable classes
those classes may contain references to files, which makes the classes themselves not picklable because open files aren't picklable.
Essentially, you want to make open files picklable. You can do this fairly easily, with certain caveats. Here's an incomplete but functional sample:
import pickle
class PicklableFile(object):
def __init__(self, fileobj):
self.fileobj = fileobj
def __getattr__(self, key):
return getattr(self.fileobj, key)
def __getstate__(self):
ret = self.__dict__.copy()
ret['_file_name'] = self.fileobj.name
ret['_file_mode'] = self.fileobj.mode
ret['_file_pos'] = self.fileobj.tell()
del ret['fileobj']
return ret
def __setstate__(self, dict):
self.fileobj = open(dict['_file_name'], dict['_file_mode'])
self.fileobj.seek(dict['_file_pos'])
del dict['_file_name']
del dict['_file_mode']
del dict['_file_pos']
self.__dict__.update(dict)
f = PicklableFile(open("/tmp/blah"))
print f.readline()
data = pickle.dumps(f)
f2 = pickle.loads(data)
print f2.read()
Caveats and notes, some obvious, some less so:
This class should operate directly on the file object you got from open. If you're using wrapper classes on files, like gzip.GzipFile, those should go above this, not below it. Logically, treat this as a decorator class on top of file.
If the file doesn't exist when you unpickle, it can't be unpickled and will throw an exception.
If it's a different file, the behavior may or may not make sense.
If the file mode includes file creation ('w+'), and the file doesn't exist, it'll be created; we don't know what file permissions to use, since that's not stored with the file. If this is important--it probably shouldn't be--then store the correct permissions in the class when you first create it.
If the file isn't seekable, trying to seek to the old position may raise IOError; if you're using a file like that you'll need to decide how to handle that.
The file classes in Python 2 and Python 3 are different; there's no file class in Python 3. Even if you're only using Python 2 right now, don't subclass file.
I'd steer away from doing this; having pickled data dependent on external files not changing and staying in the same place is brittle. This makes it difficult to even relocate files, since your pickled data won't make sense.
If you open a pointer to a file, pickle it, then attempt to reconstitute is later, there is no guarantee that file will still be available for opening.
To elaborate, the file pointer really represents a connection to the file. Just like a database connection, you can't "pickle" the other end of the connection, so this won't work.
Is it possible to keep the file pointer around in memory in its own process instead?
It sounds like you know you can't pickle the handle, and you're ok with that, you just want to pickle the part that can be pickled. As your object stands now, it can't be pickled because it has the handle. Do I have that right? If so, read on.
The pickle module will let your class describe its own state to pickle, for exactly these cases. You want to define your own __getstate__ method. The pickler will invoke it to get the state to be pickled, only if the method is missing does it go ahead and do the default thing of trying to pickle all the attributes.