How does built-in open() map to its C implementation?

How does built-in open() map to its C implementation? - python

Trying to reason about the CPython source and was curious about the built-in open() method.
This method is defined in _pyio.py and returns a FileIO object, so I dug through source and found that (on Windows) there is a call to _wopen (source line). Interestingly enough, I stumbled into fileutils.c where _Py_open is defined and subsequently _Py_open_impl. The latter makes a call to open (source line) which has a different signature than _wopen which I presume is referencing _wfopen; however, below that there are _Py_wfopen, _Py_fopen and _Py_fopen_obj. Their comment lines seem to indicate that they are wrappers around the C functions provided from #include's, so I know they're calling the originals and extending their functionality.
I'm not a C person by any means, mostly I can dig around code for debugging. This, however, has me lost. How are all these methods tied together (on Windows)? So far I have:
open() -> io.py -> _pyio.py (_io) -> _iomodule.c -> ?
Not seeing where _Py_fopen or _Py_wfopen are called explicitly called (or used to wrap library functions) other than in main.c for startup file operations.

The latter makes a call to open (source line) which has a different signature than _wopen which I presume is referencing _wfopen
What you mean is not clear but the call to open is referencing the unix open(2) syscall, nothing to do with Windows.
open() -> io.py -> _pyio.py (_io) -> _iomodule.c -> ?
_iomodule.c defines _io_open_impl which instantiates PyFileIO_Type from fileio.c.
That actually opens a file in _io_FileIO___init___impl which, after some faffing around, simply calls _wopen on windows and open(2) elsewhere: https://github.com/python/cpython/blob/master/Modules/_io/fileio.c#L383

Related

Python documentation input variable type hints

TLDR
I'm noticing a significant difference in the information presented by the official python docs compared to what I'm seeing in the PyCharm hover-over / quickdocs. I'm hoping someone can point me to where I can find the source of this quickdoc information such that I can use it outside of PyCharm as a general reference.
For example in the python docs for os.makedir I see:
os.makedirs(name, mode=0o777, exist_ok=False)
Recursive directory creation function. Like mkdir(), but makes all intermediate-level directories needed to contain the leaf directory.
The mode parameter is passed to mkdir() for creating the leaf directory; see the mkdir() description for how it is interpreted. To set the file permission bits of any newly created parent directories you can set the umask before invoking makedirs(). The file permission bits of existing parent directories are not changed.
If exist_ok is False (the default), an FileExistsError is raised if the target directory already exists.
Note
makedirs() will become confused if the path elements to create include pardir (eg. “..” on UNIX systems).
This function handles UNC paths correctly.
Raises an auditing event os.mkdir with arguments path, mode, dir_fd.
New in version 3.2: The exist_ok parameter.
Changed in version 3.4.1: Before Python 3.4.1, if exist_ok was True and the directory existed, makedirs() would still raise an error if mode did not match the mode of the existing directory. Since this behavior was impossible to implement safely, it was removed in Python 3.4.1. See bpo-21082.
Changed in version 3.6: Accepts a path-like object.
Changed in version 3.7: The mode argument no longer affects the file permission bits of newly created intermediate-level directories.
But in the quickdocs I see:
os def makedirs(name: str | bytes | PathLike[str] | PathLike[bytes],
mode: int = ...,
exist_ok: bool = ...) -> None
makedirs(name [, mode=0o777][, exist_ok=False]) Super-mkdir; create a leaf directory and all intermediate ones. Works like mkdir, except that any intermediate path segment (not just the rightmost) will be created if it does not exist. If the target directory already exists, raise an OSError if exist_ok is False. Otherwise no exception is raised. This is recursive.
Where is this quickdoc type hinting information coming from and where can I find a complete reference with all these type hints such that I can reference it outside of PyCharm?
Background
Coming mainly from a strongly typed language like Java, I struggle to make constructive use of the python documentation with regards to function input parameter types. I am hoping someone can elucidate a standard process for resolving ambiguity compared to my current trail+[lots of]errors approach.
For example, the os.makedir function's first parameter is name.
os.makedirs(name, mode=0o777, exist_ok=False)
It is not apparent to me what sorts of things I can pass as name here. Could it be:
A str? If so, how should I create this? Via a string literal, double quoted string? Does this accept / separators or \ separators or system dependent?
A pathlib.Path?
Anything pathlike?
[Note the above are rhetorical questions and not the focus of this post]. These are all informed guesses, but if I were completely new to python and was trying to use this documentation to make some directories, I see two options:
Read the source code via some IDE or other indexing
Guess until I get it right
The first is fine for easier to understand functions like makedirs but for more complicated functions this would require gaining expertise in a library that I don't necessarily want to reuse and just want to try out. I simply don't have enough time to become an expert in everything I encounter. This seems quite inefficient.
The second also seems to be quite inefficient, with the added demerit of not knowing how to write robust code to check for inappropriate inputs.
Now I don't want to bash the python docs, as they are LEAPS and BOUNDS better than a fair few other languages I've used, but is this dilemma just a case of unfinished/poor documentation, or is there a standard way of knowing/understanding what input parameters like name in this case should be that I haven't outlined above?
To be fair, this may not be the optimal example, as if you look towards the end of the doc for makedirs you can see it does state:
Changed in version 3.6: Accepts a path-like object.
but this is not specifically referring to name. Yes, in this example it may seem rather obvious it is referring to name, but with the advent of type-hinting, why are the docs not type hinted like the quickdocs from PyCharm? Is this something planned for the future, or is it too large a can of worms to try to hint all possibilities in a flexible language like python?
Just as a comparison, take a look at Java's java.io.file.mkdirs where the various constructors definitely tell you all the options for specifying the path of the file:
File(File parent, String child)
// Creates a new File instance from a parent abstract pathname and a child pathname string.
File(String pathname)
// Creates a new File instance by converting the given pathname string into an abstract pathname.
File(String parent, String child)
// Creates a new File instance from a parent pathname string and a child pathname string.
File(URI uri)
// Creates a new File instance by converting the given file: URI into an abstract pathname.
Just reading this I already know exactly how to make a File object and create directories without running/testing anything. With the quickdoc in PyCharm I can do the same, so where is this type hint information in the official docs?

Which python module contains file object methods?

While it is simple to search by using help for most methods that have a clear help(module.method) arrangement, for example help(list.extend), I cannot work out how to look up the method .readline() in python's inbuilt help function.
Which module does .readline belong to? How would I search in help for .readline and related methods?
Furthermore is there any way I can use the interpreter to find out which module a method belongs to in future?

Don't try to find the module. Make an instance of the class you want, then call help on the method of that instance, and it will find the correct help info for you. Example:
>>> f = open('pathtosomefile')
>>> help(f.readline)
Help on built-in function readline:
readline(size=-1, /) method of _io.TextIOWrapper instance
Read until newline or EOF.
Returns an empty string if EOF is hit immediately.
In my case (Python 3.7.1), it's defined on the type _io.TextIOWrapper (exposed publicly as io.TextIOWrapper, but help doesn't know that), but memorizing that sort of thing isn't very helpful. Knowing how to figure it out by introspecting the specific thing you care about is much more broadly applicable. In this particular case, it's extra important not to try guessing, because the open function can return a few different classes, each with different methods, depending on the arguments provided, including io.BufferedReader, io.BufferedWriter, io.BufferedRandom, and io.FileIO, each with their own version of the readline method (though they all share a similar interface for consistency's sake).

From the text of help(open):
open() returns a file object whose type depends on the mode, and
through which the standard file operations such as reading and writing
are performed. When open() is used to open a file in a text mode ('w',
'r', 'wt', 'rt', etc.), it returns a TextIOWrapper. When used to open
a file in a binary mode, the returned class varies: in read binary
mode, it returns a BufferedReader; in write binary and append binary
modes, it returns a BufferedWriter, and in read/write mode, it returns
a BufferedRandom.
See also the section of python's io module documentation on the class hierarchy.
So you're looking at TextIOWrapper, BufferedReader, BufferedWriter, or BufferedRandom. These all have their own sets of class hierarchies, but suffice it to say that they share the IOBase superclass at some point - that's where the functions readline() and readlines() are declared. Of course, each subclass implements these functions differently for its particular mode - if you do
help(_io.TextIOWrapper.readline)
you should get the documentation you're looking for.
In particular, you're having trouble accessing the documentation for whichever version of readline you need, because you can't be bothered to figure out which class it is. You can actually call help on an object as well. If you're working with a particular file object, then you can spin up a terminal, instantiate it, and then just pass it to help() and it'll show you whatever interface is closest to the surface. Example:
x = open('some_file.txt', 'r')
help(x.readline)

How does open handle context management?

The python built-ins open, and file work with context managers in a way that I don't quite understand.
It is to my understanding that open will create a file. file implements the context-manager methods __enter__ and __exit__. I would initially expect __enter__ to implement the actual opening of the file descriptor.
However, using open outside of a with block will return a file which is already open. So, it appears either file.__init__ or open is actually opening the file descriptor, and as far as I can tell file.__enter__ isn't doing anything. Or maybe file.__init__/open calls file.__enter__ directly?
First question:
What is the execution-flow of the open built-in? What does open handle, what does file.__init__ handle, and what does file.__enter__ handle? How does this work when re-using one file object for multiple cycles of opening/closing the file? How is this different from re-using other contextmanager objects for multiple context-cycles?
Second question:
Objects such as file objects have a setup steps and teardown steps. The setup occurs in __init__ , and the tear-down occurs in either close or __exit__.
Is this a good design pattern? Should this design pattern be implemented for custom functions/context managers?

If you look in _pyio.py (a pure-Python implementation of the io module) you find the following code in class IOBase:
### Context manager ###
def __enter__(self): # That's a forward reference
"""Context management protocol. Returns self (an instance of IOBase)."""
self._checkClosed()
return self
def __exit__(self, *args):
"""Context management protocol. Calls close()"""
self.close()
This contains the answers to most of your questions. The important thing to understand is that the context manager's function is to insure that you close the file when you are done with it. It does this simply by calling the close function, which saves you the trouble of doing so.
What does file.__enter__ handle? Nothing. It simply returns to you the file object that was the result of the call to the built-in function open().
How does this work when using one file object for multiple cycles of opening and closing the file? The context manager is not very useful for that purpose, since you must explicitly open the file each time.
Is this a good design pattern? Yes, because it reduces the amount of code you have to write, it's easy to read and understand.
Should this pattern be implemented for custom functions/context managers? Any time you have an object that needs to be cleaned up, or has usage that involves some type of open/close concept, you should consider this pattern. The standard library has many other examples.

For Question 1
In CPython, open() does nothing but creating a file object, which the underlying C type is PyFileObject; See source code in bltinmodule.c and fileobject.c
static PyObject *
builtin_open(PyObject *self, PyObject *args, PyObject *kwds)
{
return PyObject_Call((PyObject*)&PyFile_Type, args, kwds);
}
file.__init__ would open the file
file.__enter__ indeed do nothing except doing empty check on field file.fp
file.__exit__ invoke close() method to close file
For Question 2
Why file design like this is due to a historical reason.
open and with are two different keywords introduced on different versions of CPython. with was introduced till Python 2.5 (see PEP 343). At that time, open has been used for a long time.
For our customized type, we could design like file or not, depends on the concrete application context.
For example, threading.Lock is a different design, its init and enter are separately.

Pass FILE * into function from Python / ctypes

I have a library function (written in C) that generates text by writing the output to FILE *. I want to wrap this in Python (2.7.x) with code that creates a temp file or pipe, passes it into the function, reads the result from the file, and returns it as a Python string.
Here's a simplified example to illustrate what I'm after:
/* Library function */
void write_numbers(FILE * f, int arg1, int arg2)
{
fprintf(f, "%d %d\n", arg1, arg2);
}
Python wrapper:
from ctypes import *
mylib = CDLL('mylib.so')
def write_numbers( a, b ):
rd, wr = os.pipe()
write_fp = MAGIC_HERE(wr)
mylib.write_numbers(write_fp, a, b)
os.close(wr)
read_file = os.fdopen(rd)
res = read_file.read()
read_file.close()
return res
#Should result in '1 2\n' being printed.
print write_numbers(1,2)
I'm wondering what my best bet is for MAGIC_HERE().
I'm tempted to just use ctypes and create a libc.fdopen() wrapper that returns a Python c_void_t, then pass that into the library function. I'm seems like that should be safe in theory--just wondering if there are issues with that approach or an existing Python-ism to solve this problem.
Also, this will go in a long-running process (lets just assume "forever"), so any leaked file descriptors are going to be problematic.

First, do note that FILE* is an stdio-specific entity. It doesn't exist at system level. The things that exist at system level are descriptors (retrieved with file.fileno()) in UNIX (os.pipe() returns plain descriptors already) and handles (retrieved with msvcrt.get_osfhandle()) in Windows. Thus it's a poor choice as an inter-library exchange format if there can be more than one C runtime in action. You'll be in trouble if your library is compiled against another C runtime than your copy of Python: 1) binary layouts of the structure may differ (e.g. due to alignment or additional members for debugging purposes or even different type sizes); 2) in Windows, file descriptors that the structure links to are C-specific entities as well, and their table is maintained by a C runtime internally1.
Moreover, in Python 3, I/O was overhauled in order to untangle it from stdio. So, FILE* is alien to that Python flavor (and likely, most non-C flavors, too).
Now, what you need is to
somehow guess which C runtime you need, and
call its fdopen() (or equivalent).
(One of Python's mottoes is "make the right thing easy and the wrong thing hard", after all)
The cleanest method is to use the precise instance that the library is linked to (do pray that it's linked with it dynamically or there'll be no exported symbol to call)
For the 1st item, I couldn't find any Python modules that can analyze loaded dynamic modules' metadata to find out which DLLs/so's it have been linked with (just a name or even name+version isn't enough, you know, due to possible multiple instances of the library on the system). Though it's definitely possible since the information about its format is widely available.
For the 2nd item, it's a trivial ctypes.cdll('path').fdopen (_fdopen for MSVCRT).
Second, you can do a small helper module that would be compiled against the same (or guaranteed compatible) runtime as the library and would do the conversion from the aforementioned descriptor/handle for you. This is effectively a workaround to editing the library proper.
Finally, there's the simplest (and the dirtiest) method using Python's C runtime instance (so all the above warnings apply in full) through Python C API available via ctypes.pythonapi. It takes advantage of
the fact that Python 2's file-like objects are wrappers over stdio's FILE* (Python 3's are not)
PyFile_AsFile API that returns the wrapped FILE* (note that it's missing from Python 3)
for a standalone fd, you need to construct a file-like object first (so that there would be a FILE* to return ;) )
the fact that id() of an object is its memory address (CPython-specific)2
>>> open("test.txt")
<open file 'test.txt', mode 'r' at 0x017F8F40>
>>> f=_
>>> f.fileno()
3
>>> ctypes.pythonapi
<PyDLL 'python dll', handle 1e000000 at 12808b0>
>>> api=_
>>> api.PyFile_AsFile
<_FuncPtr object at 0x018557B0>
>>> api.PyFile_AsFile.restype=ctypes.c_void_p #as per ctypes docs,
# pythonapi assumes all fns
# to return int by default
>>> api.PyFile_AsFile.argtypes=(ctypes.c_void_p,) # as of 2.7.10, long integers are
#silently truncated to ints, see http://bugs.python.org/issue24747
>>> api.PyFile_AsFile(id(f))
2019259400
Do keep in mind that with fds and C pointers, you need to ensure proper object lifetimes by hand!
file-like objects returned by os.fdopen() do close the descriptor on .close()
so duplicate descriptors with os.dup() if you need them after a file object is closed/garbage collected
while working with the C structure, adjust the corresponding object's reference count with PyFile_IncUseCount()/PyFile_DecUseCount().
ensure no other I/O on the descriptors/file objects since it would screw up the data (e.g. ever since calling iter(f)/for l in f, internal caching is done that's independent from stdio's caching)

Python class can't be updated after being compiled

I just started with python a couple of days ago, coming from a C++ background. When I write a class, call it by a script, and afterwards update the interface of the class, I get some behaviour I find very unintuitive.
Once successfully compiled, the class seems to be not changeable anymore. Here an example:
testModule.py:
class testClass:
def __init__(self,_A):
self.First=_A
def Method(self, X, Y):
print X
testScript.py:
import testModule
tm=testModuleB.testClass(10)
tm.Method(3, 4)
Execution gives me
3
Now I change the argument list of Method:
def Method(self, X):
, I delete the testModule.pyc and in my script I call
tm.Method(3)
As result, I get
TypeError: Method() takes exactly 3 arguments (2 given)
What am I doing wrong? Why does the script not use the updated version of the class? I use the Canopy editor but I saw this behaviour also with the python.exe interpreter.
And apologies, if something similar was asked before. I did not find a question related to this one.

Python loads the code objects into memory; the class statement is executed when a file is first imported an a class object is created and stored in the module namespace. Subsequent imports re-use the already created objects.
The .pyc file is only used the next time the module is imported for the first time that Python session. Replacing the file will not result in a module reload.
You can use the reload() function to force Python to replace an already-loaded module with fresh code from disk. Note that any and all other direct references to a class are not replaced; an instance of the testClass class (tm in your case) would still reference the old class object.
When developing code, it is often just easier to restart the Python interpreter and start afresh. That way you don't have to worry about hunting down all direct references and replacing those, for example.

testModule is already loaded in your interpreter. Deleting the pyc file won't change anything. You will need to do reload(testModule), or even better restart the interpreter.

Deleting the .pyc file cannot do the change in your case. When you import a module for the first time on the interpreter, it gets completely loaded on the interpreter and deleting the files or modifying won't change anything.
Better restart the interpreter or use the built-in reload function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.