Does Python support zero-copy I/O?

Does Python support zero-copy I/O? - python

I have two open file objects, dest and src. File object dest is opened for writing, with the seek position placed at some offset within the file, and file object src is opened for reading. What I need to do is simply read from the current position in src to EOF and transfer the contents to dest as quickly as possible.
If I were programming in Java, I could utilize the FileChannel#transferTo() method to perform zero-copy file I/O.
Does Python also support zero-copy?

Since version 3.3, Python has os.sendfile, which interfaces to various Unix variants' sendfile(2) zero-copy I/O interfaces. It operates on file descriptors, not general file-like objects. For older Pythons, there's py-sendfile.

Since Python 3.8, you can use shutil.copyfile (and others from shutil) which will internally use zero-copy if possible, such as os.sendfile, and if not possible, fall back to a simple read-write loop.
See the shutil docs for details.
Or issue 33671 (Efficient zero-copy for shutil.copy* functions (Linux, OSX and Win)).
And the corresponding (merged) pull request.
You might also be interested in copy-on-write support or server-side copy support. See here, here.
The os.copy_file_range (since Python 3.8) would do that. See issue 37159 (Use copy_file_range() in shutil.copyfile()) (maybe Python 3.9 or 3.10).

Related

Read/Write from/to an address in memory with Python3 without using external libraries

I wanted to create a game mod with python, and I needed to read and write to memory addresses.
I found a library called pymeow and it worked, but I had no idea how it actually works so I checked the source code but it didn't helped. I saw that the library uses Process.read(ByteAddress, value) and Process.write(ByteAddress, value) but I don't know if that's a python thing or a function from the library.
So my question is, if the read() and write() can read/write from/to addresses then what exactly would the Process variable be, and if this is just from that library then how can I actually achieve something like this in python without using external libraries like this one?

Named memory-mapped files in Python?

I'm using OpenCV to process some video data in a web service. Before calling OpenCV, the video is already loaded to a bytearray buffer, which I would like to pass to VideoCapture object:
# The following raises cv2.error because it can't convert '_io.BytesIO' to 'str' for 'filename'
cap = cv2.VideoCapture(buffer)
Unfortunately, VideoCapture() expects a string filename, not a buffer. For now, I'm saving the bytearray to a temporary file, and pass its name to VideoCapture().
Questions:
Is there a way to create named in-memory files in Python, so I can pacify OpenCV?
Alternatively, is there another OpenCV API which does support buffers?

Note: POSIX-specific! As you haven't provided OS tag, I assume it's okay.
According to this answer (and this shm_overview manpage) there is /dev/shm always present on the system. That's a tmpfs mapped in a shared (not Python process memory) memory pool, as suggested here, but the plus is that you don't need to create it, so no funny inventing of:
os.system("mount ...") or
Popen(["mount", ...]) wrappers.
Simply use tempfile.NamedTemporaryFile() like this:
from tempfile import NamedTemporaryFile
with NamedTemporaryFile(dir="/dev/shm") as file:
print(file.name)
# /dev/shm/tmp2m86e0e0
which you could then feed into OpenCV's API wrapper. Alternatively, utilize pyfilesystem as a more extensive wrapper around that device/FS.
Also, multiprocessing.heap.Arena uses it too, so if it didn't work, there'd be much more trouble present. For Windows check this implementation which uses winapi.
For the size of /dev/shm:
this is one of the size "specifications" I found,
shm.h, shm_add_rss_swap(), newseg() from Linux source code may hold more details
Judging by sudo ipcs it's most likely the way you want to utilize when sharing stuff between processes if you don't use sockets, pipes or disk.
As it's POSIX, it should work on POSIX-compliant systems, thus also on MacOS(no) or Solaris, but I have no means to try it.

Partially to answer the question: there is no way I know of in python to create named file-like objects which point to memory: that's something for an operating system to do. There is a very easy way to do something very like creating named memory mapped files in most modern *nixs: save the file to /tmp. These days /tmp is almost always a ramdisk. But of course it might be zram (basically a compressed ramdisk) and you likely want to check that first. At any rate it's better than thrashing your disk or depending on os caching.
Incidentally making a dedicated ramdisk is as easy as mount -t tmpfs -o size=1G tmpfs /path/to/tmpfs or similarly with ramfs.
Looking into it I don't think you're going to have much luck with alternative apis either: the use of filenames goes right down to cap.cpp, where we have things like:
VideoCapture::VideoCapture(const String& filename, int apiPreference) : throwOnFail(false)
{
CV_TRACE_FUNCTION();
open(filename, apiPreference);
}
It seems the python bindings are just a thin layer on top of this. But I'm willing to be proven wrong!
References
https://github.com/opencv/opencv/blob/master/modules/videoio/src/cap.cpp#L72

If VideoCapture was a regular Python object, and it accepted "file-like objects" in addition to paths, you could feed it a "file-like object", and it could read from that.
Python's StringIO and BytesIO are file-like objects in memory. Something useful to remember ;)
OpenCV specifically expects a file system path there, so that's out of the question.
OpenCV is a library for computer vision. It's not a library for handling video files.
You should look into PyAV. It's a (proper!) wrapper for ffmpeg's libraries. You can feed data directly in there and it will decode. Here are some examples and here are its tests that demonstrate further functionality. Its documentation is thin because most usage is (or should have been...) documented by ffmpeg itself.

You might be able to get away with a named pipe. You can use os.mkfifo to create one, then use the multiprocess module to spawn a background process that feeds the video file into it. Note that mkfifo is not supported on Windows.
The most important limitation is that a pipe does not support seeking, so your video won't be seekable or rewindable either. And whether it actually works might depend on the video format and on the backend (gstreamer, v4l2, ...) that OpenCV is using.

Among the many Python file copy functions, which ones are safe if the copy is interrupted?

As seen in How do I copy a file in Python?, there are many file copy functions:
shutil.copy
shutil.copy2
shutil.copyfile (and also shutil.copyfileobj)
or even a naive method:
with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g:
while True:
block = f.read(16*1024*1024) # work by blocks of 16 MB
if not block: # EOF
break
g.write(block)
Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok.
By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should:
either report the size the file had when the last bytes were written (e.g. 400MB)
or be deleted
The worst would be that the final filesize is written first (internally with an fallocate or ftruncate?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written.
Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here).
So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the fsize comparison), before copying the actual file content.
Note: I'm asking the question because, while shutil.filecopy was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.

Assuming that destfile does not exist prior to the copy, the naive method is safe, per your definition of safe.
shutil.copyfileobj() and shutil.copyfile() are close second in the ranking.
shutils.copy() is next, and shutils.copy2() would be last.
Explanation:
It is a filesystem's job to guarantee consistency based on application requests. If you are only writing X bytes to a file, the file size will only account for these X bytes.
Therefore, doing direct FS operations like the naive method will work.
It is now a matter of what these higher-level functions do with the filesystem.
The API doesn't state what happens if python crashes mid-copy, but it is a de-facto expectation from everyone that these functions behave like Unix cp, i.e. don't mess with the file size.
Assuming that the maintainers of CPython don't want to break people's expectations, all these functions should be safe per your definition.
That said, it isn't guaranteed anywhere, AFAICT.
However, shutil.copyfileobj() and shutil.copyfile() expressly have their API promise to not copy metadata, so they're not likely to try and set the size.
shutils.copy() wouldn't try to set the file size, only the mode, and in most filesystems setting the size and the mode require two different filesystem operations, so it should still be safe.
shutils.copy2() says it will copy metadata, and if you look at its source code, you'll see that it only copies the metadata after copying the data, so even that should be safe. Even more, copying the metadata doesn't copy the size.
So this would only be a problem if some of the internal functions python uses try to optimize using ftruncate(), fallocate(), or some such, which is unlikely given that people who write system APIs (like the python maintainers) are very aware of people's expectations.

On Windows, how to open for writing a file already opened for writing by another process?

I'm trying to open a logfile which is kept open by another process and remove the first few lines.
On Unix I'd simply do a os.open('/tmp/file.log', os.O_NONBLOCK) and that would get me closer to my goal.
Now i'm stuck with Windows and I need to rotate this log somehow without ending the application holding the file. Is this even possible?
At first I considered opening a file handle on the location where the application expected the log to be and just act as a pipe into a file-handle in Python but I couldn't find any way of doing that either on Windows.
I also thought of just moving the file on a regular basis and letting the application recreate the file but since it's being used by another process that doesn't do much good.
Thought of O_SHLOCK as well but then again, that's Unix and not Windows.
So I went for mmap the file and hope that it would make it a bit more flexible but that led me nowhere.
import mmap
import contextlib
import time
with open(r'test.log', 'r+') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0)) as m:
while 1:
line = m.readline()
if len(line) > 0:
print line
time.sleep(0.5)
This results in that the application can't access the file because Python is holding it (and vice versa).
Came to think of signal.SIGHUP but that doesn't exist in Windows either so back to square one.
I'm stuck and I've tried it all, can Python help me here or do I need to switch my language?

Even if the application opens the file as a shared object Python can't
so they can't get along by the looks of it.
It's not so bad :). You can (have to) open a file using CreateFile as pointed out by Augusto. You can use standard ctypes module for this. In the question Using a struct as a function argument with the python ctypes module you can see how to do it. Then you have to associate a C run-time file descriptor with an existing operating-system file handle you obtained in the previous step. You can use _open_osfhandle from the MS C run-time library (CRT) to do this. You can call it once again using ctypes; you can access it as ctypes.cdll.msvcrt._open_osfhandle. Then you have to associate Python file object with an existing C run-time file descriptor you obtained in the previous step. To do this in Python 3 you simply pass file descriptor as the first argument to the built-in open function. According to docs
file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to
be opened or an integer file descriptor of the file to be wrapped.
In Python 2 you have to use os.fdopen; its task, according to docs, is to
Return an open file object connected to the file descriptor fd
All of the above should not be required to do such a simple thing. There's hope it will be much simpler when CPython's implementation on Windows starts using native Windows API for files instead of going through C run-time library which does not give access to many features of Windows platform. See Add new io.FileIO using the native Windows API issue for details.

Do you have any control over the application generating the logfile? Because depending on the way the file is open by that application, you really can't modify it.
This link may seem off-topic here, but deep in Windows, what determines the file access to other application is the dwShareMode parameter of the CreateFile function: http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx
The application should enable FILE_SHARE_WRITE and possibly FILE_SHARE_DELETE, plus it should flush and update the file position everytime it writes a file. Looking at the Python documentation for open(), there is no such detailed parameter.

uncompressing tar.Z file with python?

I need to write a python script that retrieves tar.Z files from an FTP server, and uncompress them on a windows machine. tar.Z, if I understood correctly is the result of a compress command in Unix.
Python doesn't seem to know how to handle these, it's not gz, nor bz2 or zip. Does anyone know a library that would handle these ?
Thanks in advance

If GZIP -- the application -- can handle it, you have two choices.
Try the Python gzip library. It may work.
Use subprocess Popen to run gzip for you.
It may be an InstallShield .Z file. You may want to use InstallShield to unpack it and extract the .TAR file. Again, you may be able to use subprocess Popen to process the file.
It may also be a "LZW compressed file". Look at this library, it may help.
http://www.chilkatsoft.com/compression-python.asp

Since you target a specific platform (Windows), the simplest solution may be to run gzip in a system call: http://www.gzip.org/#exe
Are there other requirements in your project that the decompression needs to be done in Python?

A plain Python module that uncompresses is inexistant, AFAIK, but it's feasible to build one, given some knowledge:
the .Z format header specification
the .Z compression format
Almost all necessary information can be found the unarchiver CompressAlgorithm. Additional info from wikipedia for adaptive LZW and perhaps the compress man page.
Basically, you read the first three bytes (first two are magic bytes) to modify your algorithm, and then start reading and decompressing.
There's a lot of bit fiddling (.Z files begin having 9-bit tokens, up to 16-bit ones and then resetting the symbol table to the initial 256+2 values), which probably you'll deal with doing binary operations (&, <<= etc).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.