Python read/write vs shutil copy - python

I need to save files uploaded to my server (Max file size is 10MB) and found this answer, which works perfectly. However, I'm wondering what is the point of using the shutil module, and what is the difference between this:
file_location = f"files/{uploaded_file.filename}"
with open(file_location, "wb+") as file_object:
file_object.write(uploaded_file.file.read())
and this:
import shutil
file_location = f"files/{uploaded_file.filename}"
with open(file_location, "wb+") as file_object:
shutil.copyfileobj(uploaded_file.file, file_object)
During my programming experience, I came across shutil module multiple times, but still can't figure out what its benefits are over read() and write() methods.

Your method requires the whole file be in memory. shutil copies in chunks so you can copy files larger than memory. Also, shutil has routines to copy files by name so you don't have to open them at all, and it can preserve the permissions, ownership, and creation/modification/access timestamps.

I would like to highlight a few points with regards to OP's question and the (currently accepted) answer by #Tim Roberts:
"shutil copies in chunks so you can copy files larger than memory". You can also copy a file in chunks using read()—please
have a look at the short example below, as well as this and this answer for more
details—just like you can load the whole file into memory
using shutil.copyfileobj(), by giving a negative length value.
with open(uploaded_file.filename, 'wb') as f:
while contents := uploaded_file.file.read(1024 * 1024): # adjust the chunk size as desired
f.write(contents)
Under the hood, copyfileob() uses a very similar approach to the above, utilising read() and write() methods of file objects; hence, it would make little difference, if you used one over the other. The source code of copyfileob() can be seen below. The default buffer size, i.e., COPY_BUFSIZE below, is set to 1MB (1024 *1024 bytes), if it is running on Wnidows, or 64KB (64 * 1024 bytes) on other platforms (see here).
def copyfileobj(fsrc, fdst, length=0):
"""copy data from file-like object fsrc to file-like object fdst"""
if not length:
length = COPY_BUFSIZE
# Localize variable access to minimize overhead.
fsrc_read = fsrc.read
fdst_write = fdst.write
while True:
buf = fsrc_read(length)
if not buf:
break
fdst_write(buf)
"shutil has routines to copy files by name so you don't have to open them at all..." Since OP seems to be using FastAPI
framework (which is actually
Starlette underneath), UploadFile exposes an actual Python
SpooledTemporaryFile (a file-like object) that you can get using the .file
attribute (source code can be found here). When FastAPI/Starlette creates a new instance of UploadFile, it already creates the SpooledTemporaryFile behind the scenes, which remains open. Hence, since you are dealing with a temporary
file that has no visible name in the file system—that would otherwise allow you to copy the contents without opening the file using shutil—and which is already open, it would make no
difference using either read() or copyfileobj().
"it can preserve the permissions, ownership, and creation/modification/access timestamps." Even though this is about saving a file uploaded through a web framework—and hence, most of these metadata wouldn't be transfered along with the file—as per the documentation, the above statement is not entirely true:
Warning: Even the higher-level file copying functions (shutil.copy(), shutil.copy2()) cannot copy all file
metadata.
On POSIX platforms, this means that file owner and group are lost
as well as ACLs. On Mac OS, the resource fork and other metadata are
not used. This means that resources will be lost and file type and creator codes will not be correct. On Windows, file
owners,
ACLs and alternate data streams are not copied.
That being said, there is nothing wrong with using copyfileobj(). On the contrary, if you are dealing with large files and you would like to avoid loading the entire file into memory—as you may not have enough RAM to accommodate all the data—and you would rather use copyfileobj() instead of a similar solution using read() method (as described in point 1 above), it is perfectly fine to use shutil.copyfileobj(fsrc, fdst). Besides, copyfileobj() has been offered (since Python 3.8) as an alternative platform-dependent efficient copy operation. You can change the default buffer size through adjusting the length argument in copyfileobj().
Note
If copyfileobj() is used inside a FastAPI def (sync) endpoint, it is perfectly fine, as a normal def endpoint in FastAPI is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server). On the other hand, async def endpoints run on the main (single) thread, and thus, calling such a method, i.e., copyfileobj(), that performs blocking I/O operations (as shown in the source code) would result in blocking the entire server (for more information on def vs async def, please have a look at this answer). Hence, if you are about to call copyfileobj() from within an async def endpoint, you should make sure to run this operation—as well as all other file operations, such as open() and close()—in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. You can do that using Starlette's run_in_threadpool(), which is also used by FastAPI internally when you call the async methods of the UploadFile object, as shown here. For instance:
await run_in_threadpool(shutil.copyfileobj, fsrc, fdst)
For more details and code examples, please have a look at this answer.

Related

Wrap URL as filesystem path

I am trying to call a python function that takes an absolute path as an argument, but the file I want to reference is on the web.
Without cloning the file locally, is there a way I can refer to the file that will make python think the file is local?
In other words, I want to wrap the URL in a variable my_file_path, and have this return True:
os.path.isfile(my_file_path)
Note that I need to fake a file system path, as other calls in the program I am using are expecting a path, and not a file-like object (this includes other functions that call the function I linked)
A really great way to do this is with the requests library. You can get a file-like object using the stream=True option to the get function:
r = requests.get('https://api.github.com/events', stream=True)
loadmat(r.raw, ...)
In the case of needing an actual path, you can use the tempfile module as well:
with tempfile.NamedTemporaryFile() as fd:
r = requests.get('https://api.github.com/events', stream=True)
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
fd.flush()
loadmat(fd.name)
# other code here, where the temp file no longer exists but the data has been read
There is no way to make Python take a URL where it wants a path.
In many cases—like the very function you linked in your question—it actually wants a file-like object, and the object returned by, e.g., urlopen is file-like. But in other cases, that doesn't work.
So, what can you do?
Below the Python level, your operating system may have a way to mount different kinds of remote paths as if they were part of your local filesystem.
At a higher level, write your own wrapper that just downloads the file to a temporary file. That temporary file will, of course, pass the os.path.isfile(my_file_path) test that you wanted, and will work with everything else that needs a file. But it means that you need to keep the two "layers" of your code—the part that wants to deal with URLs, and the part that needs to deal with functions that can only take local files—separate, and write the interface between those layers. On at least some platforms, you can create a temporary file that never gets flushed to disk unless necessary. (You can even create a temporary file that doesn't appear anywhere in the directory tree, but that wouldn't help here, because then you obviously can't pass a pathname around…) So you're not "cloning the file" in any sense that actually matters.

How to create a temporary file that can be read by a subprocess?

I'm writing a Python script that needs to write some data to a temporary file, then create a subprocess running a C++ program that will read the temporary file. I'm trying to use NamedTemporaryFile for this, but according to the docs,
Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows NT or later).
And indeed, on Windows if I flush the temporary file after writing, but don't close it until I want it to go away, the subprocess isn't able to open it for reading.
I'm working around this by creating the file with delete=False, closing it before spawning the subprocess, and then manually deleting it once I'm done:
fileTemp = tempfile.NamedTemporaryFile(delete = False)
try:
fileTemp.write(someStuff)
fileTemp.close()
# ...run the subprocess and wait for it to complete...
finally:
os.remove(fileTemp.name)
This seems inelegant. Is there a better way to do this? Perhaps a way to open up the permissions on the temporary file so the subprocess can get at it?
Since nobody else appears to be interested in leaving this information out in the open...
tempfile does expose a function, mkdtemp(), which can trivialize this problem:
try:
temp_dir = mkdtemp()
temp_file = make_a_file_in_a_dir(temp_dir)
do_your_subprocess_stuff(temp_file)
remove_your_temp_file(temp_file)
finally:
os.rmdir(temp_dir)
I leave the implementation of the intermediate functions up to the reader, as one might wish to do things like use mkstemp() to tighten up the security of the temporary file itself, or overwrite the file in-place before removing it. I don't particularly know what security restrictions one might have that are not easily planned for by perusing the source of tempfile.
Anyway, yes, using NamedTemporaryFile on Windows might be inelegant, and my solution here might also be inelegant, but you've already decided that Windows support is more important than elegant code, so you might as well go ahead and do something readable.
According to Richard Oudkerk
(...) the only reason that trying to reopen a NamedTemporaryFile fails on
Windows is because when we reopen we need to use O_TEMPORARY.
and he gives an example of how to do this in Python 3.3+
import os, tempfile
DATA = b"hello bob"
def temp_opener(name, flag, mode=0o777):
return os.open(name, flag | os.O_TEMPORARY, mode)
with tempfile.NamedTemporaryFile() as f:
f.write(DATA)
f.flush()
with open(f.name, "rb", opener=temp_opener) as f:
assert f.read() == DATA
assert not os.path.exists(f.name)
Because there's no opener parameter in the built-in open() in Python 2.x, we have to combine lower level os.open() and os.fdopen() functions to achieve the same effect:
import subprocess
import tempfile
DATA = b"hello bob"
with tempfile.NamedTemporaryFile() as f:
f.write(DATA)
f.flush()
subprocess_code = \
"""import os
f = os.fdopen(os.open(r'{FILENAME}', os.O_RDWR | os.O_BINARY | os.O_TEMPORARY), 'rb')
assert f.read() == b'{DATA}'
""".replace('\n', ';').format(FILENAME=f.name, DATA=DATA)
subprocess.check_output(['python', '-c', subprocess_code]) == DATA
You can always go low-level, though am not sure if it's clean enough for you:
fd, filename = tempfile.mkstemp()
try:
os.write(fd, someStuff)
os.close(fd)
# ...run the subprocess and wait for it to complete...
finally:
os.remove(filename)
At least if you open a temporary file using existing Python libraries, accessing it from multiple processes is not possible in case of Windows. According to MSDN you can specify a 3rd parameter (dwSharedMode) shared mode flag FILE_SHARE_READ to CreateFile() function which:
Enables subsequent open operations on a file or device to request read
access. Otherwise, other processes cannot open the file or device if
they request read access. If this flag is not specified, but the file
or device has been opened for read access, the function fails.
So, you can write a Windows specific C routine to create a custom temporary file opener function, call it from Python and then you can make your sub-process access the file without any error. But I think you should stick with your existing approach as it is the most portable version and will work on any system and thus is the most elegant implementation.
Discussion on Linux and windows file locking can be found here.
EDIT: Turns out it is possible to open & read the temporary file from multiple processes in Windows too. See Piotr Dobrogost's answer.
Using mkstemp() instead with os.fdopen() in a with statement avoids having to call close():
fd, path = tempfile.mkstemp()
try:
with os.fdopen(fd, 'wb') as fileTemp:
fileTemp.write(someStuff)
# ...run the subprocess and wait for it to complete...
finally:
os.remove(path)
I know this is a really old post, but I think it's relevant today given that the API is changing and functions like mktemp and mkstemp are being replaced by functions like TemporaryFile() and TemporaryDirectory(). I just wanted to demonstrate in the following sample how to make sure that a temp directory is still available downstream:
Instead of coding:
tmpdirname = tempfile.TemporaryDirectory()
and using tmpdirname throughout your code, you should trying to use your code in a with statement block to insure that it is available for your code calls... like this:
with tempfile.TemporaryDirectory() as tmpdirname:
[do dependent code nested so it's part of the with statement]
If you reference it outside of the with then it's likely that it won't be visible anymore.

Python: Opening a file without creating a lock

I'm trying to create a script in Python to back up some files. But, these files could be renamed or deleted at any time. I don't want my script to prevent that by locking the file; the file should be able to still be deleted at any time during the backup.
How can I do this in Python? And, what happens? Do my objects just become null if the stream cannot be read?
Thank you! I'm somewhat new to Python.
As mentioned by #kindall, this is a Windows-specific issue. Unix OSes allow deleting.
To do this in Windows, I needed to use win32file.CreateFile() to use the Windows-specific dwSharingMode flag (in Python's pywin32, it's just called shareMode).
Rough Example:
import msvcrt
import os
import win32file
py_handle = win32file.CreateFile(
'filename.txt',
win32file.GENERIC_READ,
win32file.FILE_SHARE_DELETE
| win32file.FILE_SHARE_READ
| win32file.FILE_SHARE_WRITE,
None,
win32file.OPEN_EXISTING,
win32file.FILE_ATTRIBUTE_NORMAL,
None
)
try:
with os.fdopen(
msvcrt.open_osfhandle(py_handle.handle, os.O_RDONLY)
) as file_descriptor:
... # read from `file_descriptor`
finally:
py_handle.Close()
Note: if you need to keep the win32-file open beyond the lifetime of the file-handle object returned, you should invoke PyHandle.detach() on that handle.
On UNIX-like OSs, including Linux, this isn't an issue. Well, some other program could write to the file at the same time you're reading it, which could cause problems (the file you are copying could end up corrupted) but this is solvable with a verification pass.
On Windows, use Volume Snapshot Service (aka Volume Shadow Copy). VSS creates a snapshot of the volume at a moment in time, and you can open files on the snapshot without locking the files on the original volume. A quick Google found a Python module for doing copies using VSS here: http://sourceforge.net/projects/pyvss/

Redirecting audio output from one function to another function in python

Suppose I have two functions drawn from two different APIs, function A and B.
By default, function A outputs audio data to a wav file.
By default, function B takes audio input from a wav file and process it.
Is it possible to stream the data from function A to B? If so, how do I do this? I work on lubuntu if that is relevant.
This is function A I'm thinking about from the PJSUA python API:
create_recorder(self, filename)
Create WAV file recorder.
Keyword arguments
filename -- WAV file name
Return:
WAV recorder ID
And this is function B from the Pocketsphinx Python API
decode_raw(...)
Decode raw audio from a file.
Parameters:
fh (file) - Filehandle to read audio from.
uttid (str) - Identifier to give to this utterance.
maxsamps (int) - Maximum number of samples to read. If not specified or -1, the rest of the file will be read.
update:
When I try to pass the filename of a socket or named pipe, it outputs this error message, seems that the C function that the python bindings use doesn't like anything but .wav files... Why would that be?
pjsua_aud.c .pjsua_recorder_create() error: unable to determine file format for /tmp/t_fifo. Exception: Object: LIb, operation=create(recorder), error=Option/operation is not supported (PJ_ENOTSUP)
I need to use a value returned by create_recorder(), it is an int that is used to get the wav recorder id (which is not passed on directly to decode_raw() but rather passed on to some other function.
The answer is highly platform dependent and more details are required. Different Operating Systems have different ways of handling Interprocess Communication, or IPC.
If you're using a UNIXlike environment, there are a rich set of IPC primitives to work with. Pipes, SYS V Message Queues, shared memory, sockets, etc. In your case I think it would make sense to use a pipe or a socket, depending on whether the A and B are running in the same process or not.
Update:
In your case, I would use python's subprocess and or os module and a pipe. The idea here is to create calling contexts to the two APIs in processes which share a parent process, which has also created a unidirectional named pipe and passed it to its children. Then, data written to the named pipe in create_recorder will immediately be available for read()ing in the named pipe.
You could use a named pipe os.mkfifo() and move functions to different threads/processes e.g.:
import os
from multiprocessing import Process
os.mkfifo(filename)
try:
Process(target=obj.create_recorder, args=[filename]).start()
decode_raw(filename, ...)
finally:
os.remove(filename)

does close() imply flush() in Python?

In Python, and in general - does a close() operation on a file object imply a flush() operation?
Yes. It uses the underlying close() function which does that for you (source).
NB: close() and flush() won't ensure that the data is actually secure on the disk. It just ensures that the OS has the data == that it isn't buffered inside the process.
You can try sync or fsync to get the data written to the disk.
Yes, in Python 3 this is finally in the official documentation, but is was already the case in Python 2 (see Martin's answer).
As a complement to this question, yes python flushes before close, however if you want to ensure data is written properly to disk this is not enough.
This is how I would write a file in a way that it's atomically updated on a UNIX/Linux server, whenever the target file exists or not. Note that some filesystem will implicitly commit data to disk on close+rename (ext3 with data=ordered (default), and ext4 initially uncovered many application flaws before adding detection of write-close-rename patterns and sync data before metadata on those[1]).
# Write destfile, using a temporary name .<name>_XXXXXXXX
base, name = os.path.split(destfile)
tmpname = os.path.join(base, '.{}_'.format(name)) # This is the tmpfile prefix
with tempfile.NamedTemporaryFile('w', prefix=tmpname, delete=False) as fd:
# Replace prefix with actual file path/name
tmpname = str(fd.name)
try:
# Write fd here... ex:
json.dumps({}, fd)
# We want to fdatasync before closing, so we need to flush before close anyway
fd.flush()
os.fdatasync(fd)
# Since we're using tmpfile, we need to also set the proper permissions
if os.path.exists(destfile):
# Copy destination file's mask
os.fchmod(fd.fileno, os.stat(destfile).st_mode)
else:
# Set mask based on current umask value
umask = os.umask(0o22)
os.umask(umask)
os.fchmod(fd.fileno, 0o666 & ~umask) # 0o777 for dirs and executable files
# Now we can close and rename the file (overwriting any existing one)
fd.close()
os.rename(tmpname, destfile)
except:
# On error, try to cleanup the temporary file
try:
os.unlink(tmpname)
except OSError:
pass
raise
IMHO it would have been nice if Python provided simple methods around this... At the same time I guess if you care about data consistency it's probably best to really understand what is going on at a low level, especially since there are many differences across various Operating Systems and Filesystems.
Also note that this does not guarantee the written data can be recovered, only that you will get a consistent copy of the data (old or new). To ensure the new data is safely written and accessible when returning, you need to use os.fsync(...) after the rename, and even then if you have unsafe caches in the write path you could still lose data. this is common on consumer-grade hardware although any system can be configured for unsafe writes which boosts performance too. At least even with unsafe caches, the method above should still guarantee whichever copy of the data you get is valid.
filehandle.close does not necessarily flush. Surprisingly, filehandle.flush doesn't help either---it still can get stuck in the OS buffers when Python is running. Observe this session where I wrote to a file, closed it and Ctrl-Z to the shell command prompt and examined the file:
$ cat xyz
ghi
$ fg
python
>>> x=open("xyz","a")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.flush
<built-in method flush of file object at 0x7f58e0044660>
>>> x.close
<built-in method close of file object at 0x7f58e0044660>
>>>
[1]+ Stopped python
$ cat xyz
ghi
Subsequently I can reopen the file, and that necessarily syncs the file (because, in this case, I open it in the append mode). As the others have said, the sync syscall (available from the os package) should flush all buffers to disk but it has possible system-wide performance implications (it syncs all files on the system).

Categories