Wrap URL as filesystem path - python

I am trying to call a python function that takes an absolute path as an argument, but the file I want to reference is on the web.
Without cloning the file locally, is there a way I can refer to the file that will make python think the file is local?
In other words, I want to wrap the URL in a variable my_file_path, and have this return True:
os.path.isfile(my_file_path)
Note that I need to fake a file system path, as other calls in the program I am using are expecting a path, and not a file-like object (this includes other functions that call the function I linked)

A really great way to do this is with the requests library. You can get a file-like object using the stream=True option to the get function:
r = requests.get('https://api.github.com/events', stream=True)
loadmat(r.raw, ...)
In the case of needing an actual path, you can use the tempfile module as well:
with tempfile.NamedTemporaryFile() as fd:
r = requests.get('https://api.github.com/events', stream=True)
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
fd.flush()
loadmat(fd.name)
# other code here, where the temp file no longer exists but the data has been read

There is no way to make Python take a URL where it wants a path.
In many cases—like the very function you linked in your question—it actually wants a file-like object, and the object returned by, e.g., urlopen is file-like. But in other cases, that doesn't work.
So, what can you do?
Below the Python level, your operating system may have a way to mount different kinds of remote paths as if they were part of your local filesystem.
At a higher level, write your own wrapper that just downloads the file to a temporary file. That temporary file will, of course, pass the os.path.isfile(my_file_path) test that you wanted, and will work with everything else that needs a file. But it means that you need to keep the two "layers" of your code—the part that wants to deal with URLs, and the part that needs to deal with functions that can only take local files—separate, and write the interface between those layers. On at least some platforms, you can create a temporary file that never gets flushed to disk unless necessary. (You can even create a temporary file that doesn't appear anywhere in the directory tree, but that wouldn't help here, because then you obviously can't pass a pathname around…) So you're not "cloning the file" in any sense that actually matters.

Related

Python read/write vs shutil copy

I need to save files uploaded to my server (Max file size is 10MB) and found this answer, which works perfectly. However, I'm wondering what is the point of using the shutil module, and what is the difference between this:
file_location = f"files/{uploaded_file.filename}"
with open(file_location, "wb+") as file_object:
file_object.write(uploaded_file.file.read())
and this:
import shutil
file_location = f"files/{uploaded_file.filename}"
with open(file_location, "wb+") as file_object:
shutil.copyfileobj(uploaded_file.file, file_object)
During my programming experience, I came across shutil module multiple times, but still can't figure out what its benefits are over read() and write() methods.
Your method requires the whole file be in memory. shutil copies in chunks so you can copy files larger than memory. Also, shutil has routines to copy files by name so you don't have to open them at all, and it can preserve the permissions, ownership, and creation/modification/access timestamps.
I would like to highlight a few points with regards to OP's question and the (currently accepted) answer by #Tim Roberts:
"shutil copies in chunks so you can copy files larger than memory". You can also copy a file in chunks using read()—please
have a look at the short example below, as well as this and this answer for more
details—just like you can load the whole file into memory
using shutil.copyfileobj(), by giving a negative length value.
with open(uploaded_file.filename, 'wb') as f:
while contents := uploaded_file.file.read(1024 * 1024): # adjust the chunk size as desired
f.write(contents)
Under the hood, copyfileob() uses a very similar approach to the above, utilising read() and write() methods of file objects; hence, it would make little difference, if you used one over the other. The source code of copyfileob() can be seen below. The default buffer size, i.e., COPY_BUFSIZE below, is set to 1MB (1024 *1024 bytes), if it is running on Wnidows, or 64KB (64 * 1024 bytes) on other platforms (see here).
def copyfileobj(fsrc, fdst, length=0):
"""copy data from file-like object fsrc to file-like object fdst"""
if not length:
length = COPY_BUFSIZE
# Localize variable access to minimize overhead.
fsrc_read = fsrc.read
fdst_write = fdst.write
while True:
buf = fsrc_read(length)
if not buf:
break
fdst_write(buf)
"shutil has routines to copy files by name so you don't have to open them at all..." Since OP seems to be using FastAPI
framework (which is actually
Starlette underneath), UploadFile exposes an actual Python
SpooledTemporaryFile (a file-like object) that you can get using the .file
attribute (source code can be found here). When FastAPI/Starlette creates a new instance of UploadFile, it already creates the SpooledTemporaryFile behind the scenes, which remains open. Hence, since you are dealing with a temporary
file that has no visible name in the file system—that would otherwise allow you to copy the contents without opening the file using shutil—and which is already open, it would make no
difference using either read() or copyfileobj().
"it can preserve the permissions, ownership, and creation/modification/access timestamps." Even though this is about saving a file uploaded through a web framework—and hence, most of these metadata wouldn't be transfered along with the file—as per the documentation, the above statement is not entirely true:
Warning: Even the higher-level file copying functions (shutil.copy(), shutil.copy2()) cannot copy all file
metadata.
On POSIX platforms, this means that file owner and group are lost
as well as ACLs. On Mac OS, the resource fork and other metadata are
not used. This means that resources will be lost and file type and creator codes will not be correct. On Windows, file
owners,
ACLs and alternate data streams are not copied.
That being said, there is nothing wrong with using copyfileobj(). On the contrary, if you are dealing with large files and you would like to avoid loading the entire file into memory—as you may not have enough RAM to accommodate all the data—and you would rather use copyfileobj() instead of a similar solution using read() method (as described in point 1 above), it is perfectly fine to use shutil.copyfileobj(fsrc, fdst). Besides, copyfileobj() has been offered (since Python 3.8) as an alternative platform-dependent efficient copy operation. You can change the default buffer size through adjusting the length argument in copyfileobj().
Note
If copyfileobj() is used inside a FastAPI def (sync) endpoint, it is perfectly fine, as a normal def endpoint in FastAPI is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server). On the other hand, async def endpoints run on the main (single) thread, and thus, calling such a method, i.e., copyfileobj(), that performs blocking I/O operations (as shown in the source code) would result in blocking the entire server (for more information on def vs async def, please have a look at this answer). Hence, if you are about to call copyfileobj() from within an async def endpoint, you should make sure to run this operation—as well as all other file operations, such as open() and close()—in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. You can do that using Starlette's run_in_threadpool(), which is also used by FastAPI internally when you call the async methods of the UploadFile object, as shown here. For instance:
await run_in_threadpool(shutil.copyfileobj, fsrc, fdst)
For more details and code examples, please have a look at this answer.

How can I check if a loaded Python function changed?

As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.
So a solution I use quite often is:
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
if invalid_hash:
data = load_data_initial()
filehash = mpu.io.hash(original_filepath)
mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})
return data
This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.
Is there a way to check for changes in Python functions?
Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…
There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.
Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:
def source_match(source_path, object):
try:
with open(source_path) as f:
source = f.read()
if source == inspect.getsource(object):
return True
except Exception as e:
# Maybe log e or something here, but any of the obvious problems,
# like the file not existing or the function not being inspectable,
# mean we have to re-generate the data
pass
return False
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
# etc., but make sure to save the source when you save the pickle too
Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.
And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.
Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.
And plenty of other variations. But this should be enough to get you started.
A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.
Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.
Anyway, the simplest pre-commit hook would be something like (untested):
#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py
Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).
If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)
You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.
If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.

test for the file related function in python

I am rather new to software testing. I wonder how to write a unit test for file related functions in python. e.g., if I have a file copy function as follows.
def copy_file (self):
if not os.path.isdir(dest_path):
os.makedirs(dest_path)
try:
shutil.copy2(src_path, dest_path)
except IOError as e:
print e
What should I do about testing above function? what should I assert (directory, file content, exceptions)?
You can rely on shutil.copy2 copying the contents correctly. You only need to test your code. In this case that it creates the directories if they don't exist, and that it swallows IOErrors.
And don't forget to clean up. ;)
I think, you can get some clue from test_shutil itself and see how it is testing the copy functionality. Namely it is moving files and testing if it exists using another module. The difference in behavior of standard shutil.copy to your wrapper is in dealing with destination if it not already exists. In shutil.copy2, if the destination not already exists, then a file is created which is move from the source, in your case, it is not file, but a destination directory is created and you move your source into that. So write tests there destination does not exists and ensure that after your wrapper runs, the destination is still a directory and and it contains the file that shutil moved.
Think about requirements you put on your method, like:
The method shall report an error in case source_dir does not exist or either source or destination director are not accessible
The method shall report create destination_dir if it does not exist and report an error if destination dir cannot be created
The method shall report an error if either source or destination dir have illegal values (null or illegal chars for directories)
The method shall copy all files from source to destination directory
check, if all files have been copied
The method shall replace existing files (maybe)
check, if existing files are replaced
Those are just some thoughts. Don't put any possible requirement and don't test too much, keep in mind what want to do with this method. If it's private or only used in your own code, you can reduce the scope, but if you provide a public API, then you have to make sure that any input has a defined result (which may be an error message).

does close() imply flush() in Python?

In Python, and in general - does a close() operation on a file object imply a flush() operation?
Yes. It uses the underlying close() function which does that for you (source).
NB: close() and flush() won't ensure that the data is actually secure on the disk. It just ensures that the OS has the data == that it isn't buffered inside the process.
You can try sync or fsync to get the data written to the disk.
Yes, in Python 3 this is finally in the official documentation, but is was already the case in Python 2 (see Martin's answer).
As a complement to this question, yes python flushes before close, however if you want to ensure data is written properly to disk this is not enough.
This is how I would write a file in a way that it's atomically updated on a UNIX/Linux server, whenever the target file exists or not. Note that some filesystem will implicitly commit data to disk on close+rename (ext3 with data=ordered (default), and ext4 initially uncovered many application flaws before adding detection of write-close-rename patterns and sync data before metadata on those[1]).
# Write destfile, using a temporary name .<name>_XXXXXXXX
base, name = os.path.split(destfile)
tmpname = os.path.join(base, '.{}_'.format(name)) # This is the tmpfile prefix
with tempfile.NamedTemporaryFile('w', prefix=tmpname, delete=False) as fd:
# Replace prefix with actual file path/name
tmpname = str(fd.name)
try:
# Write fd here... ex:
json.dumps({}, fd)
# We want to fdatasync before closing, so we need to flush before close anyway
fd.flush()
os.fdatasync(fd)
# Since we're using tmpfile, we need to also set the proper permissions
if os.path.exists(destfile):
# Copy destination file's mask
os.fchmod(fd.fileno, os.stat(destfile).st_mode)
else:
# Set mask based on current umask value
umask = os.umask(0o22)
os.umask(umask)
os.fchmod(fd.fileno, 0o666 & ~umask) # 0o777 for dirs and executable files
# Now we can close and rename the file (overwriting any existing one)
fd.close()
os.rename(tmpname, destfile)
except:
# On error, try to cleanup the temporary file
try:
os.unlink(tmpname)
except OSError:
pass
raise
IMHO it would have been nice if Python provided simple methods around this... At the same time I guess if you care about data consistency it's probably best to really understand what is going on at a low level, especially since there are many differences across various Operating Systems and Filesystems.
Also note that this does not guarantee the written data can be recovered, only that you will get a consistent copy of the data (old or new). To ensure the new data is safely written and accessible when returning, you need to use os.fsync(...) after the rename, and even then if you have unsafe caches in the write path you could still lose data. this is common on consumer-grade hardware although any system can be configured for unsafe writes which boosts performance too. At least even with unsafe caches, the method above should still guarantee whichever copy of the data you get is valid.
filehandle.close does not necessarily flush. Surprisingly, filehandle.flush doesn't help either---it still can get stuck in the OS buffers when Python is running. Observe this session where I wrote to a file, closed it and Ctrl-Z to the shell command prompt and examined the file:
$ cat xyz
ghi
$ fg
python
>>> x=open("xyz","a")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.flush
<built-in method flush of file object at 0x7f58e0044660>
>>> x.close
<built-in method close of file object at 0x7f58e0044660>
>>>
[1]+ Stopped python
$ cat xyz
ghi
Subsequently I can reopen the file, and that necessarily syncs the file (because, in this case, I open it in the append mode). As the others have said, the sync syscall (available from the os package) should flush all buffers to disk but it has possible system-wide performance implications (it syncs all files on the system).

Absolute path of a file object

This has been discussed on StackOverflow before - I am trying to find a good way to find the absolute path of a file object, but I need it to be robust to os.chdir(), so cannot use
f = file('test')
os.path.abspath(f.name)
Instead, I was wondering whether the following is a good solution - basically extending the file class so that on opening, the absolute path of the file is saved:
class File(file):
def __init__(self, filename, *args, **kwargs):
self.abspath = os.path.abspath(filename)
file.__init__(self, filename, *args, **kwargs)
Then one can do
f = File('test','rb')
os.chdir('some_directory')
f.abspath # absolute path can be accessed like this
Are there any risks with doing this?
One significant risk is that, once the file is open, the process is dealing with that file by its file descriptor, not its path. On many operating systems, the file's path can be changed by some other process (by a mv operation in an unrelated process, say) and the file descriptor is still valid and refers to the same file.
I often take advantage of this by, for example, beginning a download of a large file, then realising the destination file isn't where I want it to be, and hopping to a separate shell and moving it to the right location – while the download continues uninterrupted.
So it is a bad idea to depend on the path remaining the same for the life of the process, when there's no such guarantee given by the operating system.
It depends on what you need it for.
As long as you understand the limitations--someone might move, rename, or hard-link the file in the interim--there are plenty of appropriate uses for this. You may want to delete the file when you're done with it or if something goes wrong while writing it (eg. gcc does this when writing files):
f = File(path, "w+")
try:
...
except:
try:
os.unlink(f.abspath)
except OSError: # nothing we can do if this fails
pass
raise
If you just want to be able to identify the file in user messages, there's already file.name. It's impossible to use this (reliably) for anything else, unfortunately; there's no way to distinguish between a filename "<stdin>" and sys.stdin, for example.
(You really shouldn't have to derive from a builtin class just to add attributes to it; that's just an ugly inconsistent quirk of Python...)

Categories