Absolute path of a file object - python

This has been discussed on StackOverflow before - I am trying to find a good way to find the absolute path of a file object, but I need it to be robust to os.chdir(), so cannot use
f = file('test')
os.path.abspath(f.name)
Instead, I was wondering whether the following is a good solution - basically extending the file class so that on opening, the absolute path of the file is saved:
class File(file):
def __init__(self, filename, *args, **kwargs):
self.abspath = os.path.abspath(filename)
file.__init__(self, filename, *args, **kwargs)
Then one can do
f = File('test','rb')
os.chdir('some_directory')
f.abspath # absolute path can be accessed like this
Are there any risks with doing this?

One significant risk is that, once the file is open, the process is dealing with that file by its file descriptor, not its path. On many operating systems, the file's path can be changed by some other process (by a mv operation in an unrelated process, say) and the file descriptor is still valid and refers to the same file.
I often take advantage of this by, for example, beginning a download of a large file, then realising the destination file isn't where I want it to be, and hopping to a separate shell and moving it to the right location – while the download continues uninterrupted.
So it is a bad idea to depend on the path remaining the same for the life of the process, when there's no such guarantee given by the operating system.

It depends on what you need it for.
As long as you understand the limitations--someone might move, rename, or hard-link the file in the interim--there are plenty of appropriate uses for this. You may want to delete the file when you're done with it or if something goes wrong while writing it (eg. gcc does this when writing files):
f = File(path, "w+")
try:
...
except:
try:
os.unlink(f.abspath)
except OSError: # nothing we can do if this fails
pass
raise
If you just want to be able to identify the file in user messages, there's already file.name. It's impossible to use this (reliably) for anything else, unfortunately; there's no way to distinguish between a filename "<stdin>" and sys.stdin, for example.
(You really shouldn't have to derive from a builtin class just to add attributes to it; that's just an ugly inconsistent quirk of Python...)

Related

Thing to check if you have permissions to directory

I'm writting an python program and now I'm working at exceptions.
while True:
try:
os.makedirs("{}\\test".format(dest))
except PermissionError:
print("Make sure that you have access to specified path")
print("Try again specify your path: ", end='')
dest = input()
continue
break
It is working but later I need to delete that folder.
What is the better way to do it?
Don't.
It is almost never worth verifying that you have permissions to perform an operation that your program requires. For one thing, permissions are not the only possible reason for failure. A delete may also fail because of a file lock by another program, for instance. Unless you have a very good reason to do otherwise, it is both more efficient and more reliable to just write your code to try the operation and then abort on failure:
import shutil
try:
shutil.rmtree(path_to_remove) # Recursively deletes directory and files inside it
except Exception as ex:
print('Failed to delete directory, manual clean up may be required: {}'.format(path_to_remove))
sys.exit(1)
Other concerns about your code
Use os.path.join to concatenate file paths: os.makedirs(os.path.join(dest, test)). This will use the appropriate directory separator for the operating system.
Why are you looping on failure? In real world programs, simply aborting the entire operation is simpler and usually makes for a better user experience.
Are you sure you aren't looking for the tempfile library? It allows you to spit out a unique directory to the operating system's standard temporary location:
import tempfile
with tempfile.TemporaryDirectory() as tmpdir:
some_function_that_creates_several_files(tmpdir)
for f in os.walk(tmpdir):
# do something with each file
# tmpdir automatically deleted when context manager exits
# Or if you really only need the file
with tempfile.TemporaryFile() as tmpfile:
tmpfile.write('my data')
some_function_that_needs_a_file(tmpfile)
# tmpfile automatically deleted when context manager exits
I think what you want is os.access.
os.access(path, mode, *, dir_fd=None, effective_ids=False, follow_symlinks=True)
Use the real uid/gid to test for access to path. Note that most operations will use the effective uid/gid, therefore this routine can be used in a suid/sgid environment to test if the invoking user has the specified access to path. mode should be F_OK to test the existence of path, or it can be the inclusive OR of one or more of R_OK, W_OK, and X_OK to test permissions. Return True if access is allowed, False if not.
For example:
os.access("/path", os.R_OK)
And the mode contains:
os.F_OK # existence
os.R_OK # readability
os.W_OK # writability
os.X_OK # executability
Refer: https://docs.python.org/3.7/library/os.html#os.access

How can I check if a loaded Python function changed?

As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.
So a solution I use quite often is:
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
if invalid_hash:
data = load_data_initial()
filehash = mpu.io.hash(original_filepath)
mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})
return data
This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.
Is there a way to check for changes in Python functions?
Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…
There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.
Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:
def source_match(source_path, object):
try:
with open(source_path) as f:
source = f.read()
if source == inspect.getsource(object):
return True
except Exception as e:
# Maybe log e or something here, but any of the obvious problems,
# like the file not existing or the function not being inspectable,
# mean we have to re-generate the data
pass
return False
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
# etc., but make sure to save the source when you save the pickle too
Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.
And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.
Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.
And plenty of other variations. But this should be enough to get you started.
A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.
Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.
Anyway, the simplest pre-commit hook would be something like (untested):
#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py
Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).
If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)
You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.
If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.

Wrap URL as filesystem path

I am trying to call a python function that takes an absolute path as an argument, but the file I want to reference is on the web.
Without cloning the file locally, is there a way I can refer to the file that will make python think the file is local?
In other words, I want to wrap the URL in a variable my_file_path, and have this return True:
os.path.isfile(my_file_path)
Note that I need to fake a file system path, as other calls in the program I am using are expecting a path, and not a file-like object (this includes other functions that call the function I linked)
A really great way to do this is with the requests library. You can get a file-like object using the stream=True option to the get function:
r = requests.get('https://api.github.com/events', stream=True)
loadmat(r.raw, ...)
In the case of needing an actual path, you can use the tempfile module as well:
with tempfile.NamedTemporaryFile() as fd:
r = requests.get('https://api.github.com/events', stream=True)
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
fd.flush()
loadmat(fd.name)
# other code here, where the temp file no longer exists but the data has been read
There is no way to make Python take a URL where it wants a path.
In many cases—like the very function you linked in your question—it actually wants a file-like object, and the object returned by, e.g., urlopen is file-like. But in other cases, that doesn't work.
So, what can you do?
Below the Python level, your operating system may have a way to mount different kinds of remote paths as if they were part of your local filesystem.
At a higher level, write your own wrapper that just downloads the file to a temporary file. That temporary file will, of course, pass the os.path.isfile(my_file_path) test that you wanted, and will work with everything else that needs a file. But it means that you need to keep the two "layers" of your code—the part that wants to deal with URLs, and the part that needs to deal with functions that can only take local files—separate, and write the interface between those layers. On at least some platforms, you can create a temporary file that never gets flushed to disk unless necessary. (You can even create a temporary file that doesn't appear anywhere in the directory tree, but that wouldn't help here, because then you obviously can't pass a pathname around…) So you're not "cloning the file" in any sense that actually matters.

Python create empty file - Unix

I want to create empty file using Python script in Unix environment. Could see different ways mentioned of achieving the same. What are the benefits/pitfalls of one over the other.
os.system('touch abc')
open('abc','a').close()
open('abc','a')
subprocess.call(['touch','abc'])
Well, for a start, the ones that rely on touch are not portable. They won't work under standard Windows, for example, without the installation of CygWin, GNUWin32, or some other package providing a touch utility..
They also involve the creation of a separate process for doing the work, something that's totally unnecessary in this case.
Of the four, I would probably use open('abc','a').close() if the intent is to try and just create the file if it doesn't exist. In my opinion, that makes the intent clear.
But, if you're trying to create an empty file, I'd probably be using the w write mode rather than the a append mode.
In addition, you probably also want to catch the exception if, for example, you cannot actually create the file.
TLDR: use
open('abc','a').close()
(or 'w' instead of 'a' if the intent is to truncate the file if it already exists).
Invoking a separate process to do something Python can do itself is wasteful, and non-portable to platforms where the external command is not available. (Additionally, os.system uses two processes -- one more for a shell to parse the command line -- and is being deprecated in favor of subprocess.)
Not closing an open filehandle when you're done with it is bad practice, and could cause resource depletion in a larger program (you run out of filehandles if you open more and more files and never close them).
To create an empty file on Unix in Python:
import os
try:
os.close(os.open('abc', os.O_WRONLY | os.O_CREAT | os.O_EXCL |
getattr(os, "O_CLOEXEC", 0) |
os.O_NONBLOCK | os.O_NOCTTY))
except OSError:
pass # decide what to consider an error in your case and reraise
# 1. is it an error if 'abc' entry already exists?
# 2. is it an error if 'abc' is a directory or a symlink to a directory?
# 3. is it an error if 'abc' is a named pipe?
# 4. it is probably an error if the parent directory is not writable
# or the filesystem is read-only (can't create a file)
Or more portable variant:
try:
open('abc', 'ab', 0).close()
except OSError:
pass # see the comment above
Without the explicit .close() call, non-reference-counting Python implementations such as Pypy, Jython may delay closing the file until garbage collection is run (it may exhaust available file descriptors for your process).
The latter example may stuck on FIFO and follows symlinks. On my system, it is equivalent to:
from os import *
open("abc", O_WRONLY|O_CREAT|O_APPEND|O_CLOEXEC, 0666)
In addition, touch command updates the access and modification times of existing files to the current time.
In more recent Python 3 variants, we have Path.touch() from pathlib. This will create an empty file if it doesn't exist, and update the mtime if it does, in the same way as your example os.system('touch abc'), but it's much more portable:
from pathlib import Path
abc = Path('abc')
abc.touch()

does close() imply flush() in Python?

In Python, and in general - does a close() operation on a file object imply a flush() operation?
Yes. It uses the underlying close() function which does that for you (source).
NB: close() and flush() won't ensure that the data is actually secure on the disk. It just ensures that the OS has the data == that it isn't buffered inside the process.
You can try sync or fsync to get the data written to the disk.
Yes, in Python 3 this is finally in the official documentation, but is was already the case in Python 2 (see Martin's answer).
As a complement to this question, yes python flushes before close, however if you want to ensure data is written properly to disk this is not enough.
This is how I would write a file in a way that it's atomically updated on a UNIX/Linux server, whenever the target file exists or not. Note that some filesystem will implicitly commit data to disk on close+rename (ext3 with data=ordered (default), and ext4 initially uncovered many application flaws before adding detection of write-close-rename patterns and sync data before metadata on those[1]).
# Write destfile, using a temporary name .<name>_XXXXXXXX
base, name = os.path.split(destfile)
tmpname = os.path.join(base, '.{}_'.format(name)) # This is the tmpfile prefix
with tempfile.NamedTemporaryFile('w', prefix=tmpname, delete=False) as fd:
# Replace prefix with actual file path/name
tmpname = str(fd.name)
try:
# Write fd here... ex:
json.dumps({}, fd)
# We want to fdatasync before closing, so we need to flush before close anyway
fd.flush()
os.fdatasync(fd)
# Since we're using tmpfile, we need to also set the proper permissions
if os.path.exists(destfile):
# Copy destination file's mask
os.fchmod(fd.fileno, os.stat(destfile).st_mode)
else:
# Set mask based on current umask value
umask = os.umask(0o22)
os.umask(umask)
os.fchmod(fd.fileno, 0o666 & ~umask) # 0o777 for dirs and executable files
# Now we can close and rename the file (overwriting any existing one)
fd.close()
os.rename(tmpname, destfile)
except:
# On error, try to cleanup the temporary file
try:
os.unlink(tmpname)
except OSError:
pass
raise
IMHO it would have been nice if Python provided simple methods around this... At the same time I guess if you care about data consistency it's probably best to really understand what is going on at a low level, especially since there are many differences across various Operating Systems and Filesystems.
Also note that this does not guarantee the written data can be recovered, only that you will get a consistent copy of the data (old or new). To ensure the new data is safely written and accessible when returning, you need to use os.fsync(...) after the rename, and even then if you have unsafe caches in the write path you could still lose data. this is common on consumer-grade hardware although any system can be configured for unsafe writes which boosts performance too. At least even with unsafe caches, the method above should still guarantee whichever copy of the data you get is valid.
filehandle.close does not necessarily flush. Surprisingly, filehandle.flush doesn't help either---it still can get stuck in the OS buffers when Python is running. Observe this session where I wrote to a file, closed it and Ctrl-Z to the shell command prompt and examined the file:
$ cat xyz
ghi
$ fg
python
>>> x=open("xyz","a")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.flush
<built-in method flush of file object at 0x7f58e0044660>
>>> x.close
<built-in method close of file object at 0x7f58e0044660>
>>>
[1]+ Stopped python
$ cat xyz
ghi
Subsequently I can reopen the file, and that necessarily syncs the file (because, in this case, I open it in the append mode). As the others have said, the sync syscall (available from the os package) should flush all buffers to disk but it has possible system-wide performance implications (it syncs all files on the system).

Categories