Check if an open file has been deleted after open in python - python

Is it possible to check if a file has been deleted or recreated in python?
For example, if you did a open("file") in the script, and then while that file is still open, you do rm file; touch file;, then the script will still hold a reference to the old file even though it's already been deleted.

You should fstat the file descriptor for the opened file.
>>> import os
>>> f = open("testdv.py")
>>> os.fstat(f.fileno())
posix.stat_result(st_mode=33188, st_ino=1508053, st_dev=65027L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=1107, st_atime=1349180541, st_mtime=1349180540, st_ctime=1349180540)
>>> os.fstat(f.fileno()).st_nlink
1
Ok, this file has one link, so one name in the filesystem. Now remove it:
>>> os.unlink("testdv.py")
>>> os.fstat(f.fileno()).st_nlink
0
No more links, so we have an "anonymous file" that's only kept alive as long as we have it open. Creating a new file with the same name has no effect on the old file:
>>> g = open("testdv.py", "w")
>>> os.fstat(g.fileno()).st_nlink
1
>>> os.fstat(f.fileno()).st_nlink
0
Of course, st_nlink can sometimes be >1 initially, so checking that for zero is not entirely reliable (though in a controlled setting, it might be good enough). Instead, you can verify whether the file at the path you initially opened is the same one that you have a file descriptor for by comparing stat results:
>>> os.stat("testdv.py") == os.fstat(f.fileno())
False
>>> os.stat("testdv.py") == os.fstat(g.fileno())
True
(And if you want this to be 100% correct, then you should compare only the st_dev and st_ino fields on stat results, since the other fields and st_atime in particular might change in between the calls.)

Yes. Use the os.stat() function to check the file length. If the length is zero (or the function returns the error "File not found"), then someone deleted the file.
Alternatively, you can open+write+close the file each time you need to write something into it. The drawback is that opening a file is a pretty slow operation, so this is out of the question if you need to write a lot of data.
Why? Because the new file isn't the file that you're holding open. In a nutshell, Unix filesystems have two levels. One is the directory entry (i.e. the file name, file size, modification time, pointer to the data) and the second level is the file data.
When you open a file, Unix uses the name to find the file data. After that, it operates only on the second level - changes to the directory entry have no effect on any open "file handles". Which is exactly why you can delete the directory entry: Your program isn't using it.
When you use os.stat(), you don't look at the file data but at the directory entry again.
On the positive side, this allows you to create files which no one can see but your program: Open the file, delete it and then use it. Since there is no directory entry for the file, no other program can access the data.
On the negative side, you can't easily solve problems like the one you have.

Yes -- you can use the inotify facility to check for file changes and more. There also is a Python binding for it. Using inotify you can watch files or directories for filesystem activiy. From the manual the following events can be detected:
IN_ACCESS File was accessed (read) (*).
IN_ATTRIB Metadata changed, e.g., permissions, timestamps, extended attributes, link count (since Linux 2.6.25), UID, GID, etc. (*).
IN_CLOSE_WRITE File opened for writing was closed (*).
IN_CLOSE_NOWRITE File not opened for writing was closed (*).
IN_CREATE File/directory created in watched directory (*).
IN_DELETE File/directory deleted from watched directory (*).
IN_DELETE_SELF Watched file/directory was itself deleted.
IN_MODIFY File was modified (*).
IN_MOVE_SELF Watched file/directory was itself moved.
IN_MOVED_FROM File moved out of watched directory (*).
IN_MOVED_TO File moved into watched directory (*).
IN_OPEN File was opened (*).
From here you can google yourself a solution, but I think you get the overall idea. Of course this may only work on Linux, but from your question I assume you are using it (references to rm and touch).

Related

How to check that file is saved to hard during TCP sending?

I send many files over TCP from PC(windows) to Server(Linux).
When I process files on server sometimes I get error, since file is corrupted or has zero size, because it is still undergoes 'saving' to hard disc.
I process files in python, grabbing like this:
file_list = sorted(glob('*.bin'))
for file in file_list:
file_size = os.path.getsize(file)
if file_size > min_file_size:
do_process(file)
How to make it in proper way, i.e make sure, that file is ok.
I cant choose right min_file_size, since files have different sizes..
May be I should copy it to another folder ant then process them?
** I'm using SCP to copy files. So on the server side how can I be sure(some linux hints), that file is ok, to move it to directory, which will be processing? Sometimes by typing ls I see files, whch is not fully sent yet.. so how can I rename them?
You can use the fuser command to check whether the file is currently being accessed by any process, as follows:
import subprocess
...
file_list = sorted(glob('*.bin'))
for file in file_list:
result = subprocess.run(['fuser','--silent',file])
if result.returncode != 0:
do_process(file)
The fuser command will terminate with a non-0 return code if the file is not being accessed.
This has nothing to do with TCP. You are basically asking how to synchronize two processes in a way that if one writes the file the other will only use it once it is completely written and closed by the other.
One common way is to let the first process (writer) use a temporary file name which is not expected by the second process (reader) and to rename the file to the expected one after the file was closed. Other ways involve using file locking. One can also have a communication between the two processes (like a pipe or socketpair) which is used to explicitly inform the reader once the writer has finished and which file was written.

How to have multiple programs access the same file without manually giving them all the file path?

I'm writing several related python programs that need to access the same file however, this file will be updated/replaced intermittently and I need them all to access the new file. My current idea is to have a specific folder where the latest file is placed whenever it needs to be replaced and was curious how I could have python select whatever text file is in the folder.
Or, would I be better off creating a program that has a Class entirely dedicated to holding the information of the file and have each program reference the file in that class. I could have the Class use tkinter.filedialog to select a new file whenever necessary and perhaps have a text file that has the path or name to the file that I need to access and have the other programs reference that.
Edit: I don't need to write to the file at all just read from it. However, I would like to have it so that I do not need to manually update the path to the file every time I run the program or update the file path.
Edit2: Changed title to suit the question more
If the requirement is to get the most recently modified file in a specific directory:
import os
mypath = r'C:\path\to\wherever'
myfiles = [(f,os.stat(os.path.join(mypath,f)).st_mtime) for f in os.listdir(mypath)]
mysortedfiles = sorted(myfiles,key=lambda x: x[1],reverse=True)
print('Most recently updated: %s'%mysortedfiles[0][0])
Basically, get a list of files in the directory, together with their modified time as a list of tuples, sort on modified date, then get the one you want.
It sounds like you're looking for a singleton pattern, which is a neat way of hiding a lot of logic into an 'only one instance' object.
This means the logic for identifying, retrieving, and delivering the file is all in one place, and your programs interact with it by saying 'give me the one instance of that thing'. If you need to alter how it identifies, retrieves, or delivers what that one thing is, you can keep that hidden.
It's worth noting that the singleton pattern can be considered an antipattern as it's a form of global state, it depends on the context of the program if this is a deal breaker or not.
To "have python select whatever text file is in the folder", you could use the glob library to get a list of file(s) in the directory, see: https://docs.python.org/2/library/glob.html
You can also use os.listdir() to list all of the files in a directory, without matching pattern names.
Then, open() and read() whatever file or files you find in that directory.

Is it safe to download a NamedTemporaryFile from a pyramid FileResponse?

I'm currently working on an export feature for a web application using Pyramid on Python and running on Ubuntu 14.04. It zips the files into a NamedTemporaryFile and sends it back through a FileResponse:
# Create the temporary file to store the zip
with NamedTemporaryFile(delete=True) as output:
map_zip = zipfile.ZipFile(output, 'w', zipfile.ZIP_DEFLATED)
length_mapdir = len(map_directory)
for root, dirs, files in os.walk(map_directory, followlinks=True):
for file in files:
file_path = os.path.join(root, file)
map_zip.write(file_path, file_path[length_mapdir:])
map_zip.close()
#Send the response as an attachement to let the user download the file
response = FileResponse(os.path.abspath(output.name))
response.headers['Content-Type'] = 'application/download'
response.headers['Content-Disposition'] = 'attachement; filename="'+filename+'"'
return response
On the client's side, the export takes some time then the file download popup appears, nothing goes wrong and everything is in the zip as planned.
While the file is zipping, I can see a file taking up more and more size in /tmp/, and before the download popup appears, the file disappears. I assume this is the NamedTemporaryFile.
While the file is being zipped or downloaded, there isn't any significant change in the amount of RAM being used, it stays around 40mb while the actual zip is over 800mb.
Where is pyramid downloading the file from? From what I understand of tempfile, it is unlinked when it is closed. If that's true, is it possible another process could write on the memory where the file was stored, corrupting whatever pyramid is downloading?
In Unix environments something called reference counting is used when a file is created, and opened. For each open() call on a file, the reference number is increased, for each close() it is decreased. unlink() is special in that when that is called the file is unlinked from the directory tree, but will remain on disk so long as the reference count stays above 0.
In your case NamedTemporaryFile() creates a file on disk named /tmp/somefile
/tmp/somefile now has a link count of 1
/tmp/somefile then has open() called on it, so that it can return the file to you, this increases the reference count to 1
/tmp/somefile is then written to by your code, in this case a zip file
/tmp/somefile is then passed to FileResponse() which then has open() called on it, increasing the reference count to 2
You exit the scope of the with statement, and NamedTemporaryFile() calls close() followed by unlink(). Your file now has 1 reference to it, and a link count of 0. Due to the reference still existing, the file still exists on disk, but can no longer be seen when searching for it.
FileResponse() is iterated over by your WSGI server, and eventually once the file has been fully read, your WSGI server calls close() on it, dropping the reference count to 0, at which point the file system will clean the file up entirely
It is at that last point that the file is no longer accessible. In the mean time your file is completely safe and there is no way for it to be overwritten in memory or otherwise.
That being said, if FileResponse() was lazy loaded for example (i.e. it wouldn't open() the file until the WSGI server started sending the response), it would be entirely possible that it would attempt to open() the temporary file too late, and NamedTemporaryFile() would have already deleted the file. Just something to keep in mind.

Just generate path for a temporary file

As tempfile.mktemp is depreciated in Python 2.7 I generate a unique path to a temporary file as follows:
temp = tempfile.NamedTemporaryFile(suffix=".py")
path_to_generated_py = temp.name
temp.close()
# now I use path_to_gerated_py to create a python file
Is this the recommended way in Python 2.7? As I close the temp file immediately it looks like misusing NamedTemporaryFile....
The direct replacement for tempfile.mktemp() is tempfile.mkstemp(). The latter creates the file, like NamedTemporaryFile, so you must close it (as in your code snippet). The difference with NamedTemporaryFile is that the file is not deleted when closed. This is actually required: your version has a theoretical race condition where two processes might end up with the same temporary file name. If you use mkstemp() instead, the file is never deleted, and will likely be overwritten by the 3rd-party library you use --- but at any point in time, the file exists, and so there is no risk that another process would create a temporary file of the same name.

Delete file from zipfile with the ZipFile Module

The only way I came up for deleting a file from a zipfile was to create a temporary zipfile without the file to be deleted and then rename it to the original filename.
In python 2.4 the ZipInfo class had an attribute file_offset, so it was possible to create a second zip file and copy the data to other file without decompress/recompressing.
This file_offset is missing in python 2.6, so is there another option than creating another zipfile by uncompressing every file and then recompressing it again?
Is there maybe a direct way of deleting a file in the zipfile, I searched and didn't find anything.
The following snippet worked for me (deletes all *.exe files from a Zip archive):
zin = zipfile.ZipFile ('archive.zip', 'r')
zout = zipfile.ZipFile ('archve_new.zip', 'w')
for item in zin.infolist():
buffer = zin.read(item.filename)
if (item.filename[-4:] != '.exe'):
zout.writestr(item, buffer)
zout.close()
zin.close()
If you read everything into memory, you can eliminate the need for a second file. However, this snippet recompresses everything.
After closer inspection the ZipInfo.header_offset is the offset from the file start. The name is misleading, but the main Zip header is actually stored at the end of the file. My hex editor confirms this.
So the problem you'll run into is the following: You need to delete the directory entry in the main header as well or it will point to a file that doesn't exist anymore. Leaving the main header intact might work if you keep the local header of the file you're deleting as well, but I'm not sure about that. How did you do it with the old module?
Without modifying the main header I get an error "missing X bytes in zipfile" when I open it. This might help you to find out how to modify the main header.
Not very elegant but this is how I did it:
import subprocess
import zipfile
z = zipfile.ZipFile(zip_filename)
files_to_del = filter( lambda f: f.endswith('exe'), z.namelist()]
cmd=['zip', '-d', zip_filename] + files_to_del
subprocess.check_call(cmd)
# reload the modified archive
z = zipfile.ZipFile(zip_filename)
The routine delete_from_zip_file from ruamel.std.zipfile¹ allows you to delete a file based on its full path within the ZIP, or based on (re) patterns. E.g. you can delete all of the .exe files from test.zip using
from ruamel.std.zipfile import delete_from_zip_file
delete_from_zip_file('test.zip', pattern='.*.exe')
(please note the dot before the *).
This works similar to mdm's solution (including the need for recompression), but recreates the ZIP file in memory (using the class InMemZipFile()), overwriting the old file after it is fully read.
¹ Disclaimer: I am the author of that package.
Based on Elias Zamaria comment to the question.
Having read through Python-Issue #51067, I want to give update regarding it.
For today, solution already exists, though it is not approved by Python due to missing Contributor Agreement from the author.
Nevertheless, you can take the code from https://github.com/python/cpython/blob/659eb048cc9cac73c46349eb29845bc5cd630f09/Lib/zipfile.py and create a separate file from it. After that just reference it from your project instead of built-in python library: import myproject.zipfile as zipfile.
Usage:
with zipfile.ZipFile(f"archive.zip", "a") as z:
z.remove(f"firstfile.txt")
I believe it will be included in future python versions. For me it works like a charm for given use case.

Categories