I'm currently working on an export feature for a web application using Pyramid on Python and running on Ubuntu 14.04. It zips the files into a NamedTemporaryFile and sends it back through a FileResponse:
# Create the temporary file to store the zip
with NamedTemporaryFile(delete=True) as output:
map_zip = zipfile.ZipFile(output, 'w', zipfile.ZIP_DEFLATED)
length_mapdir = len(map_directory)
for root, dirs, files in os.walk(map_directory, followlinks=True):
for file in files:
file_path = os.path.join(root, file)
map_zip.write(file_path, file_path[length_mapdir:])
map_zip.close()
#Send the response as an attachement to let the user download the file
response = FileResponse(os.path.abspath(output.name))
response.headers['Content-Type'] = 'application/download'
response.headers['Content-Disposition'] = 'attachement; filename="'+filename+'"'
return response
On the client's side, the export takes some time then the file download popup appears, nothing goes wrong and everything is in the zip as planned.
While the file is zipping, I can see a file taking up more and more size in /tmp/, and before the download popup appears, the file disappears. I assume this is the NamedTemporaryFile.
While the file is being zipped or downloaded, there isn't any significant change in the amount of RAM being used, it stays around 40mb while the actual zip is over 800mb.
Where is pyramid downloading the file from? From what I understand of tempfile, it is unlinked when it is closed. If that's true, is it possible another process could write on the memory where the file was stored, corrupting whatever pyramid is downloading?
In Unix environments something called reference counting is used when a file is created, and opened. For each open() call on a file, the reference number is increased, for each close() it is decreased. unlink() is special in that when that is called the file is unlinked from the directory tree, but will remain on disk so long as the reference count stays above 0.
In your case NamedTemporaryFile() creates a file on disk named /tmp/somefile
/tmp/somefile now has a link count of 1
/tmp/somefile then has open() called on it, so that it can return the file to you, this increases the reference count to 1
/tmp/somefile is then written to by your code, in this case a zip file
/tmp/somefile is then passed to FileResponse() which then has open() called on it, increasing the reference count to 2
You exit the scope of the with statement, and NamedTemporaryFile() calls close() followed by unlink(). Your file now has 1 reference to it, and a link count of 0. Due to the reference still existing, the file still exists on disk, but can no longer be seen when searching for it.
FileResponse() is iterated over by your WSGI server, and eventually once the file has been fully read, your WSGI server calls close() on it, dropping the reference count to 0, at which point the file system will clean the file up entirely
It is at that last point that the file is no longer accessible. In the mean time your file is completely safe and there is no way for it to be overwritten in memory or otherwise.
That being said, if FileResponse() was lazy loaded for example (i.e. it wouldn't open() the file until the WSGI server started sending the response), it would be entirely possible that it would attempt to open() the temporary file too late, and NamedTemporaryFile() would have already deleted the file. Just something to keep in mind.
Related
I send many files over TCP from PC(windows) to Server(Linux).
When I process files on server sometimes I get error, since file is corrupted or has zero size, because it is still undergoes 'saving' to hard disc.
I process files in python, grabbing like this:
file_list = sorted(glob('*.bin'))
for file in file_list:
file_size = os.path.getsize(file)
if file_size > min_file_size:
do_process(file)
How to make it in proper way, i.e make sure, that file is ok.
I cant choose right min_file_size, since files have different sizes..
May be I should copy it to another folder ant then process them?
** I'm using SCP to copy files. So on the server side how can I be sure(some linux hints), that file is ok, to move it to directory, which will be processing? Sometimes by typing ls I see files, whch is not fully sent yet.. so how can I rename them?
You can use the fuser command to check whether the file is currently being accessed by any process, as follows:
import subprocess
...
file_list = sorted(glob('*.bin'))
for file in file_list:
result = subprocess.run(['fuser','--silent',file])
if result.returncode != 0:
do_process(file)
The fuser command will terminate with a non-0 return code if the file is not being accessed.
This has nothing to do with TCP. You are basically asking how to synchronize two processes in a way that if one writes the file the other will only use it once it is completely written and closed by the other.
One common way is to let the first process (writer) use a temporary file name which is not expected by the second process (reader) and to rename the file to the expected one after the file was closed. Other ways involve using file locking. One can also have a communication between the two processes (like a pipe or socketpair) which is used to explicitly inform the reader once the writer has finished and which file was written.
I've had a script for a while that has been running without issues however recently had a "hitch" with a temporary file that was within a directory.
The file in question started with '~$' on a windows PC so the script was erroring out on this file as it is not a proper DOCX file. The file in question was not open and occurred after being transferred of a network drive onto an external hard drive. Checking the destination drive (with hidden files on etc) did not show this file either.
I have attempted a quick fix off:
for (dirpath,dirnames,filenames) in os.walk('.'):
for filename in filenames:
if filename.endswith('.docx'):
filesList.append(os.path.join(dirpath,filename))
for file in filesList:
if file.startswith('~$'):
pass
else:
<rest of script>
However the script appears to be ignoring this to proceed then error out again, as the file is not "valid".
Does anyone know either why this isn't working or a quick solution to get it to ignore any files that are like this? I would attempt a if exists, however the file technically does exist so this wouldn't work either.
Sorry if its a bit stupid, but I am a bit stumped as to A. why its there and B. how to code around it.
In the second code block, your variable file contains the whole file path, not just the file name.
Instead skip the "bad" files in your first block instead of appending to the list:
for (dirpath,dirnames,filenames) in os.walk('.'):
for filename in filenames:
if filename.endswith('.docx'):
if not filename.startswith('~$'):
filesList.append(os.path.join(dirpath,filename))
The other option would be to check os.path.basename(file) in your second code block.
As tempfile.mktemp is depreciated in Python 2.7 I generate a unique path to a temporary file as follows:
temp = tempfile.NamedTemporaryFile(suffix=".py")
path_to_generated_py = temp.name
temp.close()
# now I use path_to_gerated_py to create a python file
Is this the recommended way in Python 2.7? As I close the temp file immediately it looks like misusing NamedTemporaryFile....
The direct replacement for tempfile.mktemp() is tempfile.mkstemp(). The latter creates the file, like NamedTemporaryFile, so you must close it (as in your code snippet). The difference with NamedTemporaryFile is that the file is not deleted when closed. This is actually required: your version has a theoretical race condition where two processes might end up with the same temporary file name. If you use mkstemp() instead, the file is never deleted, and will likely be overwritten by the 3rd-party library you use --- but at any point in time, the file exists, and so there is no risk that another process would create a temporary file of the same name.
Is it possible to check if a file has been deleted or recreated in python?
For example, if you did a open("file") in the script, and then while that file is still open, you do rm file; touch file;, then the script will still hold a reference to the old file even though it's already been deleted.
You should fstat the file descriptor for the opened file.
>>> import os
>>> f = open("testdv.py")
>>> os.fstat(f.fileno())
posix.stat_result(st_mode=33188, st_ino=1508053, st_dev=65027L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=1107, st_atime=1349180541, st_mtime=1349180540, st_ctime=1349180540)
>>> os.fstat(f.fileno()).st_nlink
1
Ok, this file has one link, so one name in the filesystem. Now remove it:
>>> os.unlink("testdv.py")
>>> os.fstat(f.fileno()).st_nlink
0
No more links, so we have an "anonymous file" that's only kept alive as long as we have it open. Creating a new file with the same name has no effect on the old file:
>>> g = open("testdv.py", "w")
>>> os.fstat(g.fileno()).st_nlink
1
>>> os.fstat(f.fileno()).st_nlink
0
Of course, st_nlink can sometimes be >1 initially, so checking that for zero is not entirely reliable (though in a controlled setting, it might be good enough). Instead, you can verify whether the file at the path you initially opened is the same one that you have a file descriptor for by comparing stat results:
>>> os.stat("testdv.py") == os.fstat(f.fileno())
False
>>> os.stat("testdv.py") == os.fstat(g.fileno())
True
(And if you want this to be 100% correct, then you should compare only the st_dev and st_ino fields on stat results, since the other fields and st_atime in particular might change in between the calls.)
Yes. Use the os.stat() function to check the file length. If the length is zero (or the function returns the error "File not found"), then someone deleted the file.
Alternatively, you can open+write+close the file each time you need to write something into it. The drawback is that opening a file is a pretty slow operation, so this is out of the question if you need to write a lot of data.
Why? Because the new file isn't the file that you're holding open. In a nutshell, Unix filesystems have two levels. One is the directory entry (i.e. the file name, file size, modification time, pointer to the data) and the second level is the file data.
When you open a file, Unix uses the name to find the file data. After that, it operates only on the second level - changes to the directory entry have no effect on any open "file handles". Which is exactly why you can delete the directory entry: Your program isn't using it.
When you use os.stat(), you don't look at the file data but at the directory entry again.
On the positive side, this allows you to create files which no one can see but your program: Open the file, delete it and then use it. Since there is no directory entry for the file, no other program can access the data.
On the negative side, you can't easily solve problems like the one you have.
Yes -- you can use the inotify facility to check for file changes and more. There also is a Python binding for it. Using inotify you can watch files or directories for filesystem activiy. From the manual the following events can be detected:
IN_ACCESS File was accessed (read) (*).
IN_ATTRIB Metadata changed, e.g., permissions, timestamps, extended attributes, link count (since Linux 2.6.25), UID, GID, etc. (*).
IN_CLOSE_WRITE File opened for writing was closed (*).
IN_CLOSE_NOWRITE File not opened for writing was closed (*).
IN_CREATE File/directory created in watched directory (*).
IN_DELETE File/directory deleted from watched directory (*).
IN_DELETE_SELF Watched file/directory was itself deleted.
IN_MODIFY File was modified (*).
IN_MOVE_SELF Watched file/directory was itself moved.
IN_MOVED_FROM File moved out of watched directory (*).
IN_MOVED_TO File moved into watched directory (*).
IN_OPEN File was opened (*).
From here you can google yourself a solution, but I think you get the overall idea. Of course this may only work on Linux, but from your question I assume you are using it (references to rm and touch).
Is this usage of Python tempfile.NamedTemporaryFile secure (i.e. devoid security issues of deprecated tempfile.mktemp)?
def mktemp2():
"""Create and close an empty temporary file.
Return the temporary filename"""
tf = tempfile.NamedTemporaryFile(delete=False)
tfilename = tf.name
tf.close()
return tfilename
outfilename = mktemp2()
subprocess.call(['program_name','-o',outfilename])
What I need to run external command that requires output file name as one of the arguments. It overwrites the outfilename if that exists without warnings. I want to use temporary file as I just need to read its content, I don't need it later.
Totally unsafe. There is an opportunity for an attacker to create the file with whatever permissions they like (or a symlink) with that name between when it is deleted and opened by the subprocess
If you can instead create the file in a directory other than /tmp that is owned and onnly read/writeable by your process, you don't need to concern yourself with the security of the file as anything in the directory is protected