How to check that file is saved to hard during TCP sending? - python

I send many files over TCP from PC(windows) to Server(Linux).
When I process files on server sometimes I get error, since file is corrupted or has zero size, because it is still undergoes 'saving' to hard disc.
I process files in python, grabbing like this:
file_list = sorted(glob('*.bin'))
for file in file_list:
file_size = os.path.getsize(file)
if file_size > min_file_size:
do_process(file)
How to make it in proper way, i.e make sure, that file is ok.
I cant choose right min_file_size, since files have different sizes..
May be I should copy it to another folder ant then process them?
** I'm using SCP to copy files. So on the server side how can I be sure(some linux hints), that file is ok, to move it to directory, which will be processing? Sometimes by typing ls I see files, whch is not fully sent yet.. so how can I rename them?

You can use the fuser command to check whether the file is currently being accessed by any process, as follows:
import subprocess
...
file_list = sorted(glob('*.bin'))
for file in file_list:
result = subprocess.run(['fuser','--silent',file])
if result.returncode != 0:
do_process(file)
The fuser command will terminate with a non-0 return code if the file is not being accessed.

This has nothing to do with TCP. You are basically asking how to synchronize two processes in a way that if one writes the file the other will only use it once it is completely written and closed by the other.
One common way is to let the first process (writer) use a temporary file name which is not expected by the second process (reader) and to rename the file to the expected one after the file was closed. Other ways involve using file locking. One can also have a communication between the two processes (like a pipe or socketpair) which is used to explicitly inform the reader once the writer has finished and which file was written.

Related

Writing and Reading virtual files on Windows

As part of a Python project on Windows, I need to use small files as a means to communicate different processes. Since the external process must be called with subprocess.run(program, args...), I can't simply obtain a file descriptor for my file and pass it as a parameter to the external process. Instead, I need a file with a name which can be accessed from the normal filesystem. Thus, I would like a simple way to create a temporary file which is stored in memory (RAM) instead of disk, and which has a name other external processes can use to access it. In Linux, this can be achieved with the function os.mkfifo(). However, this function is not available in Windows.
At the moment, I am simply using the tempfile module to create a temporary file which is stored in disk and deleted once it is no longer needed. Here is a small reproducible example for my code:
import tempfile
import subprocess
import os
fd = tempfile.NamedTemporaryFile(mode="w+t", delete=False) # Obtain file descriptor
file_path = fd.name # Obtain file path
# Write data (encoded as str) to the file and close it
fd.write(data)
fd.close()
# Access this same file from an external process
output = subprocess.run( [program, file_path], stdout=subprocess.PIPE).stdout.decode('utf-8')
# Delete the file
os.remove(file_path)
print("External process output:", output)
So, my question is the following: How can I change this code so that in line fd.write(data) the data is written to RAM instead of disk?

Windows - file opened by another process, still can rename it in Python

On Windows OS, just before doing some actions on my file, I need to know if it's in use by another process. After some serious research over all the other questions with a similar problem, I wasn't able to find a working solution for it.
os.rename(my_file.csv, my_file.csv) is still working even if I have the file opened with ... notepad let's say.
psutil ... it took too much time and it doesn't work (can't find my file path in nt.path:
for proc in psutil.process_iter():
try:
flist = proc.open_files()
if flist:
for nt in flist:
if my_file_path == nt.path:
print("it's here")
except psutil.NoSuchProcess as err:
print(err)
Is there any other solution for this?
UPDATE 1
I have to do 2 actions on this file: 1. check if the filename corresponds to a pattern; 2. copy it over SFTP.
UPDATE 2 + solution
Thanks to #Eryk Sun, I found out that Notepad "reads the contents into memory and then closes the handle". After opening my file with Word, os.rename and psutil are working like a (py)charm.
If the program You use opens the file by importing it (like Excel would do it, for example), that means that it transforms Your data in a readable form for itself, without keeping a hand on the actual file afterwards. If You save the file from there, it either saves it in the programs own format or exports (and transforms) the file back.
What dou You want to do with the file? Maybe You can simply copy the file?

Is it safe to download a NamedTemporaryFile from a pyramid FileResponse?

I'm currently working on an export feature for a web application using Pyramid on Python and running on Ubuntu 14.04. It zips the files into a NamedTemporaryFile and sends it back through a FileResponse:
# Create the temporary file to store the zip
with NamedTemporaryFile(delete=True) as output:
map_zip = zipfile.ZipFile(output, 'w', zipfile.ZIP_DEFLATED)
length_mapdir = len(map_directory)
for root, dirs, files in os.walk(map_directory, followlinks=True):
for file in files:
file_path = os.path.join(root, file)
map_zip.write(file_path, file_path[length_mapdir:])
map_zip.close()
#Send the response as an attachement to let the user download the file
response = FileResponse(os.path.abspath(output.name))
response.headers['Content-Type'] = 'application/download'
response.headers['Content-Disposition'] = 'attachement; filename="'+filename+'"'
return response
On the client's side, the export takes some time then the file download popup appears, nothing goes wrong and everything is in the zip as planned.
While the file is zipping, I can see a file taking up more and more size in /tmp/, and before the download popup appears, the file disappears. I assume this is the NamedTemporaryFile.
While the file is being zipped or downloaded, there isn't any significant change in the amount of RAM being used, it stays around 40mb while the actual zip is over 800mb.
Where is pyramid downloading the file from? From what I understand of tempfile, it is unlinked when it is closed. If that's true, is it possible another process could write on the memory where the file was stored, corrupting whatever pyramid is downloading?
In Unix environments something called reference counting is used when a file is created, and opened. For each open() call on a file, the reference number is increased, for each close() it is decreased. unlink() is special in that when that is called the file is unlinked from the directory tree, but will remain on disk so long as the reference count stays above 0.
In your case NamedTemporaryFile() creates a file on disk named /tmp/somefile
/tmp/somefile now has a link count of 1
/tmp/somefile then has open() called on it, so that it can return the file to you, this increases the reference count to 1
/tmp/somefile is then written to by your code, in this case a zip file
/tmp/somefile is then passed to FileResponse() which then has open() called on it, increasing the reference count to 2
You exit the scope of the with statement, and NamedTemporaryFile() calls close() followed by unlink(). Your file now has 1 reference to it, and a link count of 0. Due to the reference still existing, the file still exists on disk, but can no longer be seen when searching for it.
FileResponse() is iterated over by your WSGI server, and eventually once the file has been fully read, your WSGI server calls close() on it, dropping the reference count to 0, at which point the file system will clean the file up entirely
It is at that last point that the file is no longer accessible. In the mean time your file is completely safe and there is no way for it to be overwritten in memory or otherwise.
That being said, if FileResponse() was lazy loaded for example (i.e. it wouldn't open() the file until the WSGI server started sending the response), it would be entirely possible that it would attempt to open() the temporary file too late, and NamedTemporaryFile() would have already deleted the file. Just something to keep in mind.

Check if an open file has been deleted after open in python

Is it possible to check if a file has been deleted or recreated in python?
For example, if you did a open("file") in the script, and then while that file is still open, you do rm file; touch file;, then the script will still hold a reference to the old file even though it's already been deleted.
You should fstat the file descriptor for the opened file.
>>> import os
>>> f = open("testdv.py")
>>> os.fstat(f.fileno())
posix.stat_result(st_mode=33188, st_ino=1508053, st_dev=65027L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=1107, st_atime=1349180541, st_mtime=1349180540, st_ctime=1349180540)
>>> os.fstat(f.fileno()).st_nlink
1
Ok, this file has one link, so one name in the filesystem. Now remove it:
>>> os.unlink("testdv.py")
>>> os.fstat(f.fileno()).st_nlink
0
No more links, so we have an "anonymous file" that's only kept alive as long as we have it open. Creating a new file with the same name has no effect on the old file:
>>> g = open("testdv.py", "w")
>>> os.fstat(g.fileno()).st_nlink
1
>>> os.fstat(f.fileno()).st_nlink
0
Of course, st_nlink can sometimes be >1 initially, so checking that for zero is not entirely reliable (though in a controlled setting, it might be good enough). Instead, you can verify whether the file at the path you initially opened is the same one that you have a file descriptor for by comparing stat results:
>>> os.stat("testdv.py") == os.fstat(f.fileno())
False
>>> os.stat("testdv.py") == os.fstat(g.fileno())
True
(And if you want this to be 100% correct, then you should compare only the st_dev and st_ino fields on stat results, since the other fields and st_atime in particular might change in between the calls.)
Yes. Use the os.stat() function to check the file length. If the length is zero (or the function returns the error "File not found"), then someone deleted the file.
Alternatively, you can open+write+close the file each time you need to write something into it. The drawback is that opening a file is a pretty slow operation, so this is out of the question if you need to write a lot of data.
Why? Because the new file isn't the file that you're holding open. In a nutshell, Unix filesystems have two levels. One is the directory entry (i.e. the file name, file size, modification time, pointer to the data) and the second level is the file data.
When you open a file, Unix uses the name to find the file data. After that, it operates only on the second level - changes to the directory entry have no effect on any open "file handles". Which is exactly why you can delete the directory entry: Your program isn't using it.
When you use os.stat(), you don't look at the file data but at the directory entry again.
On the positive side, this allows you to create files which no one can see but your program: Open the file, delete it and then use it. Since there is no directory entry for the file, no other program can access the data.
On the negative side, you can't easily solve problems like the one you have.
Yes -- you can use the inotify facility to check for file changes and more. There also is a Python binding for it. Using inotify you can watch files or directories for filesystem activiy. From the manual the following events can be detected:
IN_ACCESS File was accessed (read) (*).
IN_ATTRIB Metadata changed, e.g., permissions, timestamps, extended attributes, link count (since Linux 2.6.25), UID, GID, etc. (*).
IN_CLOSE_WRITE File opened for writing was closed (*).
IN_CLOSE_NOWRITE File not opened for writing was closed (*).
IN_CREATE File/directory created in watched directory (*).
IN_DELETE File/directory deleted from watched directory (*).
IN_DELETE_SELF Watched file/directory was itself deleted.
IN_MODIFY File was modified (*).
IN_MOVE_SELF Watched file/directory was itself moved.
IN_MOVED_FROM File moved out of watched directory (*).
IN_MOVED_TO File moved into watched directory (*).
IN_OPEN File was opened (*).
From here you can google yourself a solution, but I think you get the overall idea. Of course this may only work on Linux, but from your question I assume you are using it (references to rm and touch).

Python programming - Windows focus and program process

I'm working on a python program that will automatically combine sets of files based on their names.
Being a newbie, I wasn't quite sure how to go about it, so I decided to just brute force it with the win32api.
So I'm attempting to do everything with virtual keys. So I run the script, it selects the top file (after arranging the by name), then sends a right click command,selects 'combine as adobe PDF', and then have it push enter. This launched the Acrobat combine window, where I send another 'enter' command. The here's where I hit the problem.
The folder where I'm converting these things loses focus and I'm unsure how to get it back. Sending alt+tab commands seems somewhat unreliable. It sometimes switches to the wrong thing.
A much bigger issue for me.. Different combination of files take different times to combine. though I haven't gotten this far in my code, my plan was to set some arbitrarily long time.sleep() command before it finally sent the last "enter" command to finish and confirm the file name completing the combination process. Is there a way to monitor another programs progress? Is there a way to have python not execute anymore code until something else has finished?
I would suggest using a command-line tool like pdftk http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ - it does exactly what you want, it's cross-platform, it's free, and it's a small download.
You can easily call it from python with (for example) subprocess.Popen
Edit: sample code as below:
import subprocess
import os
def combine_pdfs(infiles, outfile, basedir=''):
"""
Accept a list of pdf filenames,
merge the files,
save the result as outfile
#param infiles: list of string, names of PDF files to combine
#param outfile: string, name of merged PDF file to create
#param basedir: string, base directory for PDFs (if filenames are not absolute)
"""
# From the pdftk documentation:
# Merge Two or More PDFs into a New Document:
# pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
if basedir:
infiles = [os.path.join(basedir,i) for i in infiles]
outfile = [os.path.join(basedir,outfile)]
pdftk = [r'C:\Program Files (x86)\Pdftk\pdftk.exe'] # or wherever you installed it
op = ['cat']
outcmd = ['output']
args = pdftk + infiles + op + outcmd + outfile
res = subprocess.call(args)
combine_pdfs(
['p1.pdf', 'p2.pdf'],
'p_total.pdf',
'C:\\Users\\Me\\Downloads'
)

Categories