File copy completion? - python

In Linux, how can we know if a file has completed copying before reading it? In Windows, an OSError is raised.

You can use the inotify mechanisms (via pyinotify) to catch events like CREATE, WRITE, CLOSE and based on them you can assume wether the copy has finished or not.
However, since you provided no details on what are you trying to do, I can't tell if inotify would be suitable for you (btw, inotify is Linux specific so you can't use it on Windows or other platforms)

In Linux, you can open a file while another process is writing to it without Python throwing an OSError, so in general, you cannot know for sure whether the other side has finished writing into that file. You can try some hacks, though:
You can check the file size regularly to see whether it increased since the last check. If it hasn't increased in, say, five seconds, you might be safe to assume that the copy has finished. I'm saying might since this is not true in all circumstances. If the other process that is writing the file is blocked for whatever reason, it might temporarily stop writing to the file and resume it later. So this is not 100% fool-proof, but might work for local file copies if the system is never under a heavy load that would stall the writing process.
You can check the output of fuser (this is a shell command), which will list the process IDs for all the files that are holding a file handle to a given file name. If this list includes any process other than yours, you can assume that the copying process hasn't finished yet. However, you will have to make sure that fuser is installed on the target system in order to make it work.

Related

closing files of a killed process

python: 3.4
OS: win7 / win10
I want to kill a running process with python and close all the files it opened:
for proc in psutil.process_iter():
if proc.name() == 'myprocess.exe':
opened = proc.open_files()
proc.kill()
for i in opened:
print(i.path)
io.FileIO(i.path).close()
print(io.FileIO(i.path).closed)
Somehow io.IOBase(i.path).close() does not work.
Explanation:
It's like I would like to kill Microsoft Word with python, but it leaves some files open. And I would like to close those files as well.
Microsoft Word is just an example. It is a self-written python programm. The opened files are:
fonts (.ttf)
clr.pyd
and .dll-s
How should I close these files?
You don't need to close any files that were opened by the process. That is done automatically:
Terminating a process has the following results:
Any remaining threads in the process are marked for termination.
Any resources allocated by the process are freed.
All kernel objects are closed.
The process code is removed from memory.
The process exit code is set.
The process object is signaled.
The important bit is "All kernel objects are closed." For every open file handle, there is an associated kernel object--that's actually what a handle is, a mapping from a number to a kernel object. When the process exits, the kernel will walk behind and close all associated file handles, sockets, etc.
Additionally, you're original approach has a few problems. First, the list of open files is only a snapshot of which ones were open at that time. In between asking for the list of open files and killing the process, the process could have opened many more, or closed and removed many as well. Second, the Python 3 docs say that the constructor for IOBase isn't public, so using it in this way is wrong:
class io.IOBase
The abstract base class for all I/O classes, acting on streams of bytes. There is no public constructor.
Generally, you'd use something like io.open() which takes the path. This leads to the third issue. All you have to work with is the path. In order to close a file, you really need the handle. Those handles are process-specific. This means in one process, 0x5555AAAA may correspond to "file1.txt", but in another process, it might correspond to "file2.txt" or maybe not even a file at all (it could be a socket or something else). So even if you have the kernel handle, we don't really have a way of saying "close this handle in the context of this other process." That violates some security goals of processes. Also, it means that what you're actually doing here is creating your own handle to only turn around and close it (or in this case, it possibly does nothing at all since the object wasn't created correctly).
So, if you're having a problem with files still being held, perhaps the problem is that the process didn't actually die yet before trying whatever work you needed to get done. You may need to wait for the process to exit before attempting to move on if there are files the process was using that you want to use again. It looks like you can use psutils.wait_procs() to do that.
Also, on Windows I find that anti-virus tools often get in the way. They hold open files accessed by applications making it look like a process is still holding onto them when it's actually the virus scanner doing its thing. I remember one instance of having to deal with this in Subversion. The code still exists today. So you might need to simply wait a bit and try again.
Update
Microsoft Word is just an example. It is a self-written python programm. The opened files are:
fonts (.ttf)
clr.pyd
and .dll-s
How should I close these files?
The answer is that you shouldn't need to. Just make sure the process has actually exited. It's not an instantaneous operation, so there's some time between killing it and it actually exiting that it still retains the file handles.
Given that you've actually written the process being killed, I think a far better approach would be to introduce a way to launch that process, have it do its work, then exit gracefully. Then use subprocess.run() to run the script and wait for it to exit.
It's like I would like to kill Microsoft Word with python, but it leaves some files open. And I would like to close those files as well.
There is some misunderstanding here. When you terminate Word with kill, all files are closed from a system point of view, but they will be dirty closed. When Word terminates normally, it flushes its internal buffers, removes any temporary files and mark the files as clean. When it crashes or is abruptely terminated, all that cleaning does not occur. Some modifications may not be written to disk, and temp files are still there, so on next execution, Word will warn you that the files have not been orderly closed and have to be repaired.
So you do not want to kill Microsoft Word, but to close it, meaning posting a WM_QUIT message to its main window. Unfortunately, there is no clean and neat support in Python for that. There is an example of closing Excel by the win32com module here. The convertion for Word should be (beware untested):
wd = win32com.client.Dispatch("Word.Application")
wd.Quit() #quit word, as if user hit the close button/clicked file->exit.
Take a look at the with statement syntax. There's a brief overview here

When does Python write a file to disk?

I have a library that interacts with a configuration file. When the library is imported, the initialization code reads the configuration file, possibly updates it, and then writes the updated contents back to the file (even if nothing was changed).
Very occasionally, I encounter a problem where the contents of the configuration file simply disappear. Specifically, this happens when I run many invocations of a short script (using the library), back-to-back, thousands of times. It never occurs during the same directories, which leads me to believe it's a somewhat random problem--specifically a race condition with IO.
This is a pain to debug, since I can never reliably reproduce the problem and it only happens on some systems. I have a suspicion about what might happen, but I wanted to see if my picture of file I/O in Python is correct.
So the question is, when does a Python program actually write file contents to a disk? I thought that the contents would make it to disk by the time that the file closed, but then I can't explain this error. When python closes a file, does it flush the contents to the disk itself, or simply queue it up to the filesystem? Is it possible that file contents can be written to disk after Python terminates? And can I avoid this issue by using fp.flush(); os.fsync(fp.fileno()) (where fp is the file handle)?
If it matters, I'm programming on a Unix system (Mac OS X, specifically). Edit: Also, keep in mind that the processes are not running concurrently.
Appendix: Here is the specific race condition that I suspect:
Process #1 is invoked.
Process #1 opens the configuration file in read mode and closes it when finished.
Process #1 opens the configuration file in write mode, erasing all of its contents. The erasing of the contents is synced to the disk.
Process #1 writes the new contents to the file handle and closes it.
Process #1: Upon closing the file, Python tells the OS to queue writing these contents to disk.
Process #1 closes and exits
Process #2 is invoked
Process #2 opens the configuration file in read mode, but new contents aren't synced yet. Process #2 sees an empty file.
The OS finally finishes writing the contents to disk, after process 2 reads the file
Process #2, thinking the file is empty, sets defaults for the configuration file.
Process #2 writes its version of the configuration file to disk, overwriting the last version.
It is almost certainly not python's fault. If python closes the file, OR exits cleanly (rather than killed by a signal), then the OS will have the new contents for the file. Any subsequent open should return the new contents. There must be something more complicated going on. Here are some thoughts.
What you describe sounds more likely to be a filesystem bug than a Python bug, and a filesystem bug is pretty unlikely.
Filesystem bugs are far more likely if your files actually reside in a remote filesystem. Do they?
Do all the processes use the same file? Do "ls -li" on the file to see its inode number, and see if it ever changes. In your scenario, it should not. Is it possible that something is moving files, or moving directories, or deleting directories and recreating them? Are there symlinks involved?
Are you sure that there is no overlap in the running of your programs? Are any of them run from a shell with "&" at the end (i.e. in the background)? That could easily mean that a second one is started before the first one is finished.
Are there any other programs writing to the same file?
This isn't your question, but if you need atomic changes (so that any program running in parallel only sees either the old version or the new one, never the empty file), the way to achieve it is to write the new content to another file (e.g. "foo.tmp"), then do os.rename("foo.tmp", "foo"). Rename is atomic.

which inotify event signals the completion of a large file operation?

for large files or slow connections, copying files may take some time.
using pyinotify, i have been watching for the IN_CREATE event code. but this seems to occur at the start of a file transfer. i need to know when a file is completely copied - it aint much use if it's only half there.
when a file transfer is finished and completed, what inotify event is fired?
IN_CLOSE probably means the write is complete. This isn't for sure since some applications are bad actors and open and close files constantly while working with them, but if you know the app you're dealing with (file transfer, etc.) and understand its' behaviour, you're probably fine. (Note, this doesn't mean the transfer completed successfully, obviously, it just means that the process that opened the file handle closed it).
IN_CLOSE catches both IN_CLOSE_WRITE and IN_CLOSE_NOWRITE, so make your own decisions about whether you want to just catch one of those. (You probably want them both - WRITE/NOWRITE have to do with file permissions and not whether any writes were actually made).
There is more documentation (although annoyingly, not this piece of information) in Documentation/filesystems/inotify.txt.
For my case I wanted to execute a script after a file was fully uploaded. I was using WinSCP which writes large files with a .filepart extension till done.
I first started modifying my script to ignore files if they're themselves ending with .filepart or if there's another file existing in the same directory with the same name but .filepart extension, hence that means the upload is not fully completed yet.
But then I noticed at the end of the upload, when all the parts have been finished, I have a IN_MOVED_IN notification getting triggered which helped me run my script exactly when I wanted it.
If you want to know how your file uploader behaves, add this to the incrontab:
/your/directory/ IN_ALL_EVENTS echo "$$ $# $# $% $&"
and then
tail -F /var/log/cron
and monitor all the events getting triggered to find out which one suits you best.
Good luck!
Why don't you add a dummy file at the end of the transfer? You can use the IN_CLOSE or IN_CREATE event code on the dummy. The important thing is that the dummy as to be transfered as the last file in the sequence.
I hope it'll help.

How do I watch a folder for changes and when changes are done using Python?

i need to watch a folder for incoming files. i did that with the following help:
How do I watch a file for changes?
the problem is that the files that are being moved are pretty big (10gb)
and i want to be notified when all files are done moving.
i tried comparing the size of the folder every 20 seconds but the file shows its correct size even tough windows shows that it is still moving.
i am using windows with python
i found a solution using open and waiting for an io exception.
if the file is still being moved i get errno 13.
You should take a look at this link:
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
There you can see the comparison of the method you are speaking about (simple polling) with two other windows-specific techniques which, in my opinion, offers a really better solution to your problem!
Otherwise, if you are using linux, there's iNotify and the relative Python wrapper:
Pyinotify is a pure Python module used
for monitoring filesystems events on
Linux platforms through inotify
Here: http://trac.dbzteam.org/pyinotify
If you have control over the process of importing the files, I would put a lock file when starting to copy files in, and remove it when you are done. by lock file I mean a tmp empty file, which is just there to indicate that you are coping a file. then your py script can check for the existence of the lock files.
You may be able to use os.stat() to monitor the mtime of the file. However be aware that under various network conditions, the copy may stall momentarily and so the mtime is not updated for a few seconds, so you need to make allowance for this.
Another option is to try opening the file with exclusive read/write which should fail under windows if the file is still opened by the other process
The most reliable method would be to write your own program to move the files.
try checking for the last-modified time change instead of the filesize during your poll.

Does python have hooks into EXT3

We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.
no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.
There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished

Categories