I have a Python 2.7.x process running in an infinite loop that monitors a folder in Ubuntu server.
Whenever it finds a file, it checks the file against a set of known files that have been processed already, and acts accordingly. In pseudocode:
found = set()
while True:
for file in all_files("<DIR>"):
if file not in found:
process_file(file, found)
How can I make sure that the file hasn't just begun being copied there? I wouldn't want to say, take MD5 sum of file or open it with another process until I'm sure it's all there and ready.
The safest solution is to use the Linux kernel's inotify API via the pyinotify library. Experiment with the IN_CREATE and IN_MOVED_TO events depending on your needs. Also note this blog post warning of some implementation problems with the pyinotify library.
Due to locks and other system-level operations, you will not be able to do anything to the file until it has completed copying.
A file cannot be in two operations at once.
Related
Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks
Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.
If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.
The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.
Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.
Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.
I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.
You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html
I have a Python script that checks on a pickup directory and processes any files that it finds, and then deletes them.
How can I make sure not to pickup a file that is still being written by the process that drops files in that directory?
My test case is pretty simple. I copy-paste 300MB of files into the pickup directory, and frequently the script will grab a file that's still being written. It operates on only the partial file, then delete it. This fires off a file operation error in the OS as the file it was writing to disappeared.
I've tried acquiring a lock on the file (using the FileLock module) before I open/process/delete it. But that hasn't helped.
I've considered checking the modification time on the file to avoid anything within X seconds of now. But that seems clunky.
My test is on OSX, but I'm trying to find a solution that will work across the major platforms.
I see a similar question here (How to check if a file is still being written?), but there was no clear solution.
Thank you
As a workaround, you could listen to file modified events (watchdog is cross-platform). The modified event (on OS X at least) isn't fired for each write, it's only fired on close. So when you detect a modified event you can assume all writes are complete.
Of course, if the file is being written in chunks, and being saved after each chunk this won't work.
One solution to this problem would be to change the program writing the files to write the files to a temporary file first, and then move that temporary file to the destination when it is done. On most operating systems, when the source and destination are on the same file system, move is atomic.
If you have no control over the writing portion, about all you can do is watch the file yourself, and when it stops growing for a certain amount of time, call it good. I have to use that method myself, and found 40 seconds is safe for my conditions.
Each OS will have a different solution, because file locking mechanisms are not portable.
On Windows, you can use OS locking.
On Linux you can have a peek at open files (similarily how lsof does) and if file is open, leave it.
Have you tried opening the file before coping it? If the file is still in use, then open() should throw exception.
try:
with open(filename, "rb") as fp:
pass
# Copy the file
except IOError:
# Dont copy
In Linux, how can we know if a file has completed copying before reading it? In Windows, an OSError is raised.
You can use the inotify mechanisms (via pyinotify) to catch events like CREATE, WRITE, CLOSE and based on them you can assume wether the copy has finished or not.
However, since you provided no details on what are you trying to do, I can't tell if inotify would be suitable for you (btw, inotify is Linux specific so you can't use it on Windows or other platforms)
In Linux, you can open a file while another process is writing to it without Python throwing an OSError, so in general, you cannot know for sure whether the other side has finished writing into that file. You can try some hacks, though:
You can check the file size regularly to see whether it increased since the last check. If it hasn't increased in, say, five seconds, you might be safe to assume that the copy has finished. I'm saying might since this is not true in all circumstances. If the other process that is writing the file is blocked for whatever reason, it might temporarily stop writing to the file and resume it later. So this is not 100% fool-proof, but might work for local file copies if the system is never under a heavy load that would stall the writing process.
You can check the output of fuser (this is a shell command), which will list the process IDs for all the files that are holding a file handle to a given file name. If this list includes any process other than yours, you can assume that the copying process hasn't finished yet. However, you will have to make sure that fuser is installed on the target system in order to make it work.
i need to watch a folder for incoming files. i did that with the following help:
How do I watch a file for changes?
the problem is that the files that are being moved are pretty big (10gb)
and i want to be notified when all files are done moving.
i tried comparing the size of the folder every 20 seconds but the file shows its correct size even tough windows shows that it is still moving.
i am using windows with python
i found a solution using open and waiting for an io exception.
if the file is still being moved i get errno 13.
You should take a look at this link:
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
There you can see the comparison of the method you are speaking about (simple polling) with two other windows-specific techniques which, in my opinion, offers a really better solution to your problem!
Otherwise, if you are using linux, there's iNotify and the relative Python wrapper:
Pyinotify is a pure Python module used
for monitoring filesystems events on
Linux platforms through inotify
Here: http://trac.dbzteam.org/pyinotify
If you have control over the process of importing the files, I would put a lock file when starting to copy files in, and remove it when you are done. by lock file I mean a tmp empty file, which is just there to indicate that you are coping a file. then your py script can check for the existence of the lock files.
You may be able to use os.stat() to monitor the mtime of the file. However be aware that under various network conditions, the copy may stall momentarily and so the mtime is not updated for a few seconds, so you need to make allowance for this.
Another option is to try opening the file with exclusive read/write which should fail under windows if the file is still opened by the other process
The most reliable method would be to write your own program to move the files.
try checking for the last-modified time change instead of the filesize during your poll.
We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.
no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.
There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished