Does python have hooks into EXT3 - python

We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.

no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.

There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished

Related

How to detect files in a directory if the files have finished copying/adding? [duplicate]

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks
Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.
If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.
The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.
Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.
Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.
I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.
You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

python ensure file consistency

I have a Python 2.7.x process running in an infinite loop that monitors a folder in Ubuntu server.
Whenever it finds a file, it checks the file against a set of known files that have been processed already, and acts accordingly. In pseudocode:
found = set()
while True:
for file in all_files("<DIR>"):
if file not in found:
process_file(file, found)
How can I make sure that the file hasn't just begun being copied there? I wouldn't want to say, take MD5 sum of file or open it with another process until I'm sure it's all there and ready.
The safest solution is to use the Linux kernel's inotify API via the pyinotify library. Experiment with the IN_CREATE and IN_MOVED_TO events depending on your needs. Also note this blog post warning of some implementation problems with the pyinotify library.
Due to locks and other system-level operations, you will not be able to do anything to the file until it has completed copying.
A file cannot be in two operations at once.

Using Python, how do I close a file in use by another user over a network?

I have a program that creates a bunch of movie files. I runs as a cron job and every time it runs the movies from the previous iteration are moved to a 'previous' folder so that there is always a previous version to look at.
These movie files are accessed across a network by various users and that's where I'm running into a problem.
When the script runs and tries to move the files it throws a resource busy error because the files are open by various users. Is there a way in Python to force close these files before I attempt to move them?
Further clarification:
JMax is correct when he mentions it is server level problem. I can access our windows server through Administrative Tools > Computer Management > Shared Folders > Open Files and manually close the files there, but I am wondering whether there is a Python equivalent which will achieve the same result.
something like this:
try:
shutil.move(src, dst)
except OSError:
# Close src file on all machines that are currently accessing it and try again.
This question has nothing to do with Python, and everything to do with the particular operating system and file system you're using. Could you please provide these details?
At least in Windows you can use Sysinternals Handle to force a particular handle to a file to be closed. Especially as this file is opened by another user over a network this operation is extremely destabilising and will probably render the network connection subsequently useless. You're looking for the "-c" command-line argument, where the documentation reads:
Closes the specified handle (interpreted as a hexadecimal number). You
must specify the process by its PID.
WARNING: Closing handles can cause application or system instability.
And if you're force-closing a file mounted over Samba in Linux, speaking from experience this is an excruciating experience in futility. However, others have tried with mixed success; see Force a Samba process to close a file.
As far as I know you have to end the processes which access the file. At least on Windows
The .close() method doesn't work on your object file?
See dive into Python for more information on file objects
[EDIT] I've re-read your question. Your problem is that users do open the same file from the network and you want them to close the file? But can you access to their OS?
[EDIT2] The problem is more on a server level to disconnect the user that access the file. See this example for Windows servers.

which inotify event signals the completion of a large file operation?

for large files or slow connections, copying files may take some time.
using pyinotify, i have been watching for the IN_CREATE event code. but this seems to occur at the start of a file transfer. i need to know when a file is completely copied - it aint much use if it's only half there.
when a file transfer is finished and completed, what inotify event is fired?
IN_CLOSE probably means the write is complete. This isn't for sure since some applications are bad actors and open and close files constantly while working with them, but if you know the app you're dealing with (file transfer, etc.) and understand its' behaviour, you're probably fine. (Note, this doesn't mean the transfer completed successfully, obviously, it just means that the process that opened the file handle closed it).
IN_CLOSE catches both IN_CLOSE_WRITE and IN_CLOSE_NOWRITE, so make your own decisions about whether you want to just catch one of those. (You probably want them both - WRITE/NOWRITE have to do with file permissions and not whether any writes were actually made).
There is more documentation (although annoyingly, not this piece of information) in Documentation/filesystems/inotify.txt.
For my case I wanted to execute a script after a file was fully uploaded. I was using WinSCP which writes large files with a .filepart extension till done.
I first started modifying my script to ignore files if they're themselves ending with .filepart or if there's another file existing in the same directory with the same name but .filepart extension, hence that means the upload is not fully completed yet.
But then I noticed at the end of the upload, when all the parts have been finished, I have a IN_MOVED_IN notification getting triggered which helped me run my script exactly when I wanted it.
If you want to know how your file uploader behaves, add this to the incrontab:
/your/directory/ IN_ALL_EVENTS echo "$$ $# $# $% $&"
and then
tail -F /var/log/cron
and monitor all the events getting triggered to find out which one suits you best.
Good luck!
Why don't you add a dummy file at the end of the transfer? You can use the IN_CLOSE or IN_CREATE event code on the dummy. The important thing is that the dummy as to be transfered as the last file in the sequence.
I hope it'll help.

How do I watch a folder for changes and when changes are done using Python?

i need to watch a folder for incoming files. i did that with the following help:
How do I watch a file for changes?
the problem is that the files that are being moved are pretty big (10gb)
and i want to be notified when all files are done moving.
i tried comparing the size of the folder every 20 seconds but the file shows its correct size even tough windows shows that it is still moving.
i am using windows with python
i found a solution using open and waiting for an io exception.
if the file is still being moved i get errno 13.
You should take a look at this link:
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
There you can see the comparison of the method you are speaking about (simple polling) with two other windows-specific techniques which, in my opinion, offers a really better solution to your problem!
Otherwise, if you are using linux, there's iNotify and the relative Python wrapper:
Pyinotify is a pure Python module used
for monitoring filesystems events on
Linux platforms through inotify
Here: http://trac.dbzteam.org/pyinotify
If you have control over the process of importing the files, I would put a lock file when starting to copy files in, and remove it when you are done. by lock file I mean a tmp empty file, which is just there to indicate that you are coping a file. then your py script can check for the existence of the lock files.
You may be able to use os.stat() to monitor the mtime of the file. However be aware that under various network conditions, the copy may stall momentarily and so the mtime is not updated for a few seconds, so you need to make allowance for this.
Another option is to try opening the file with exclusive read/write which should fail under windows if the file is still opened by the other process
The most reliable method would be to write your own program to move the files.
try checking for the last-modified time change instead of the filesize during your poll.

Categories